Designing the append-only decision log for forensic auditability

Nklave Team · May 8, 2026 ·

audit-logforensicscheckpoints

A signing security layer that cannot tell you what it did is not a security layer. It is an opaque RPC that occasionally refuses requests, and when something goes wrong, you are left with the validator client’s logs and a confused conversation with whoever was on-call. We did not want to ship that. So nklave was designed from the start around an append-only log that records every decision — allow or refuse — in a format that supports after-the-fact forensic reconstruction.

This post walks through how that log is structured, what the checkpoint chain actually guarantees, where the operator key lives, and how to use the log in the situations you actually care about: explaining a refusal, proving non-equivocation, and detecting tampering after a host compromise.

The format

Every signing evaluation lands in the log as a newline-delimited JSON record. There are two record types: decision entries (one per signing request) and checkpoint markers (one per sealed batch). The decision entries look like this:

{"ts": 1742054400, "validator": "0xabc...", "type": "ATTESTATION", "decision": "allow", "signing_root": "0xdef..."}
{"ts": 1742054401, "validator": "0xabc...", "type": "ATTESTATION", "decision": "refuse", "policy": "slashing-protection-attestation", "reason": "double vote at target_epoch=12345"}

A few deliberate choices in that format are worth calling out.

Refusals carry both the policy name that refused and a human-readable reason. The policy name lets you grep for systemic problems (“how many refusals from fork-allowlist last week?”); the reason gives you the actual values that triggered the refusal so you can correlate against validator-client behavior. Including both is not redundancy; it is two different lookup paths for two different questions.

Allow entries carry the signing_root of the message that was signed. This means the log is sufficient evidence of what was signed, without dragging in the validator client’s own records. If a slashing claim is ever made against one of your validators, the log can prove that nklave did not sign the contested attestation.

The log does not record signatures, only signing roots. Signatures are public after they hit the chain; signing roots are pre-image data. Recording the signing root and the timestamp is enough to bind nklave’s decision to the on-chain artifact via the verifiable BLS signature. Recording signatures would add nothing forensically and would make the log a juicier target if it ever leaked.

Checkpoints, Merkle roots, and the chain

A log that an attacker can edit is worse than no log at all — it is a log you trust. nklave makes the log tamper-evident through periodic checkpoints.

Every checkpoint_interval_seconds (default: 60), nklave does four things:

Computes the Merkle root over all decision entries since the last checkpoint.
Signs the root with the operator key (more on this below).
Writes a checkpoint marker into the log: {ts, prev_root, root, signature, entry_count}.
Continues appending new decision entries against the next checkpoint window.

The checkpoint chain is what gives the log its security property. Each checkpoint commits to (a) the entries since the last checkpoint, via the Merkle root, and (b) the previous checkpoint, via prev_root. Verifying the chain means re-walking it from any known-good point forward: re-compute each checkpoint’s Merkle root from the entries it covers, confirm it matches the marker, confirm the operator signature is valid, confirm prev_root chains back correctly. If any of those checks fails, you have detected tampering and you know roughly when it happened.

nklave log verify --from 0 --to latest

The verify command does exactly this, end-to-end, and exits non-zero on any failure. It is intended to be runnable as a cron job. We recommend wiring it up that way, with the exit code piped into your alerting.

Operator key custody

The operator key is the keystone of the whole scheme: if an attacker can forge operator signatures, they can forge checkpoint markers, and a forged checkpoint chain validates cleanly. So the operator key needs to be no easier to compromise than a validator signing key.

nklave supports three operator-key backends, with the same trust hierarchy as the validator-key backends:

An Ed25519 keypair in a separate keystore file, password-protected. Adequate for dev and single-host setups where the threat model accepts host compromise.
A YubiHSM 2 slot. The operator key never leaves the device; nklave authenticates against an auth key and requests checkpoint-root signatures over the serial protocol.
An AWS KMS key. nklave uses IAM-role credentials to call Sign; the key material never leaves AWS HSMs.

In all three cases, the operator key is separate from any validator signing key. They serve different security purposes: the validator key authorizes signatures the chain accepts; the operator key authorizes claims that the log was not tampered with. They should not share custody, and on a production deployment they should not share a host.

A reasonable production posture: operator key in a YubiHSM slot on the validator host, validator keys in a separate (or the same) YubiHSM slot with a different auth key. An attacker who pops the host can stop new checkpoints from being signed (denial-of-service) but cannot retroactively edit checkpoint markers, because the YubiHSM will not produce a signature without the auth-key credential. The audit chain remains intact through any host compromise.

What the log lets you do

The forensic queries we built nklave’s CLI around all share a shape: “filter the log down to a time window and a condition, dump the matching entries.”

Why was this signature refused?

nklave log query --validator 0xabc... --decision refuse --since 2026-05-01

Returns every refusal for that validator since the date, with policy and reason. Useful when the validator-client logs say “signing failed” and you want to know which rule blocked it.

Did we ever sign this contested attestation?

nklave log query --signing-root 0xdeadbeef...

If the log has an entry with decision: allow and that signing root, you signed it. If not, you didn’t — and the checkpoint chain proves you didn’t.

What happened between 09:55 and 10:05?

nklave log query --since "2026-05-08 09:55" --until "2026-05-08 10:05"

Returns every decision in the window. Pair this with validator-client logs to reconstruct end-to-end behavior during an incident.

Is the log itself trustworthy?

nklave log verify --from 0 --to latest

Re-walks every checkpoint, recomputes every Merkle root, validates every signature against the registered operator pubkey. Exit code zero means the chain holds.

Retention and cold storage

Logs are not append-forever on a single file. nklave rotates by both size (default 1 GB per file) and age (default 365 days). Older files are sealed — no further appends — and can be moved to cold storage.

The checkpoint chain is designed to survive that move. Every checkpoint marker contains both its own Merkle root and the prev_root from the previous checkpoint, regardless of whether the previous checkpoint is in the same file. As long as you keep the operator’s public key around and the chain is contiguous, you can verify a year-old log file in cold storage as easily as the current one.

The implication for compliance: nklave makes “show me every signing decision from Q3 2025, and prove the log was not edited” answerable as a CLI command. That answer takes seconds to produce regardless of how much log volume sits behind it.

What the log is not

The decision log is not a replacement for monitoring, and it is not the right place to look for live operational problems.

For live signals — saturated HSM, surging refusal rate, stale checkpoint — the right place is Prometheus metrics:

nklave_signing_requests_total{type, decision} — counters by message type and outcome.
nklave_policy_refusals_total{policy} — refusals by policy name.
nklave_signing_latency_seconds — end-to-end histograms.
nklave_log_checkpoint_age_seconds — seconds since the last checkpoint. Alert above 600.

The decision log answers “what happened?” after the fact. Metrics answer “what is happening?” right now. Both ship with nklave. Wire up both.

Why this matters

Most validator operators have at some point been in the meeting where the question is “did we equivocate?” and the answer is “let me check the logs.” The check usually takes too long, the logs usually do not have what is needed, and the answer is usually “we think not.”

The append-only log makes that meeting different. The answer is “no, here is the log entry showing the signing-root for the contested slot, and here is the checkpoint chain proving the log was not edited.” That answer takes minutes to assemble, not days. And it is the kind of answer that is good enough for the people who need to hear it.

The point of the log is not to make signing safer. The policy engine does that. The point of the log is to make the question of what happened answerable, exactly, after the fact. That is a different security property, and it is worth building for explicitly.