Skip to main content
When you ask an LLM to write code, you’re usually handing it a natural language description and hoping the output matches what you meant. The result is often plausible-looking code that passes a quick skim but harbors subtle bugs, the kind that only surface in production, under adversarial inputs, at 2 AM. The industry has a word for this, slop. Code that looks right, compiles right, and is complete nonsense. There’s a better way. Instead of describing what you want in English, describe it in math. Build the formal model first, the proofs, the invariants, the state machine, and then let the AI implement against that specification. The model becomes both the blueprint and the test suite, and the AI can’t reason its way past a model checker. Either it holds true or it doesn’t. This post walks through a real project where we did exactly that, what we found, and why it matters for AI-assisted development.

The Setup

The project is a distributed file sharing platform (DFS) with authentication, file CRUD, storage quotas, permission sharing, and account lifecycle state machines. Before writing a single line of implementation, we built 32 CBMC proof harnesses across seven categories:
  • Authentication: login, logout, session integrity, password verification, unauthorized access prevention
  • Storage: per-user quotas, system-wide limits, rejection when full, accounting invariants
  • File operations: create, read, write, delete
  • Sharing: grant, revoke, read access semantics, write access semantics, owner semantics
  • Concurrency: login races, file races, quota races, session bounds
  • FSM properties: legal transitions, no illegal states, deadlock freedom, authentication waypoints
  • Invariants: storage sum consistency, permission consistency, session consistency
Each proof harness constructs a valid system state, performs operations, and asserts that specific properties hold. CBMC exhaustively explores all reachable states within the configured bounds, checking for memory safety, arithmetic overflow, pointer validity, and our custom behavioral assertions. If any assertion can fail on any path, CBMC finds it. The headers define the type contracts: the data structures, function signatures, and configuration constants. The proofs define the behavioral contracts: what each function must do, what it must reject, and what invariants it must preserve. Together, they form a machine-checkable specification that is completely unambiguous.

Handing the Spec to Claude

With 32 proofs, four header files, and a proof infrastructure library (allocators, nondeterministic generators, helper functions), we handed Claude the entire specification and asked it to implement the seven source files. Claude read every proof harness, understood the behavioral contracts, and produced implementations that:
  • Used constant-time XOR accumulation for password comparison (because the proof checks for it)
  • Implemented lock arbitration on file writes (because the concurrency proofs assert it)
  • Charged storage deltas to the file owner, not the writing user (because the sharing proofs assert accounting on the owner’s tally)
  • Cascaded share deletion on file delete (because the invariant proofs check post-deletion consistency)
  • Rolled back authentication state on session creation failure (because the auth proofs test the failure path)
This isn’t Claude being clever. This is Claude reading a formal specification and doing what it says. The proofs didn’t leave room for interpretation. They defined exact error codes for exact scenarios, exact state transitions for exact events, exact arithmetic for exact storage accounting. All 32 proofs passed on the first complete implementation. Zero verification failures across 1,739 verification conditions.

The Interesting Part: What the Proofs Missed

Here’s where the story gets instructive. We ran a systematic audit of the implementation against the invariant predicates, not asking “does the code pass the proofs?” but asking “does the code preserve the invariants that the proofs claim to verify?” We found three bugs. Bug 1: DFS_PERM_NONE grants were accepted. The share grant function rejected DFS_PERM_OWNER (you can’t grant ownership through shares) but had no guard against DFS_PERM_NONE. An owner could create a share entry with permission NONE, which is semantically meaningless but violates the system’s own permission invariant: “every share entry has a permission that is READ or WRITE.” Bug 2: Non-owners could grant shares to the file owner. The self-share check tested target == caller, not target == file_owner. If user B had a reshare-enabled share on a file owned by user A, B could grant a share to A. This creates a redundant, contradictory entry (owner already has implicit OWNER permission) and violates the invariant: “no share entry references the file’s owner.” Bug 3: Missing system-level overflow guard. The storage quota checker had an overflow guard for per-user arithmetic (used + additional < used detects uint64 wraparound) but no equivalent guard for the system-wide total. The per-user guard was there because someone thought about overflow. The system-wide guard was missing because someone didn’t think about it twice.

Why the Proofs Didn’t Catch Them

This is the critical lesson. The proofs were correct, they verified exactly what they claimed. But there was a gap between what the operation proofs tested and what the invariant proofs tested. That’s on me for doing this quickly, but the fact is that this was a much better quality output than you usually get from LLM code, plus it was easy to understand. That more than anything else is critical. The operation proofs (e.g., share_grant_permission_harness) called dfs_share_grant and checked return codes: “does granting READ return OK? Does granting without reshare return CANNOT_RESHARE?” They verified the function’s interface contract. The invariant proofs (e.g., invariant_permission_consistency_harness) called operations and then asserted dfs_permission_invariant_holds(). But they constrained their inputs too tightly: permissions were bounded to [READ, WRITE], excluding NONE. Only the owner granted shares, excluding the non-owner-grants-to-owner path. The operation proofs never checked invariant preservation. The invariant proofs never explored adversarial inputs. The bugs lived in the gap between them.

The Fix

We fixed the code:
  • Added permission != READ && permission != WRITE rejection (NONE returns INVALID_ARGUMENT, OWNER returns CANNOT_GRANT_OWNER)
  • Added target_user_id == file->owner_id rejection (returns SELF_SHARE)
  • Added symmetric overflow guard for system-wide storage
Then we strengthened the proofs:
  • Added dfs_permission_invariant_holds() assertions after every grant operation in the operation proofs
  • Removed the DFS_ASSUME that constrained permissions to [READ, WRITE], now CBMC explores all enum values including NONE and OWNER
  • Added a third user with reshare capability so CBMC can explore the grant-to-owner path
  • Added explicit test cases for both rejected paths
After the fixes, all 32 proofs pass with 0 of 1,752 verification conditions failed. And now, if anyone reintroduces either bug, CBMC will catch it.

What This Teaches Us About AI-Assisted Development

Formal models compress ambiguity to zero. When you tell an LLM “implement authentication,” you get whatever the LLM thinks authentication means. When you hand it a proof that says “after a successful login, user.auth_state == AUTHENTICATED and user.active_session_id == session.id and session.state == ACTIVE and dfs_session_invariant_holds(sys),” there’s exactly one correct behavior. The model eliminates the class of bugs that come from misunderstood requirements. Proofs are better test suites, but they’re not magic. A proof verifies what it asserts, nothing more. If you don’t assert invariant preservation after an operation, the proof won’t check it. The bugs we found weren’t in the model checker, they were in the proof harnesses. Coverage gaps in proofs are just as real as coverage gaps in tests, but they’re more dangerous because proofs carry an aura of completeness. The feedback loop matters. The most valuable moment wasn’t the initial implementation, it was the audit. Having the formal specification made the audit tractable: we could mechanically ask “does this function preserve this predicate for all inputs?” instead of guessing at edge cases. And when we found bugs, we could fix both the code and the specification, closing the gap permanently. AI is better at implementation than specification. Claude produced a correct implementation from a formal spec on the first try. But when we asked “are there bugs the proofs should have caught?”, the interesting findings required reasoning about proof coverage: what inputs the harnesses didn’t explore, what assertions they didn’t make. Writing good specifications requires adversarial thinking about your own assumptions. Writing implementations from good specifications is mechanical. Play to each tool’s strengths.

The Workflow

Here’s the workflow that I followed, this was purely done with a toy project, next I’ll be working on something that could be considered “production capable” (not necessarily “production ready”)
  1. Write the types first. Data structures, enums, configuration constants. These are your vocabulary.
  2. Write the invariants. What must always be true about your system state? Write predicates that check it.
  3. Write the proofs. For each operation, construct a valid state, call the operation, and assert the postconditions including invariant preservation.
  4. Hand everything to Claude. The headers, the proofs, the proof infrastructure. Ask it to implement.
  5. Verify. Run the model checker. Fix failures.
  6. Audit the proofs. Ask: what inputs didn’t we explore? What invariants didn’t we check after which operations? Strengthen the harnesses.
  7. Verify again. This is where you find the real bugs, not in the implementation, but in the gap between what you proved and what you thought you proved.
The formal model doesn’t just improve the AI’s code. It improves your understanding of what correct means. And that understanding is the thing that actually matters.