The Setup
The project is a distributed file sharing platform (DFS) with authentication, file CRUD, storage quotas, permission sharing, and account lifecycle state machines. Before writing a single line of implementation, we built 32 CBMC proof harnesses across seven categories:- Authentication: login, logout, session integrity, password verification, unauthorized access prevention
- Storage: per-user quotas, system-wide limits, rejection when full, accounting invariants
- File operations: create, read, write, delete
- Sharing: grant, revoke, read access semantics, write access semantics, owner semantics
- Concurrency: login races, file races, quota races, session bounds
- FSM properties: legal transitions, no illegal states, deadlock freedom, authentication waypoints
- Invariants: storage sum consistency, permission consistency, session consistency
Handing the Spec to Claude
With 32 proofs, four header files, and a proof infrastructure library (allocators, nondeterministic generators, helper functions), we handed Claude the entire specification and asked it to implement the seven source files. Claude read every proof harness, understood the behavioral contracts, and produced implementations that:- Used constant-time XOR accumulation for password comparison (because the proof checks for it)
- Implemented lock arbitration on file writes (because the concurrency proofs assert it)
- Charged storage deltas to the file owner, not the writing user (because the sharing proofs assert accounting on the owner’s tally)
- Cascaded share deletion on file delete (because the invariant proofs check post-deletion consistency)
- Rolled back authentication state on session creation failure (because the auth proofs test the failure path)
The Interesting Part: What the Proofs Missed
Here’s where the story gets instructive. We ran a systematic audit of the implementation against the invariant predicates, not asking “does the code pass the proofs?” but asking “does the code preserve the invariants that the proofs claim to verify?” We found three bugs.Bug 1: DFS_PERM_NONE grants were accepted. The share grant function rejected DFS_PERM_OWNER (you can’t grant ownership through shares) but had no guard against DFS_PERM_NONE. An owner could create a share entry with permission NONE, which is semantically meaningless but violates the system’s own permission invariant: “every share entry has a permission that is READ or WRITE.”
Bug 2: Non-owners could grant shares to the file owner. The self-share check tested target == caller, not target == file_owner. If user B had a reshare-enabled share on a file owned by user A, B could grant a share to A. This creates a redundant, contradictory entry (owner already has implicit OWNER permission) and violates the invariant: “no share entry references the file’s owner.”
Bug 3: Missing system-level overflow guard. The storage quota checker had an overflow guard for per-user arithmetic (used + additional < used detects uint64 wraparound) but no equivalent guard for the system-wide total. The per-user guard was there because someone thought about overflow. The system-wide guard was missing because someone didn’t think about it twice.
Why the Proofs Didn’t Catch Them
This is the critical lesson. The proofs were correct, they verified exactly what they claimed. But there was a gap between what the operation proofs tested and what the invariant proofs tested. That’s on me for doing this quickly, but the fact is that this was a much better quality output than you usually get from LLM code, plus it was easy to understand. That more than anything else is critical. The operation proofs (e.g.,share_grant_permission_harness) called dfs_share_grant and checked return codes: “does granting READ return OK? Does granting without reshare return CANNOT_RESHARE?” They verified the function’s interface contract.
The invariant proofs (e.g., invariant_permission_consistency_harness) called operations and then asserted dfs_permission_invariant_holds(). But they constrained their inputs too tightly: permissions were bounded to [READ, WRITE], excluding NONE. Only the owner granted shares, excluding the non-owner-grants-to-owner path.
The operation proofs never checked invariant preservation. The invariant proofs never explored adversarial inputs. The bugs lived in the gap between them.
The Fix
We fixed the code:- Added
permission != READ && permission != WRITErejection (NONE returnsINVALID_ARGUMENT, OWNER returnsCANNOT_GRANT_OWNER) - Added
target_user_id == file->owner_idrejection (returnsSELF_SHARE) - Added symmetric overflow guard for system-wide storage
- Added
dfs_permission_invariant_holds()assertions after every grant operation in the operation proofs - Removed the
DFS_ASSUMEthat constrained permissions to[READ, WRITE], now CBMC explores all enum values including NONE and OWNER - Added a third user with reshare capability so CBMC can explore the grant-to-owner path
- Added explicit test cases for both rejected paths
What This Teaches Us About AI-Assisted Development
Formal models compress ambiguity to zero. When you tell an LLM “implement authentication,” you get whatever the LLM thinks authentication means. When you hand it a proof that says “after a successful login,user.auth_state == AUTHENTICATED and user.active_session_id == session.id and session.state == ACTIVE and dfs_session_invariant_holds(sys),” there’s exactly one correct behavior. The model eliminates the class of bugs that come from misunderstood requirements.
Proofs are better test suites, but they’re not magic. A proof verifies what it asserts, nothing more. If you don’t assert invariant preservation after an operation, the proof won’t check it. The bugs we found weren’t in the model checker, they were in the proof harnesses. Coverage gaps in proofs are just as real as coverage gaps in tests, but they’re more dangerous because proofs carry an aura of completeness.
The feedback loop matters. The most valuable moment wasn’t the initial implementation, it was the audit. Having the formal specification made the audit tractable: we could mechanically ask “does this function preserve this predicate for all inputs?” instead of guessing at edge cases. And when we found bugs, we could fix both the code and the specification, closing the gap permanently.
AI is better at implementation than specification. Claude produced a correct implementation from a formal spec on the first try. But when we asked “are there bugs the proofs should have caught?”, the interesting findings required reasoning about proof coverage: what inputs the harnesses didn’t explore, what assertions they didn’t make. Writing good specifications requires adversarial thinking about your own assumptions. Writing implementations from good specifications is mechanical. Play to each tool’s strengths.
The Workflow
Here’s the workflow that I followed, this was purely done with a toy project, next I’ll be working on something that could be considered “production capable” (not necessarily “production ready”)- Write the types first. Data structures, enums, configuration constants. These are your vocabulary.
- Write the invariants. What must always be true about your system state? Write predicates that check it.
- Write the proofs. For each operation, construct a valid state, call the operation, and assert the postconditions including invariant preservation.
- Hand everything to Claude. The headers, the proofs, the proof infrastructure. Ask it to implement.
- Verify. Run the model checker. Fix failures.
- Audit the proofs. Ask: what inputs didn’t we explore? What invariants didn’t we check after which operations? Strengthen the harnesses.
- Verify again. This is where you find the real bugs, not in the implementation, but in the gap between what you proved and what you thought you proved.