<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://zfhuang99.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://zfhuang99.github.io/" rel="alternate" type="text/html" /><updated>2025-12-02T18:11:01+00:00</updated><id>https://zfhuang99.github.io/feed.xml</id><title type="html">Cheng Huang’s corner</title><subtitle>a corner of learning and sharing</subtitle><entry><title type="html">Learnings from 100K Lines of Rust with AI</title><link href="https://zfhuang99.github.io/rust/claude%20code/codex/contracts/spec-driven%20development/2025/12/01/rust-with-ai.html" rel="alternate" type="text/html" title="Learnings from 100K Lines of Rust with AI" /><published>2025-12-01T00:00:00+00:00</published><updated>2025-12-01T00:00:00+00:00</updated><id>https://zfhuang99.github.io/rust/claude%20code/codex/contracts/spec-driven%20development/2025/12/01/rust-with-ai</id><content type="html" xml:base="https://zfhuang99.github.io/rust/claude%20code/codex/contracts/spec-driven%20development/2025/12/01/rust-with-ai.html"><![CDATA[<style>
pre code { white-space: pre-wrap; word-break: break-word; }
</style>

<p>In the past few months, I’ve been stress-testing how far AI coding agents can take us when building real, production-grade distributed systems.</p>

<p>The result: a Rust-based multi-Paxos consensus engine that not only implements all the features of Azure’s Replicated State Library (RSL) [<a href="https://github.com/Azure/RSL">1</a>] — which underpins most major Azure services — but also modernizes it for today’s hardware.</p>

<p>The entire project took me ~3 months, with 100K lines of Rust code written in ~4 weeks and performance optimization from 23K operations/sec to 300K ops/sec achieved in ~3 weeks.</p>

<p>Besides unprecedented productivity, I discovered several techniques that were instrumental. This post shares my most valuable learnings on: ensuring correctness with code contracts, applying lightweight spec-driven development, and pursuing aggressive performance optimization — plus my wish list for the future of AI-assisted coding.</p>

<h2 id="why-modernize-rsl">Why Modernize RSL?</h2>

<p>Azure’s RSL implements the <em>multi-Paxos</em> consensus protocol and forms the backbone of replication in many Azure services. However, RSL was written more than a decade ago. While robust, it hasn’t evolved to match modern hardware and workloads.</p>

<p>There are three key gaps motivated this project:</p>

<ol>
  <li><strong>No pipelining:</strong> When a vote is in flight, new requests must wait, inflating latency.</li>
  <li><strong>No NVM support:</strong> Non-volatile memory is now common in Azure datacenters and can drastically reduce commit time.</li>
  <li><strong>Limited hardware awareness:</strong> RSL wasn’t built to leverage RDMA, which is now pervasive in Azure data centers.</li>
</ol>

<p>Removing these limitations could unlock significantly lower latency and higher throughput — critical for modern cloud workloads and AI-driven services.</p>

<p>Given my interest in Rust and AI-accelerated development, I set out to build a modern RSL equivalent from scratch.</p>

<h2 id="massive-productivity-boost">Massive Productivity Boost</h2>

<p>In roughly six weeks, I’ve driven AI and implemented over 130K lines of Rust code covering the full feature set of RSL, including multi-Paxos, leader election, log replication, snapshotting, and configuration changes.</p>

<p>I utilized many available AI coding agents: GitHub Copilot, Claude Code, Codex, Augment Code, Kiro, and Trae. My workflow evolved quickly, but today my main drivers are <strong>Claude Code</strong> and <strong>Codex CLI</strong>, with VS Code handling diffs and minor edits.</p>

<p>I’ve found that coding from the CLI creates a perfect asynchronous flow that maximizes my productivity. I also discovered a simple psychological trick:</p>

<blockquote>
  <p>I pay $100/month for Anthropic’s max plan. This became a forcing function — if I don’t kick off a coding task with Claude before bed, I feel like I’m wasting money.</p>
</blockquote>

<p>When Codex CLI arrived, I added a second ChatGPT Plus subscription to handle rate limits — one subscription for Monday–Wednesday, the other for Thursday–Sunday.</p>

<h2 id="code-contracts--by-ai-for-ai">Code Contracts — By AI, For AI</h2>

<p>The question I get most often is: <em>How can AI possibly implement something as complex as Paxos correctly?</em></p>

<p>Testing is the first layer of defense. My system now includes 1,300+ tests — from unit tests to minimal integration tests (e.g., proposer + acceptor only), all the way to multi-replica full integration tests with injected failures. See the <a href="#appendix-project-stats">project status</a>.</p>

<p>But the real breakthrough came from AI-driven <strong>code contracts</strong>.</p>

<p>Code contracts specify <em>preconditions</em>, <em>postconditions</em>, and <em>invariants</em> for critical functions. These contracts are converted into runtime asserts during testing but can be disabled in production builds for performance. While I started using this approach long ago with .NET [<a href="https://learn.microsoft.com/en-us/dotnet/framework/debug-trace-profile/code-contracts">2</a>], AI has made contracts vastly more powerful.</p>

<p>Here’s how I apply them at three levels:</p>

<p><strong>1. Ask AI to write contracts.</strong> Opus 4.1 writes good contracts, but GPT-5 High writes excellent ones. I focus on reviewing and refining. For example, the <code class="language-plaintext highlighter-rouge">process_2a</code> method (handling phase 2a messages in Paxos) has <strong>16 contracts</strong>, including this one:</p>

<p align="center">
  <img src="/assets/images/rsml/contract_7_new.png" width="100%" />
</p>

<p><strong>2. Generate tests from contracts.</strong> Once contracts are defined, I ask AI to create targeted test cases for each post-condition. It excels at this, generating meaningful edge cases automatically.</p>

<p><strong>3. Property-based tests for contracts.</strong> This is my favorite. AI translates contracts into property-based tests, exploring a vast space of randomized inputs. Any contract violation triggers a panic, exposing deep bugs early.</p>

<p>For instance, one AI-generated contract found a subtle Paxos safety violation:</p>

<p align="center">
  <img src="/assets/images/rsml/contract_catch_bug_new.png" width="100%" />
</p>

<p>That single contract saved what could have been a serious replication consistency issue — well before it ever hits production.</p>

<h2 id="lightweight-spec-driven-development">Lightweight Spec-Driven Development</h2>

<p>I’ve tried various Spec-Driven Development (SDD) tools. In fact, the earlier components (such as leader election, proposer, acceptor, and learner) were all implemented following a rigid SDD approach. I would start with a requirement markdown, turn it into a design markdown, and then a task list markdown. However, I gradually found this process too rigid; making changes along the way and ensuring all documents remained consistent became a headache.</p>

<p>I’ve now switched to a more lightweight approach. When I work on a feature (e.g., snapshotting), I use <code class="language-plaintext highlighter-rouge">/specify</code> from <strong>spec kit</strong> [<a href="https://github.com/github/spec-kit">3</a>] to generate a spec markdown. This spec consists of a few user stories and acceptance criteria.</p>

<p>Here is an example user story for snapshotting:</p>

<p align="center">
  <img src="/assets/images/rsml/user_story.png" width="100%" />
</p>

<p>I then use <code class="language-plaintext highlighter-rouge">/clarify</code> to ask the AI to self-critique and improve the user stories and criteria. I also ask it to suggest additional user stories not covered in the initial spec. I spend the majority of my time here.</p>

<p>Once satisfied, I jump into <strong>plan mode</strong> and ask AI to generate a plan for a specific user story. Given the capability of today’s AI coding agents, a single user story feels like the “sweet spot” unit of work they can manage effectively. Along the way, we may discover additions or tweaks, which are easy to handle in the same coding session (agents might compress context, but I typically don’t worry about losing too much).</p>

<p>Here is a sample <code class="language-plaintext highlighter-rouge">/clarify</code> interaction for configuration changes:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>● Question 4: Starting Slot Determination

  How should the starting slot for the new configuration be determined relative to the ending slot?

  Recommended: Option A - Always exactly ending_slot + 1

  This ensures no gaps in the slot sequence and maintains continuity between configurations, preventing any slots from being orphaned or unassigned.

  | Option | Description                                                    |
  |--------|----------------------------------------------------------------|
  | A      | Always exactly ending_slot + 1                                 |
  | B      | Allow operator to specify any value &gt; ending_slot              |
  | C      | Next multiple of 1000 after ending_slot for cleaner boundaries |
  | D      | Same as ending_slot (both configs share final slot)            |

  You can reply with the option letter (e.g., "A"), accept the recommendation by saying "yes" or "recommended", or provide your own short answer.
</code></pre></div></div>

<h2 id="aggressive-performance-optimization">Aggressive Performance Optimization</h2>

<p>Performance optimization is where AI really shines. After ensuring initial correctness, I spent about three weeks purely on throughput tuning — and AI became my co-pilot in performance engineering.</p>

<p>Through iterative cycles, we boosted throughput from ~23K ops/sec to ~300K ops/sec on a single laptop.
Here’s the loop I followed repeatedly:</p>

<ol>
  <li>Ask AI to instrument latency metrics across all code paths.</li>
  <li>Run performance tests and output trace logs.</li>
  <li>Let AI analyze latency breakdowns (it writes Python scripts to calculate quantiles and identify bottlenecks).</li>
  <li>Ask AI to propose optimizations, implement one, re-measure, and repeat.</li>
</ol>

<p>This process surfaced insights I might have missed — for example, lock contention on async paths, redundant memory copies, and unnecessary task spawns.</p>

<p>Rust’s safety model made it easy to push these optimizations confidently. Key gains came from minimizing allocations, applying zero-copy techniques, avoiding locks, and selectively removing async overhead. Each improvement felt like peeling another layer of latency off a high-performance engine — without fear of corrupting memory.</p>

<h2 id="wish-list-for-ai-assisted-coding">Wish List for AI-Assisted Coding</h2>

<p>Reflecting on my journey, I keep wondering where AI could deliver even more value. Here are some items on my wish list:</p>

<p><strong>End-to-End User Story Execution:</strong> I still prefer to define the user stories myself. As an architect, I feel I have a better sense of what I’m building and how I’d like to build it. However, the delivery of a perfect execution is something I believe AI can handle increasingly well. Today, I still have to spend a fair amount of time steering the AI — telling it to continue when it pauses, suggesting refactoring, reviewing test coverage, and suggesting additional tests. I would prefer the AI take more autonomy to drive this end-to-end.</p>

<p><strong>Automated Contract Workflows:</strong> The flow of applying contracts seems largely automatable. While I’d still want to review the contracts and offer suggestions, I’d like the AI to drive the rest: generating tests based on contracts, debugging individual test cases, ensuring consistency between tests and contracts, and writing property-based tests. When a test fails, I’d like the AI to debug and fix trivial issues automatically, only notifying me when there are genuine correctness issues in the contracts or the implementation.</p>

<p><strong>Autonomous Performance Optimization:</strong> Performance tuning seems ripe for more automation. Much of what I’ve done is repetitive and parallelizable. Projects like AlphaEvolve (or OpenEvolve) show promise in this direction. Ideally, I would suggest potential optimization avenues, and the AI would execute the experiments completely by itself. While current tools handle small bodies of code, applying similar techniques to larger codebases with end-to-end measurement seems feasible.</p>

<h2 id="appendix-project-status">Appendix: Project Status</h2>

<p>The seed of the project is an elegant design markdown authored by Jay Lorch [<a href="https://jaylorch.net/">4</a>] from Microsoft Research. This design greatly simplifies all the components in multi-Paxos, making it easier to implement and reason about.</p>

<p>So far, 2 out of the 3 RSL limitations have been addressed: pipelining and NVM support (Jay integrated the fully verified persistence log for NVM which was published in the <code class="language-plaintext highlighter-rouge">PoWER Never Corrupts</code> paper [<a href="https://www.usenix.org/conference/osdi25/presentation/leblanc">5</a>] at OSDI 2025). The RDMA support is still TBD.</p>

<p>To date, the project has grown to over <strong>130K lines of Rust code</strong>, with <strong>1,300+ tests</strong> accounting for more than <strong>65%</strong> of the codebase.</p>

<p align="center">
  <img src="/assets/images/rsml/count_lines.png" width="30%" />
</p>

<p align="center">
  <img src="/assets/images/rsml/test_coverage.png" width="80%" />
</p>]]></content><author><name></name></author><category term="Rust" /><category term="Claude Code" /><category term="Codex" /><category term="Contracts" /><category term="Spec-Driven Development" /><summary type="html"><![CDATA[In the past few months, I’ve been stress-testing how far AI coding agents can take us when building real, production-grade distributed systems. The result: a Rust-based multi-Paxos consensus engine that not only implements all the features of Azure’s Replicated State Library (RSL) [1] — which underpins most major Azure services — but also modernizes it for today’s hardware. The entire project took me ~3 months, with 100K lines of Rust code written in ~4 weeks and performance optimization from 23K operations/sec to 300K ops/sec achieved in ~3 weeks. Besides unprecedented productivity, I discovered several techniques that were instrumental. This post shares my most valuable learnings on: ensuring correctness with code contracts, applying lightweight spec-driven development, and pursuing aggressive performance optimization — plus my wish list for the future of AI-assisted coding. Why Modernize RSL? Azure’s RSL implements the multi-Paxos consensus protocol and forms the backbone of replication in many Azure services. However, RSL was written more than a decade ago. While robust, it hasn’t evolved to match modern hardware and workloads. There are three key gaps motivated this project: No pipelining: When a vote is in flight, new requests must wait, inflating latency. No NVM support: Non-volatile memory is now common in Azure datacenters and can drastically reduce commit time. Limited hardware awareness: RSL wasn’t built to leverage RDMA, which is now pervasive in Azure data centers. Removing these limitations could unlock significantly lower latency and higher throughput — critical for modern cloud workloads and AI-driven services. Given my interest in Rust and AI-accelerated development, I set out to build a modern RSL equivalent from scratch. Massive Productivity Boost In roughly six weeks, I’ve driven AI and implemented over 130K lines of Rust code covering the full feature set of RSL, including multi-Paxos, leader election, log replication, snapshotting, and configuration changes. I utilized many available AI coding agents: GitHub Copilot, Claude Code, Codex, Augment Code, Kiro, and Trae. My workflow evolved quickly, but today my main drivers are Claude Code and Codex CLI, with VS Code handling diffs and minor edits. I’ve found that coding from the CLI creates a perfect asynchronous flow that maximizes my productivity. I also discovered a simple psychological trick: I pay $100/month for Anthropic’s max plan. This became a forcing function — if I don’t kick off a coding task with Claude before bed, I feel like I’m wasting money. When Codex CLI arrived, I added a second ChatGPT Plus subscription to handle rate limits — one subscription for Monday–Wednesday, the other for Thursday–Sunday. Code Contracts — By AI, For AI The question I get most often is: How can AI possibly implement something as complex as Paxos correctly? Testing is the first layer of defense. My system now includes 1,300+ tests — from unit tests to minimal integration tests (e.g., proposer + acceptor only), all the way to multi-replica full integration tests with injected failures. See the project status. But the real breakthrough came from AI-driven code contracts. Code contracts specify preconditions, postconditions, and invariants for critical functions. These contracts are converted into runtime asserts during testing but can be disabled in production builds for performance. While I started using this approach long ago with .NET [2], AI has made contracts vastly more powerful. Here’s how I apply them at three levels: 1. Ask AI to write contracts. Opus 4.1 writes good contracts, but GPT-5 High writes excellent ones. I focus on reviewing and refining. For example, the process_2a method (handling phase 2a messages in Paxos) has 16 contracts, including this one: 2. Generate tests from contracts. Once contracts are defined, I ask AI to create targeted test cases for each post-condition. It excels at this, generating meaningful edge cases automatically. 3. Property-based tests for contracts. This is my favorite. AI translates contracts into property-based tests, exploring a vast space of randomized inputs. Any contract violation triggers a panic, exposing deep bugs early. For instance, one AI-generated contract found a subtle Paxos safety violation: That single contract saved what could have been a serious replication consistency issue — well before it ever hits production. Lightweight Spec-Driven Development I’ve tried various Spec-Driven Development (SDD) tools. In fact, the earlier components (such as leader election, proposer, acceptor, and learner) were all implemented following a rigid SDD approach. I would start with a requirement markdown, turn it into a design markdown, and then a task list markdown. However, I gradually found this process too rigid; making changes along the way and ensuring all documents remained consistent became a headache. I’ve now switched to a more lightweight approach. When I work on a feature (e.g., snapshotting), I use /specify from spec kit [3] to generate a spec markdown. This spec consists of a few user stories and acceptance criteria. Here is an example user story for snapshotting: I then use /clarify to ask the AI to self-critique and improve the user stories and criteria. I also ask it to suggest additional user stories not covered in the initial spec. I spend the majority of my time here. Once satisfied, I jump into plan mode and ask AI to generate a plan for a specific user story. Given the capability of today’s AI coding agents, a single user story feels like the “sweet spot” unit of work they can manage effectively. Along the way, we may discover additions or tweaks, which are easy to handle in the same coding session (agents might compress context, but I typically don’t worry about losing too much). Here is a sample /clarify interaction for configuration changes: ● Question 4: Starting Slot Determination How should the starting slot for the new configuration be determined relative to the ending slot? Recommended: Option A - Always exactly ending_slot + 1 This ensures no gaps in the slot sequence and maintains continuity between configurations, preventing any slots from being orphaned or unassigned. | Option | Description | |--------|----------------------------------------------------------------| | A | Always exactly ending_slot + 1 | | B | Allow operator to specify any value &gt; ending_slot | | C | Next multiple of 1000 after ending_slot for cleaner boundaries | | D | Same as ending_slot (both configs share final slot) | You can reply with the option letter (e.g., "A"), accept the recommendation by saying "yes" or "recommended", or provide your own short answer. Aggressive Performance Optimization Performance optimization is where AI really shines. After ensuring initial correctness, I spent about three weeks purely on throughput tuning — and AI became my co-pilot in performance engineering. Through iterative cycles, we boosted throughput from ~23K ops/sec to ~300K ops/sec on a single laptop. Here’s the loop I followed repeatedly: Ask AI to instrument latency metrics across all code paths. Run performance tests and output trace logs. Let AI analyze latency breakdowns (it writes Python scripts to calculate quantiles and identify bottlenecks). Ask AI to propose optimizations, implement one, re-measure, and repeat. This process surfaced insights I might have missed — for example, lock contention on async paths, redundant memory copies, and unnecessary task spawns. Rust’s safety model made it easy to push these optimizations confidently. Key gains came from minimizing allocations, applying zero-copy techniques, avoiding locks, and selectively removing async overhead. Each improvement felt like peeling another layer of latency off a high-performance engine — without fear of corrupting memory. Wish List for AI-Assisted Coding Reflecting on my journey, I keep wondering where AI could deliver even more value. Here are some items on my wish list: End-to-End User Story Execution: I still prefer to define the user stories myself. As an architect, I feel I have a better sense of what I’m building and how I’d like to build it. However, the delivery of a perfect execution is something I believe AI can handle increasingly well. Today, I still have to spend a fair amount of time steering the AI — telling it to continue when it pauses, suggesting refactoring, reviewing test coverage, and suggesting additional tests. I would prefer the AI take more autonomy to drive this end-to-end. Automated Contract Workflows: The flow of applying contracts seems largely automatable. While I’d still want to review the contracts and offer suggestions, I’d like the AI to drive the rest: generating tests based on contracts, debugging individual test cases, ensuring consistency between tests and contracts, and writing property-based tests. When a test fails, I’d like the AI to debug and fix trivial issues automatically, only notifying me when there are genuine correctness issues in the contracts or the implementation. Autonomous Performance Optimization: Performance tuning seems ripe for more automation. Much of what I’ve done is repetitive and parallelizable. Projects like AlphaEvolve (or OpenEvolve) show promise in this direction. Ideally, I would suggest potential optimization avenues, and the AI would execute the experiments completely by itself. While current tools handle small bodies of code, applying similar techniques to larger codebases with end-to-end measurement seems feasible. Appendix: Project Status The seed of the project is an elegant design markdown authored by Jay Lorch [4] from Microsoft Research. This design greatly simplifies all the components in multi-Paxos, making it easier to implement and reason about. So far, 2 out of the 3 RSL limitations have been addressed: pipelining and NVM support (Jay integrated the fully verified persistence log for NVM which was published in the PoWER Never Corrupts paper [5] at OSDI 2025). The RDMA support is still TBD. To date, the project has grown to over 130K lines of Rust code, with 1,300+ tests accounting for more than 65% of the codebase.]]></summary></entry><entry><title type="html">Lamport Agent - AI-assisted Formal Specification</title><link href="https://zfhuang99.github.io/github%20copilot/formal%20verification/tla+/2025/11/14/lamport-agent.html" rel="alternate" type="text/html" title="Lamport Agent - AI-assisted Formal Specification" /><published>2025-11-14T00:00:00+00:00</published><updated>2025-11-14T00:00:00+00:00</updated><id>https://zfhuang99.github.io/github%20copilot/formal%20verification/tla+/2025/11/14/lamport-agent</id><content type="html" xml:base="https://zfhuang99.github.io/github%20copilot/formal%20verification/tla+/2025/11/14/lamport-agent.html"><![CDATA[<p>This blog post was prompted by an invitation to present to Nvidia engineers (<a href="/assets/attachments/lamport_agent.pdf">slides</a>).</p>

<p>In the previous post [<a href="https://zfhuang99.github.io/github%20copilot/formal%20verification/tla+/2025/05/24/ai-revolution-in-distributed-systems.html">1</a>], I discussed how large language models (LLMs) are now capable of producing precise formal specifications directly from large production codebases and identifying nuanced race conditions. This process can be largely automated, making it broadly applicable. Accordingly, we have developed Lamport Agent so that others can utilize with their own codebases. The following is an example demonstrating this agent using DeepSeek’s open-source distributed file system, 3FS [<a href="https://github.com/deepseek-ai/3FS">2</a>]. The complete GitHub Copilot chat history and all the artifacts produced by the agent are available here [<a href="https://github.com/zfhuang99/lamport-agent">3</a>].</p>

<h2 id="craq-in-deepseek-3fs">CRAQ in DeepSeek 3FS</h2>

<p>3FS implements an optimized chain replication protocol, called CRAQ [<a href="https://www.usenix.org/legacy/event/usenix09/tech/full_papers/terrace/terrace.pdf">4</a>]. We
aim to model this system and check its consistency.</p>

<p>The charts compare standard chain replication with CRAQ. In standard
replication, only the tail node serves reads for consistency. CRAQ
allows any node to serve reads by using version numbers: writes
increment the object's version at the head and are marked dirty until
they commit at the tail. Once committed, the update propagates back,
converting all dirty writes to committed ones. Nodes can serve reads
directly unless they have dirty writes; in that case, they fetch the
latest version from the tail. The process grows more complex when
additional steps are needed to address node failures and repairs.</p>

<p>To ensure strong consistency, all reads must return the most recent
committed write. Our experiment evaluates whether 3FS’ implementation
meets this requirement.</p>

<hr />
<p float="left">
<img src="/assets/images/lamport/image1.png" width="45%" />
<img src="/assets/images/lamport/image2.png" width="45%" />
</p>

<hr />

<h2 id="lamport-agent-in-action">Lamport Agent in action</h2>

<p>To begin, we create a custom agent in GitHub Copilot by following the
provided instructions [<a href="https://github.com/zfhuang99/lamport-agent">3</a>].</p>

<p align="center">
  <img src="/assets/images/lamport/image3.png" width="80%" />
</p>

<p>The procedure may be initiated through a straightforward initial prompt.</p>

<p><img src="/assets/images/lamport/image4.png" alt="" width="100%" /></p>

<p>The agent develops a 6-phase plan, starting with phase 1: studying the
implementation and documenting the architecture and the happy path
logic.</p>

<p><img src="/assets/images/lamport/image5.png" alt="" width="100%" />
<img src="/assets/images/lamport/image6.png" alt="" width="100%" /></p>

<p>A snippet of the happy path markdown.</p>

<p><img src="/assets/images/lamport/image7.png" alt="" width="100%" /></p>

<p><img src="/assets/images/lamport/image8.png" alt="" width="100%" /></p>

<p>Phase 2 covers failure paths.</p>

<p><img src="/assets/images/lamport/image9.png" alt="" width="100%" /></p>

<p><img src="/assets/images/lamport/image10.png" alt="" width="100%" /></p>

<p>Phase 3 produces invariants – 7 safety and 1 liveness in this example.</p>

<p><img src="/assets/images/lamport/image11.png" alt="" width="100%" /></p>

<p><img src="/assets/images/lamport/image12.png" alt="" width="100%" /></p>

<p>One may choose to stop at this stage and use the agent solely to enhance the understanding of the codebase, without proceeding to the TLA+ section.</p>

<p><img src="/assets/images/lamport/image13.png" alt="" width="100%" /></p>

<p>AI is able to identify and fix its own errors, whether they are related
to syntax or logic.</p>

<p><img src="/assets/images/lamport/image14.png" alt="" width="100%" /></p>

<p>AI engages in a back-and-forth argument with me, much like a human team
member. This appears to be an emergent behavior in GPT5.1!</p>

<p><img src="/assets/images/lamport/image15.png" alt="" width="100%" /></p>

<p><img src="/assets/images/lamport/image16.png" alt="" width="100%" /></p>

<p><img src="/assets/images/lamport/image17.png" alt="" width="100%" /></p>

<p>We can continue driving AI to improve the specification with minimal
additional effort.</p>

<p><img src="/assets/images/lamport/image18.png" alt="" width="100%" /></p>

<p><img src="/assets/images/lamport/image19.png" alt="" width="100%" /></p>]]></content><author><name></name></author><category term="GitHub Copilot" /><category term="Formal verification" /><category term="TLA+" /><summary type="html"><![CDATA[This blog post was prompted by an invitation to present to Nvidia engineers (slides).]]></summary></entry><entry><title type="html">The Coming AI Revolution in Distributed Systems</title><link href="https://zfhuang99.github.io/github%20copilot/formal%20verification/tla+/2025/05/24/ai-revolution-in-distributed-systems.html" rel="alternate" type="text/html" title="The Coming AI Revolution in Distributed Systems" /><published>2025-05-24T00:00:00+00:00</published><updated>2025-05-24T00:00:00+00:00</updated><id>https://zfhuang99.github.io/github%20copilot/formal%20verification/tla+/2025/05/24/ai-revolution-in-distributed-systems</id><content type="html" xml:base="https://zfhuang99.github.io/github%20copilot/formal%20verification/tla+/2025/05/24/ai-revolution-in-distributed-systems.html"><![CDATA[<p>Formal verification has long been the gold standard for uncovering subtle bugs in distributed system design [<a href="https://lamport.azurewebsites.net/tla/industrial-use.html">1</a>]. While AI has already proven its ability to accelerate verification processes [<a href="https://zfhuang99.github.io/tla+/pluscal/chatgpt/2023/09/24/TLA-made-simple-with-chatgpt.html">2</a>], recent breakthroughs suggest a far more transformative potential: AI can now autonomously generate accurate formal specifications directly from very large production codebases. This capability marks a pivotal moment in software engineering, pointing toward a future where AI-driven correctness verification becomes not just standard practice, but potentially superior to human efforts <sup id="fnref:disclaimer" role="doc-noteref"><a href="#fn:disclaimer" class="footnote" rel="footnote">1</a></sup>.</p>

<h2 id="ai-driven-bug-discovery">AI-driven Bug Discovery</h2>

<p>Our recent work with GitHub Copilot demonstrates this potential in action. The AI autonomously produced precise TLA+ specifications from Azure Storage’s production source code, uncovering a subtle race condition that had evaded traditional code reviews and extensive testing. This achievement illustrates how AI can transform our approach to ensuring correctness in complex distributed systems.</p>

<h3 id="strategic-planning-ais-methodical-approach">Strategic Planning: AI’s Methodical Approach</h3>

<p>We began by tasking GitHub Copilot (with o3 model) to formulate a comprehensive plan for analyzing a specific feature and generating a corresponding TLA+ specification. The AI responded with an 8-step methodology that demonstrated deep understanding of formal verification principles:</p>

<ul>
  <li>Systematic cataloging of code paths, data structures, and Paxos/server commands</li>
  <li>Extraction of behavioral patterns and invariants from source code</li>
  <li>Construction of a minimal yet complete TLA+ model</li>
  <li>Integration of failure modes and safety/liveness properties</li>
</ul>

<h3 id="from-plan-to-execution-ai-in-action">From Plan to Execution: AI in Action</h3>

<p><strong>1. Autonomous Code Analysis and Initial Specification</strong></p>

<ul>
  <li>
    <p>After reviewing the AI-generated plan, we instructed GitHub Copilot Agent (with Claude 3.7 Sonnet model) to execute each step, marking completion as it progressed. Notably, we only provided the feature name and the relevant component directory to narrow the search scope. The AI autonomously identified pertinent files and extracted essential information without further guidance.</p>
  </li>
  <li>
    <p>Subsequently, the AI produced comprehensive architecture and behavior documentation, along with an initial TLA+ specification containing over 10 invariants. Impressively, the primary safety invariant matched precisely what we anticipated.</p>
  </li>
</ul>

<p><strong>2. Iterative Refinement of the Specification</strong></p>

<ul>
  <li>
    <p>Upon reviewing the initial specification, we noticed the absence of an etag-based optimistic concurrency control mechanism. We prompted the AI to investigate how etags were utilized to prevent race conditions. In response, the AI revisited the source code, updated the documentation accordingly, and refined the TLA+ specification to incorporate this critical mechanism.</p>
  </li>
  <li>
    <p>We observed that the AI refined the specification iteratively, updating one atomic action at a time. This incremental approach likely helped the AI maintain precision and accuracy throughout the refinement process.</p>
  </li>
</ul>

<p><strong>3. Validation through Model Checking</strong></p>

<ul>
  <li>With the refined specification ready, we proceeded to validate it using the TLA+ model checker. Initially, the tool reported a language error, which the AI swiftly corrected. Subsequent model checking uncovered a violation of the primary safety invariant. After providing the violation trace to the AI, it quickly identified the problematic sequence — a concurrent deletion and reference addition scenario.</li>
</ul>

<p><strong>4. Enhancing the Specification via Git History Analysis</strong></p>

<ul>
  <li>To ensure comprehensive coverage, we asked the AI to analyze the git commit history for the feature using git MCP. This analysis revealed an additional critical safety mechanism involving pessimistic locking, previously overlooked. The AI promptly updated both the documentation and the TLA+ specification to reflect this important discovery.</li>
</ul>

<h3 id="the-critical-discovery">The Critical Discovery</h3>

<p>Within hours of iterative refinement, the AI had surfaced a critical race condition: an old Paxos primary could perform a deletion while a new primary simultaneously added a reference. This subtle bug had escaped detection through traditional methods, yet given Azure Storage’s scale, it would likely have manifested in production eventually.</p>

<h2 id="looking-ahead-the-autonomous-future">Looking Ahead: The Autonomous Future</h2>

<p>After a decade of manually crafting TLA+ specifications, I must acknowledge that this AI-generated specification rivals human work. This achievement represents more than incremental progress — it demonstrates that the fundamental components for fully automating correctness verification are now available.</p>

<p>Based on current capabilities, I envision the following evolution:</p>

<ul>
  <li><strong>Autonomous System Analysis</strong>: AI will independently dissect large-scale distributed systems, mapping critical components and their interactions.</li>
  <li><strong>Intelligent Invariant Discovery</strong>: AI will identify and formalize key system invariants without human guidance.</li>
  <li><strong>Self-Validating Specifications</strong>: AI-generated TLA+ models will undergo automatic validation, with AI analyzing any violations
    <ul>
      <li>Violations will trigger AI-crafted targeted tests</li>
      <li>Confirmed bugs will be automatically corrected</li>
      <li>Model-implementation discrepancies will drive continuous specification refinement</li>
    </ul>
  </li>
  <li><strong>Reinforcement Learning Feedback Loop</strong>: Each iteration will generate valuable data, enabling AI to continuously improve its verification capabilities.</li>
  <li><strong>Mastery Through Clear Objectives</strong>: Like other domains with verifiable outcomes, distributed system correctness will become an area where AI consistently outperforms human practitioners.</li>
</ul>

<p>While we haven’t reached this destination yet, the path forward is clear. Even if AI development paused today, the foundation for an AI-driven correctness revolution has already been laid.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:disclaimer" role="doc-endnote">
      <p>Opinions are personal. <a href="#fnref:disclaimer" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><category term="GitHub Copilot" /><category term="Formal verification" /><category term="TLA+" /><summary type="html"><![CDATA[Formal verification has long been the gold standard for uncovering subtle bugs in distributed system design [1]. While AI has already proven its ability to accelerate verification processes [2], recent breakthroughs suggest a far more transformative potential: AI can now autonomously generate accurate formal specifications directly from very large production codebases. This capability marks a pivotal moment in software engineering, pointing toward a future where AI-driven correctness verification becomes not just standard practice, but potentially superior to human efforts 1. AI-driven Bug Discovery Our recent work with GitHub Copilot demonstrates this potential in action. The AI autonomously produced precise TLA+ specifications from Azure Storage’s production source code, uncovering a subtle race condition that had evaded traditional code reviews and extensive testing. This achievement illustrates how AI can transform our approach to ensuring correctness in complex distributed systems. Strategic Planning: AI’s Methodical Approach We began by tasking GitHub Copilot (with o3 model) to formulate a comprehensive plan for analyzing a specific feature and generating a corresponding TLA+ specification. The AI responded with an 8-step methodology that demonstrated deep understanding of formal verification principles: Systematic cataloging of code paths, data structures, and Paxos/server commands Extraction of behavioral patterns and invariants from source code Construction of a minimal yet complete TLA+ model Integration of failure modes and safety/liveness properties From Plan to Execution: AI in Action 1. Autonomous Code Analysis and Initial Specification After reviewing the AI-generated plan, we instructed GitHub Copilot Agent (with Claude 3.7 Sonnet model) to execute each step, marking completion as it progressed. Notably, we only provided the feature name and the relevant component directory to narrow the search scope. The AI autonomously identified pertinent files and extracted essential information without further guidance. Subsequently, the AI produced comprehensive architecture and behavior documentation, along with an initial TLA+ specification containing over 10 invariants. Impressively, the primary safety invariant matched precisely what we anticipated. 2. Iterative Refinement of the Specification Upon reviewing the initial specification, we noticed the absence of an etag-based optimistic concurrency control mechanism. We prompted the AI to investigate how etags were utilized to prevent race conditions. In response, the AI revisited the source code, updated the documentation accordingly, and refined the TLA+ specification to incorporate this critical mechanism. We observed that the AI refined the specification iteratively, updating one atomic action at a time. This incremental approach likely helped the AI maintain precision and accuracy throughout the refinement process. 3. Validation through Model Checking With the refined specification ready, we proceeded to validate it using the TLA+ model checker. Initially, the tool reported a language error, which the AI swiftly corrected. Subsequent model checking uncovered a violation of the primary safety invariant. After providing the violation trace to the AI, it quickly identified the problematic sequence — a concurrent deletion and reference addition scenario. 4. Enhancing the Specification via Git History Analysis To ensure comprehensive coverage, we asked the AI to analyze the git commit history for the feature using git MCP. This analysis revealed an additional critical safety mechanism involving pessimistic locking, previously overlooked. The AI promptly updated both the documentation and the TLA+ specification to reflect this important discovery. The Critical Discovery Within hours of iterative refinement, the AI had surfaced a critical race condition: an old Paxos primary could perform a deletion while a new primary simultaneously added a reference. This subtle bug had escaped detection through traditional methods, yet given Azure Storage’s scale, it would likely have manifested in production eventually. Looking Ahead: The Autonomous Future After a decade of manually crafting TLA+ specifications, I must acknowledge that this AI-generated specification rivals human work. This achievement represents more than incremental progress — it demonstrates that the fundamental components for fully automating correctness verification are now available. Based on current capabilities, I envision the following evolution: Autonomous System Analysis: AI will independently dissect large-scale distributed systems, mapping critical components and their interactions. Intelligent Invariant Discovery: AI will identify and formalize key system invariants without human guidance. Self-Validating Specifications: AI-generated TLA+ models will undergo automatic validation, with AI analyzing any violations Violations will trigger AI-crafted targeted tests Confirmed bugs will be automatically corrected Model-implementation discrepancies will drive continuous specification refinement Reinforcement Learning Feedback Loop: Each iteration will generate valuable data, enabling AI to continuously improve its verification capabilities. Mastery Through Clear Objectives: Like other domains with verifiable outcomes, distributed system correctness will become an area where AI consistently outperforms human practitioners. While we haven’t reached this destination yet, the path forward is clear. Even if AI development paused today, the foundation for an AI-driven correctness revolution has already been laid. Opinions are personal. &#8617;]]></summary></entry><entry><title type="html">Evaluating Windsurf - A Production System Experiment</title><link href="https://zfhuang99.github.io/windsurf/2024/12/15/evaluating-windsurf.html" rel="alternate" type="text/html" title="Evaluating Windsurf - A Production System Experiment" /><published>2024-12-15T00:00:00+00:00</published><updated>2024-12-15T00:00:00+00:00</updated><id>https://zfhuang99.github.io/windsurf/2024/12/15/evaluating-windsurf</id><content type="html" xml:base="https://zfhuang99.github.io/windsurf/2024/12/15/evaluating-windsurf.html"><![CDATA[<h2 id="summary">Summary</h2>

<p>2025 is projected to be a breakthrough year for AI agents, particularly
in software development. Agentic coding assistants like Cursor and
Windsurf (both forking from VS Code) are evolving rapidly, with
increasingly sophisticated capabilities. While social media abounds with
success stories of these tools democratizing coding — including
accounts of children as young as eight building functional games —
their effectiveness in handling complex, real-world codebases remains
largely unexplored.</p>

<p>This experiment examines how Windsurf can be leveraged to develop a new
feature for a sophisticated production component. We specifically
highlight how agentic coding differs from code completion as pioneered
by GitHub Copilot.</p>

<p>Our experiment subject is Microsoft's RSL [<a href="https://github.com/Azure/RSL">1</a>] (Replicated State
Library), an implementation of the Paxos consensus algorithm that serves
as the core metadata engine for numerous large-scale distributed systems
within Azure. Similar components are used by other major cloud providers
in their production environments. RSL's open-source nature makes it an
ideal candidate for evaluating cutting-edge agentic coding assistants.</p>

<p>RSL enables consensus among a group of nodes when a quorum (majority)
remains operational. While this works effectively for single-group
scenarios, we've encountered challenges in sharded systems with
multiple RSL groups. Specifically, we've observed cross-talk — nodes
from different RSL groups attempting to communicate with each other.
This interference can disrupt RSL's operation, potentially causing
system unavailability or compromising correctness. To address this, we
aimed to implement a group ID feature that restricts communication to
nodes sharing the same group identifier.</p>

<p>The implementation of such features is non-trivial. While we had a broad
understanding of RSL, navigating its complex codebase — with the
single core file alone containing over 7,500 lines of code — required
deep familiarity with implementation details at the functional level.
Under normal circumstances, just gaining sufficient understanding of the
codebase to implement this feature would have required focused effort
spanning several days. However, with Windsurf's agentic capabilities,
we were able to complete the implementation and resolve all build issues
in just <strong>two hours</strong> (not including the development of test cases).</p>

<h2 id="agentic-flow-highlights">Agentic Flow Highlights</h2>

<p>In this section, we present two examples that led to our eureka
moments in appreciating the power of agentic flow. In subsequent
sections, we cover the flow of our feature development and additional
interactions with the AI agent.</p>

<h3 id="example-1">Example 1</h3>

<p>Upon completion of the group ID feature (as detailed in the next
section), we requested the AI agent to review a critical change. Our
objective was to ensure that the change had been applied across all
necessary code paths. With a single prompt, the AI agent meticulously
reviewed the entire file (comprising thousands of lines of code),
identified all relevant methods, determined where changes were
necessary, and generated the requisite modifications.</p>

<p align="center">
  <img src="/assets/images/windsurf/image1.png" width="90%" />
</p>

<p align="center">
  <img src="/assets/images/windsurf/image2.png" width="90%" />
</p>

<p align="center">
  <img src="/assets/images/windsurf/image3.png" width="90%" />
</p>

<h3 id="example-2">Example 2</h3>

<p>RSL employs a complex test setup, including a test engine with simulated
legislators, to facilitate testing. In order to augment the test cases
to cover our new feature, we wanted a comprehensive understanding of the
test case setup. Specifically, we sought to determine whether message
passing in the test cases was implemented through mocking or transferred
via TCP. Following a straightforward prompt, the AI agent traced through
both the core RSL library codebase and the test code implementation. It
successfully pieced together relevant information and provided a
detailed answer for our investigation.</p>

<p align="center">
  <img src="/assets/images/windsurf/image4.png" width="90%" />
</p>

<p align="center">
  <img src="/assets/images/windsurf/image5.png" width="90%" />
</p>

<p align="center">
  <img src="/assets/images/windsurf/image6.png" width="90%" />
</p>

<h2 id="first-prompt">First prompt</h2>

<p>Our initial interaction with the AI agent began by directly requesting
the feature implementation, without providing additional context. While
the agent couldn't immediately access the codebase and relevant files,
it demonstrated understanding of the task by proposing a viable
implementation strategy.</p>

<p align="center">
  <img src="/assets/images/windsurf/image7.png" width="90%" />
</p>

<h2 id="clarifying-requirements">Clarifying requirements</h2>

<p>In its initial response, the agent proactively sought clarification
through targeted questions to better understand the requirements and
context.</p>

<p align="center">
  <img src="/assets/images/windsurf/image8.png" width="90%" />
</p>

<h2 id="asking-for-help">Asking for help</h2>

<p>When unable to locate the necessary files, the agent explicitly
requested assistance rather than making assumptions or proceeding with
incomplete information.</p>

<p align="center">
  <img src="/assets/images/windsurf/image9.png" width="90%" />
</p>

<h2 id="proposing-detailed-implementations">Proposing detailed implementations</h2>

<p>Once the agent successfully located the necessary files, it proposed a
comprehensive implementation strategy.</p>

<p align="center">
  <img src="/assets/images/windsurf/image10.png" width="90%" />
</p>

<h2 id="updating-multiple-files">Updating multiple files</h2>

<p>The agent successfully modified multiple files, some containing up to
more than 1,000 lines of code. However, when faced with legislator.cpp
— the largest file at over 7,500 lines — the agent encountered
limitations in directly implementing changes. After several unsuccessful
attempts, we adapted our approach by requesting the agent to specify the
necessary modifications, which we then applied manually.</p>

<p align="center">
  <img src="/assets/images/windsurf/image11.png" width="90%" />
</p>

<h2 id="identifying-gaps">Identifying gaps</h2>

<p>After implementing the core functionality, we initiated a new
conversation thread to review the central logic — the validation and
rejection of messages based on group IDs. This decision to start fresh
helped avoid potential confusion from an oversized context window.
Although we needed to reorient the agent with the correct file paths, it
quickly engaged in meaningful code review.</p>

<p>The agent not only successfully analyzed the code but also identified
additional locations requiring modifications. Through this interactive
process, we (the pilot) discovered a simpler implementation approach.
When presented with this insight, the agent readily adapted and updated
the implementation to align with the simplified solution.</p>

<p align="center">
  <img src="/assets/images/windsurf/image12.png" width="90%" />
</p>

<p align="center">
  <img src="/assets/images/windsurf/image13.png" width="90%" />
</p>

<p align="center">
  <img src="/assets/images/windsurf/image14.png" width="90%" />
</p>

<p align="center">
  <img src="/assets/images/windsurf/image15.png" width="90%" />
</p>

<h2 id="fixing-build-errors">Fixing build errors</h2>

<p>As expected, build errors were easily resolved through prompting the
agent for fixes. While RSL's specialized build environment precluded
direct agent interaction with the build process, we've observed more
autonomous capabilities in other contexts. In a separate Rust project,
the agent demonstrated full autonomy by successfully resolving all build
errors after four consecutive build-fix iterations.</p>

<p align="center">
  <img src="/assets/images/windsurf/image16.png" width="90%" />
</p>

<p align="center">
  <img src="/assets/images/windsurf/image17.png" width="90%" />
</p>

<h3 id="autonomous-iterations">Autonomous iterations</h3>

<p>In a separate project with a standard build environment, the agent
demonstrated full autonomy by successfully resolving all build errors
after four consecutive build-fix iterations.</p>

<p align="center">
  <img src="/assets/images/windsurf/image18.png" width="90%" />
</p>

<p align="center">
  <img src="/assets/images/windsurf/image19.png" width="90%" />
</p>

<p align="center">
  <img src="/assets/images/windsurf/image20.png" width="90%" />
</p>]]></content><author><name></name></author><category term="Windsurf" /><summary type="html"><![CDATA[Summary 2025 is projected to be a breakthrough year for AI agents, particularly in software development. Agentic coding assistants like Cursor and Windsurf (both forking from VS Code) are evolving rapidly, with increasingly sophisticated capabilities. While social media abounds with success stories of these tools democratizing coding — including accounts of children as young as eight building functional games — their effectiveness in handling complex, real-world codebases remains largely unexplored. This experiment examines how Windsurf can be leveraged to develop a new feature for a sophisticated production component. We specifically highlight how agentic coding differs from code completion as pioneered by GitHub Copilot. Our experiment subject is Microsoft's RSL [1] (Replicated State Library), an implementation of the Paxos consensus algorithm that serves as the core metadata engine for numerous large-scale distributed systems within Azure. Similar components are used by other major cloud providers in their production environments. RSL's open-source nature makes it an ideal candidate for evaluating cutting-edge agentic coding assistants. RSL enables consensus among a group of nodes when a quorum (majority) remains operational. While this works effectively for single-group scenarios, we've encountered challenges in sharded systems with multiple RSL groups. Specifically, we've observed cross-talk — nodes from different RSL groups attempting to communicate with each other. This interference can disrupt RSL's operation, potentially causing system unavailability or compromising correctness. To address this, we aimed to implement a group ID feature that restricts communication to nodes sharing the same group identifier. The implementation of such features is non-trivial. While we had a broad understanding of RSL, navigating its complex codebase — with the single core file alone containing over 7,500 lines of code — required deep familiarity with implementation details at the functional level. Under normal circumstances, just gaining sufficient understanding of the codebase to implement this feature would have required focused effort spanning several days. However, with Windsurf's agentic capabilities, we were able to complete the implementation and resolve all build issues in just two hours (not including the development of test cases). Agentic Flow Highlights In this section, we present two examples that led to our eureka moments in appreciating the power of agentic flow. In subsequent sections, we cover the flow of our feature development and additional interactions with the AI agent. Example 1 Upon completion of the group ID feature (as detailed in the next section), we requested the AI agent to review a critical change. Our objective was to ensure that the change had been applied across all necessary code paths. With a single prompt, the AI agent meticulously reviewed the entire file (comprising thousands of lines of code), identified all relevant methods, determined where changes were necessary, and generated the requisite modifications. Example 2 RSL employs a complex test setup, including a test engine with simulated legislators, to facilitate testing. In order to augment the test cases to cover our new feature, we wanted a comprehensive understanding of the test case setup. Specifically, we sought to determine whether message passing in the test cases was implemented through mocking or transferred via TCP. Following a straightforward prompt, the AI agent traced through both the core RSL library codebase and the test code implementation. It successfully pieced together relevant information and provided a detailed answer for our investigation. First prompt Our initial interaction with the AI agent began by directly requesting the feature implementation, without providing additional context. While the agent couldn't immediately access the codebase and relevant files, it demonstrated understanding of the task by proposing a viable implementation strategy. Clarifying requirements In its initial response, the agent proactively sought clarification through targeted questions to better understand the requirements and context. Asking for help When unable to locate the necessary files, the agent explicitly requested assistance rather than making assumptions or proceeding with incomplete information. Proposing detailed implementations Once the agent successfully located the necessary files, it proposed a comprehensive implementation strategy. Updating multiple files The agent successfully modified multiple files, some containing up to more than 1,000 lines of code. However, when faced with legislator.cpp — the largest file at over 7,500 lines — the agent encountered limitations in directly implementing changes. After several unsuccessful attempts, we adapted our approach by requesting the agent to specify the necessary modifications, which we then applied manually. Identifying gaps After implementing the core functionality, we initiated a new conversation thread to review the central logic — the validation and rejection of messages based on group IDs. This decision to start fresh helped avoid potential confusion from an oversized context window. Although we needed to reorient the agent with the correct file paths, it quickly engaged in meaningful code review. The agent not only successfully analyzed the code but also identified additional locations requiring modifications. Through this interactive process, we (the pilot) discovered a simpler implementation approach. When presented with this insight, the agent readily adapted and updated the implementation to align with the simplified solution. Fixing build errors As expected, build errors were easily resolved through prompting the agent for fixes. While RSL's specialized build environment precluded direct agent interaction with the build process, we've observed more autonomous capabilities in other contexts. In a separate Rust project, the agent demonstrated full autonomy by successfully resolving all build errors after four consecutive build-fix iterations. Autonomous iterations In a separate project with a standard build environment, the agent demonstrated full autonomy by successfully resolving all build errors after four consecutive build-fix iterations.]]></summary></entry><entry><title type="html">TLA+ Made Simple with ChatGPT</title><link href="https://zfhuang99.github.io/tla+/pluscal/chatgpt/2023/09/24/TLA-made-simple-with-chatgpt.html" rel="alternate" type="text/html" title="TLA+ Made Simple with ChatGPT" /><published>2023-09-24T00:00:00+00:00</published><updated>2023-09-24T00:00:00+00:00</updated><id>https://zfhuang99.github.io/tla+/pluscal/chatgpt/2023/09/24/TLA-made-simple-with-chatgpt</id><content type="html" xml:base="https://zfhuang99.github.io/tla+/pluscal/chatgpt/2023/09/24/TLA-made-simple-with-chatgpt.html"><![CDATA[<h2 id="more-than-just-a-first-impression">More than just a first impression</h2>

<p>Back in 2013, I was working hard on my first implementation of the Paxos consensus protocol. When everything was going smoothly, it worked great. But when I tried some tricky test cases, things often went wrong. It was tough making sure my implementation was perfect, especially with so many possible tests to think of.</p>

<p>I thought to myself, “I can’t be the only one facing this problem.” So, I started looking for better ways to handle these challenges. That’s when I talked to some folks at Microsoft Research and heard about the P programming language [<a href="https://github.com/p-org/P">1</a>]. It felt like I was onto something.</p>

<p>Around the same time, Leslie Lamport was awarded the Turing Award. Not long after the annoucement, Leslie gave a lecture at MSR. The biggest room in building 99 was packed. Everyone wanted to hear from the newest Turing Award winner.</p>

<p>Leslie’s talk had a profound impact on me. I can still remember the title like it was yesterday: “Who Builds a Skyscraper without a Blueprint?”. He was talking about how some of us try to figure out complex distributed systems while we’re writing code, like trying to design a skyscraper while you’re already building it.</p>

<p>That was the first time I heard about TLA+ [<a href="https://lamport.azurewebsites.net/tla/tla.html">2</a>]. Over the next few years, I realized how important TLA+ was for making sure our distributed systems worked right. I became a big fan of TLA+ and even helped to host the first few multi-day TLA+ training sessions by Leslie himself for everyone at Microsoft.</p>

<h2 id="challenges-in-tla-adoption">Challenges in TLA+ adoption</h2>

<p>Despite TLA+ showing promise in real-world systems [<a href="https://lamport.azurewebsites.net/tla/industrial-use.html">3</a>], its widespread adoption remains limited. What seems to be holding it back?</p>

<p>One primary challenge is its relatively steep learning curve. Among the available resources, Leslie’s lectures [<a href="https://lamport.azurewebsites.net/video/videos.html">4</a>] stand out as the most comprehensive guide. To illustrate, one developer from Azure was able to craft detailed specifications after diving into Leslie’s lectures for an entire week. However, for many, this might be an optimistic timeline, with a more realistic learning period often spanning several weeks or even longer.</p>

<p>Another significant concern is the scarcity of resources when facing difficulties. During my time assisting other developers, I observed the struggles they faced in adopting abstract thinking. Without prompt and constructive feedback, refining such skills can be a prolonged journey.</p>

<p>Moreover, the lack of readily available support compounds the issue. When developers grapple with specific aspects of the TLA+ language, finding guidance can often be a challenge in itself.</p>

<h2 id="tla-in-the-age-of-llm">TLA+ in the age of LLM</h2>

<p>The arrival of Large Language Models (LLM) promises a transformative shift, potentially making TLA+ more accessible to a broader range of developers.</p>

<p>Drawing from my recent experiences, I’ve recognized the immense value of using ChatGPT to draft TLA+ specifications and iron out language quirks. A particular moment that struck me was when I realized a flaw in my initial design of a distributed system protocol. After modeling the protocol in TLA+ and running the validation, I was presented with an invariant violation, highlighted by a comprehensive error trace. Out of curiosity, I fed this error trace to ChatGPT. Astonishingly, ChatGPT not only pinpointed the core of the mistake but also offered a list of options to refine the protocol. In this endeavor, ChatGPT emerged as a truly invaluable assistant.</p>

<p>Interestingly, each time I’ve shared this experience with colleagues, they’ve expressed genuine surprise. It seems that the potential synergy between TLA+ and ChatGPT remains largely undiscovered. This realization motivates this article, aiming to enlighten a broader audience.</p>

<h2 id="simplifying-tla-with-chatgpt">Simplifying TLA+ with ChatGPT</h2>

<p>For illustrative purposes, I chose a simple toy consensus protocol to interact with ChatGPT. It’s worth noting that even with more complex and real-world protocol designs, the insights remained consistent. Those curious can find the complete ChatGPT session detailed [<a href="https://chat.openai.com/share/52147e84-ff08-434a-94c8-769701f4d246">5</a>]. Here are some of the key takeaways.</p>

<h3 id="drafting-specification">Drafting specification</h3>

<p>We’ll begin by outlining the distributed system challenge at hand. Then, we’ll prompt ChatGPT to produce a TLA+ specification using PlusCal.</p>

<p align="center">
  <img src="/assets/images/tla_draft_spec.jpeg" width="100%" />
</p>

<h3 id="resolving-language-errors">Resolving language errors</h3>

<p>After copying the specification into a .tla file, we employ the TLA+ toolbox for compilation. If the compilation encounters issues, we turn to ChatGPT for error resolution. Interestingly, ChatGPT tends to repeat certain minor errors. Recognizing these patterns allows us to preemptively address them in subsequent prompts by setting a few guiding rules.</p>

<p align="center">
  <img src="/assets/images/tla_fix_compile_error.jpeg" width="100%" />
</p>

<h3 id="reviewing-specification">Reviewing specification</h3>

<p>Upon reviewing the specification, I noticed that ChatGPT mistook CHOOSE for representing non-determinism. It’s important to clarify this with the model. Moving forward, incorporating this clarification into our guidelines for future prompts will be beneficial.</p>

<p align="center">
  <img src="/assets/images/tla_choose_vs_with.jpeg" width="100%" />
</p>

<h3 id="defining-invariants">Defining invariants</h3>

<p>Having acquired a full TLA+ specification and confirming its correctness through model checking, we can now proceed to establish invariants.</p>

<p align="center">
  <img src="/assets/images/tla_define_invariant.jpeg" width="100%" />
</p>

<h3 id="interpreting-error-trace">Interpreting error trace</h3>

<p>When I ran the model checking, it flagged an invariant violation. I turned to ChatGPT to help break down and understand this error trace. I was genuinely taken aback by ChatGPT’s ability to not just delineate the error trace clearly, but also to shed light on the underlying cause of the error.</p>

<p align="center">
  <img src="/assets/images/tla_analyze_error_trace_1.jpeg" width="100%" />
  <img src="/assets/images/tla_analyze_error_trace_2.jpeg" width="100%" />
</p>

<h3 id="can-chatgpt-mend-distributed-protocols">Can ChatGPT mend distributed protocols?</h3>

<p>It might seem like a tall order, but I decided to challenge ChatGPT: Could it suggest ways to rectify the toy consensus protocol? What stunned me was how ChatGPT didn’t just offer one, but a spectrum of potential solutions, weighing the advantages and drawbacks of each. In the end, I went with the most straightforward solution, and ChatGPT re-generated the specification. The revised spec sailed through the verification process seamlessly. Truly remarkable!</p>

<p align="center">
  <img src="/assets/images/tla_mend_protocol.jpeg" width="100%" />
</p>

<h2 id="conclusion">Conclusion</h2>

<p>As AI delivers an unprecedented surge in programming productivity, ensuring the robustness and fail-proof nature of our cloud-scale distributed systems becomes even more paramount. Embracing formal methods like TLA+ and software verification is imperative [<a href="https://zfhuang99.github.io/rust/chatgpt/2023/03/14/implementing-Paxos-in-Rust-with-ChatGPT.html#future-of-ai-assisted-programming">6</a>]. Through this article, I’ve aimed to highlight how ChatGPT can be a game-changer in making TLA+ more approachable. Looking ahead, I envision a future where, with tools like ChatGPT, developers can effortlessly construct more resilient infrastructures at scale.</p>]]></content><author><name></name></author><category term="TLA+" /><category term="PlusCal" /><category term="ChatGPT" /><summary type="html"><![CDATA[More than just a first impression]]></summary></entry><entry><title type="html">Practical Formal Verification for Distributed Systems</title><link href="https://zfhuang99.github.io/formal%20verification/ivy/2023/06/18/practical-formal-verification.html" rel="alternate" type="text/html" title="Practical Formal Verification for Distributed Systems" /><published>2023-06-18T00:00:00+00:00</published><updated>2023-06-18T00:00:00+00:00</updated><id>https://zfhuang99.github.io/formal%20verification/ivy/2023/06/18/practical-formal-verification</id><content type="html" xml:base="https://zfhuang99.github.io/formal%20verification/ivy/2023/06/18/practical-formal-verification.html"><![CDATA[<h2 id="introduction">Introduction</h2>

<p>In a rapidly evolving tech landscape where cloud systems scale exponentially and AI-assisted programming takes center stage [<a href="https://zfhuang99.github.io/rust/chatgpt/2023/03/14/implementing-Paxos-in-Rust-with-ChatGPT.html#future-of-ai-assisted-programming">1</a>], mastering the art of balancing outage costs with preventive measures is more critical than ever. This pressing need has invigorated the exploration of formal methods, a domain that includes both model checking, exemplified by tools like TLA+ [<a href="https://lamport.azurewebsites.net/tla/tla.html">2</a>], and formal verification techniques.</p>

<p>While model checking has found its way into practical engineering workflows [<a href="https://lamport.azurewebsites.net/tla/industrial-use.html">3</a>], formal verification remains a largely untapped resource for ensuring the reliability of cloud-scale distributed systems.</p>

<p>In this article, we advocate for a pragmatic approach to formal verification, emphasizing its value as a complement to model checking. Our thesis is straightforward: as AI-assisted programming becomes ubiquitous, the ability to rigorously reason about system correctness will be a key differentiator. Practical formal verification, we argue, could be the missing piece that takes the robustness of cloud-scale distributed systems to the next level.</p>

<h2 id="outage-cost-vs-prevention">Outage: cost vs. prevention</h2>

<!-- ![outage_cost](/assets/images/outage_cost_vs_investment_v2.png){:width="85%"} -->
<p align="center">
  <img src="/assets/images/outage_cost_vs_investment_v2.png" width="85%" />
</p>

<p>The above graph illustrates the relationship between the cost of outages and the preventive measures taken beforehand. On the x-axis, we have the preventive investments, and on the y-axis, we see the resulting cost when outages cannot be fully prevented.</p>

<p>The blue curve showcases the cost of outages as the product of software defects and their probability. This cost diminishes as preventative investment rises. The upper section of this curve highlights the benefits of conventional engineering practices, such as unit testing, integration testing, and strategic release stages. In this zone, the majority of straightforward software defects are rectified. Moreover, these practices diminish the likelihood of more obscure defects, leading to a steep decline in outage costs with increased investment.</p>

<p>However, as we move towards the bottom section of the curve, the scenario becomes more challenging. Here, latent defects, often deeply embedded in the software stack, remain hidden until significant scaling takes place. These defects, though rare, can have a profound impact. To counter them, a substantial investment is necessary, but the returns (in terms of outage cost reduction) are lesser in comparison to the initial phase. Yet, when weighing the potential damage from these defects against the preventative investment, the decision is clear: it’s prudent to keep investing until an equilibrium is reached.</p>

<p>Compounding this scenario, the blue curve is set to rise with the scaling of cloud systems. Even with a conservative annual growth rate of 20%, the cost of outages will double within four years. In contrast, the green curve, representing engineering costs, increases linearly over time. This discrepancy means that the challenges represented by the bottom section of the blue curve will grow exponentially, making investment in this area increasingly valuable.</p>

<h2 id="landscape-of-formal-methods">Landscape of formal methods</h2>

<!-- ![formal_methods](/assets/images/formal_methods.png){:width="100%"} -->
<p align="center">
  <img src="/assets/images/formal_methods.png" width="100%" />
</p>

<p>Formal methods can be visualized across four quadrants in a two-dimensional space, categorized by their approach and applicability. The Y-axis differentiates between model checking and formal verification, while the X-axis distinguishes between techniques applied to abstract specifications versus real implementations. Let’s delve into each quadrant:</p>

<ul>
  <li>Model checking on real implementation (top-right quadrant):</li>
</ul>

<p>Example: Take Coyote [<a href="https://microsoft.github.io/coyote/">4</a>], originally developed at Microsoft Research and now open-sourced. In traditional testing, executing a test case 100 times yields roughly identical results each time. With Coyote, each execution explores different task interleavings, thanks to its control over asynchronous task scheduling. As a result, invariant violations surface more quickly, and when detected, Coyote captures a trace, allowing for deterministic replay and debugging.</p>

<ul>
  <li>Model checking on abstract specification (top-left quadrant):</li>
</ul>

<p>Example: Consider TLA+ [<a href="https://lamport.azurewebsites.net/tla/tla.html">2</a>], a specification language. Real-world systems contain myriad details, many of which are non-essential to verifying correctness. By abstracting core logic and eliminating extraneous details, specifications in TLA+ focus solely on crucial behaviors. Model checking these specifications explores all defined interleavings, capturing and replaying traces whenever an invariant is violated.</p>

<ul>
  <li>Formal verification on abstract specification (bottom-feft quadrant):</li>
</ul>

<p>Example: Instead of verifying through exhaustive exploration, formal verification offers mathematical proof of correctness. For instance, using the IVy language [<a href="https://kenmcmil.github.io/ivy/">5</a>], one can abstract essential distributed system logic, write specifications, and then mathematically prove their correctness using the IVy toolchain.</p>

<ul>
  <li>Formal verification on real implementation (bottom-right quadrant):</li>
</ul>

<p>Example: Verus [<a href="https://github.com/verus-lang/verus">6</a>] is a pioneering open-source project marrying formal verification with real-world implementation in the Rust language. It exemplifies the potential of integrating rigorous mathematical proofs with tangible, working code.</p>

<h2 id="formal-verification-a-practical-perspective">Formal verification: a practical perspective</h2>

<p>Diving into the realm of formal verification, it’s essential to distinguish between “practical” formal verification and its academic sibling. While both aim for the same overarching goal—ensuring system correctness — their approaches and concerns differ markedly.</p>

<p>From an engineering standpoint, formal verification fills certain voids that model checking doesn’t address. Model checking’s primary challenge lies in the definition of safety properties or invariants. Engineers can usually spell out ultimate safety properties for distributed systems with ease: “ensure no data loss in a storage system” or “a consensus protocol must converge to a single value.” Yet, the intricacies arise due to the global nature of these invariants. To verify them, collaboration among multiple distributed entities becomes indispensable. For instance, confirming no data loss requires consulting all storage nodes, while verifying a consensus protocol demands interrogation of numerous nodes. These global checks are cumbersome in runtime, pushing developers towards local, intermediate invariants that individual nodes can easily enforce. But this approach isn’t without its pitfalls: How many intermediate invariants are enough? Are they comprehensive in ensuring the overarching safety goals?</p>

<p>Enter formal verification. It necessitates the formulation of inductive invariants which, collectively, can vouch for a protocol’s correctness. Interestingly, these inductives often pertain to individual nodes, making their runtime enforcement more feasible.</p>

<p>However, there’s a nuance to appreciate. In practical engineering, the aim of formal verification isn’t just about achieving system correctness. While academics may laud formal verification for its prowess in managing unbounded state spaces, on the ground, this advantage doesn’t always translate to practical superiority. Even bounded model checking often suffices to ensure system correctness. Whether scrutinizing data safety across a set number of nodes or checking a consensus protocol with limited participants, it’s unlikely for a system to work in a restricted setting but fail when scaled. The true value of formal verification lies in its methodical approach — systematically proving correctness, leading us to pinpoint essential inductive invariants. Once identified and transformed into runtime checks, these invariants become the vigilant guardians of system reliability.</p>

<p>To distill our perspective on practical formal verification: It’s about pinpointing the comprehensive set of inductive invariants that vouch for a system’s safety. We’re content with bounded settings, like consensus among a specific number of nodes or ensuring linearizability within a set operation count. By doing so, we often streamline our specifications, even if it means forgoing generalization — a primary concern in academic formal verification.</p>

<h2 id="ivy">IVy</h2>

<p>This is where IVy [<a href="https://kenmcmil.github.io/ivy/">5</a>] comes in. IVy offers both a language for specifying distributed protocols and a toolkit for formal verification. For those who have worked with TLA+, transitioning to IVy is relatively straightforward.</p>

<p>Using IVy is an interactive experience. It starts with specifying a distributed protocol. With a target safety property, you ask IVy to verify it. Since most safety properties aren’t automatically inductive invariants, IVy often rejects the proof and provides counterexamples. These show scenarios where a legitimate state leads to a violation of the safety property after a valid action. This feedback helps users recognize missing base invariants.</p>

<p>The process is iterative: you add these base invariants, and IVy checks them. If IVy still flags issues, its feedback points to the need for more complex inductive invariants. As users propose these invariants, IVy confirms their validity or provides counterexamples. The cycle continues until IVy can’t find any more counterexamples, indicating that the safety property has been proven.</p>

<p>From an engineering viewpoint, this iterative method is useful. Each cycle deepens our understanding of the distributed protocol. Moreover, many of the identified inductive invariants are local, making them easy to implement as runtime checks. By adding these checks to the real system, we transition the correctness assurance from the specification to the actual implementation.</p>

<h3 id="why-not-dafny">Why not Dafny?</h3>

<p>Dafny [<a href="https://dafny.org/dafny/">7</a>] has garnered attention as a tool for the formal verification of distributed protocols. However, based on our experiences, Dafny falls short in serving the specific needs of practical formal verification. A key limitation lies in Dafny’s inability to consistently generate concise and easily understandable counterexamples, a crucial aspect in the journey of identifying inductive invariants.</p>

<h2 id="toy-consensus-example">Toy consensus example</h2>

<p>To grasp the concept of inductive invariants and their discovery, let’s walk through a simplified consensus protocol.</p>

<p><strong>Protocol overview</strong></p>
<ul>
  <li>We have three nodes: <code class="language-plaintext highlighter-rouge">n_1</code>, <code class="language-plaintext highlighter-rouge">n_2</code>, and <code class="language-plaintext highlighter-rouge">n_3</code>.</li>
  <li>We have two values: <code class="language-plaintext highlighter-rouge">v_1</code> and <code class="language-plaintext highlighter-rouge">v_2</code>.</li>
  <li>Nodes can cast a vote for any value, provided they haven’t voted already.</li>
  <li>Consensus (or a value being decided) is achieved when 2 out of the 3 nodes vote for the same value.</li>
</ul>

<p><strong>Safety property</strong>
The protocol must ensure that only one value is decided upon. In other words, two different values can’t both achieve consensus.</p>

<p>Take a moment to consider this setup. A node, if it hasn’t previously voted, can vote for any value. Once a value receives votes from two nodes, that value is deemed “decided”. The core safety aspect we’re ensuring is that only one value reaches this “decided” status.</p>

<p><strong>IVy specification</strong></p>
<ul>
  <li>Apart from nodes and values, our specification includes two quorums, <code class="language-plaintext highlighter-rouge">q_1</code> and <code class="language-plaintext highlighter-rouge">q_2</code>. The axioms <code class="language-plaintext highlighter-rouge">quorum_q1</code> and <code class="language-plaintext highlighter-rouge">quorum_q2</code> provide the simplest possible definitions for these quorums (e.g., <code class="language-plaintext highlighter-rouge">q_1</code> consists of two nodes <code class="language-plaintext highlighter-rouge">n_1</code> and <code class="language-plaintext highlighter-rouge">n_2</code>), which we’ll elaborate on later.</li>
  <li>We employ several relations, which are boolean-returning functions. For those versed in Prolog, these relations should feel familiar. Capitalized variables act as wildcards, matching any instance, whereas lowercase ones pinpoint specific instances. For instance, <code class="language-plaintext highlighter-rouge">vote(N, V)</code> indicates voting status for any node-value pairing, while <code class="language-plaintext highlighter-rouge">vote(n, v)</code> refers to a specific node-value combination.</li>
  <li>The initial state, <code class="language-plaintext highlighter-rouge">vote(N, V) := false</code>, signifies that no node has voted. Conversely, the statement <code class="language-plaintext highlighter-rouge">vote(n, v) := true</code> in the <code class="language-plaintext highlighter-rouge">case_vote</code> action indicates a chosen node (denoted as <code class="language-plaintext highlighter-rouge">n</code>) voting for a specific value (<code class="language-plaintext highlighter-rouge">v</code>).</li>
  <li>The protocol delineates two actions:
    <ol>
      <li><code class="language-plaintext highlighter-rouge">case_vote</code>: Here, a node <code class="language-plaintext highlighter-rouge">n</code> can vote if it hasn’t voted before, as represented by the <code class="language-plaintext highlighter-rouge">~vote(n, V)</code> precondition (notice the capitalized V includes all values).</li>
      <li><code class="language-plaintext highlighter-rouge">decide</code>: This action is triggered when a quorum of nodes have voted for the same value <code class="language-plaintext highlighter-rouge">v</code>.</li>
    </ol>
  </li>
</ul>

<pre><code class="language-Ivy">type node = {n1, n2, n3}
type value = {v1, v2}
type quorum = {q1, q2}

relation member(N:node, Q:quorum)
axiom [quorum_q1] member(n1, q1) &amp; member(n2, q1) &amp; ~member(n3, q1)
axiom [quorum_q2] member(n1, q2) &amp; ~member(n2, q2) &amp; member(n3, q2)

relation vote(N:node, V:value)
relation decision(V:value)

action cast_vote(n:node, v:value) = {
    require ~vote(n, V);
    vote(n, v) := true
}

action decide(q:quorum, v:value) = {
    require member(N, q) -&gt; vote(N, v);
    decision(v) := true
}

after init {
    vote(N, V) := false;
    decision(V) := false;
}

export cast_vote
export decide
</code></pre>

<h2 id="interactive-invariant-discovery">Interactive invariant discovery</h2>

<p>Diving into the discovery of inductive invariants, let’s start by stating our primary safety invariant: only a single value can be decided upon. In essence, any two decided values must be identical.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>invariant decision(V1) &amp; decision(V2) -&gt; V1 = V2
</code></pre></div></div>

<p>IVy flags this invariant as non-inductive. The counterexample it provides begins with a state where <code class="language-plaintext highlighter-rouge">decision(v2) = true</code> and <code class="language-plaintext highlighter-rouge">decision(v1) = false</code>, with both <code class="language-plaintext highlighter-rouge">vote(n1, v1)</code> and <code class="language-plaintext highlighter-rouge">vote(n2, v1)</code> being true. This state is valid as it adheres to our invariant. However, executing the <code class="language-plaintext highlighter-rouge">decide</code> action would result in <code class="language-plaintext highlighter-rouge">decision(v1) = true</code>, violating the invariant since both <code class="language-plaintext highlighter-rouge">v1</code> and <code class="language-plaintext highlighter-rouge">v2</code> have now been decided.</p>

<p>This is a pivotal moment in understanding inductive invariants. One might argue that the state of <code class="language-plaintext highlighter-rouge">decision(v2) = true</code> and both votes for <code class="language-plaintext highlighter-rouge">v1</code> isn’t a realistic scenario. This is true, but the essence of inductive invariants lies in the fact that the reachability of the beginning state is irrelevant. As long as it fits the invariant, it’s deemed valid. This counterexample, albeit adversarial, showcases IVy’s robustness.</p>

<p>From this counterexample, it’s clear we must prevent <code class="language-plaintext highlighter-rouge">decision(v2) = true</code> when two nodes have voted for <code class="language-plaintext highlighter-rouge">v1</code>. To fortify our proof, we add an intermediate invariant ensuring that a decided value has the backing of a quorum:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>invariant decision(V) -&gt; exists Q:quorum. forall N:node. member(N, Q) -&gt; vote(N, V)
</code></pre></div></div>

<p>Yet, IVy rejects the proof again. The new counterexample retains the original state but adds both <code class="language-plaintext highlighter-rouge">vote(n1, v2) = true</code> and <code class="language-plaintext highlighter-rouge">vote(n2, v2) = true</code> because of the added invariant.</p>

<p>It’s evident that having both <code class="language-plaintext highlighter-rouge">vote(n1, v1)</code> and <code class="language-plaintext highlighter-rouge">vote(n1, v2)</code> as true is problematic, indicating a node has voted twice. So, we introduce another invariant, which states all the values voted by an arbitrary node must be be identical.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>invariant vote(N, V1) &amp; vote(N, V2) -&gt; V1 = V2
</code></pre></div></div>

<p>With this addition, IVy accepts the invariants, collectively making them inductive. This confirms the safety of our target invariant under the given specification — no two values can simultaneously be decided. Crucially, the final invariant is local, pivotal for correctness, and can be continuously validated during runtime. We can envision each node maintaining a tally of its votes, with any node exceeding one vote indicating a potential breach of the invariant. This methodology extends to more intricate protocols, like Paxos, where individual nodes can also monitor local conditions to ensure system safety.</p>

<h2 id="simplification-via-specificity">Simplification via specificity</h2>

<p>One of the key hallmarks of “practical” formal verification lies in the art of simplification. Let’s delve into how this is highlighted in our toy concensus specification.</p>

<p>Firstly, we limit the scope to just 3 nodes and 2 values. This might seem restrictive, but it aligns perfectly with our primary objective: the discovery of all inductive invariants. Smaller specifications not only speed up the verification process but also yield counterexamples that are easier to interpret.</p>

<p>Secondly, we use specific axioms to define our quorums. Instead of adopting a generic representation that demands overlapping nodes between any two quorums, we go for a more straightforward approach. This simplification doesn’t compromise our ability to identify all the required inductive invariants.</p>

<p>However, it’s worth noting that these simplifications come with their trade-offs. For example, if we decide to extend the system to handle more nodes and values, we’ll need to manually update the specification. This is generally a one-time effort to avoid unexpected surprises.</p>

<p>Also, keep in mind that extending the specification can significantly increase the time required for verification. For instance, in a specification modeling the Paxos concensus protocol, when expanding the sets of Ballot IDs and values, the verification time shot up exponentially. To illustrate, allowing ballots to choose between 0 and 7, and values from a set of 3, took nearly 3 hours for verification. While it’s possible to craft a more complex specification that verifies faster, it contradicts our aim in practical formal verification: simplicity is key, as long as it enables us to identify all inductive invariants.</p>

<table>
  <thead>
    <tr>
      <th>Ballot ID</th>
      <th>Value</th>
      <th>Verification Time</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>0, 1, 2, 3</td>
      <td>{v1, v2}</td>
      <td>4.5s</td>
    </tr>
    <tr>
      <td>0, 1, 2, 3</td>
      <td>{v1, v2, v3}</td>
      <td>5.9s</td>
    </tr>
    <tr>
      <td>0, 1, …, 7</td>
      <td>{v1, v2}</td>
      <td>6m27s</td>
    </tr>
    <tr>
      <td>0, 1, …, 7</td>
      <td>{v1, v2, v3}</td>
      <td>171m26s</td>
    </tr>
  </tbody>
</table>

<h2 id="conclusion">Conclusion</h2>

<p>In an imminent future, where AI-assisted programming takes center stage [<a href="https://zfhuang99.github.io/rust/chatgpt/2023/03/14/implementing-Paxos-in-Rust-with-ChatGPT.html#future-of-ai-assisted-programming">1</a>], the capability to reason and ascertain high-level system correctness will stand out as a pivotal differentiating skill. For those of us crafting the foundational layers of cloud-scale distributed infrastructure, it’s imperative to not only familiarize ourselves with tools like TLA+ but to also become adept at practical formal verification. It isn’t just about knowing the tools — it’s about seamlessly integrating them into our everyday workflow, ensuring that the systems we build are robust and fail-proof.</p>]]></content><author><name></name></author><category term="Formal verification" /><category term="IVy" /><summary type="html"><![CDATA[Introduction]]></summary></entry><entry><title type="html">Implementing Paxos in Rust with ChatGPT</title><link href="https://zfhuang99.github.io/rust/chatgpt/2023/03/14/implementing-Paxos-in-Rust-with-ChatGPT.html" rel="alternate" type="text/html" title="Implementing Paxos in Rust with ChatGPT" /><published>2023-03-14T00:00:00+00:00</published><updated>2023-03-14T00:00:00+00:00</updated><id>https://zfhuang99.github.io/rust/chatgpt/2023/03/14/implementing-Paxos-in-Rust-with-ChatGPT</id><content type="html" xml:base="https://zfhuang99.github.io/rust/chatgpt/2023/03/14/implementing-Paxos-in-Rust-with-ChatGPT.html"><![CDATA[<h2 id="introduction">Introduction</h2>

<p>This article provides an overview of my experience implementing Paxos in Rust, aided by the advanced language model, ChatGPT. This journey, which involved a thorough ChatGPT session spanning multiple days, covered an extensive range of topics. These stretched from mastering fundamental language constructs to troubleshooting compiler errors and exploring advanced areas such as concurrency, asynchronous programming, and model checking.</p>

<p>The dynamism of this interaction led to a noticeable surge in my productivity, enabling me to complete a substantial amount of work in a rather condensed timeframe. Impressively, the entire project was wrapped up in less than a week, with roughly 2-3 hours invested on weekdays and an additional half-day over the weekend.</p>

<p>I would be remiss not to mention that this was my <em>inaugural</em> Rust project. Despite having a basic grasp of the language from prior resources like “Rust by Example” and the completion of “rustlings” — a compilation of 94 mini assignments designed to introduce Rust concepts — this project truly underscored the remarkable capabilities of ChatGPT in alleviating Rust’s infamously steep learning curve.</p>

<p>The experience left me profoundly convinced of the transformative potential that the synergy of Rust and ChatGPT holds for the field of <em>infrastructure software development</em>. Rust’s unique ability to deliver code devoid of memory leaks, crashes, and race conditions ensures a robust and efficient software infrastructure. Coupling this with the problem-solving prowess of ChatGPT can help us navigate the more intimidating facets of Rust, including its compiler errors and learning curve.</p>

<p>Moreover, this expedition offered an insight into the probable future of AI-assisted programming. With the support of AI, developers’ productivity could reach new heights, enabling the generation of significantly larger volumes of code. However, it’s essential to bear in mind that AI-generated code is not inherently bug-free. To ensure the robustness of our software infrastructure in the face of an influx of new code, a proactive approach is needed. Rust, with its prowess in minimizing low-level bugs, is a formidable tool for this task. To navigate the realm of high-level bugs, we must rely on formal methods like TLA+ and other formal verification techniques.</p>

<h3 id="why-paxos">Why Paxos</h3>

<p>Paxos is one of the most fundamental protocols anchoring many distributed systems. A simple implementation of Paxos might involve two proposers each trying to commit their own value among three acceptors. This process involves concurrency, as the requests from the proposers may compete with each other at any of the acceptors in an arbitrary sequence. Furthermore, the process involves asynchrony. When a proposer sends requests to the acceptors, the responses might return in any order or not at all, in the event of network issues. Implementing Paxos provides an opportunity to thoroughly comprehend both the concurrency and asynchrony aspects of Rust, making it an appealing choice for my inaugural project.</p>

<h2 id="highlights-of-chatgpt">Highlights of ChatGPT</h2>

<p>The comprehensive ChatGPT session, including numerous interactions around compiler errors, is accessible [<a href="https://chat.openai.com/share/263d8cb4-0001-46de-bbea-d2a07de60f9c">1</a>] for those interested. To offer a snapshot of the expansive spectrum of assistance provided by ChatGPT, I’ve compiled a few standout moments below.</p>

<h3 id="getting-started">Getting started</h3>

<p>Here is the initial prompt! Although ChatGPT’s comprehensive knowledge of Paxos is impressive, what amazed me even more was its ability to infer that I intended to develop Paxos on a key-value store, deduced merely from the project name.</p>

<p>This also implies that, to maximize the potential of Large Language Models like ChatGPT, we should try to translate the problems we’re tackling into concepts that are familiar and easily understood.</p>

<!-- ![getting_started](/assets/images/rkvpaxos_getting_started_dark_v0.jpeg){:width="90%"} -->
<p><img src="/assets/images/rkvpaxos_getting_started_dark.jpg" alt="getting_started" width="90%" /></p>

<h3 id="define-data-structure">Define data structure</h3>

<p>Here are a few examples of me asking ChatGPT to define some basic data structures and tailoring the definitions based on my need.</p>

<p><img src="/assets/images/rkvpaxos_define_data_structure_dark.jpeg" alt="define_data_structure" width="90%" /></p>

<h3 id="program-async">Program async</h3>

<p><img src="/assets/images/rkvpaxos_write_async_method_dark.jpeg" alt="write_async_method" width="90%" /></p>

<h3 id="create-unit-tests">Create unit tests</h3>

<p><img src="/assets/images/rkvpaxos_create_unit_test_for_given_method_dark.jpeg" alt="create_unit_test_for_given_method" width="90%" /></p>

<h3 id="rewrite-as-rust-native">Rewrite as Rust native</h3>

<p><img src="/assets/images/rkvpaxos_pattern_matching_dark.jpeg" alt="pattern_matching" width="90%" /></p>

<h3 id="revamp-implementation">Revamp implementation</h3>

<p>My interaction with ChatGPT led me to realize that ‘channels’ might be a powerful primitive for implementing Paxos. Specifically, consider a scenario where a proposer sends a voting request to three acceptors. As soon as it receives responses from at least two of the three, it can conclude a voting round. This can be implemented elegantly using channels — the proposer simply listens on a channel, with each individual response from different acceptors delivered via the same channel. This way, the proposer can implement a straightforward loop and tally the responses, sidestepping the need for complex concurrency primitives. With this insight, I requested ChatGPT to revamp the implementation.</p>

<p>Given that most of my time was spent prompting and editing, this request for a significant rewrite didn’t feel particularly taxing.</p>

<p><img src="/assets/images/rkvpaxos_entire_method_with_updated_design_dark.jpeg" alt="entire_method_with_updated_design" width="90%" /></p>

<h3 id="refactor">Refactor</h3>

<p>By this stage, I had successfully implemented Paxos Phase 1 and Phase 2. Adhering to the principle of prioritizing functionality before optimization, there was some degree of redundancy between the two phase implementations. To my delight, I realized I could simply request ChatGPT to refactor the duplicated code on my behalf.</p>

<p><img src="/assets/images/rkvpaxos_refactor_duplicated_code_dark.jpeg" alt="refactor_duplicated_code" width="90%" /></p>

<h3 id="model-checking">Model checking</h3>

<p>When AWS unveiled their ShardStore paper [<a href="https://www.amazon.science/publications/using-lightweight-formal-methods-to-validate-a-key-value-storage-node-in-amazon-s3">2</a>] at SOSP, the authors open-sourced a model checking framework for Rust, named Shuttle. Having never explored Shuttle before and having little interest in poring over its documentation, I decided to task ChatGPT with explaining its workings. With just a few prompts, I quickly grasped the essence of the framework and understood how I could potentially incorporate it into my own testing regime — a task for another day.</p>

<p><img src="/assets/images/rkvpaxos_model_checking_dark.jpeg" alt="model_checking" width="90%" /></p>

<h2 id="future-of-ai-assisted-programming">Future of AI-assisted programming</h2>

<p>This journey has deeply convinced me of the imminent rise of Rust as a dominating force in infrastructure software development, potentially surpassing C++.</p>

<p>As more developers begin to embrace AI assistance, a tremendous increase in productivity is predicted, facilitating the generation of much larger volumes of code. However, it’s crucial to note that AI-generated code isn’t exempt from bugs. A recent study [<a href="https://arxiv.org/abs/2211.03622">3</a>] revealed that “participants who had access to an AI assistant wrote significantly less secure code” and “were more likely to believe they wrote secure code.” While this study centered around security, it’s reasonable to infer that its conclusions could be applicable to other areas such as availability and reliability.</p>

<p>In the wake of this anticipated surge in productivity, the need to maintain high reliability within our infrastructure software stack is of utmost importance. The escalating demand for stability is likely to elevate Rust as the most desirable choice. This is largely due to Rust’s proficiency in mitigating low-level bugs such as memory leaks, crashes, and race conditions.</p>

<p>However, Rust alone can’t guarantee high-level correctness, such as ensuring safety (preventing data loss in storage systems) or liveness (always enabling users to upload and read blobs). Until LLMs evolve to develop profound reasoning capabilities, these aspects will remain as significant areas of expertise for developers. To address high-level bugs, formal methods, such as TLA+ and formal verification, are increasingly being adopted as effective strategies.</p>

<p>As AI-assisted programming becomes the norm, every developer’s AI co-pilot will become standardized, making them largely interchangeable. Therefore, the ability to tackle high-level correctness issues effectively will emerge as a key differentiator. This skill will be instrumental in maintaining the relevance of human developers in the rapidly evolving landscape of infrastructure software development.</p>]]></content><author><name></name></author><category term="Rust" /><category term="ChatGPT" /><summary type="html"><![CDATA[Introduction]]></summary></entry></feed>