How to Build an AI Skill for Diagnosing Flaky Tests

<h2>Introduction</h2><p>If you've spent any time in software development, you've likely encountered flaky tests—those unpredictable failures that drive teams crazy. They undermine trust in your test suite and waste countless hours. But what if you could teach an AI agent to systematically hunt down the root cause? With AI Agent Skills—reusable instruction sets for AI—you can. This guide walks you through creating a skill that empowers your AI to diagnose flaky tests with deterministic precision. We'll use a real-world example: a TOCTOU (time-of-check to time-of-use) bug causing duplicate invoice numbers in a Spring Boot webshop. By the end, you'll have a working skill that turns your AI into a flaky test detective.</p><figure style="margin:20px 0"><img src="https://blog.jetbrains.com/wp-content/uploads/2026/04/IJ-social-BlogSocialShare-1280x720-1-2.png" alt="How to Build an AI Skill for Diagnosing Flaky Tests" style="width:100%;height:auto;border-radius:8px" loading="lazy"><figcaption style="font-size:12px;color:#666;margin-top:5px">Source: blog.jetbrains.com</figcaption></figure><h2>What You Need</h2><ul><li><strong>An AI Agent Platform</strong> that supports Skills (e.g., a custom LLM integration or a tool like LangChain).</li><li><strong>Access to the source code</strong> of the project you want to debug (the example uses a Spring Boot webshop).</li><li><strong>A flaky test case</strong>—ideally one that fails sporadically. For our example, it's <code>InvoiceServiceTest.firstTwoOrdersGetInvoiceNumbersOneAndTwo</code>.</li><li><strong>Familiarity with Java, Spring Boot, and concurrent programming</strong> concepts (or equivalent in your language).</li><li><strong>Developers tools</strong>: debugger, log analysis tools, and a CI/CD pipeline to reproduce flaky behavior.</li></ul><h2>Step-by-Step Guide</h2><h3 id="step1">Step 1: Understand the Nature of Flaky Tests</h3><p>Before writing any skill, you must grasp what makes a test flaky. Flaky tests often stem from non-deterministic behaviors like race conditions, network timeouts, or resource contention. In our example, the test <code>firstTwoOrdersGetInvoiceNumbersOneAndTwo</code> creates two concurrent orders (<code>CompletableFuture</code>) and expects unique invoice numbers. The bug is a TOCTOU issue: the invoice service checks the last number and then increments it, but another thread intervenes, causing duplicates. The test passes or fails randomly because of thread scheduling.</p><p>Your AI skill needs to recognize such patterns. So, begin by documenting the common causes of flakiness in your environment (e.g., timing dependencies, shared mutable state). This knowledge becomes part of the Skill's context.</p><h3 id="step2">Step 2: Define the Skill's Purpose and Scope</h3><p>Decide exactly what your AI Skill will do. For our case: "Given a flaky test report and source code, identify the root cause by analyzing race conditions, shared state, and concurrency patterns." Keep the scope narrow to avoid overwhelming the AI. Write this as a clear one-sentence objective in the Skill document.</p><h3 id="step3">Step 3: Structure the Skill Document</h3><p>An AI Skill is a plain text file with a consistent format. Use these sections:</p><ol><li><strong>Title and Description</strong></li><li><strong>Input Requirements</strong> (e.g., test name, code file paths, logs)</li><li><strong>Analysis Steps</strong> (the core procedure)</li><li><strong>Output Format</strong> (e.g., a JSON report with root cause, confidence, reproduction steps)</li></ol><p>For our example, the analysis steps should include: check for concurrent execution (like <code>CompletableFuture</code>), inspect shared resources (e.g., invoice number generation), verify atomicity of read-modify-write operations, and suggest fixes.</p><h3 id="step4">Step 4: Write the Core Diagnosis Logic</h3><p>This is the heart of the Skill. In bullet points, describe what the AI must look for:</p><ul><li><strong>Identify concurrent operations</strong>: Look for multi-threading constructs (e.g., <code>@Async</code>, <code>CompletableFuture</code>, threads).</li><li><strong>Spot shared mutable state</strong>: Find variables or objects accessed from multiple threads without synchronization.</li><li><strong>Check atomicity</strong>: Does the code check a condition (e.g., <code>getLastNumber()</code>) and then act (e.g., <code>setLastNumber()</code>) in a way that can be interrupted? That's a TOCTOU bug.</li><li><strong>Reproduce the flakiness</strong>: Suggest increasing thread count or adding intentional delays to force failure.</li></ul><p>Provide concrete examples from your project. For the invoice service, point to the <code>InvoiceService</code> class where <code>synchronized</code> blocks are missing.</p><figure style="margin:20px 0"><img src="https://blog.jetbrains.com/wp-content/uploads/2023/12/1-200x200.png" alt="How to Build an AI Skill for Diagnosing Flaky Tests" style="width:100%;height:auto;border-radius:8px" loading="lazy"><figcaption style="font-size:12px;color:#666;margin-top:5px">Source: blog.jetbrains.com</figcaption></figure><h3 id="step5">Step 5: Integrate Developer Tools</h3><p>An AI alone isn't enough. Your Skill should instruct the AI to leverage tools like:</p><ul><li><strong>Static analyzers</strong> (e.g., FindBugs, SpotBugs) to detect race conditions.</li><li><strong>Log analysis</strong> to correlate failures with timestamps.</li><li><strong>Debugger breakpoints</strong> to pause threads at critical sections.</li></ul><p>In the Skill, include commands or API calls the AI can execute to run these tools. For instance: "Run <code>mvn spotbugs:check</code> and examine the output for <code>NO_NOTIFY</code> or <code>WRONG_USE_OF_SYNCHRONIZED</code>."</p><h3 id="step6">Step 6: Test the Skill on the Example Project</h3><p>Load the webshop demo from the article (see <a href="#example-project">Example Project</a>). Feed the flaky test report to your AI with the Skill activated. The AI should:</p><ul><li>Recognize the two concurrent <code>CompletableFuture</code> calls.</li><li>Trace the <code>checkout</code> method to <code>InvoiceService.getNextInvoiceNumber()</code>.</li><li>Identify the non-atomic read-modify-write.</li><li>Recommend using <code>synchronized</code> or <code>AtomicInteger</code>.</li></ul><p>Iterate until the AI consistently produces accurate diagnoses.</p><h3 id="step7">Step 7: Refine and Expand the Skill</h3><p>After initial success, add more root causes (e.g., network flakiness, database contention). Update the Skill document with new patterns. Also include <strong>remediation steps</strong> for each cause, so the AI can suggest fixes. For our example, the fix is to make <code>getNextInvoiceNumber</code> atomic via synchronization or <code>AtomicLong</code>.</p><h2>Tips for Success</h2><ul><li><strong>Start Small</strong>: Focus on one type of flakiness (like concurrency) before generalizing.</li><li><strong>Use Clear Language</strong>: Avoid ambiguous terms; define technical jargon in the Skill.</li><li><strong>Include Negative Examples</strong>: Show cases where a test is <em>not</em> flaky to sharpen the AI's detection.</li><li><strong>Version Control Your Skill</strong>: Treat the Skill document like code—track changes and review updates.</li><li><strong>Combine with Human Review</strong>: Let developers validate the AI's findings before acting on them.</li><li><strong>Monitor Performance</strong>: Keep a log of accuracy and false positives, and adjust the Skill accordingly.</li></ul><p>By following these steps, you'll transform your AI agent into a reliable debugger for flaky tests, saving your team time and frustration. Ready to give it a try? Start with the <a href="#step1">first step</a> and build your Skill today.</p>
Tags: