Can a clear set of expectations change the way teams and machines build software?
Over the past few months we pushed the limits of AI-assisted development by pairing Cursor and Claude Code in practical projects. We focused on defining requirements before writing any code to reduce confusion and speed delivery.
We treat an llm like a teammate: we give precise details, explain edge cases, and keep feedback loops tight. This approach cut back on hallucinated features and made our daily work more predictable.
In this guide we share our refined process. We show how clear communication and strict expectations help us build reliable, maintainable software while still moving fast in development.
Key Takeaways
- Define requirements up front to guide model-led coding efforts.
- Provide granular details to reduce unwanted outputs.
- Treat llms as collaborative partners for smoother work.
- Strict expectations improve reliability and maintainability.
- Our process balances speed and high quality in development.
The Challenges of AI-Assisted Development
Fast model output sounds great until it creates more work for the team.
We found that faster model output can create more overhead than help when teams try to absorb large code dumps. That overload makes review shallow and leaves gaps in quality.
Common LLM Pitfalls
One big problem is that an llm often produces long stretches of code quickly. Developers then face review fatigue and miss edge cases.
Without clear constraints, the model leaks implementation details into core logic. That complicates maintenance and makes future tests harder to write.
The Problem of Implementation-First Coding
When we let implementation lead, features end up fragile. The model skips checks that a proper test suite would catch.
- Pressure to accept code can let unvetted changes slip in.
- Time saved at first is often lost fixing issues later.
- We need strict constraints and clear details to keep control of development.
Why We Use BDD with Claude to Improve Our Workflow
We began shaping features as readable scenarios that both people and models can act on. This change makes intent explicit and reduces time spent clarifying goals.
Our adoption grew from the Cucumber tradition and Gherkin syntax. Those tools give us a clear, structured way to write acceptance criteria. We write specs that are easy for humans to read and for llms to parse.
We found that a standardized specification format helps our teams collaborate faster. It makes expectations measurable and avoids vague tickets.
- Clear acceptance criteria: Models follow rules; humans review outcomes.
- Startup-friendly: Fast planning without losing quality.
- Predictable results: Less ambiguity, fewer surprises in delivery.
| Specification | Human Readability | Machine Parsability |
|---|---|---|
| Gherkin scenario | High | High |
| Informal ticket | Medium | Low |
| Automated test | Low | High |
Bridging the Gap Between Requirements and Code
We translate human requirements into machine-readable scenarios to keep intent clear during development.
Leveraging Gherkin Syntax
We use Gherkin as a clear format that links requirements to working code. It reads like plain language and maps to automated tests.
Why this works:
- Gherkin lets us describe user behavior instead of implementation details, so tests stay valid when architecture shifts.
- The yurenju/llm-bdd-coding-demo repo shows how a Model Context Protocol can verify development results end-to-end.
- We allow the llm to drive browsers and emulators as a tool to confirm that a feature meets acceptance criteria.
The logic is simple: a shared format creates a contract the model must satisfy. This reduces manual glue code and shortens the feedback loop.
Result: every test and code change ties back to a requirement. That makes development predictable and keeps testing practical for the whole team.
Overcoming Cognitive Overload with Incremental TDD

To keep review feasible, we split each requested feature into tiny, testable pieces that the model could handle.
We combine our scenario-first style with incremental tdd cycles from the yurenju/cursor-tdd-rules template. This keeps model output small and focused. The LLM breaks a feature into components we can verify in one pass.
In the red phase we add tests and make sure each new test fails. That confirms the testing framework and points out gaps in requirements early.
Before any implementation, we ask the LLM to write only test descriptions. This lets us review granularity and catch issues before code arrives. It also reduces cognitive load on reviewers and shortens review time.
| Step | Goal | Outcome | Key benefit |
|---|---|---|---|
| Divide feature | Smaller scope | Focused tests | Lower review burden |
| Red phase | Failing tests | Validates framework | Clear correctness |
| Green phase | Implement one test | Targeted code | Higher code quality |
Architectural Isolation Through Subagents
We solved a persistent coordination issue by building an architectural layer that keeps responsibilities separate. Our system assigns narrow roles so every agent only sees the context it needs. This design keeps the work focused and predictable.
Context Pollution Explained
Context pollution is a common problem when implementation plans bleed into test design. That leak lets a model anticipate code, which weakens test objectivity.
We fixed this by assigning each skill to its own subagent. The test writer no longer knows implementation details, so tests stay honest.
Benefits of Isolated Contexts
Isolated agents prevent cheating: they enforce true test-first behavior and reduce hidden assumptions.
- Each skill runs in a sealed environment, improving focus and reliability.
- Giving agents only the exact details they need reduced accidental coupling.
- Our structure scales: the system routes complex work to specialized agents and keeps projects manageable.
Implementing the Red-Green-Refactor Cycle
Every new requirement starts as a failing assertion so we can locate gaps before code appears.
We strictly enforce the Red-Green-Refactor cycle: a test must fail before any implementation begins. This rule keeps our TDD rhythm clear and predictable.
In the red phase, a focused subagent writes a test that defines the requested behavior and the logic it must satisfy. That failing test shows precisely what is missing.
During the green phase we write the minimal code to make the test pass. We target only the required implementation so each feature stays small and reviewable.
In refactor we improve structure and readability while keeping all tests green. This step protects the framework and prevents regressions.
- Fail first: prevents assumptions from seeping into tests.
- Minimal code: avoids unnecessary implementation.
- Always backed: every change ships with tests that prove it works.
Our approach ensures testing drives development, not the other way around. By repeating this cycle we keep quality high, reviews fast, and the project steady.
Automating Skill Activation with Hooks
We automated how skills start so the model treats activation as an explicit decision.
We use hooks in Claude Code to inject instructions at defined lifecycle points. This ensures our TDD skill triggers reliably and in the correct phase.
Using a forced evaluation hook increased our skill activation rate from 20% to 84% by requiring the model to state whether a skill is needed before it proceeds.
Forced Evaluation Techniques
How it works:
- Hooks run before every prompt, so the model evaluates required behavior early.
- The forced check asks the model to declare if a skill should activate.
- When the model affirms a skill, we attach a structured test-first format to that session.
This automated activation keeps the model focused on the right behavior and reduces context drift. It also preserves our test format and structure across projects.
| Metric | Before Hooks | After Hooks |
|---|---|---|
| Skill activation rate | 20% | 84% |
| Consistent test output | Low | High |
| Manual reminders needed | Frequent | Rare |
| Phase alignment (red/green) | Inconsistent | Consistent |
Validating Quality with Acceptance Testing

Before we consider a feature done, we run acceptance tests that describe the desired behavior in plain terms.
We use acceptance testing to confirm that our implementation matches the user’s intent. Each test targets a single, observable behavior so reviews stay focused and fast.
Acceptance checks run in a separate phase. That separation prevents premature implementation from masking missing requirements. It also makes it simple to fail fast and iterate.
Following Robert C. Martin’s approach, we run two test streams to constrain the model and improve code quality. One stream pins the contract; the other verifies execution.
- Clear plain-language format for each test so the model can self-verify outcomes.
- One behavior per test to keep the suite maintainable.
- Acceptance tests act as the final check that the implementation is complete.
Result: reliable testing that ties requirements to working code. Learn more about how we applied this in practice at acceptance tests.
Scaling Development with Specialized Agent Teams
To scale reliably, we organized focused agent teams that mirror a small engineering org.
We assign each agent a narrow skill and a strict set of constraints. The team lead coordinates priorities and preserves the overall structure.
Defining Agent Roles
Our implementer runs the tdd cycle and writes minimal code for one failing test at a time. A spec writer drafts clear behavior so a user story becomes an executable format.
Other agents review output, manage integration, and keep the system aligned to the project timeline.
Mutation Testing for Reliability
As a final verification, we run mutation testing. It injects deliberate errors into the code to check that our tests actually fail when they should.
Result: a higher quality test suite and confidence that tests catch real problems before release.
- Clear roles reduce review time and keep implementation focused.
- Strict constraints prevent context leaks and inconsistent output.
- Mutation testing proves test strength and protects long-term quality.
| Role | Primary focus | Key outcome |
|---|---|---|
| Team lead | Orchestration | Consistent development flow |
| Implementer | Code & test | Small, reviewable changes |
| Spec writer | Behavior & format | Clear, machine-readable tests |
Refining Our Collaborative Future with AI
We believe precise behavior definitions let teams and machines build better together.
By tightening how we describe intent and manage context, we make each llm session more predictable. Our ongoing work improves skill activation and keeps daily work focused on the right outcomes.
We stay current on new llms and tools so the system can deliver reliable code and faster development cycles. Our commitment to clear behavior, small testable steps, and strong skill boundaries keeps quality high.
We encourage teams to learn these practices and consult an AI tools guide to explore practical options and extend their own workflows.


