The rise of artificial intelligence coding agents has split the software industry down the middle. Proponents argue it's a productivity multiplier, allowing engineers to tackle greater amounts of work and solve problems that they'd otherwise have skipped. Opponents argue that "vibe coding" your software leaves you without an understanding of how it works, inevitably biting you down the road when you need to troubleshoot it.

This piece is a high‑level introduction to agentic software engineering for software engineers. It shares practical workflows and configurations that have worked for me, grounded in emerging best practice. I hope it offers useful context for later adopters and guidance for those already experimenting but not yet seeing consistent results, or who find themselves relying on more expensive models to get there. I also aim to dispel the idea that adopting these tools necessarily trades off quality for throughput.

I am intentionally not addressing the ethics of the production of the frontier models as I have nothing more to say. Given the scale of the theft from the commons, I think it’s fair to extract as much value as you can from these VCs.

# Emerging best practice

Some good pre-reads here.

# Spec Driven Development (SDD)

Microsoft has a good write up to SDD in their introductory post for GitHub's Spec Kit. Addy Osmani builds on this, outlining a similar process in his "How to write a good spec for AI agents" article. Broadly speaking:

  1. Define a high-level specification of functional and non-functional requirements.
  2. Use the generated specification to formulate a technical plan.
  3. Break the plan down into manageable tasks.
  4. Implement the tasks, either sequentially or in parallel.
  5. Reflect on how the process went, and propagate learnings to rules/instructions for future sessions.

Addy describes a practical approach incorporating managing the model's context via use of dedicated sub-agents (or sessions) to increase fidelity and reduce the room for confusion.

# O16g

O16g, or Outcome Engineering, is a manifesto that defines 16 principles for agentic software engineering:

# Goals

  1. The Voyage: Human Intent: do not abdicate vision to the machine.
  2. The Truth: Verified Reality is the Only Truth: predict, measure, and prove it works.
  3. The Teamwork: No More Single Player Mode: outcome engineering is a team sport and it requires a defined approach.
  4. The Liberation: The Backlog is Dead: time is no longer the constraint, cost is, so if the outcome is worth the cost you should action immediately.
  5. The Joy: Unleash the Builders: delegate toil, write code only for joy.
  6. The Map: No Wandering in the Dark: "map the territory before building".
  7. The Tech Island: Build It all: code is now cheap, so test your hypotheses and glean knowledge.
  8. The Artifacts: Failures are Artifacts: reflect and learn from negative outcomes.

# The Building

  1. The Orchestration: Agentic Orchestration is a New Org: design an organisational structure for agents as you would for people.
  2. The Law: Code the Constitution: encode mission, vision and goals to clarify intent. Eliminate ambiguity.
  3. The Graph: All the Context, Everywhere: agents operating in a vacuum will make assumptions; grant them access to knowledge to allow them to validate them.
  4. The Order: Priorities Drive Compute: compute is reasonably expensive, so optimise and order for outcomes.
  5. The Documentation: Show Your Work: record discoveries, rejected paths, and logic to grant insight into the machine.
  6. The Immune System: Continuous Improvement: reflect on outcomes, prevent repeating failures.
  7. The Gate: Risk Stops the Line: gate on unmitigated/unknown risks.
  8. The Validation: Audit the Outcomes: "verify the tool is sharp before you use it".

# Some terms

  • Agent Harnesses are the exoskeletons for models, exposing tools (read, write, execute, etc.) to them.
  • Agent Control Protocol (ACP) provides a way for agent harnesses to expose their functionality to an editor.
  • Commands are directly invoked prompts that may be templated with arguments, usually exposed as slash-commands (/do-a-barrel-roll).
  • Model Context Protocol (MCP) defines an interface for embedding external tools into agent harnesses, expanding context and capability.
  • Rules are the constraints that govern the behavior of the agent. Think .cursor/rules, CLAUDE.md, AGENTS.md.
  • Skills are similar to commands, but are discovered from a description rather than being directly invoked.

# Remain agile

This space is rapidly evolving, and the models that serve us well today may not serve us so well tomorrow. Set yourself up with tools that you can easily carry between models, and that grant you access to agents in as many modalities as possible.

# Avoid lock-in

You can consume these models through several modalities:

  • Dedicated editors, like Antigravity, Cursor, and Windsurf. These are tied to the hosted models and billing systems of their vendors, and tend to obfuscate choices. They also enforce conventions like CLAUDE.md on you and have their own proprietary implementations of tools, MCP clients, and skills. I'd strongly advise against these black boxes.
  • Online services like GitHub Copilot Workspaces and Claude Code Web, which are entirely tied to their vendors' hosted models and conventions.
  • Dedicated agent harnesses from vendors like Anthropic (Claude Code) and OpenAI (Codex). Although it is possible to configure these for third-party models, I would recommend avoiding these as they also enforce their own conventions.
  • Editor extensions from vendors like Anthropic, GitHub, and OpenAI. These favour their own models and enforce their own conventions.
  • Third-party harnesses like OpenCode and Crush allow easily switching between models on different providers, and offer a way of taking your MCP server configurations, skills and rules with you.

When choosing a harness, consider:

  • Whether it has a desktop app?
  • Can you run it as a server, for remote access/sharing sessions?
  • Does it have editor extensions, so you have the agent available where you want to work?
  • Does it have a CLI/TUI app for smaller edits, or polishing up bits of work ahead of checking it in?

Generally, try to favour tools that opt into emerging standards and let you switch models whilst retaining your MCP servers, skills, and rules.

# Choose the best model for the job

Different models are better suited to different tasks, and their capabilities change often. Models can differ on many axes. The ones relevant to us are:

  • Reasoning ability, the ability of a model to construct and follow chains of thought.
  • Context window size, how many tokens a model can handle between the system prompt, user prompts, context it read, and its own outputs.
  • Per-call input/output token limits constrain how much work can be done by a single prompt.
  • Cost, on a few dimensions:
    • Input and output token cost.
    • Cache read/write cost, where cache allows future calls to recall context rather than needing all tokens to be re-sent.
  • Temperature control, a dial for tuning entropy in generation. Lower temperatures increase determinism, higher temperatures increase creativity.
  • Knowledge cutoff, the date the model's static data was last trained. Older models will suggest older API and package versions, for instance.
  • Ability to call tools, a hard requirement for a coding agent.

models.dev provides a searchable reference for many available models. Knowing which models to reach for is a skill and personal preference that you will develop a taste for over time.

# Claude

Claude ships in three different flavours:

  • Haiku's speed makes it well-suited to smaller, iterative tasks, like generating boilerplate. Don't write it off for well-specified implementation grunt work.
  • Sonnet is a great general purpose model, striking a good balance between reasoning ability, speed, and cost. This is a good daily driver.
  • Opus is very expensive, but may be better suited to very large contexts and complex architectural tasks. I'd consider extensive/default use of it to be a crutch for poor planning.

# Gemini

Gemini offers two different flavours:

  • Flash models are very fast, but are poor at maintaining context. Good for boilerplate and lightweight planning work, but cross-check work with a more thorough agent. Worse than Haiku.
  • Pro models are good for parsing structured data, general refactoring work. Approximately comparable with Sonnet.

# GPT

OpenAI are almost on a par with Claude:

  • 5.1/5.2 Codex is a generally capable model, somewhere between Sonnet and Opus. It can be prone to reasoning for extended periods of time without output if given too broad a remit, so be sure to prompt it carefully.
  • 4o may be useful for frontend grunt work, due to its ability to read images.

# Use a subscription

All of Anthropic and OpenAI's subscription plans are heavily subsidised and present much better value for money, at the cost of a mix of short- and long-term rate limits. Use these over token-based billing, but understand that at some point the free lunch will end.

# Don't abdicate reasoning

If you allow it, the model will act as one-armed bandit: you can pull that arm all you want, but know that you are gambling with your output and shouldn't expect consistent return.

Surrendering your thought process to a large language model will leave you unable to understand and operate the result. Retain your ownership and understanding of the built artifacts, and remain accountable for the artifacts you check in.

# Plan upfront

Specification Driven Development (SDD) has emerged with the likes of GitHub's Spec Kit and Fission's OpenSpec.

To structure my thinking, I've adopted an Architecture Decision Records (ADR) process that frontloads as much thinking and decision making as possible. The process is made up of five stages, all but implementation being actively supervised by a human:

StageModeOutputs
/adr.specifySupervised/adrs/<date>-<slug>/spec.md
/adr.planSupervised/adrs/<date>-<slug>/plan.md
/adr.tasksSupervised/adrs/<date>-<slug>/tasks.md
/adr.implementUnsupervisedCode, and updates to /adrs/<date>-<slug>/tasks.md
/adr.reflectSupervised/AGENTS.md, /adrs/README.md

Each stage emits an artifact, all of which live in and are versioned with the repository. Any proof of concept work for language/framework/library selection is retained inside the related ADR. Separation of tasks allows parallelisation and shifts supervision from hands-on to review.

    flowchart TD
  reflectCmd[Reflect]
  tasks@{ shape: doc, label: "Tasks" }

  start([Start]) -->
  specCmd[Spec] -->
  planCmd[Plan] -->
  tasksCmd[Tasks]

  tasksCmd --> implementCmd0[Implement] --> reflectCmd
  tasksCmd --> implementCmd1[Implement] --> reflectCmd
  tasksCmd --> implementCmd2[Implement] --> reflectCmd
  reflectCmd --> finish([Finish])
  
  specCmd -->|Produces| spec@{ shape: doc, label: "Spec" }
  planCmd -->|Produces| plan@{ shape: doc, label: "Plan" }
  tasksCmd -->|Produces| tasks
  implementCmd0 -->|Updates| tasks
  implementCmd1 -->|Updates| tasks
  implementCmd2 -->|Updates| tasks
  reflectCmd -->|Revises| agents@{ shape: doc, label: "Agents" }

The models can be surprisingly useful rubber ducks during planning -- reason with them before you start writing code.

This is implemented by an OpenCode agent configuration defining a common prompt and a series of commands that consume it.

# Retain planning documents

These ADR documents and the reflections within explain both what the intent of a given feature was, and why implementation decisions were made. They're useful references for humans and agents alike, and they provide a source of truth for intent against which the code can be measured.

Being able to point an agent at prior decisions has proven a very useful exercise.

# Know your harness

# Rules (aka instructions)

Rules provide static context, such as guidance, to an LLM. I'd recommend writing this up in AGENTS.md and directing other agents (Cursor, Claude Code, etc.) to it from their respective files.

I'm not convinced there's a great deal of value to be had here with general purpose rules, given the extensive system and agent harness prompts already in use. Keep your rules short and signpost key documentation about how to develop and run your applications. I treat rules as a routing layer into real docs, not a place to restate everything. What's useful for a new developer is likely also of value to an agent.

  • What is the tech stack?
  • What do I need to install?
    • If you use something like Nix or direnv, how can I invoke commands in a development shell with the dependencies available?
  • How do I build the thing?
  • How do I test the thing?

# Agents

This said, I find a lot of time can be saved by writing your commonly used prompts up as dedicated agents. Most harnesses provide a means of defining agents which may be either primary (presented to the user as personas they can use in sessions) or sub-agents (usable by commands or skills, but not directly assumable). These agents can have the available tools and permissions restricted.

For instance:

  • Explore this codebase, make no writes.
  • Specify this feature, changing no code.
  • Work exclusively via a sandbox, with limited tools and permissions.
  • Quality assurance engineer, tasked with ensuring adequate code coverage and identifying common faults.
  • Performance optimisation and analysis, maybe with access to your observability tools.
  • Security analyst, with a prompt detailing common vulnerabilities (sourced from maybe OWASP).

I'm actively extending my use of agents.

# Tools

A model operating in a vacuum is relying on guesswork to understand what it needs to do. Text is a versatile interface, and LLMs are excellent at parsing both structured and unstructured data. We have two primary extension points that are uncoupled from the agent harness.

# exec external tools

External tools require no special implementation beyond ensuring the agent has access to the necessary configuration to invoke them. Allowing agents to execute code directly on your host is not advisable, though.

# MCP servers

MCP provides a means of either a local or remote service exposing tools to an agent. Tools have a name and a set of arguments.

Unfortunately since this interface must be known to the model ahead of it invoking the tools, the schema itself consumes tokens. For this reason, limit MCP tools to only those which are necessary for a given task, and consider using subagents to limit their token consumption.

# Examples

Consider how the model might benefit from:

  • Source control and issue management tools (GitHub, Jira) for access to issues.
  • Observability tools (Coralogix's only redeeming feature is their MCP server means I no longer have to deal with their awful platform).
  • Documentation RAG, to somewhat extend the knowledge cutoff (rtfmbro, RIP Context7).

I suggest splitting read and write access early on. OpenCode makes it very easy for you to toggle MCP servers on and off, and access to external tools can be gated via permissions.

# Manage context

Conversational interfaces are a bit of a lie: models emulate this by reviewing the entire context (conversation history) on each message. To limit compute complexity, a limit on the size of the context is enforced.

The purest ways to address this problem require no support from the harness:

  • Limit your use of external tools in the first place.
  • When reading files, encourage the agent to favour grep over direct reads.
  • Make use of subagents to do complex analysis, and have only the distilled learnings bubble up into the parent session.
  • Start fresh sessions.

Some agent harnesses or plugins for them offer commands like the following to trim the fat:

  • distill, which extracts valuable content from tool output into a concise summary before removing the rest.
  • compress, which summarises sections of the history into single messages.
  • prune, which removes completed/noisy messages from the history.

Other agents may provide session management tools for summarising an existing session into a continuation prompt for a future session.

# Favour starting new sessions

Once a session has started hallucinating a meandering path, favour starting fresh sessions early: don't resort to squabbling. If starting a fresh session feels alarming, consider whether you're relying too heavily on session history rather than more durable artifacts.

# Security

Security Operations organisations everywhere are asleep at the wheel, and I predict it'll start having some fairly catastrophic consequences this year. Allowing a model unfettered access to your system (the default for many harnesses, and a very rapidly learned behaviour given the constant stream of permission prompts) is irresponsible.

# Securing secrets

Secrets will appear in many places:

  • Your application will need some at runtime to authenticate with other services or data stores.
  • MCP servers may require bearer tokens for authentication via OAuth.
  • Command-line tools (e.g. GitHub's CLI gh) need bearer tokens too.

If an agent reads them directly, they end up in the session context and potentially in vendor logs, prompts, and future completions. You should assume anything in context can leak. There are some best practices you can follow right now to minimise this risk:

  • Avoid printing secrets in your application and any supporting bootstrapping scripts. This is something you should already be doing.
  • Instead of using .env files, store your credentials in a password manager, and automate sourcing them into your shell using its CLI, maybe using direnv. This will help avoid rogue read and grep operations picking up secrets.

# Sandboxing

Most harnesses are adopting Git worktrees to provide each agent its own working copy to persist changes to. This allows parallelisation of work, but it does nothing to protect the host machine from rogue file and execute operations.

Additionally, each harness has chosen its own dubiously useful means of filtering commands that may be run via their exec tools:

  • Claude Code supports a notion of hooks, where you can pattern match.
  • Codex has permissions, built around some extremely basic presets.
  • OpenCode has permissions.

They're all too clumsy and too coarse to be useful. By contrast, sandboxing both limits the blast radius of a failure and lets you parallelise implementation, with merges later.

There aren't many tools in this space yet:

  • Dagger's Container Use is the most mature at the moment. It provides an MCP server that launches Docker containers based on a configuration file in each repository, copying source code into them. You configure the harness to deny access to all write tools except those provided by Container Use, and the agent writes changes via the MCP server's tools rather than the harness's tools. You must invoke container-use merge <env> to accept changes into your repository.
  • Litterbox is my own attempt, which is broadly similar but aims to seamlessly update your Git index on every change to the sandbox.

I'm not aware of any established tools that extend to egress traffic and secrets management.

# Consider dictation

This one's simple, but Handy saves a bunch of time expressing complex concepts, especially when specifying new features, and also makes for some super neat demos. I enable push-to-talk and use the keybinding Cmd + Shift + ..

# What's next

# Extending the sandbox to secrets

Ideally, I don't think agents should have the ability to see any secrets. This would need to take several forms:

  • Limiting network access to prevent exfiltration of secrets.
  • Completely removing access to secrets from the agent itself, by having the agent make unauthenticated requests to an external proxy, which in turn would authenticate the requests and relay them to the intended destination. This is already the case for MCP servers, but additional work is required to extend this to tools like webfetch.

I've yet to figure this one out, but it's likely a future focus of Litterbox.

# Agent teams

Agent teams build on subagents, defining routes for message passing between separate agents, allowing for collaboration. I've not yet started exploring this, but I can see huge potential for automating some supervision tasks with it. There's some progress being made to port Claude Code's agent teams implementation to OpenCode at the moment.

# Easier collaboration

How do you take a local session and share it with your team? OpenCode has session sharing built in, but this doesn't allow inactive use. Likewise, running an agent in a server configuration on a developer workstation raises security concerns.

# Look to localise

Many organisations are unwilling to entrust organisations in foreign countries with their source code.

All of Anthropic and OpenAI's subscription plans are heavily subsidised, and their pricing appears completely unsustainable.

If we wish to continue using these tools, I think we need to put a lot more thought into how we transition away from the frontier models to smaller, more focused ones that we're able to run locally. With GPU prices what they are at the moment, I'm putting this on the backburner for now.

# Resources