← Back to Blog

Harness Engineering

Harness LLMs Best Practices

AI Harness Engineering is the discipline of building the system, constraints, and infrastructure around an AI model to make it reliable and useful in production.

It Answers Questions Like:

Levers

1. Input Levers

Controls that shape the model's knowledge, permissions, and available tools before execution starts.

  • Initial Prompts
  • System Prompt
  • Context
  • Tools / MCPs / Skills / Commands
  • Sub-Agents
  • Hooks
  • Orchestration Logic

2. Execution Levers

Runtime controls governing orchestration patterns, context refinement, and safety boundaries.

  • Orchestration
  • Parallel / Sequential / Loop Agents
  • Dynamic Context Refinement
  • Safe Execution Sandboxes

3. Output Levers

Observability and evaluation metrics used to verify, benchmark, and analyze the agent's work.

  • Observability
  • Evaluation
  • Logs
  • Token Cost
  • Test Results

Equipping the LLM

There are some actions that we need to let the LLM know how to handle:


A Review of Core Practices

1. Save Progress Safely

Managing instructions and keeping track of state is a core harnessing requirement. Two types of CLAUDE.md files serve as less frequently updated rule documentation:

For short-term progress, maintain a frequently updated progress.md or task.md file directly in the codebase and reference it in the context.

For code safety, tracking changes is a classic problem with robust, native solutions. Rather than building custom logic to log changes or file histories, rely entirely on Git branch history, commits, and diffs to track changes, debug regressions, and revert code safely.


2. Run Code

Avoid using custom scripts, custom implementations, or proprietary code routines when standard command-line tools can do the job. Standard shell built-ins and Unix commands (like cat, grep, find, sed, and jq) are highly optimized and standard for file operations, data extraction, and search.


3. Stay Safe

To keep your computer secure from rogue command execution or buggy script loops, agent code must always run within isolated terminal sandboxes. Read our detailed guide on how sandboxing boundaries keep your environment secure: Understanding the Terminal Sandbox for AI Agents.


4. Learn and Remember

Rather than forcing the agent to rely solely on static training weights or loading huge manuals into the active context, use Model Context Protocol (MCP) servers to retrieve information on-demand.

For example, by configuring Context7 as an MCP server, the agent can query live library documentations (e.g. Next.js, Stripe, or React) dynamically. This pulls version-specific API reference docs on-demand, saving token usage and preventing outdated assumptions.


5. Don't Get Confused

As conversations stretch, context windows fill up with bulky log traces and old file contents, causing performance degradation and loops. To prevent this, actively compress your session. Check out our comprehensive guide on context management, pruning, and compaction commands: Context Fix Strategies.


6. Tackle Big Projects

Complex, multi-file software engineering tasks cannot be completed in a single prompt. They require rigorous research, plans, specifications, and test loops before writing production code. We will explore this structured methodology in a future post about Spec-Driven Development.


7. Understand Huge Codebases

Locating utility functions, modules, and API logic across millions of lines of code requires advanced search indices. We will details techniques like vector search, semantic embeddings, and project maps in a future post about Navigating and Understanding Huge Codebases.


8. See and Test User Interfaces

Modern frontend testing requires visual verification. State-of-the-art agents use Computer Use capabilities (such as the Google Antigravity browser tools) combined with Vision Language Models (VLMs) to spin up local headless browsers, click elements, capture page screenshots, and inspect layout rendering to verify UI changes.


9. Avoid Bad Assumptions

To prevent agents from building wrong features based on ambiguous prompt instructions, establish guardrails that enforce human-in-the-loop confirmation. The agent must pause and ask clarifying questions instead of making blind assumptions. We will discuss evaluation frameworks and interactive guardrails in a future post on Agent Evaluation.