awaf v1.3  ·  Open Specification

Production-ready
AI agents.
Measurably.

awaf is an open framework for evaluating AI agent architecture across 10 pillars. Score your agent. Find the gaps. Ship with confidence.

Read the Spec
pip install awaf Copied!
0 Not Ready High Risk Needs Work Near Ready Production Ready 100

One command. Ten scores.

Point awaf at your repo. It reads your code and returns a structured assessment across all 10 pillars — findings ordered by severity, recommendations included.

~ awaf run --provider openai --model gpt-4o
   _      _  _  _    _      ___
  /_\    | || || |  /_\    | __|
 / _ \   | \/ \/ | / _ \   | _|
/_/ \_\   \_/\_/  /_/ \_\  |_       Agent Well-Architected Framework

AWAF Assessment: my-agent
AWAF v1.3  |  2026-03-29  |  openai / gpt-4o
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  Overall Score    78/100   Near Ready
  Close to production. Address findings before deploying.

  Scale: Production Ready 85-100 · Near Ready 70-84 · Needs Work 50-69
         High Risk 25-49 · Not Ready 0-24
  Foundation <40 = automatic FAIL regardless of overall score.
  Tier 2 pillars (Reasoning, Controllability, Context Integrity) carry 1.5x weight.

┌──────────────────────┬───────┬──────────────┬────────────┬─────────┐
 Pillar                Score  Progress      Confidence    Status 
╞══════════════════════╪═══════╪══════════════╪════════════╪═════════╡
 TIER 0 -- FOUNDATION                                               
├──────────────────────┼───────┼──────────────┼────────────┼─────────┤
 Foundation               85  [########  ]  verified       PASS 
╞══════════════════════╪═══════╪══════════════╪════════════╪═════════╡
 TIER 1 -- CLOUD WAF ADAPTED                                        
├──────────────────────┼───────┼──────────────┼────────────┼─────────┤
 Op. Excellence           74  [#######   ]  verified            
 Security                 82  [########  ]  verified            
 Reliability              71  [#######   ]  verified            
 Performance              80  [########  ]  verified            
 Cost Optim.              65  [######    ]  partial             
 Sustainability           79  [########  ]  verified            
╞══════════════════════╪═══════╪══════════════╪════════════╪═════════╡
 TIER 2 -- AGENT-NATIVE  (1.5x weight)                              
├──────────────────────┼───────┼──────────────┼────────────┼─────────┤
 Reasoning Integ.         71  [#######   ]  partial        1.5x 
 Controllability          78  [########  ]  verified       1.5x 
 Context Integrity        80  [########  ]  verified       1.5x 
└──────────────────────┴───────┴──────────────┴────────────┴─────────┘

  FILES ANALYZED     12 files
  TOKENS             182,340 in / 8,920 out  (peak call: 14% of 128K window)
  COST (est)         ~$0.1821
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  FINDINGS  (ordered by severity)
  [High     ]  Cost Optim.          No session budget cap; runaway token spend possible
  [Medium   ]  Reasoning Integ.     Evals present but hallucination rate not measured
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  RECOMMENDATIONS
  Cost Optim.         Add AWAF_SESSION_BUDGET_USD env var and wire hard stop in
                      agent loop before tool dispatch
  Reasoning Integ.    Instrument LangSmith eval run to capture hallucination rate
                      alongside tool selection accuracy
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  TO IMPROVE THIS ASSESSMENT
  Share LangSmith or Braintrust eval output to upgrade Reasoning Integ.
  from partial to verified
  Share token usage dashboard or budget alert config to verify Cost Optim.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Honest scoring.

awaf uses a mechanical risk tally — not holistic LLM estimation — so models cannot anchor on a comfortable number. Two distinct suspect flags tell you exactly what to do next.

anchoring detected — review tallies
SUSPECT RESULTS  (included in score — flagged for review)
! 3 pillars returned score 42
  possible model anchoring or guessing
  Foundation     score 42 shared by 3 pillars (cluster pattern)
  Security       score 42 shared by 3 pillars (cluster pattern)
  Reliability    score 42 shared by 3 pillars (cluster pattern)
score 100 — verify with multiple runs
SUSPECT RESULTS  (included in score — flagged for review)
! 3 pillars returned score 100
  score 100 is difficult to achieve consistently;
  confirm by averaging multiple runs and checking
  std deviation before treating as reliable
  Reliability    score 100 on 3 pillars — verify with multi-run average
  Cost Optim.    score 100 on 3 pillars — verify with multi-run average
  Sustainability score 100 on 3 pillars — verify with multi-run average

Suspect pillars are included in the overall score. Suspect is a warning for operators, not a veto.

10 Pillars. 3 Tiers.

A complete architectural model for agent systems, from foundational requirements to agent-native concerns that have no cloud equivalent. Read the intro post →

Foundation
Prerequisite

Foundation

Agents must own their domain end-to-end with independent tools, context, and data. A vertically sliced agent owns its domain: its tools, its context, its data.

0 – 100
FAIL < 40
Cloud WAF Adapted
1.0×

Operational Excellence

SLOs, playbooks, and postmortems. Determines whether the other pillars remain effective in production.

Security

Enforced in code, not prompts. Credentials must never enter the agent. Blast radius must be explicitly bounded.

Reliability

Designed for failure, not just uptime. Chain boundaries as fault domains. Fail-loud behavior and circuit breakers at the MCP layer. Checkpoint/resume for multi-step runs.

Performance Efficiency

Optimizes execution speed and resource usage across agent operations.

Cost Optimization

Tracks every token and tool call. Session budgets and loop detection from day one. Hard stop at 100% budget. Non-negotiable. Prevents solutions that cost more than the problems they solve.

Sustainability

Long-term viability and environmental considerations adapted from cloud WAF principles.

Agent-Native
1.5×

Reasoning Integrity

Addresses silent, confident failures — the worst failure type. Agents can hallucinate arguments, select wrong tools, or derail without visible errors. Requires evals covering tool selection, argument accuracy, and chain-of-thought faithfulness.

Controllability

Human control through code-level enforcement, not prompts. Any in-flight agent must be externally stoppable. Requires pause, notify, and resume/abort primitives.

Context Integrity

Manages agent perception of reality. Prevents stale context from corrupting reasoning. Requires external content sanitization through MCP and active lifecycle management for long sessions. The agent must understand its own knowledge limitations.

Bad agent. Good agent.

These are the patterns awaf verifies in your code. The difference between a production-ready agent and a liability is usually one of these.

Controllability Tier 2 · 1.5×
bad — control in the prompt
# system prompt
"""
You are a helpful agent. If the user says
'stop', please stop what you are doing.
Never take irreversible actions unless asked.
"""

# no kill switch, no cancel primitive,
# no external signal handler — agent
# cannot be stopped from outside
result = agent.run_forever(task)
good — code-level enforcement
# checked before every tool dispatch
async def dispatch_tool(tool, args, ctx):
    await kill_switch.check(ctx.run_id)  # raises if flagged
    await pause_gate.wait(ctx.run_id)   # blocks if paused
    return await tool.call(args)

# operator can stop or pause any in-flight run
# via API — no prompt required
kill_switch.flag(run_id="abc123")
Security Tier 1 · 1.0×
bad — credentials in context
# credentials passed into the agent's context
agent = Agent(
    system=f"Use API key {api_key} to call...",
    tools=[slack_tool, github_tool, db_tool],
    # no scope limits — agent can call anything
)
good — credentials never reach the model
# tool injects credentials at call time; model never sees them
class SlackTool(MCPTool):
    def call(self, channel, message):
        # key read from env inside the tool, not from agent context
        token = os.environ["SLACK_BOT_TOKEN"]
        # scope is read-only by default; write requires explicit grant
        return slack_client.post(token, channel, message)
Cost Optimization Tier 1 · 1.0×
bad — unbounded token spend
while not task.complete():
    response = llm.call(context)
    context.append(response)
    # no loop detection, no budget cap
    # a stuck task can spend $100s unchecked
good — hard stop before tool dispatch
budget = SessionBudget(limit_usd=float(os.environ["AWAF_SESSION_BUDGET_USD"]))

while not task.complete():
    budget.check()        # raises BudgetExceededError if over limit
    loop_guard.check()    # raises LoopDetectedError if repeating
    response = llm.call(context)
    budget.record(response.usage)
    context.append(response)
Reasoning Integrity Tier 2 · 1.5×
bad — no evals for tool selection
# tested manually a few times and it seemed fine
# agent has 12 tools; no automated evals exist
# hallucination rate is unknown
# no tracking of which tool was selected vs expected
agent.deploy()
good — evals with measurable pass rate
# eval suite run in CI — fails if hallucination rate > 3%
suite = EvalSuite.load("evals/tool_selection.yaml")
results = suite.run(agent)

assert results.tool_accuracy >= 0.95, \
    f"Tool selection accuracy {results.tool_accuracy:.0%} below threshold"
assert results.hallucination_rate <= 0.03, \
    f"Hallucination rate {results.hallucination_rate:.1%} exceeds 3%"

The Ecosystem

Spec-first. Multiple implementations. Community-owned.

awaf is community-owned.

The spec is open. Implementations are open.

If you build agents in production, your patterns belong here.