awaf is an open framework for evaluating AI agent architecture across 10 pillars. Score your agent. Find the gaps. Ship with confidence.
awaf run
Point awaf at your repo. It reads your code and returns a structured assessment across all 10 pillars — findings ordered by severity, recommendations included.
_ _ _ _ _ ___ /_\ | || || | /_\ | __| / _ \ | \/ \/ | / _ \ | _| /_/ \_\ \_/\_/ /_/ \_\ |_ Agent Well-Architected Framework AWAF Assessment: my-agent AWAF v1.3 | 2026-03-29 | openai / gpt-4o ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Overall Score 78/100 Near Ready Close to production. Address findings before deploying. Scale: Production Ready 85-100 · Near Ready 70-84 · Needs Work 50-69 High Risk 25-49 · Not Ready 0-24 Foundation <40 = automatic FAIL regardless of overall score. Tier 2 pillars (Reasoning, Controllability, Context Integrity) carry 1.5x weight. ┌──────────────────────┬───────┬──────────────┬────────────┬─────────┐ │ Pillar │ Score │ Progress │ Confidence │ Status │ ╞══════════════════════╪═══════╪══════════════╪════════════╪═════════╡ │ TIER 0 -- FOUNDATION │ ├──────────────────────┼───────┼──────────────┼────────────┼─────────┤ │ Foundation │ 85 │ [######## ] │ verified │ PASS │ ╞══════════════════════╪═══════╪══════════════╪════════════╪═════════╡ │ TIER 1 -- CLOUD WAF ADAPTED │ ├──────────────────────┼───────┼──────────────┼────────────┼─────────┤ │ Op. Excellence │ 74 │ [####### ] │ verified │ │ │ Security │ 82 │ [######## ] │ verified │ │ │ Reliability │ 71 │ [####### ] │ verified │ │ │ Performance │ 80 │ [######## ] │ verified │ │ │ Cost Optim. │ 65 │ [###### ] │ partial │ │ │ Sustainability │ 79 │ [######## ] │ verified │ │ ╞══════════════════════╪═══════╪══════════════╪════════════╪═════════╡ │ TIER 2 -- AGENT-NATIVE (1.5x weight) │ ├──────────────────────┼───────┼──────────────┼────────────┼─────────┤ │ Reasoning Integ. │ 71 │ [####### ] │ partial │ 1.5x │ │ Controllability │ 78 │ [######## ] │ verified │ 1.5x │ │ Context Integrity │ 80 │ [######## ] │ verified │ 1.5x │ └──────────────────────┴───────┴──────────────┴────────────┴─────────┘ FILES ANALYZED 12 files TOKENS 182,340 in / 8,920 out (peak call: 14% of 128K window) COST (est) ~$0.1821 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ FINDINGS (ordered by severity) [High ] Cost Optim. No session budget cap; runaway token spend possible [Medium ] Reasoning Integ. Evals present but hallucination rate not measured ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ RECOMMENDATIONS Cost Optim. Add AWAF_SESSION_BUDGET_USD env var and wire hard stop in agent loop before tool dispatch Reasoning Integ. Instrument LangSmith eval run to capture hallucination rate alongside tool selection accuracy ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ TO IMPROVE THIS ASSESSMENT Share LangSmith or Braintrust eval output to upgrade Reasoning Integ. from partial to verified Share token usage dashboard or budget alert config to verify Cost Optim. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
integrity checks
awaf uses a mechanical risk tally — not holistic LLM estimation — so models cannot anchor on a comfortable number. Two distinct suspect flags tell you exactly what to do next.
SUSPECT RESULTS (included in score — flagged for review) ! 3 pillars returned score 42 possible model anchoring or guessing Foundation score 42 shared by 3 pillars (cluster pattern) Security score 42 shared by 3 pillars (cluster pattern) Reliability score 42 shared by 3 pillars (cluster pattern)
SUSPECT RESULTS (included in score — flagged for review) ! 3 pillars returned score 100 score 100 is difficult to achieve consistently; confirm by averaging multiple runs and checking std deviation before treating as reliable Reliability score 100 on 3 pillars — verify with multi-run average Cost Optim. score 100 on 3 pillars — verify with multi-run average Sustainability score 100 on 3 pillars — verify with multi-run average
Suspect pillars are included in the overall score. Suspect is a warning for operators, not a veto.
A complete architectural model for agent systems, from foundational requirements to agent-native concerns that have no cloud equivalent. Read the intro post →
Agents must own their domain end-to-end with independent tools, context, and data. A vertically sliced agent owns its domain: its tools, its context, its data.
SLOs, playbooks, and postmortems. Determines whether the other pillars remain effective in production.
Enforced in code, not prompts. Credentials must never enter the agent. Blast radius must be explicitly bounded.
Designed for failure, not just uptime. Chain boundaries as fault domains. Fail-loud behavior and circuit breakers at the MCP layer. Checkpoint/resume for multi-step runs.
Optimizes execution speed and resource usage across agent operations.
Tracks every token and tool call. Session budgets and loop detection from day one. Hard stop at 100% budget. Non-negotiable. Prevents solutions that cost more than the problems they solve.
Long-term viability and environmental considerations adapted from cloud WAF principles.
Addresses silent, confident failures — the worst failure type. Agents can hallucinate arguments, select wrong tools, or derail without visible errors. Requires evals covering tool selection, argument accuracy, and chain-of-thought faithfulness.
Human control through code-level enforcement, not prompts. Any in-flight agent must be externally stoppable. Requires pause, notify, and resume/abort primitives.
Manages agent perception of reality. Prevents stale context from corrupting reasoning. Requires external content sanitization through MCP and active lifecycle management for long sessions. The agent must understand its own knowledge limitations.
What AWAF looks for
These are the patterns awaf verifies in your code. The difference between a production-ready agent and a liability is usually one of these.
# system prompt """ You are a helpful agent. If the user says 'stop', please stop what you are doing. Never take irreversible actions unless asked. """ # no kill switch, no cancel primitive, # no external signal handler — agent # cannot be stopped from outside result = agent.run_forever(task)
# checked before every tool dispatch async def dispatch_tool(tool, args, ctx): await kill_switch.check(ctx.run_id) # raises if flagged await pause_gate.wait(ctx.run_id) # blocks if paused return await tool.call(args) # operator can stop or pause any in-flight run # via API — no prompt required kill_switch.flag(run_id="abc123")
# credentials passed into the agent's context agent = Agent( system=f"Use API key {api_key} to call...", tools=[slack_tool, github_tool, db_tool], # no scope limits — agent can call anything )
# tool injects credentials at call time; model never sees them class SlackTool(MCPTool): def call(self, channel, message): # key read from env inside the tool, not from agent context token = os.environ["SLACK_BOT_TOKEN"] # scope is read-only by default; write requires explicit grant return slack_client.post(token, channel, message)
while not task.complete():
response = llm.call(context)
context.append(response)
# no loop detection, no budget cap
# a stuck task can spend $100s unchecked
budget = SessionBudget(limit_usd=float(os.environ["AWAF_SESSION_BUDGET_USD"])) while not task.complete(): budget.check() # raises BudgetExceededError if over limit loop_guard.check() # raises LoopDetectedError if repeating response = llm.call(context) budget.record(response.usage) context.append(response)
# tested manually a few times and it seemed fine # agent has 12 tools; no automated evals exist # hallucination rate is unknown # no tracking of which tool was selected vs expected agent.deploy()
# eval suite run in CI — fails if hallucination rate > 3% suite = EvalSuite.load("evals/tool_selection.yaml") results = suite.run(agent) assert results.tool_accuracy >= 0.95, \ f"Tool selection accuracy {results.tool_accuracy:.0%} below threshold" assert results.hallucination_rate <= 0.03, \ f"Hallucination rate {results.hallucination_rate:.1%} exceeds 3%"
Spec-first. Multiple implementations. Community-owned.
The canonical spec. FRAMEWORK.md defines all 10 pillars, scoring, and readiness ratings.
Reference implementation. Multi-provider (Anthropic, OpenAI, Azure, Google, LiteLLM). CI/CD integration, GitHub Action. pip install awaf.
Dialogue-driven assessments in Claude Code. Accepts code, docs, exports, or verbal descriptions.
AI-powered planning tool grounded in market analysis and Anthropic Economic Index data. Built on awaf.
The spec is open. Implementations are open.
If you build agents in production, your patterns belong here.