CROW BLUE AGENT READINESS STANDARD · V0.1 · CALL FOR COMMENTS
The standard for
production-ready agentic AI.
Ten domains. What a grumpy senior developer would ask before signing off on an agent going anywhere near production. Not a regulatory framework. Not a compliance checklist. A punchlist written for the people who build agents and the organizations that depend on them.
Version 0.1 · April 2026 · crow.blue/standard
Items marked Required are verified by Signet before a full governance credential is issued. Items marked Advisory are surfaced during registration and periodic review. Items marked Attestation require a human to confirm — they cannot be verified programmatically.
This is version 0.1. It is a call for comments. Submit feedback at the form below. All substantive comments will be acknowledged and addressed in the public change log.
01 · DATA HANDLING
Data Handling
Every data source the agent accesses is documented and classified. Classification uses one of: public, internal, confidential, restricted, personal_data, sensitive_personal_data, financial, health, regulated. Unclassified data sources are flagged and must be resolved before a full credential is issued.
The agent does not send personally identifiable information to a model endpoint unless it is necessary for the task. If PII must be sent, it is documented in the registration record and the model provider's data retention policy is reviewed and accepted.
The builder has reviewed the model provider's data processing terms and confirmed that inputs to the model are not retained for training purposes, or has accepted the risk and documented the decision.
The agent's outputs are classified at least as high as its most sensitive input. An agent that reads confidential data does not write outputs to a public channel without explicit human review.
Prompts do not include more data than necessary. If a full record is available but only a subset is needed, the agent sends the subset. Prompt engineering reviews are part of the change management process.
Application logs do not contain raw model inputs or outputs that include PII or confidential data. Log sanitization is implemented before logs are written to any storage or aggregation system.
For agents operating under jurisdictional data requirements (GDPR, state privacy laws, sector-specific regulations), the builder has confirmed that data processed by the agent does not leave the required jurisdiction.
If the agent stores outputs, a retention policy is defined. Outputs are deleted on schedule. If the agent processes personal data, subject deletion requests can be fulfilled — the agent's outputs can be identified and removed.
02 · RESILIENCY AND FAILURE HANDLING
Resiliency and Failure Handling
The agent has explicit handling for model API unavailability. It does not silently fail, loop indefinitely, or propagate an error as a valid result. The failure mode is documented: fail fast, retry with backoff, queue for later, or alert and stop.
Where retries are appropriate, the agent implements exponential backoff with jitter. It does not hammer the model API on failure. Maximum retry count and total timeout are defined and bounded.
The agent validates model output before acting on it. If the output does not conform to the expected schema or contains nonsense, the agent does not proceed. Downstream systems are not corrupted by garbage outputs.
For multi-step workflows, the agent handles partial completion. If step 3 of 5 fails, the state is known, the partial work is either rolled back or held for review, and the agent does not leave the system in an inconsistent state.
Every external call the agent makes — to the model, to APIs, to databases — has an explicit timeout. The agent does not wait indefinitely. Timeouts are appropriate to the operation and documented.
For agents that call downstream services repeatedly, a circuit breaker pattern is implemented. If a downstream service is consistently failing, the agent backs off rather than contributing to a cascade failure.
Where possible, the agent has a degraded mode. If the model is unavailable but the task is critical, a fallback path exists: a human queue, a rule-based fallback, or a held queue for processing when the model recovers.
For agents that write to databases, send messages, or trigger workflows, operations are idempotent where possible. Running the same operation twice does not produce duplicate records, duplicate messages, or duplicate charges. (Required for agents that write to external systems.)
03 · HUMAN OVERSIGHT
Human Oversight
The registration record specifies the agent's human oversight model: always reviewed before action, sampled review, exception-only review, or fully automated. The declared model is appropriate to the risk tier and the consequences of errors.
There is a defined escalation path for cases the agent cannot handle with sufficient confidence. The escalation destination is a real person or queue, not a dead end.
For agents that make classifications or decisions, a confidence threshold is defined below which the agent escalates to human review rather than acting. The threshold is calibrated to the risk of false positives and false negatives.
Humans can override the agent's output or decision after the fact. The override mechanism is documented, accessible to the appropriate people, and the override is logged. (Required for high and critical risk tiers.)
The agent can be stopped. There is a documented, tested path to halt the agent immediately if it is producing harmful outputs or behaving unexpectedly. The person responsible for triggering the stop is identified.
Every human escalation event is emitted to the governance telemetry stream. The escalation rate is monitored. An unusual change in escalation rate — spike or drop — triggers a governance review.
For agents that make decisions with direct consequences for people — hiring, lending, benefits, healthcare — the builder has confirmed that a human reviews decisions before they take effect, or that an appeal mechanism is available to affected individuals.
The oversight model is documented in plain language that affected stakeholders can understand. "The system automatically approves requests under $500" is documentation. "AI-powered decisioning" is not.
04 · OUTPUT VALIDATION AND QUALITY
Output Validation and Quality
If the agent produces structured output (JSON, CSV, a database record, a classification), the output is validated against a schema before it is used. Invalid outputs are rejected, not silently coerced.
For numerical or quantitative outputs, plausibility checks are implemented. An agent that produces a price estimate of $0.00 or $999,999,999.99 should not forward that to a billing system without review.
The builder has considered the agent's exposure to hallucination and documented the mitigations. For tasks where hallucinated outputs are high-risk (medical, legal, financial), additional validation or human review is required.
Where the agent's outputs reference facts, documents, or data, sources are traceable. The agent does not fabricate citations. If the agent uses retrieval-augmented generation, the retrieved content is logged.
Agent outputs that are rendered in a UI, written to a database, or forwarded to another system are sanitized appropriately. SQL injection, XSS, and prompt injection via output are considered and mitigated.
A set of known-good inputs and expected outputs exists for the agent's primary tasks. This evaluation suite is run before any significant change to prompts, model versions, or tool configurations. Results are logged.
The builder has documented the inputs for which the agent's behavior is uncertain or known to be poor. These edge cases are either handled explicitly or routed to human review.
05 · SECURITY
Security
The agent's input handling considers prompt injection. Inputs from untrusted sources — user text, retrieved documents, external API responses — are not directly concatenated into system prompts without sanitization or structural separation.
The agent has access to only the data sources and tools it needs for its stated purpose. Credentials are scoped to the minimum necessary permissions. An agent that reads a database does not have write access unless it writes.
The agent's credentials — API keys, database passwords, OAuth tokens — are stored in a secrets manager or environment variables. They are not hardcoded in source code, prompt templates, or configuration files checked into version control.
The agent's credentials are rotated on a defined schedule or on personnel change. The rotation process is documented and tested. Rotation does not require downtime.
For agents that invoke tools (APIs, code execution, file system operations), the tool call parameters are validated before execution. The agent cannot be instructed to call a tool with parameters outside its expected range.
The agent's outputs are not forwarded directly to systems that interpret them as instructions. An agent that writes to a message queue does not produce messages that downstream consumers will execute as code or commands.
The agent's dependencies — SDK versions, model client libraries, tool libraries — are tracked and updated. Known vulnerabilities in dependencies are remediated on a defined schedule.
Prompts do not contain secrets, credentials, or sensitive configuration values. If prompts must reference configuration, they use references to secrets manager values, not the values themselves.
If the agent receives instructions from another agent, those instructions are treated with appropriate skepticism. The source agent's identity is verified where possible. Instructions that expand the receiving agent's permissions or scope are rejected.
06 · BIAS AND FAIRNESS
Bias and Fairness
For agents that make or influence decisions affecting people, the builder has assessed whether the agent's outputs vary systematically by demographic group. The assessment is documented and mitigations are implemented where disparate impact is found. (Required for high and critical risk tiers.)
The builder has considered whether the base model's training data is appropriate for the agent's use case. Known limitations of the model — demographic biases, knowledge cutoffs, domain gaps — are documented and handled.
The agent does not use protected characteristics (race, gender, age, religion, national origin, disability status) as inputs to decisions unless explicitly permitted by law and documented. Even where permitted, their use is reviewed by legal counsel. (Required for agents making consequential decisions about people.)
If the agent's outputs influence future inputs — recommendations that shape behavior, classifications that affect what data the agent sees — the feedback loop is documented and monitored for drift toward harmful patterns.
The agent has defined boundaries for what it will and will not do. These boundaries are implemented in the prompt and tested. The agent refuses requests outside its intended scope rather than attempting them with lower quality.
For agents that interact with or make decisions about vulnerable populations (minors, elderly, people with disabilities, people in crisis), additional safeguards are implemented and documented. (Where applicable.)
07 · TRANSPARENCY AND DISCLOSURE
Transparency and Disclosure
People who interact with or are significantly affected by the agent's outputs are informed that AI is involved. The disclosure is clear, not buried in terms of service. (Required for high and critical risk tiers.)
Where the agent makes or contributes to a decision, a plain-language explanation of the decision is available to the affected person on request. "The system flagged your application for manual review" is not sufficient. "Your application was flagged because the income documentation did not match the stated salary" is.
Affected individuals can request human review of an agent's decision. The request path is documented and accessible. Requests are fulfilled on a defined timeline.
Agents that interact with humans in real time — chat, voice, email — identify themselves as AI when directly asked. They do not claim to be human. (Required for agents that interact with humans directly.)
Known limitations of the agent are disclosed to the people who depend on its outputs. Users who act on agent outputs know the conditions under which those outputs are unreliable.
08 · CODE QUALITY AND MAINTAINABILITY
Code Quality and Maintainability
System prompts are documented. The intent of each instruction is explained in comments or accompanying documentation. Someone other than the original author can understand why the prompt is structured as it is.
All agent code, prompt templates, configuration, and tool definitions are in version control. Changes are tracked. The current production version is identifiable.
Significant changes to the agent — prompt revisions, model version updates, tool additions, behavioral changes — are recorded in a change log with the date, the change, the reason, and the person responsible.
The agent's core logic — input parsing, output formatting, tool call construction, error handling — has unit tests. Tests cover happy paths and known edge cases. Tests pass before any deployment.
End-to-end tests exist that exercise the agent's full workflow with realistic inputs. These tests are run before significant changes and their results are logged.
The agent's code is written for the next developer, not just for the computer. Functions are named clearly. Complex logic is commented. Magic values are named constants. The codebase can be understood by a competent developer in a reasonable amount of time.
The agent's dependencies — libraries, external services, model endpoints — are documented. Versions are pinned in the dependency manifest. A new developer can set up a working environment from the documentation.
The code includes comments identifying known limitations, workarounds, and technical debt. Future maintainers are not surprised by behavior they cannot explain.
Commented-out code, unused imports, experimental features, and debug logging are not present in the production codebase. If something is not used, it is deleted.
09 · VERSIONING AND CHANGE MANAGEMENT
Versioning and Change Management
The agent specifies an explicit model version, not a floating alias like "latest." The behavior of a pinned version is stable. Model updates are deliberate, not automatic.
When the organization wants to update the model version, there is a documented process: update in staging, run the evaluation suite, compare outputs against the previous version, approve, deploy. The update is not made directly in production.
Changes to system prompts follow the same process as code changes: version control, review, testing against the evaluation suite, documented rationale. A prompt change that affects production behavior is not made casually.
Before any significant change — model update, prompt change, tool addition — the agent's outputs on a standard set of inputs are compared to the previous version. Unexpected regressions are investigated before the change is deployed.
The previous version of the agent — previous prompt, previous model version, previous tool configuration — can be restored within a defined time window. The rollback process is documented and has been tested.
For significant changes to high-traffic agents, changes are rolled out to a subset of traffic before full deployment. Monitoring confirms expected behavior at partial deployment before full rollout proceeds.
When the agent will be retired or replaced, the deprecation plan is documented and communicated to stakeholders before the agent is shut down. Data produced by the agent is retained or migrated per the relevant retention policy.
10 · COST AND RESOURCE MANAGEMENT
Cost and Resource Management
The agent's costs are attributed to a cost center, project, or owner. Finance can answer "how much did this agent cost last month" without a manual investigation.
A spend limit is defined for the agent. If the agent exceeds the limit, it stops or alerts. The limit is appropriate to the agent's expected workload and the organization's tolerance for unexpected spend.
The builder knows the approximate cost per invocation of the agent. Cost per call is documented. Input and output token counts are logged. Cost efficiency is a design consideration, not an afterthought.
Unusual changes in the agent's cost are detected and alerted. A 10x spike in cost triggers a notification to the agent owner. The owner can determine whether the spike is expected. (Covered by Signet telemetry where Signet is deployed.)
Prompts do not include unnecessary context. Conversation history is managed — old context is truncated or summarized rather than passed indefinitely. Token usage is periodically reviewed for waste.
Agents that are no longer in use are deprecated and their infrastructure is cleaned up. Zombie agents — deployed but no longer invoked — are identified in the registry and reviewed.
Mapping to Regulatory Frameworks
The following table maps CBARS domains to the primary regulatory frameworks relevant to enterprise AI governance. Mappings are conservative — only direct, defensible connections are listed.
| Domain | EU AI Act 2024 | NIST AI RMF 1.0 | ISO 42001 |
|---|---|---|---|
| 1. Data handling | Art. 10 (data governance) | MAP, MEASURE | 6.1, 8.4 |
| 2. Resiliency | Art. 15 (robustness) | MEASURE, MANAGE | 8.5 |
| 3. Human oversight | Art. 14 (human oversight) | GOVERN | 6.2 |
| 4. Output validation | Art. 15 (accuracy) | MEASURE | 8.5 |
| 5. Security | Art. 15 (cybersecurity) | MANAGE | 8.3 |
| 6. Bias and fairness | Art. 10, Art. 9 (risk management) | MAP, MEASURE | 6.1 |
| 7. Transparency | Art. 13 (transparency) | GOVERN | 7.5 |
| 8. Code quality | Art. 11 (technical documentation) | GOVERN | 8.2 |
| 9. Change management | Art. 9 (risk management) | MANAGE | 8.5 |
| 10. Cost management | — | GOVERN | — |
Submit a comment
CBARS is a call for comments. We want input from senior engineers, security teams, governance teams, and anyone who has been burned by the things it covers. All substantive comments are acknowledged and addressed in the public change log.
Download CBARS v0.1
PDF format. Freely reproducible with attribution.
Signet ships this standard with every provisioning event.