The following reference materials were provided as context. Use them as guidance only -- always prefer the latest, most current information over any potentially outdated content in these files.
--- Reference: instructor_8lsy243ftffjjy1cx9lm3o2bw_public_1773274827_Claude+Certified+Architect+-+Foundations+Certification+Exam+Guide.pdf ---
Claude Certified Architect - Foundations Certification Exam Guide
Introduction
The Claude Certified Architect - Foundations certification validates that practitioners can make
informed decisions about tradeoffs when implementing real-world solutions with Claude. This
exam tests foundational knowledge across Claude Code, the Claude Agent SDK, the Claude API,
and Model Context Protocol (MCP) -- the core technologies used to build production-grade
applications with Claude.
Questions on this exam are grounded in realistic scenarios drawn from actual customer use
cases, including building agentic systems for customer support, designing multi-agent research
pipelines, integrating Claude Code into CI/CD workflows, building developer productivity tools,
and extracting structured data from unstructured documents. Candidates must demonstrate not
only conceptual knowledge but practical judgment about architecture, configuration, and
tradeoffs in production deployments.
This guide describes the exam content, lists the domains and task statements tested, provides
sample questions, and recommends preparation strategies. Use it alongside hands-on
experience to prepare effectively.
Target Candidate Description
The ideal candidate for this certification is a solution architect who designs and implements
production applications with Claude. This candidate has hands-on experience with:
- Building agentic applications using the Claude Agent SDK, including multi-agent
orchestration, subagent delegation, tool integration, and lifecycle hooks
- Configuring and customizing Claude Code for team workflows using CLAUDE.md files,
Agent Skills, MCP server integrations, and plan mode
- Designing Model Context Protocol (MCP) tool and resource interfaces for backend
system integration
- Engineering prompts that produce reliable structured output, leveraging JSON schemas,
few-shot examples, and extraction patterns
- Managing context windows effectively across long documents, multi-turn conversations,
and multi-agent handoffs
Anthropic, PBC · Confidential Need to Know (NTK)
- Integrating Claude into CI/CD pipelines for automated code review, test generation, and
pull request feedback
- Making sound escalation and reliability decisions, including error handling,
human-in-the-loop workflows, and self-evaluation patterns
The candidate typically has 6+ months of practical experience building with Claude APIs, Agent
SDK, Claude Code, and MCP, understanding both the capabilities and limitations of large
language models in production environments.
Exam Content
Response Types
All questions on the exam are multiple choice format. Each question has one correct response
and three incorrect responses (distractors).
Select the single response that best completes the statement or answers the question.
Distractors are response options that a candidate with incomplete knowledge or experience
might choose.
Unanswered questions are scored as incorrect; there is no penalty for guessing.
Exam Results
The exam has a pass or fail designation. The exam is scored against a minimum standard
established by subject matter experts.
Your results are reported as a scaled score of 100-1,000. The minimum passing score is 720.
Scaled scoring models help equate scores across multiple exam forms that might have slightly
different difficulty levels.
Content Outline
This exam guide includes weightings, content domains, and task statements for the exam.
The exam has the following content domains and weightings:
- Domain 1: Agentic Architecture & Orchestration (27% of scored content)
- Domain 2: Tool Design & MCP Integration (18% of scored content)
Anthropic, PBC · Confidential Need to Know (NTK)
- Domain 3: Claude Code Configuration & Workflows (20% of scored content)
- Domain 4: Prompt Engineering & Structured Output (20% of scored content)
- Domain 5: Context Management & Reliability (15% of scored content)
Exam Scenarios
The exam uses scenario-based questions. Each scenario presents a realistic production context
that frames a set of questions. During the exam, 4 scenarios will be presented and picked at
random from the full set of the 6 scenarios below.
Scenario 1: Customer Support Resolution Agent
You are building a customer support resolution agent using the Claude Agent SDK. The agent
handles high-ambiguity requests like returns, billing disputes, and account issues. It has access
to your backend systems through custom Model Context Protocol (MCP) tools (get_customer,
lookup_order, process_refund, escalate_to_human). Your target is 80%+ first-contact
resolution while knowing when to escalate.
Primary domains: Agentic Architecture & Orchestration, Tool Design & MCP Integration, Context
Management & Reliability
Scenario 2: Code Generation with Claude Code
You are using Claude Code to accelerate software development. Your team uses it for code
generation, refactoring, debugging, and documentation. You need to integrate it into your
development workflow with custom slash commands, CLAUDE.md configurations, and
understand when to use plan mode vs direct execution.
Primary domains: Claude Code Configuration & Workflows, Context Management & Reliability
Scenario 3: Multi-Agent Research System
You are building a multi-agent research system using the Claude Agent SDK. A coordinator
agent delegates to specialized subagents: one searches the web, one analyzes documents, one
synthesizes findings, and one generates reports. The system researches topics and produces
comprehensive, cited reports.
Anthropic, PBC · Confidential Need to Know (NTK)
Primary domains: Agentic Architecture & Orchestration, Tool Design & MCP Integration, Context
Management & Reliability
Scenario 4: Developer Productivity with Claude
You are building developer productivity tools using the Claude Agent SDK. The agent helps
engineers explore unfamiliar codebases, understand legacy systems, generate boilerplate code,
and automate repetitive tasks. It uses the built-in tools (Read, Write, Bash, Grep, Glob) and
integrates with Model Context Protocol (MCP) servers.
Primary domains: Tool Design & MCP Integration, Claude Code Configuration & Workflows,
Agentic Architecture & Orchestration
Scenario 5: Claude Code for Continuous Integration
You are integrating Claude Code into your Continuous Integration/Continuous Deployment
(CI/CD) pipeline. The system runs automated code reviews, generates test cases, and provides
feedback on pull requests. You need to design prompts that provide actionable feedback and
minimize false positives.
Primary domains: Claude Code Configuration & Workflows, Prompt Engineering & Structured
Output
Scenario 6: Structured Data Extraction
You are building a structured data extraction system using Claude. The system extracts
information from unstructured documents, validates the output using JavaScript Object
Notation (JSON) schemas, and maintains high accuracy. It must handle edge cases gracefully and
integrate with downstream systems.
Primary domains: Prompt Engineering & Structured Output, Context Management & Reliability
Domain 1: Agentic Architecture & Orchestration
Task Statement 1.1: Design and implement agentic loops for autonomous task
execution
Knowledge of:
Anthropic, PBC · Confidential Need to Know (NTK)
- The agentic loop lifecycle: sending requests to Claude, inspecting stop_reason
("tool_use" vs "end_turn"), executing requested tools, and returning results for the
next iteration
- How tool results are appended to conversation history so the model can reason about the
next action
- The distinction between model-driven decision-making (Claude reasons about which tool
to call next based on context) and pre-configured decision trees or tool sequences
Skills in:
- Implementing agentic loop control flow that continues when stop_reason is
"tool_use" and terminates when stop_reason is "end_turn"
- Adding tool results to conversation context between iterations so the model can
incorporate new information into its reasoning
- Avoiding anti-patterns such as parsing natural language signals to determine loop
termination, setting arbitrary iteration caps as the primary stopping mechanism, or
checking for assistant text content as a completion indicator
Task Statement 1.2: Orchestrate multi-agent systems with coordinator-subagent
patterns
Knowledge of:
- Hub-and-spoke architecture where a coordinator agent manages all inter-subagent
communication, error handling, and information routing
- How subagents operate with isolated context--they do not inherit the coordinator's
conversation history automatically
- The role of the coordinator in task decomposition, delegation, result aggregation, and
deciding which subagents to invoke based on query complexity
- Risks of overly narrow task decomposition by the coordinator, leading to incomplete
coverage of broad research topics
Skills in:
- Designing coordinator agents that analyze query requirements and dynamically select
which subagents to invoke rather than always routing through the full pipeline
Anthropic, PBC · Confidential Need to Know (NTK)
- Partitioning research scope across subagents to minimize duplication (e.g., assigning
distinct subtopics or source types to each agent)
- Implementing iterative refinement loops where the coordinator evaluates synthesis
output for gaps, re-delegates to search and analysis subagents with targeted queries, and
re-invokes synthesis until coverage is sufficient
- Routing all subagent communication through the coordinator for observability, consistent
error handling, and controlled information flow
Task Statement 1.3: Configure subagent invocation, context passing, and
spawning
Knowledge of:
- The Task tool as the mechanism for spawning subagents, and the requirement that
allowedTools must include "Task" for a coordinator to invoke subagents
- That subagent context must be explicitly provided in the prompt--subagents do not
automatically inherit parent context or share memory between invocations
- The AgentDefinition configuration including descriptions, system prompts, and tool
restrictions for each subagent type
- Fork-based session management for exploring divergent approaches from a shared
analysis baseline
Skills in:
- Including complete findings from prior agents directly in the subagent's prompt (e.g.,
passing web search results and document analysis outputs to the synthesis subagent)
- Using structured data formats to separate content from metadata (source URLs,
document names, page numbers) when passing context between agents to preserve
attribution
- Spawning parallel subagents by emitting multiple Task tool calls in a single coordinator
response rather than across separate turns
- Designing coordinator prompts that specify research goals and quality criteria rather than
step-by-step procedural instructions, to enable subagent adaptability
Task Statement 1.4: Implement multi-step workflows with enforcement and
handoff patterns
Knowledge of:
Anthropic, PBC · Confidential Need to Know (NTK)
- The difference between programmatic enforcement (hooks, prerequisite gates) and
prompt-based guidance for workflow ordering
- When deterministic compliance is required (e.g., identity verification before financial
operations), prompt instructions alone have a non-zero failure rate
- Structured handoff protocols for mid-process escalation that include customer details,
root cause analysis, and recommended actions
Skills in:
- Implementing programmatic prerequisites that block downstream tool calls until
prerequisite steps have completed (e.g., blocking process_refund until get_customer
has returned a verified customer ID)
- Decomposing multi-concern customer requests into distinct items, then investigating
each in parallel using shared context before synthesizing a unified resolution
- Compiling structured handoff summaries (customer ID, root cause, refund amount,
recommended action) when escalating to human agents who lack access to the
conversation transcript
Task Statement 1.5: Apply Agent SDK hooks for tool call interception and data
normalization
Knowledge of:
- Hook patterns (e.g., PostToolUse) that intercept tool results for transformation before
the model processes them
- Hook patterns that intercept outgoing tool calls to enforce compliance rules (e.g., blocking
refunds above a threshold)
- The distinction between using hooks for deterministic guarantees versus relying on
prompt instructions for probabilistic compliance
Skills in:
- Implementing PostToolUse hooks to normalize heterogeneous data formats (Unix
timestamps, ISO 8601, numeric status codes) from different MCP tools before the agent
processes them
- Implementing tool call interception hooks that block policy-violating actions (e.g., refunds
exceeding $500) and redirect to alternative workflows (e.g., human escalation)
Anthropic, PBC · Confidential Need to Know (NTK)
- Choosing hooks over prompt-based enforcement when business rules require guaranteed
compliance
Task Statement 1.6: Design task decomposition strategies for complex workflows
Knowledge of:
- When to use fixed sequential pipelines (prompt chaining) versus dynamic adaptive
decomposition based on intermediate findings
- Prompt chaining patterns that break reviews into sequential steps (e.g., analyze each file
individually, then run a cross-file integration pass)
- The value of adaptive investigation plans that generate subtasks based on what is
discovered at each step
Skills in:
- Selecting task decomposition patterns appropriate to the workflow: prompt chaining for
predictable multi-aspect reviews, dynamic decomposition for open-ended investigation
tasks
- Splitting large code reviews into per-file local analysis passes plus a separate cross-file
integration pass to avoid attention dilution
- Decomposing open-ended tasks (e.g., "add comprehensive tests to a legacy codebase") by
first mapping structure, identifying high-impact areas, then creating a prioritized plan
that adapts as dependencies are discovered
Task Statement 1.7: Manage session state, resumption, and forking
Knowledge of:
- Named session resumption using --resume to continue a specific
prior conversation
- fork_session for creating independent branches from a shared analysis baseline to
explore divergent approaches
- The importance of informing the agent about changes to previously analyzed files when
resuming sessions after code modifications
- Why starting a new session with a structured summary is more reliable than resuming
with stale tool results
Skills in:
Anthropic, PBC · Confidential Need to Know (NTK)
- Using --resume with session names to continue named investigation sessions across
work sessions
- Using fork_session to create parallel exploration branches (e.g., comparing two testing
strategies or refactoring approaches from a shared codebase analysis)
- Choosing between session resumption (when prior context is mostly valid) and starting
fresh with injected summaries (when prior tool results are stale)
- Informing a resumed session about specific file changes for targeted re-analysis rather
than requiring full re-exploration
Domain 2: Tool Design & MCP Integration
Task Statement 2.1: Design effective tool interfaces with clear descriptions and
boundaries
Knowledge of:
- Tool descriptions as the primary mechanism LLMs use for tool selection; minimal
descriptions lead to unreliable selection among similar tools
- The importance of including input formats, example queries, edge cases, and boundary
explanations in tool descriptions
- How ambiguous or overlapping tool descriptions cause misrouting (e.g.,
analyze_content vs analyze_document with near-identical descriptions)
- The impact of system prompt wording on tool selection: keyword-sensitive instructions
can create unintended tool associations
Skills in:
- Writing tool descriptions that clearly differentiate each tool's purpose, expected inputs,
outputs, and when to use it versus similar alternatives
- Renaming tools and updating descriptions to eliminate functional overlap (e.g., renaming
analyze_content to extract_web_results with a web-specific description)
- Splitting generic tools into purpose-specific tools with defined input/output contracts
(e.g., splitting a generic analyze_document into extract_data_points,
summarize_content, and verify_claim_against_source)
Anthropic, PBC · Confidential Need to Know (NTK)
- Reviewing system prompts for keyword-sensitive instructions that might override
well-written tool descriptions
Task Statement 2.2: Implement structured error responses for MCP tools
Knowledge of:
- The MCP isError flag pattern for communicating tool failures back to the agent
- The distinction between transient errors (timeouts, service unavailability), validation
errors (invalid input), business errors (policy violations), and permission errors
- Why uniform error responses (generic "Operation failed") prevent the agent from making
appropriate recovery decisions
- The difference between retryable and non-retryable errors, and how returning structured
metadata prevents wasted retry attempts
Skills in:
- Returning structured error metadata including errorCategory
(transient/validation/permission), isRetryable boolean, and human-readable
descriptions
- Including retriable: false flags and customer-friendly explanations for business rule
violations so the agent can communicate appropriately
- Implementing local error recovery within subagents for transient failures, propagating to
the coordinator only errors that cannot be resolved locally along with partial results and
what was attempted
- Distinguishing between access failures (needing retry decisions) and valid empty results
(representing successful queries with no matches)
Task Statement 2.3: Distribute tools appropriately across agents and configure
tool choice
Knowledge of:
- The principle that giving an agent access to too many tools (e.g., 18 instead of 4-5)
degrades tool selection reliability by increasing decision complexity
- Why agents with tools outside their specialization tend to misuse them (e.g., a synthesis
agent attempting web searches)
- Scoped tool access: giving agents only the tools needed for their role, with limited
cross-role tools for specific high-frequency needs
Anthropic, PBC · Confidential Need to Know (NTK)
- tool_choice configuration options: "auto", "any", and forced tool selection ({"type":
"tool", "name": "..."})
Skills in:
- Restricting each subagent's tool set to those relevant to its role, preventing
cross-specialization misuse
- Replacing generic tools with constrained alternatives (e.g., replacing fetch_url with
load_document that validates document URLs)
- Providing scoped cross-role tools for high-frequency needs (e.g., a verify_fact tool for
the synthesis agent) while routing complex cases through the coordinator
- Using tool_choice forced selection to ensure a specific tool is called first (e.g., forcing
extract_metadata before enrichment tools), then processing subsequent steps in
follow-up turns
- Setting tool_choice: "any" to guarantee the model calls a tool rather than returning
conversational text
Task Statement 2.4: Integrate MCP servers into Claude Code and agent workflows
Knowledge of:
- MCP server scoping: project-level (.mcp.json) for shared team tooling vs user-level
(~/.claude.json) for personal/experimental servers
- Environment variable expansion in .mcp.json (e.g., ${GITHUB_TOKEN}) for credential
management without committing secrets
- That tools from all configured MCP servers are discovered at connection time and
available simultaneously to the agent
- MCP resources as a mechanism for exposing content catalogs (e.g., issue summaries,
documentation hierarchies, database schemas) to reduce exploratory tool calls
Skills in:
- Configuring shared MCP servers in project-scoped .mcp.json with environment variable
expansion for authentication tokens
- Configuring personal/experimental MCP servers in user-scoped ~/.claude.json
- Enhancing MCP tool descriptions to explain capabilities and outputs in detail, preventing
the agent from preferring built-in tools (like Grep) over more capable MCP tools
Anthropic, PBC · Confidential Need to Know (NTK)
- Choosing existing community MCP servers over custom implementations for standard
integrations (e.g., Jira), reserving custom servers for team-specific workflows
- Exposing content catalogs as MCP resources to give agents visibility into available data
without requiring exploratory tool calls
Task Statement 2.5: Select and apply built-in tools (Read, Write, Edit, Bash, Grep,
Glob) effectively
Knowledge of:
- Grep for content search (searching file contents for patterns like function names, error
messages, or import statements)
- Glob for file path pattern matching (finding files by name or extension patterns)
- Read/Write for full file operations; Edit for targeted modifications using unique text
matching
- When Edit fails due to non-unique text matches, using Read + Write as a fallback for
reliable file modifications
Skills in:
- Selecting Grep for searching code content across a codebase (e.g., finding all callers of a
function, locating error messages)
- Selecting Glob for finding files matching naming patterns (e.g., **/*.test.tsx)
- Using Read to load full file contents followed by Write when Edit cannot find unique
anchor text
- Building codebase understanding incrementally: starting with Grep to find entry points,
then using Read to follow imports and trace flows, rather than reading all files upfront
- Tracing function usage across wrapper modules by first identifying all exported names,
then searching for each name across the codebase
Domain 3: Claude Code Configuration & Workflows
Task Statement 3.1: Configure CLAUDE.md files with appropriate hierarchy,
scoping, and modular organization
Knowledge of:
Anthropic, PBC · Confidential Need to Know (NTK)
- The CLAUDE.md configuration hierarchy: user-level (~/.claude/CLAUDE.md),
project-level (.claude/CLAUDE.md or root CLAUDE.md), and directory-level
(subdirectory CLAUDE.md files)
- That user-level settings apply only to that user--instructions in ~/.claude/CLAUDE.md
are not shared with teammates via version control
- The @import syntax for referencing external files to keep CLAUDE.md modular (e.g.,
importing specific standards files relevant to each package)
- .claude/rules/ directory for organizing topic-specific rule files as an alternative to a
monolithic CLAUDE.md
Skills in:
- Diagnosing configuration hierarchy issues (e.g., a new team member not receiving
instructions because they're in user-level rather than project-level configuration)
- Using @import to selectively include relevant standards files in each package's
CLAUDE.md based on maintainer domain knowledge
- Splitting large CLAUDE.md files into focused topic-specific files in .claude/rules/ (e.g.,
testing.md, api-conventions.md, deployment.md)
- Using the /memory command to verify which memory files are loaded and diagnose
inconsistent behavior across sessions
Task Statement 3.2: Create and configure custom slash commands and skills
Knowledge of:
- Project-scoped commands in .claude/commands/ (shared via version control) vs
user-scoped commands in ~/.claude/commands/ (personal)
- Skills in .claude/skills/ with SKILL.md files that support frontmatter configuration
including context: fork, allowed-tools, and argument-hint
- The context: fork frontmatter option for running skills in an isolated sub-agent
context, preventing skill outputs from polluting the main conversation
- Personal skill customization: creating personal variants in ~/.claude/skills/ with
different names to avoid affecting teammates
Skills in:
Anthropic, PBC · Confidential Need to Know (NTK)
- Creating project-scoped slash commands in .claude/commands/ for team-wide
availability via version control
- Using context: fork to isolate skills that produce verbose output (e.g., codebase
analysis) or exploratory context (e.g., brainstorming alternatives) from the main session
- Configuring allowed-tools in skill frontmatter to restrict tool access during skill
execution (e.g., limiting to file write operations to prevent destructive actions)
- Using argument-hint frontmatter to prompt developers for required parameters when
they invoke the skill without arguments
- Choosing between skills (on-demand invocation for task-specific workflows) and
CLAUDE.md (always-loaded universal standards)
Task Statement 3.3: Apply path-specific rules for conditional convention loading
Knowledge of:
- .claude/rules/ files with YAML frontmatter paths fields containing glob patterns for
conditional rule activation
- How path-scoped rules load only when editing matching files, reducing irrelevant context
and token usage
- The advantage of glob-pattern rules over directory-level CLAUDE.md files for conventions
that span multiple directories (e.g., test files spread throughout a codebase)
Skills in:
- Creating .claude/rules/ files with YAML frontmatter path scoping (e.g., paths:
["terraform/**/*"]) so rules load only when editing matching files
- Using glob patterns in path-specific rules to apply conventions to files by type regardless
of directory location (e.g., **/*.test.tsx for all test files)
- Choosing path-specific rules over subdirectory CLAUDE.md files when conventions must
apply to files spread across the codebase
Task Statement 3.4: Determine when to use plan mode vs direct execution
Knowledge of:
- Plan mode is designed for complex tasks involving large-scale changes, multiple valid
approaches, architectural decisions, and multi-file modifications
- Direct execution is appropriate for simple, well-scoped changes (e.g., adding a single
validation check to one function)
Anthropic, PBC · Confidential Need to Know (NTK)
- Plan mode enables safe codebase exploration and design before committing to changes,
preventing costly rework
- The Explore subagent for isolating verbose discovery output and returning summaries to
preserve main conversation context
Skills in:
- Selecting plan mode for tasks with architectural implications (e.g., microservice
restructuring, library migrations affecting 45+ files, choosing between integration
approaches with different infrastructure requirements)
- Selecting direct execution for well-understood changes with clear scope (e.g., a single-file
bug fix with a clear stack trace, adding a date validation conditional)
- Using the Explore subagent for verbose discovery phases to prevent context window
exhaustion during multi-phase tasks
- Combining plan mode for investigation with direct execution for implementation (e.g.,
planning a library migration, then executing the planned approach)
Task Statement 3.5: Apply iterative refinement techniques for progressive
improvement
Knowledge of:
- Concrete input/output examples as the most effective way to communicate expected
transformations when prose descriptions are interpreted inconsistently
- Test-driven iteration: writing test suites first, then iterating by sharing test failures to
guide progressive improvement
- The interview pattern: having Claude ask questions to surface considerations the
developer may not have anticipated before implementing
- When to provide all issues in a single message (interacting problems) versus fixing them
sequentially (independent problems)
Skills in:
- Providing 2-3 concrete input/output examples to clarify transformation requirements
when natural language descriptions produce inconsistent results
- Writing test suites covering expected behavior, edge cases, and performance
requirements before implementation, then iterating by sharing test failures
Anthropic, PBC · Confidential Need to Know (NTK)
- Using the interview pattern to surface design considerations (e.g., cache invalidation
strategies, failure modes) before implementing solutions in unfamiliar domains
- Providing specific test cases with example input and expected output to fix edge case
handling (e.g., null values in migration scripts)
- Addressing multiple interacting issues in a single detailed message when fixes interact,
versus sequential iteration for independent issues
Task Statement 3.6: Integrate Claude Code into CI/CD pipelines
Knowledge of:
- The -p (or --print) flag for running Claude Code in non-interactive mode in automated
pipelines
- --output-format json and --json-schema CLI flags for enforcing structured output
in CI contexts
- CLAUDE.md as the mechanism for providing project context (testing standards, fixture
conventions, review criteria) to CI-invoked Claude Code
- Session context isolation: why the same Claude session that generated code is less
effective at reviewing its own changes compared to an independent review instance
Skills in:
- Running Claude Code in CI with the -p flag to prevent interactive input hangs
- Using --output-format json with --json-schema to produce machine-parseable
structured findings for automated posting as inline PR comments
- Including prior review findings in context when re-running reviews after new commits,
instructing Claude to report only new or still-unaddressed issues to avoid duplicate
comments
- Providing existing test files in context so test generation avoids suggesting duplicate
scenarios already covered by the test suite
- Documenting testing standards, valuable test criteria, and available fixtures in
CLAUDE.md to improve test generation quality and reduce low-value test output
Anthropic, PBC · Confidential Need to Know (NTK)
Domain 4: Prompt Engineering & Structured Output
Task Statement 4.1: Design prompts with explicit criteria to improve precision and
reduce false positives
Knowledge of:
- The importance of explicit criteria over vague instructions (e.g., "flag comments only
when claimed behavior contradicts actual code behavior" vs "check that comments are
accurate")
- How general instructions like "be conservative" or "only report high-confidence findings"
fail to improve precision compared to specific categorical criteria
- The impact of false positive rates on developer trust: high false positive categories
undermine confidence in accurate categories
Skills in:
- Writing specific review criteria that define which issues to report (bugs, security) versus
skip (minor style, local patterns) rather than relying on confidence-based filtering
- Temporarily disabling high false-positive categories to restore developer trust while
improving prompts for those categories
- Defining explicit severity criteria with concrete code examples for each severity level to
achieve consistent classification
Task Statement 4.2: Apply few-shot prompting to improve output consistency and
quality
Knowledge of:
- Few-shot examples as the most effective technique for achieving consistently formatted,
actionable output when detailed instructions alone produce inconsistent results
- The role of few-shot examples in demonstrating ambiguous-case handling (e.g., tool
selection for ambiguous requests, branch-level test coverage gaps)
- How few-shot examples enable the model to generalize judgment to novel patterns rather
than matching only pre-specified cases
- The effectiveness of few-shot examples for reducing hallucination in extraction tasks (e.g.,
handling informal measurements, varied document structures)
Anthropic, PBC · Confidential Need to Know (NTK)
Skills in:
- Creating 2-4 targeted few-shot examples for ambiguous scenarios that show reasoning for
why one action was chosen over plausible alternatives
- Including few-shot examples that demonstrate specific desired output format (location,
issue, severity, suggested fix) to achieve consistency
- Providing few-shot examples distinguishing acceptable code patterns from genuine issues
to reduce false positives while enabling generalization
- Using few-shot examples to demonstrate correct handling of varied document structures
(inline citations vs bibliographies, methodology sections vs embedded details)
- Adding few-shot examples showing correct extraction from documents with varied
formats to address empty/null extraction of required fields
Task Statement 4.3: Enforce structured output using tool use and JSON schemas
Knowledge of:
- Tool use (tool_use) with JSON schemas as the most reliable approach for guaranteed
schema-compliant structured output, eliminating JSON syntax errors
- The distinction between tool_choice: "auto" (model may return text instead of
calling a tool), "any" (model must call a tool but can choose which), and forced tool
selection (model must call a specific named tool)
- That strict JSON schemas via tool use eliminate syntax errors but do not prevent semantic
errors (e.g., line items that don't sum to total, values in wrong fields)
- Schema design considerations: required vs optional fields, enum fields with "other" +
detail string patterns for extensible categories
Skills in:
- Defining extraction tools with JSON schemas as input parameters and extracting
structured data from the tool_use response
- Setting tool_choice: "any" to guarantee structured output when multiple extraction
schemas exist and the document type is unknown
- Forcing a specific tool with tool_choice: {"type": "tool", "name":
"extract_metadata"} to ensure a particular extraction runs before enrichment steps
- Designing schema fields as optional (nullable) when source documents may not contain
the information, preventing the model from fabricating values to satisfy required fields
Anthropic, PBC · Confidential Need to Know (NTK)
- Adding enum values like "unclear" for ambiguous cases and "other" + detail fields for
extensible categorization
- Including format normalization rules in prompts alongside strict output schemas to
handle inconsistent source formatting
Task Statement 4.4: Implement validation, retry, and feedback loops for extraction
quality
Knowledge of:
- Retry-with-error-feedback: appending specific validation errors to the prompt on retry to
guide the model toward correction
- The limits of retry: retries are ineffective when the required information is simply absent
from the source document (vs format or structural errors)
- Feedback loop design: tracking which code constructs trigger findings
(detected_pattern field) to enable systematic analysis of dismissal patterns
- The difference between semantic validation errors (values don't sum, wrong field
placement) and schema syntax errors (eliminated by tool use)
Skills in:
- Implementing follow-up requests that include the original document, the failed
extraction, and specific validation errors for model self-correction
- Identifying when retries will be ineffective (e.g., information exists only in an external
document not provided) versus when they will succeed (format mismatches, structural
output errors)
- Adding detected_pattern fields to structured findings to enable analysis of false
positive patterns when developers dismiss findings
- Designing self-correction validation flows: extracting "calculated_total" alongside
"stated_total" to flag discrepancies, adding "conflict_detected" booleans for inconsistent
source data
Task Statement 4.5: Design efficient batch processing strategies
Knowledge of:
- The Message Batches API: 50% cost savings, up to 24-hour processing window, no
guaranteed latency SLA
Anthropic, PBC · Confidential Need to Know (NTK)
- Batch processing is appropriate for non-blocking, latency-tolerant workloads (overnight
reports, weekly audits, nightly test generation) and inappropriate for blocking workflows
(pre-merge checks)
- The batch API does not support multi-turn tool calling within a single request (cannot
execute tools mid-request and return results)
- custom_id fields for correlating batch request/response pairs
Skills in:
- Matching API approach to workflow latency requirements: synchronous API for blocking
pre-merge checks, batch API for overnight/weekly analysis
- Calculating batch submission frequency based on SLA constraints (e.g., 4-hour windows to
guarantee 30-hour SLA with 24-hour batch processing)
- Handling batch failures: resubmitting only failed documents (identified by custom_id)
with appropriate modifications (e.g., chunking documents that exceeded context limits)
- Using prompt refinement on a sample set before batch-processing large volumes to
maximize first-pass success rates and reduce iterative resubmission costs
Task Statement 4.6: Design multi-instance and multi-pass review architectures
Knowledge of:
- Self-review limitations: a model retains reasoning context from generation, making it less
likely to question its own decisions in the same session
- Independent review instances (without prior reasoning context) are more effective at
catching subtle issues than self-review instructions or extended thinking
- Multi-pass review: splitting large reviews into per-file local analysis passes plus cross-file
integration passes to avoid attention dilution and contradictory findings
Skills in:
- Using a second independent Claude instance to review generated code without the
generator's reasoning context
- Splitting large multi-file reviews into focused per-file passes for local issues plus separate
integration passes for cross-file data flow analysis
- Running verification passes where the model self-reports confidence alongside each
finding to enable calibrated review routing
Anthropic, PBC · Confidential Need to Know (NTK)
Domain 5: Context Management & Reliability
Task Statement 5.1: Manage conversation context to preserve critical information
across long interactions
Knowledge of:
- Progressive summarization risks: condensing numerical values, percentages, dates, and
customer-stated expectations into vague summaries
- The "lost in the middle" effect: models reliably process information at the beginning and
end of long inputs but may omit findings from middle sections
- How tool results accumulate in context and consume tokens disproportionately to their
relevance (e.g., 40+ fields per order lookup when only 5 are relevant)
- The importance of passing complete conversation history in subsequent API requests to
maintain conversational coherence
Skills in:
- Extracting transactional facts (amounts, dates, order numbers, statuses) into a persistent
"case facts" block included in each prompt, outside summarized history
- Extracting and persisting structured issue data (order IDs, amounts, statuses) into a
separate context layer for multi-issue sessions
- Trimming verbose tool outputs to only relevant fields before they accumulate in context
(e.g., keeping only return-relevant fields from order lookups)
- Placing key findings summaries at the beginning of aggregated inputs and organizing
detailed results with explicit section headers to mitigate position effects
- Requiring subagents to include metadata (dates, source locations, methodological context)
in structured outputs to support accurate downstream synthesis
- Modifying upstream agents to return structured data (key facts, citations, relevance
scores) instead of verbose content and reasoning chains when downstream agents have
limited context budgets
Task Statement 5.2: Design effective escalation and ambiguity resolution patterns
Knowledge of:
Anthropic, PBC · Confidential Need to Know (NTK)
- Appropriate escalation triggers: customer requests for a human, policy exceptions/gaps
(not just complex cases), and inability to make meaningful progress
- The distinction between escalating immediately when a customer explicitly demands it
versus offering to resolve when the issue is straightforward
- Why sentiment-based escalation and self-reported confidence scores are unreliable
proxies for actual case complexity
- How multiple customer matches require clarification (requesting additional identifiers)
rather than heuristic selection
Skills in:
- Adding explicit escalation criteria with few-shot examples to the system prompt
demonstrating when to escalate versus resolve autonomously
- Honoring explicit customer requests for human agents immediately without first
attempting investigation
- Acknowledging frustration while offering resolution when the issue is within the agent's
capability, escalating only if the customer reiterates their preference
- Escalating when policy is ambiguous or silent on the customer's specific request (e.g.,
competitor price matching when policy only addresses own-site adjustments)
- Instructing the agent to ask for additional identifiers when tool results return multiple
matches, rather than selecting based on heuristics
Task Statement 5.3: Implement error propagation strategies across multi-agent
systems
Knowledge of:
- Structured error context (failure type, attempted query, partial results, alternative
approaches) as enabling intelligent coordinator recovery decisions
- The distinction between access failures (timeouts needing retry decisions) and valid
empty results (successful queries with no matches)
- Why generic error statuses ("search unavailable") hide valuable context from the
coordinator
- Why silently suppressing errors (returning empty results as success) or terminating entire
workflows on single failures are both anti-patterns
Skills in:
Anthropic, PBC · Confidential Need to Know (NTK)
- Returning structured error context including failure type, what was attempted, partial
results, and potential alternatives to enable coordinator recovery
- Distinguishing access failures from valid empty results in error reporting so the
coordinator can make appropriate decisions
- Having subagents implement local recovery for transient failures and only propagate
errors they cannot resolve, including what was attempted and partial results
- Structuring synthesis output with coverage annotations indicating which findings are
well-supported versus which topic areas have gaps due to unavailable sources
Task Statement 5.4: Manage context effectively in large codebase exploration
Knowledge of:
- Context degradation in extended sessions: models start giving inconsistent answers and
referencing "typical patterns" rather than specific classes discovered earlier
- The role of scratchpad files for persisting key findings across context boundaries
- Subagent delegation for isolating verbose exploration output while the main agent
coordinates high-level understanding
- Structured state persistence for crash recovery: each agent exports state to a known
location, and the coordinator loads a manifest on resume
Skills in:
- Spawning subagents to investigate specific questions (e.g., "find all test files," "trace refund
flow dependencies") while the main agent preserves high-level coordination
- Having agents maintain scratchpad files recording key findings, referencing them for
subsequent questions to counteract context degradation
- Summarizing key findings from one exploration phase before spawning sub-agents for the
next phase, injecting summaries into initial context
- Designing crash recovery using structured agent state exports (manifests) that the
coordinator loads on resume and injects into agent prompts
- Using /compact to reduce context usage during extended exploration sessions when
context fills with verbose discovery output
Task Statement 5.5: Design human review workflows and confidence calibration
Knowledge of:
Anthropic, PBC · Confidential Need to Know (NTK)
- The risk that aggregate accuracy metrics (e.g., 97% overall) may mask poor performance
on specific document types or fields
- Stratified random sampling for measuring error rates in high-confidence extractions and
detecting novel error patterns
- Field-level confidence scores calibrated using labeled validation sets for routing review
attention
- The importance of validating accuracy by document type and field segment before
automating high-confidence extractions
Skills in:
- Implementing stratified random sampling of high-confidence extractions for ongoing
error rate measurement and novel pattern detection
- Analyzing accuracy by document type and field to verify consistent performance across all
segments before reducing human review
- Having models output field-level confidence scores, then calibrating review thresholds
using labeled validation sets
- Routing extractions with low model confidence or ambiguous/contradictory source
documents to human review, prioritizing limited reviewer capacity
Task Statement 5.6: Preserve information provenance and handle uncertainty in
multi-source synthesis
Knowledge of:
- How source attribution is lost during summarization steps when findings are compressed
without preserving claim-source mappings
- The importance of structured claim-source mappings that the synthesis agent must
preserve and merge when combining findings
- How to handle conflicting statistics from credible sources: annotating conflicts with
source attribution rather than arbitrarily selecting one value
- Temporal data: requiring publication/collection dates in structured outputs to prevent
temporal differences from being misinterpreted as contradictions
Skills in:
- Requiring subagents to output structured claim-source mappings (source URLs,
document names, relevant excerpts) that downstream agents preserve through synthesis
Anthropic, PBC · Confidential Need to Know (NTK)
- Structuring reports with explicit sections distinguishing well-established findings from
contested ones, preserving original source characterizations and methodological context
- Completing document analysis with conflicting values included and explicitly annotated,
letting the coordinator decide how to reconcile before passing to synthesis
- Requiring subagents to include publication or data collection dates in structured outputs
to enable correct temporal interpretation
- Rendering different content types appropriately in synthesis outputs--financial data as
tables, news as prose, technical findings as structured lists--rather than converting
everything to a uniform format
Sample Questions
The following sample questions illustrate the format and difficulty level of the exam. These are
drawn from the practice test and include explanations to aid learning.
Scenario: Customer Support Resolution Agent
Question 1: Production data shows that in 12% of cases, your agent skips get_customer entirely
and calls lookup_order using only the customer's stated name, occasionally leading to
misidentified accounts and incorrect refunds. What change would most effectively address this
reliability issue?
A) Add a programmatic prerequisite that blocks lookup_order and process_refund calls until
get_customer has returned a verified customer ID. B) Enhance the system prompt to state that
customer verification via get_customer is mandatory before any order operations. C) Add
few-shot examples showing the agent always calling get_customer first, even when customers
volunteer order details. D) Implement a routing classifier that analyzes each request and enables
only the subset of tools appropriate for that request type.
Correct Answer: A
When a specific tool sequence is required for critical business logic (like verifying customer
identity before processing refunds), programmatic enforcement provides deterministic
guarantees that prompt-based approaches cannot. Options B and C rely on probabilistic LLM
compliance, which is insufficient when errors have financial consequences. Option D addresses
tool availability rather than tool ordering, which is not the actual problem.
Anthropic, PBC · Confidential Need to Know (NTK)
Question 2: Production logs show the agent frequently calls get_customer when users ask
about orders (e.g., "check my order #12345"), instead of calling lookup_order. Both tools have
minimal descriptions ("Retrieves customer information" / "Retrieves order details") and accept
similar identifier formats. What's the most effective first step to improve tool selection
reliability?
A) Add few-shot examples to the system prompt demonstrating correct tool selection patterns,
with 5-8 examples showing order-related queries routing to lookup_order. B) Expand each
tool's description to include input formats it handles, example queries, edge cases, and
boundaries explaining when to use it versus similar tools. C) Implement a routing layer that
parses user input before each turn and pre-selects the appropriate tool based on detected
keywords and identifier patterns. D) Consolidate both tools into a single lookup_entity tool
that accepts any identifier and internally determines which backend to query.
Correct Answer: B
Tool descriptions are the primary mechanism LLMs use for tool selection. When descriptions
are minimal, models lack the context to differentiate between similar tools. Option B directly
addresses this root cause with a low-effort, high-leverage fix. Few-shot examples (A) add token
overhead without fixing the underlying issue. A routing layer (C) is over-engineered and
bypasses the LLM's natural language understanding. Consolidating tools (D) is a valid
architectural choice but requires more effort than a "first step" warrants when the immediate
problem is inadequate descriptions.
Question 3: Your agent achieves 55% first-contact resolution, well below the 80% target. Logs
show it escalates straightforward cases (standard damage replacements with photo evidence)
while attempting to autonomously handle complex situations requiring policy exceptions.
What's the most effective way to improve escalation calibration?
A) Add explicit escalation criteria to your system prompt with few-shot examples demonstrating
when to escalate versus resolve autonomously. B) Have the agent self-report a confidence score
(1-10) before each response and automatically route requests to humans when confidence falls
below a threshold. C) Deploy a separate classifier model trained on historical tickets to predict
which requests need escalation before the main agent begins processing. D) Implement
Anthropic, PBC · Confidential Need to Know (NTK)
sentiment analysis to detect customer frustration levels and automatically escalate when
negative sentiment exceeds a threshold.
Correct Answer: A
Adding explicit escalation criteria with few-shot examples directly addresses the root cause:
unclear decision boundaries. This is the proportionate first response before adding
infrastructure. Option B fails because LLM self-reported confidence is poorly calibrated--the
agent is already incorrectly confident on hard cases. Option C is over-engineered, requiring
labeled data and ML infrastructure when prompt optimization hasn't been tried. Option D solves
a different problem entirely; sentiment doesn't correlate with case complexity, which is the
actual issue.
Scenario: Code Generation with Claude Code
Question 4: You want to create a custom /review slash command that runs your team's
standard code review checklist. This command should be available to every developer when they
clone or pull the repository. Where should you create this command file?
A) In the .claude/commands/ directory in the project repository B) In ~/.claude/commands/
in each developer's home directory C) In the CLAUDE.md file at the project root D) In a
.claude/config.json file with a commands array
Correct Answer: A
Project-scoped custom slash commands should be stored in the .claude/commands/ directory
within the repository. These commands are version-controlled and automatically available to all
developers when they clone or pull the repo. Option B (~/.claude/commands/) is for personal
commands that aren't shared via version control. Option C (CLAUDE.md) is for project
instructions and context, not command definitions. Option D describes a configuration
mechanism that doesn't exist in Claude Code.
Anthropic, PBC · Confidential Need to Know (NTK)
Question 5: You've been assigned to restructure the team's monolithic application into
microservices. This will involve changes across dozens of files and requires decisions about
service boundaries and module dependencies. Which approach should you take?
A) Enter plan mode to explore the codebase, understand dependencies, and design an
implementation approach before making changes. B) Start with direct execution and make
changes incrementally, letting the implementation reveal the natural service boundaries. C) Use
direct execution with comprehensive upfront instructions detailing exactly how each service
should be structured. D) Begin in direct execution mode and only switch to plan mode if you
encounter unexpected complexity during implementation.
Correct Answer: A
Plan mode is designed for complex tasks involving large-scale changes, multiple valid
approaches, and architectural decisions--exactly what monolith-to-microservices restructuring
requires. It enables safe codebase exploration and design before committing to changes. Option
B risks costly rework when dependencies are discovered late. Option C assumes you already
know the right structure without exploring the code. Option D ignores that the complexity is
already stated in the requirements, not something that might emerge later.
Question 6: Your codebase has distinct areas with different coding conventions: React
components use functional style with hooks, API handlers use async/await with specific error
handling, and database models follow a repository pattern. Test files are spread throughout the
codebase alongside the code they test (e.g., Button.test.tsx next to Button.tsx), and you
want all tests to follow the same conventions regardless of location. What's the most
maintainable way to ensure Claude automatically applies the correct conventions when
generating code?
A) Create rule files in .claude/rules/ with YAML frontmatter specifying glob patterns to
conditionally apply conventions based on file paths B) Consolidate all conventions in the root
CLAUDE.md file under headers for each area, relying on Claude to infer which section applies C)
Create skills in .claude/skills/ for each code type that include the relevant conventions in
their SKILL.md files D) Place a separate CLAUDE.md file in each subdirectory containing that
area's specific conventions
Correct Answer: A
Anthropic, PBC · Confidential Need to Know (NTK)
Option A is correct because .claude/rules/ with glob patterns (e.g., **/*.test.tsx) allows
conventions to be automatically applied based on file paths regardless of directory
location--essential for test files spread throughout the codebase. Option B relies on inference
rather than explicit matching, making it unreliable. Option C requires manual skill invocation or
relies on Claude choosing to load them, contradicting the need for deterministic "automatic"
application based on file paths. Option D can't easily handle files spread across many directories
since CLAUDE.md files are directory-bound.
Scenario: Multi-Agent Research System
Question 7: After running the system on the topic "impact of AI on creative industries," you
observe that each subagent completes successfully: the web search agent finds relevant articles,
the document analysis agent summarizes papers correctly, and the synthesis agent produces
coherent output. However, the final reports cover only visual arts, completely missing music,
writing, and film production. When you examine the coordinator's logs, you see it decomposed
the topic into three subtasks: "AI in digital art creation," "AI in graphic design," and "AI in
photography." What is the most likely root cause?
A) The synthesis agent lacks instructions for identifying coverage gaps in the findings it receives
from other agents. B) The coordinator agent's task decomposition is too narrow, resulting in
subagent assignments that don't cover all relevant domains of the topic. C) The web search
agent's queries are not comprehensive enough and need to be expanded to cover more creative
industry sectors. D) The document analysis agent is filtering out sources related to non-visual
creative industries due to overly restrictive relevance criteria.
Correct Answer: B
The coordinator's logs reveal the root cause directly: it decomposed "creative industries" into
only visual arts subtasks (digital art, graphic design, photography), completely omitting music,
writing, and film. The subagents executed their assigned tasks correctly--the problem is what
they were assigned. Options A, C, and D incorrectly blame downstream agents that are working
correctly within their assigned scope.
Anthropic, PBC · Confidential Need to Know (NTK)
Question 8: The web search subagent times out while researching a complex topic. You need to
design how this failure information flows back to the coordinator agent. Which error
propagation approach best enables intelligent recovery?
A) Return structured error context to the coordinator including the failure type, the attempted
query, any partial results, and potential alternative approaches. B) Implement automatic retry
logic with exponential backoff within the subagent, returning a generic "search unavailable"
status only after all retries are exhausted. C) Catch the timeout within the subagent and return
an empty result set marked as successful. D) Propagate the timeout exception directly to a
top-level handler that terminates the entire research workflow.
Correct Answer: A
Structured error context gives the coordinator the information it needs to make intelligent
recovery decisions--whether to retry with a modified query, try an alternative approach, or
proceed with partial results. Option B's generic status hides valuable context from the
coordinator, preventing informed decisions. Option C suppresses the error by marking failure as
success, which prevents any recovery and risks incomplete research outputs. Option D
terminates the entire workflow unnecessarily when recovery strategies could succeed.
Question 9: During testing, you observe that the synthesis agent frequently needs to verify
specific claims while combining findings. Currently, when verification is needed, the synthesis
agent returns control to the coordinator, which invokes the web search agent, then re-invokes
synthesis with results. This adds 2-3 round trips per task and increases latency by 40%. Your
evaluation shows that 85% of these verifications are simple fact-checks (dates, names, statistics)
while 15% require deeper investigation. What's the most effective approach to reduce overhead
while maintaining system reliability?
A) Give the synthesis agent a scoped verify_fact tool for simple lookups, while complex
verifications continue delegating to the web search agent through the coordinator. B) Have the
synthesis agent accumulate all verification needs and return them as a batch to the coordinator
at the end of its pass, which then sends them all to the web search agent at once. C) Give the
synthesis agent access to all web search tools so it can handle any verification need directly
without round-trips through the coordinator. D) Have the web search agent proactively cache
extra context around each source during initial research, anticipating what the synthesis agent
might need to verify.
Anthropic, PBC · Confidential Need to Know (NTK)
Correct Answer: A
Option A applies the principle of least privilege by giving the synthesis agent only what it needs
for the 85% common case (simple fact verification) while preserving the existing coordination
pattern for complex cases. Option B's batching approach creates blocking dependencies since
synthesis steps may depend on earlier verified facts. Option C over-provisions the synthesis
agent, violating separation of concerns. Option D relies on speculative caching that cannot
reliably predict what the synthesis agent will need to verify.
Scenario: Claude Code for Continuous Integration
Question 10: Your pipeline script runs claude "Analyze this pull request for
security issues" but the job hangs indefinitely. Logs indicate Claude Code is waiting for
interactive input. What's the correct approach to run Claude Code in an automated pipeline?
A) Add the -p flag: claude -p "Analyze this pull request for security issues" B)
Set the environment variable CLAUDE_HEADLESS=true before running the command C)
Redirect stdin from /dev/null: claude "Analyze this pull request for security
issues" < /dev/null D) Add the --batch flag: claude --batch "Analyze this pull
request for security issues"
Correct Answer: A
The -p (or --print) flag is the documented way to run Claude Code in non-interactive mode. It
processes the prompt, outputs the result to stdout, and exits without waiting for user
input--exactly what CI/CD pipelines require. The other options reference non-existent features
(CLAUDE_HEADLESS environment variable, --batch flag) or use Unix workarounds that don't
properly address Claude Code's command syntax.
Question 11: Your team wants to reduce API costs for automated analysis. Currently, real-time
Claude calls power two workflows: (1) a blocking pre-merge check that must complete before
developers can merge, and (2) a technical debt report generated overnight for review the next
Anthropic, PBC · Confidential Need to Know (NTK)
morning. Your manager proposes switching both to the Message Batches API for its 50% cost
savings. How should you evaluate this proposal?
A) Use batch processing for the technical debt reports only; keep real-time calls for pre-merge
checks. B) Switch both workflows to batch processing with status polling to check for
completion. C) Keep real-time calls for both workflows to avoid batch result ordering issues. D)
Switch both to batch processing with a timeout fallback to real-time if batches take too long.
Correct Answer: A
The Message Batches API offers 50% cost savings but has processing times up to 24 hours with
no guaranteed latency SLA. This makes it unsuitable for blocking pre-merge checks where
developers wait for results, but ideal for overnight batch jobs like technical debt reports. Option
B is wrong because relying on "often faster" completion isn't acceptable for blocking workflows.
Option C reflects a misconception--batch results can be correlated using custom_id fields.
Option D adds unnecessary complexity when the simpler solution is matching each API to its
appropriate use case.
Question 12: A pull request modifies 14 files across the stock tracking module. Your single-pass
review analyzing all files together produces inconsistent results: detailed feedback for some files
but superficial comments for others, obvious bugs missed, and contradictory feedback--flagging
a pattern as problematic in one file while approving identical code elsewhere in the same PR.
How should you restructure the review?
A) Split into focused passes: analyze each file individually for local issues, then run a separate
integration-focused pass examining cross-file data flow. B) Require developers to split large PRs
into smaller submissions of 3-4 files before the automated review runs. C) Switch to a
higher-tier model with a larger context window to give all 14 files adequate attention in one
pass. D) Run three independent review passes on the full PR and only flag issues that appear in at
least two of the three runs.
Correct Answer: A
Splitting reviews into focused passes directly addresses the root cause: attention dilution when
processing many files at once. File-by-file analysis ensures consistent depth, while a separate
integration pass catches cross-file issues. Option B shifts burden to developers without
Anthropic, PBC · Confidential Need to Know (NTK)
improving the system. Option C misunderstands that larger context windows don't solve
attention quality issues. Option D would actually suppress detection of real bugs by requiring
consensus on issues that may only be caught intermittently.
Preparation Exercises
Complete these hands-on exercises to build practical familiarity with the topics covered on the
exam. Each exercise is designed to reinforce knowledge across one or more exam domains.
Exercise 1: Build a Multi-Tool Agent with Escalation Logic
Objective: Practice designing an agentic loop with tool integration, structured error handling,
and escalation patterns.
Steps:
1. Define 3-4 MCP tools with detailed descriptions that clearly differentiate each tool's
purpose, expected inputs, and boundary conditions. Include at least two tools with similar
functionality that require careful description to avoid selection confusion.
2. Implement an agentic loop that checks stop_reason to determine whether to continue
tool execution or present the final response. Handle both "tool_use" and "end_turn"
stop reasons correctly.
3. Add structured error responses to your tools: include errorCategory
(transient/validation/permission), isRetryable boolean, and human-readable
descriptions. Test that the agent handles each error type appropriately (retrying transient
errors, explaining business errors to the user).
4. Implement a programmatic hook that intercepts tool calls to enforce a business rule (e.g.,
blocking operations above a threshold amount), redirecting to an escalation workflow
when triggered.
5. Test with multi-concern messages (e.g., requests involving multiple issues) and verify the
agent decomposes the request, handles each concern, and synthesizes a unified response.
Domains reinforced: Domain 1 (Agentic Architecture & Orchestration), Domain 2 (Tool Design &
MCP Integration), Domain 5 (Context Management & Reliability)
Anthropic, PBC · Confidential Need to Know (NTK)
Exercise 2: Configure Claude Code for a Team Development Workflow
Objective: Practice configuring CLAUDE.md hierarchies, custom slash commands, path-specific
rules, and MCP server integration for a multi-developer project.
Steps:
1. Create a project-level CLAUDE.md with universal coding standards and testing
conventions. Verify that instructions placed at the project level are consistently applied
across all team members.
2. Create .claude/rules/ files with YAML frontmatter glob patterns for different code
areas (e.g., paths: ["src/api/**/*"] for API conventions, paths: ["**/*.test.*"]
for testing conventions). Test that rules load only when editing matching files.
3. Create a project-scoped skill in .claude/skills/ with context: fork and
allowed-tools restrictions. Verify the skill runs in isolation without polluting the main
conversation context.
4. Configure an MCP server in .mcp.json with environment variable expansion for
credentials. Add a personal experimental MCP server in ~/.claude.json and verify both
are available simultaneously.
5. Test plan mode versus direct execution on tasks of varying complexity: a single-file bug
fix, a multi-file library migration, and a new feature with multiple valid implementation
approaches. Observe when plan mode provides value.
Domains reinforced: Domain 3 (Claude Code Configuration & Workflows), Domain 2 (Tool
Design & MCP Integration)
Exercise 3: Build a Structured Data Extraction Pipeline
Objective: Practice designing JSON schemas, using tool_use for structured output,
implementing validation-retry loops, and designing batch processing strategies.
Steps:
1. Define an extraction tool with a JSON schema containing required and optional fields, an
enum with an "other" + detail string pattern, and nullable fields for information that may
not exist in source documents. Process documents where some fields are absent and
verify the model returns null rather than fabricating values.
Anthropic, PBC · Confidential Need to Know (NTK)
2. Implement a validation-retry loop: when Pydantic or JSON schema validation fails, send a
follow-up request including the document, the failed extraction, and the specific
validation error. Track which errors are resolvable via retry (format mismatches) versus
which are not (information absent from source).
3. Add few-shot examples demonstrating extraction from documents with varied formats
(e.g., inline citations vs bibliographies, narrative descriptions vs structured tables) and
verify improved handling of structural variety.
4. Design a batch processing strategy: submit a batch of 100 documents using the Message
Batches API, handle failures by custom_id, resubmit failed documents with modifications
(e.g., chunking oversized documents), and calculate total processing time relative to SLA
constraints.
5. Implement a human review routing strategy: have the model output field-level confidence
scores, route low-confidence extractions to human review, and analyze accuracy by
document type and field to verify consistent performance.
Domains reinforced: Domain 4 (Prompt Engineering & Structured Output), Domain 5 (Context
Management & Reliability)
Exercise 4: Design and Debug a Multi-Agent Research Pipeline
Objective: Practice orchestrating subagents, managing context passing, implementing error
propagation, and handling synthesis with provenance tracking.
Steps:
1. Build a coordinator agent that delegates to at least two subagents (e.g., web search and
document analysis). Ensure the coordinator's allowedTools includes "Task" and that
each subagent receives its research findings directly in its prompt rather than relying on
automatic context inheritance.
2. Implement parallel subagent execution by having the coordinator emit multiple Task tool
calls in a single response. Measure the latency improvement compared to sequential
execution.
3. Design structured output for subagents that separates content from metadata: each
finding should include a claim, evidence excerpt, source URL/document name, and
publication date. Verify that the synthesis subagent preserves source attribution when
combining findings.
Anthropic, PBC · Confidential Need to Know (NTK)
4. Implement error propagation: simulate a subagent timeout and verify the coordinator
receives structured error context (failure type, attempted query, partial results). Test that
the coordinator can proceed with partial results and annotate the final output with
coverage gaps.
5. Test with conflicting source data (e.g., two credible sources with different statistics) and
verify the synthesis output preserves both values with source attribution rather than
arbitrarily selecting one, and structures the report to distinguish well-established from
contested findings.
Domains reinforced: Domain 1 (Agentic Architecture & Orchestration), Domain 2 (Tool Design &
MCP Integration), Domain 5 (Context Management & Reliability)
Appendix
Technologies and Concepts
The following list contains technologies and concepts that might appear on the exam:
- Claude Agent SDK -- Agent definitions, agentic loops, stop_reason handling, hooks
(PostToolUse, tool call interception), subagent spawning via Task tool, allowedTools
configuration
- Model Context Protocol (MCP) -- MCP servers, MCP tools, MCP resources, isError flag,
tool descriptions, tool distribution, .mcp.json configuration, environment variable
expansion
- Claude Code -- CLAUDE.md configuration hierarchy (user/project/directory),
.claude/rules/ with YAML frontmatter path-scoping, .claude/commands/ for slash
commands, .claude/skills/ with SKILL.md frontmatter (context: fork,
allowed-tools, argument-hint), plan mode, direct execution, /memory command,
/compact, --resume, fork_session, Explore subagent
- Claude Code CLI -- -p / --print flag for non-interactive mode, --output-format
json, --json-schema for structured CI output
- Claude API -- tool_use with JSON schemas, tool_choice options ("auto", "any",
forced tool selection), stop_reason values ("tool_use", "end_turn"), max_tokens,
system prompts
Anthropic, PBC · Confidential Need to Know (NTK)
- Message Batches API -- 50% cost savings, up to 24-hour processing window, custom_id
for request/response correlation, polling for completion, no multi-turn tool calling
support
- JSON Schema -- Required vs optional fields, enum types, nullable fields, "other" + detail
string patterns, strict mode for syntax error elimination
- Pydantic -- Schema validation, semantic validation errors, validation-retry loops
- Built-in tools -- Read, Write, Edit, Bash, Grep, Glob -- their purposes and selection
criteria
- Few-shot prompting -- Targeted examples for ambiguous scenarios, format
demonstration, generalization to novel patterns
- Prompt chaining -- Sequential task decomposition into focused passes
- Context window management -- Token budgets, progressive summarization,
lost-in-the-middle effects, context extraction, scratchpad files
- Session management -- Session resumption, fork_session, named sessions, session
context isolation
- Confidence scoring -- Field-level confidence, calibration with labeled validation sets,
stratified sampling for error rate measurement
In-Scope Topics
The following topics are explicitly tested on the exam:
- Agentic loop implementation: Control flow based on stop_reason, tool result handling,
loop termination conditions
- Multi-agent orchestration: Coordinator-subagent patterns, task decomposition, parallel
subagent execution, iterative refinement loops
- Subagent context management: Explicit context passing, structured state persistence,
crash recovery using manifests
- Tool interface design: Writing effective tool descriptions, splitting vs consolidating tools,
tool naming to reduce ambiguity
- MCP tool and resource design: Resources for content catalogs, tools for actions,
description quality for adoption
- MCP server configuration: Project vs user scope, environment variable expansion,
multi-server simultaneous access
- Error handling and propagation: Structured error responses, transient vs business vs
permission errors, local recovery before escalation
- Escalation decision-making: Explicit criteria, honoring customer preferences, policy gap
identification
Anthropic, PBC · Confidential Need to Know (NTK)
- CLAUDE.md configuration: Hierarchy (user/project/directory), @import patterns,
.claude/rules/ with glob patterns
- Custom commands and skills: Project vs user scope, context: fork, allowed-tools,
argument-hint frontmatter
- Plan mode vs direct execution: Complexity assessment, architectural decisions,
single-file changes
- Iterative refinement: Input/output examples, test-driven iteration, interview pattern,
sequential vs parallel issue resolution
- Structured output via tool_use: Schema design, tool_choice configuration, nullable
fields to prevent hallucination
- Few-shot prompting: Ambiguous scenario targeting, format consistency, false positive
reduction
- Batch processing: Message Batches API appropriateness, latency tolerance assessment,
failure handling by custom_id
- Context window optimization: Trimming verbose tool outputs, structured fact extraction,
position-aware input ordering
- Human review workflows: Confidence calibration, stratified sampling, accuracy
segmentation by document type and field
- Information provenance: Claim-source mappings, temporal data handling, conflict
annotation, coverage gap reporting
Out-of-Scope Topics
The following related topics will NOT appear on the exam:
- Fine-tuning Claude models or training custom models
- Claude API authentication, billing, or account management
- Detailed implementation of specific programming languages or frameworks (beyond
what's needed for tool and schema configuration)
- Deploying or hosting MCP servers (infrastructure, networking, container orchestration)
- Claude's internal architecture, training process, or model weights
- Constitutional AI, RLHF, or safety training methodologies
- Embedding models or vector database implementation details
- Computer use (browser automation, desktop interaction)
- Vision/image analysis capabilities
- Streaming API implementation or server-sent events
- Rate limiting, quotas, or API pricing calculations
- OAuth, API key rotation, or authentication protocol details
Anthropic, PBC · Confidential Need to Know (NTK)
- Specific cloud provider configurations (AWS, GCP, Azure)
- Performance benchmarking or model comparison metrics
- Prompt caching implementation details (beyond knowing it exists)
- Token counting algorithms or tokenization specifics
Exam Preparation Recommendations
To prepare for this certification exam:
1. Build an agent with the Claude Agent SDK: Implement a complete agentic loop with tool
calling, error handling, and session management. Practice spawning subagents and
passing context between them.
2. Configure Claude Code for a real project: Set up CLAUDE.md with a configuration
hierarchy, create path-specific rules in .claude/rules/, build custom skills with
frontmatter options (context: fork, allowed-tools), and integrate at least one MCP
server.
3. Design and test MCP tools: Write tool descriptions that clearly differentiate similar tools.
Implement structured error responses with error categories and retryable flags. Test tool
selection reliability with ambiguous requests.
4. Build a structured data extraction pipeline: Use tool_use with JSON schemas, implement
validation-retry loops, design schemas with optional/nullable fields, and practice batch
processing with the Message Batches API.
5. Practice prompt engineering techniques: Write few-shot examples for ambiguous
scenarios. Define explicit review criteria to reduce false positives. Design multi-pass
review architectures for large code reviews.
6. Study context management patterns: Practice extracting structured facts from verbose
tool outputs, implementing scratchpad files for long sessions, and designing subagent
delegation to manage context limits.
7. Review escalation and human-in-the-loop patterns: Understand when to escalate (policy
gaps, customer requests, inability to progress) versus resolve autonomously. Practice
designing human review workflows with confidence-based routing.
Anthropic, PBC · Confidential Need to Know (NTK)
8. Complete the Practice Exam: Before sitting for the real exam, complete the practice exam
(the link will be provided separately). The practice exam covers the same scenarios and
question format as the real exam and shows explanations after each answer to help
reinforce your understanding.
Version 0.1 Last Updated: Feb 10 2025
Anthropic, PBC · Confidential Need to Know (NTK)
Skill Level: Beginner