Weight: 15% of total exam score — smallest weighting, but concepts here cascade into Domains 1, 2, and 4. Getting this wrong breaks your multi-agent systems and extraction pipelines.
Condensing conversation history compresses critical details into vague summaries:
BEFORE: "Customer wants a refund of $247.83 for order #8891 placed on March 3rd"
AFTER: "customer wants a refund for a recent order" Fix: Extract transactional facts into a persistent "case facts" block. Include in every prompt. Never summarise it.
case_facts = {
"customer_id": "C-4492",
"order_id": "#8891",
"order_date": "2025-03-03",
"refund_amount": 247.83,
"issue": "Defective product - screen flickering on laptop"
}
# Include case_facts in every prompt — never summarise Models process the beginning and end of long inputs reliably. Findings buried in the middle may be missed.
Fix:
# Order lookup returns 40+ fields. You need 5.
# WRONG: Append full result to context
# RIGHT: Trim to relevant fields
trimmed = {
"order_id": result["order_id"],
"status": result["status"],
"total": result["total"],
"items": result["items"],
"shipping_tracking": result["shipping_tracking"]
} Subsequent API requests must include complete conversation history. Omitting earlier messages breaks conversational coherence.
Modify agents to return structured data (key facts, citations, relevance scores) instead of verbose content and reasoning chains. Critical when downstream agents have limited context budgets.
| Trigger | Action |
|---|---|
| Customer explicitly requests a human | Escalate immediately. Do NOT attempt to resolve first. |
| Policy exception or gap | Escalate (request falls outside documented policy) |
| Cannot make meaningful progress | Escalate after exhausting available approaches |
| Trigger | Why It's Unreliable |
|---|---|
| Sentiment-based escalation | Frustration does not correlate with case complexity |
| Self-reported confidence scores | Model is often incorrectly confident on hard cases and uncertain on easy ones |
Multiple customers match a search query:
When propagating errors, include:
| Anti-Pattern | Problem |
|---|---|
| Silent suppression | Returns empty results marked as success. Prevents any recovery. |
| Workflow termination | Kills entire pipeline on single failure. Throws away partial results. |
| Scenario | Meaning | Retry? |
|---|---|---|
| Access failure | Tool could not reach data source | Consider retry |
| Valid empty result | Tool reached source, found no matches | No. This IS the answer. |
Synthesis output should note gaps:
"Section on geothermal energy is limited due to unavailable journal access"
Better than silently omitting the section.
Extended sessions: model starts referencing "typical patterns" instead of specific classes it discovered earlier. Context fills with verbose discovery output.
| Strategy | Purpose |
|---|---|
| Scratchpad files | Write key findings to a file, reference later |
| Subagent delegation | Spawn subagents for specific investigations; main agent keeps coordination |
| Summary injection | Summarise findings from one phase before starting next |
| /compact | Reduce context usage when filled with verbose output |
Each agent exports structured state to a known file location (manifest). On resume, coordinator loads manifest and injects into agent prompts.
97% overall accuracy can hide 40% error rates on a specific document type.
Always validate accuracy by document type AND field segment before automating.
Sample high-confidence extractions for ongoing verification. Detects novel error patterns that would otherwise slip through.
{
"claim": "Global solar capacity increased 30% in 2025",
"source_url": "https://example.com/iea-report",
"document_name": "IEA Solar Capacity Report 2025",
"relevant_excerpt": "Total installed capacity reached 2.4 TW...",
"publication_date": "2026-01-15"
} Downstream agents preserve and merge these mappings through synthesis. Without this, attribution dies during summarisation.
Two credible sources report different statistics:
Require publication/data collection dates in structured outputs. Different dates explain different numbers — these are not contradictions, they are temporal differences.
| Content Type | Format |
|---|---|
| Financial data | Tables |
| News | Prose |
| Technical findings | Structured lists |
Do not flatten everything into one uniform format.
Q1. A customer support agent refers to a "$200 refund" when the customer originally requested $247.83. What went wrong?
Q2. A customer says "I want to speak to a manager." The agent responds by offering to investigate the issue first. Is this correct?
Q3. A tool returns an empty result set with status: "success". The agent retries three times. What is the problem?
status: "success" + empty results = valid empty result. No retry needed.Q4. After a long exploration session, the agent references "typical patterns" instead of specific classes it found earlier. What is the issue?
Q5. A system achieves 96% extraction accuracy overall, but only 55% on handwritten forms. Should it be deployed for handwritten forms?
Q6. A synthesis report presents a single statistic when two credible sources report different values. What should happen instead?
Design and Debug a Multi-Agent Research Pipeline
- Build a coordinator agent that delegates to at least two subagents (e.g., web search and document analysis). Ensure the coordinator’s
allowedToolsincludes"Task"and that each subagent receives its research findings directly in its prompt rather than relying on automatic context inheritance.- Implement parallel subagent execution by having the coordinator emit multiple
Tasktool calls in a single response. Measure the latency improvement compared to sequential execution.- Design structured output for subagents that separates content from metadata: each finding should include a claim, evidence excerpt, source URL/document name, and publication date. Verify that the synthesis subagent preserves source attribution when combining findings.
- Implement error propagation: simulate a subagent timeout and verify the coordinator receives structured error context (failure type, attempted query, partial results). Test that the coordinator can proceed with partial results and annotate the final output with coverage gaps.
- Test with conflicting source data (e.g., two credible sources with different statistics) and verify the synthesis output preserves both values with source attribution rather than arbitrarily selecting one, and structures the report to distinguish well-established from contested findings.
Domains reinforced: Domain 1 (Agentic Architecture), Domain 2 (Tool Design & MCP), Domain 5 (Context Management & Reliability)