Warning: This is where the exam gets sneaky. Wrong answers sound like good engineering. Right answers require knowing which technique applies to which specific problem.
Specific categorical criteria obliterate vague confidence-based instructions.
# WRONG: Vague
"Be conservative. Only report high-confidence findings."
# RIGHT: Specific categorical criteria
"""Flag comments only when:
- Claimed behaviour contradicts actual code behaviour
- Security vulnerabilities (SQL injection, XSS, auth bypass)
- Bugs that would cause runtime errors
Skip:
- Minor style preferences
- Local patterns that differ from team convention but are functional
""" High false positive rates in one category destroy trust in ALL categories.
Fix: Temporarily disable high false-positive categories while improving prompts for those categories. Restores trust while you iterate.
Define explicit severity criteria with concrete code examples for each level — not prose descriptions.
severity_examples = {
"critical": "# Unvalidated user input in SQL query\n"
"cursor.execute(f'SELECT * FROM users WHERE id = {user_input}')",
"major": "# Missing null check before method call\n"
"user.profile.get_preferences() # user.profile may be None",
"minor": "# Variable name could be more descriptive\n"
"x = calculate_total(items)"
} Few-shot examples are the most effective technique for consistency. Not more instructions. Not confidence thresholds.
<example>
Input: "The contract expires on the 15th"
Output: {"expiry_date": "unclear - no month/year specified", "confidence": "low"}
Reasoning: Only a day number is given. Do not fabricate month or year.
</example>
<example>
Input: "Agreement valid through December 2025"
Output: {"expiry_date": "2025-12-31", "confidence": "high"}
Reasoning: Month and year specified. Use last day of month as default.
</example> Few-shot examples showing correct handling of varied document structures (inline citations vs bibliographies, narrative vs structured tables) dramatically improve extraction quality.
| Method | Syntax Errors | Semantic Errors |
|---|---|---|
| tool_use with JSON schema | Eliminated | Still possible |
| Prompt-based JSON | Possible | Still possible |
| Setting | Behaviour | Use Case |
|---|---|---|
"auto" | May return text instead of tool call | Default |
"any" | MUST call a tool, chooses which | Guaranteed structured output with unknown document types |
{"type": "tool", "name": "..."} | MUST call specific tool | Force mandatory first steps |
{
"type": "object",
"properties": {
"company_name": {"type": "string"},
"revenue": {
"type": ["number", "null"],
"description": "Annual revenue in USD. null if not stated in document."
},
"fiscal_year": {
"type": ["string", "null"],
"description": "Fiscal year for the revenue figure. null if not stated."
},
"industry": {
"type": "string",
"enum": ["technology", "healthcare", "finance", "manufacturing", "other"]
},
"industry_detail": {
"type": ["string", "null"],
"description": "If industry is 'other', specify here."
},
"data_quality": {
"type": "string",
"enum": ["complete", "partial", "unclear"]
}
},
"required": ["company_name", "data_quality"]
} Key patterns:
"unclear" enum value for ambiguous cases"other" + freeform detail string for extensible categorisation# Validation-retry loop
for attempt in range(3):
result = extract(document)
errors = validate(result)
if not errors:
break
# Send back: original doc + failed extraction + specific error
messages.append({
"role": "user",
"content": f"Extraction had errors: {errors}\n"
f"Original document: {document}\n"
f"Your extraction: {json.dumps(result)}\n"
f"Please fix the specific errors and re-extract."
}) | Scenario | Retry Effective? | Why |
|---|---|---|
| Format mismatch | Yes | Model can fix structural errors |
| Misplaced values | Yes | Model can correct field placement |
| Information absent from source | No | No amount of retrying will find what isn't there |
The exam presents both scenarios. You must identify which is fixable.
Add to structured findings to track which code construct triggered the finding. Enables analysis of dismissal patterns → improves prompts over time.
{
"stated_total": 1250.00,
"calculated_total": 1237.50,
"conflict_detected": true,
"line_items": [...]
} Extract calculated_total alongside stated_total to flag discrepancies.
| Property | Value |
|---|---|
| Cost savings | 50% |
| Processing window | Up to 24 hours |
| Latency SLA | None guaranteed |
| Multi-turn tool calling | NOT supported within a single request |
| Correlation | custom_id per request/response pair |
| Workflow Type | API |
|---|---|
| Blocking (pre-merge checks, developer waiting) | Synchronous |
| Latency-tolerant (overnight reports, weekly audits) | Batch |
Exam trap: A manager proposes using batch for everything. Correct answer: keep blocking workflows synchronous.
custom_idA model reviewing its own output in the same session retains reasoning context. It is less likely to question its own decisions. An independent instance without prior context catches more subtle issues.
Pass 1: Per-file local analysis → consistent depth per file
Pass 2: Cross-file integration → catches data flow issues Prevents attention dilution and contradictory findings.
Q1. A code review tool flags style preferences as critical bugs. Developers lose trust and ignore all findings. What is the fix?
Q2. An extraction pipeline produces inconsistent output formats despite detailed prose instructions. What should be tried first?
Q3. A tool_use schema has all fields marked as required. Documents sometimes lack certain information. What happens?
Q4. A validation-retry loop keeps failing because the source document genuinely does not contain a required field. How many retries will fix this?
Q5. A manager proposes using the Batch API for pre-merge code review checks that developers wait on. Is this appropriate?
Q6. A code review in a single session generates code and then reviews it, finding no issues. A separate review session finds 5 bugs. Why?
Q7. An extraction task encounters documents in three different formats: tables, narrative text, and forms. Output quality is inconsistent. What helps most?
Q8. A code review tool's overall accuracy is 97%, but accuracy on configuration files is only 60%. Should the tool be deployed for configuration files?
Build a Structured Data Extraction Pipeline
- Define an extraction tool with a JSON schema containing required and optional fields, an enum with an
"other"+ detail string pattern, and nullable fields for information that may not exist in source documents. Process documents where some fields are absent and verify the model returnsnullrather than fabricating values.- Implement a validation-retry loop: when Pydantic or JSON schema validation fails, send a follow-up request including the document, the failed extraction, and the specific validation error. Track which errors are resolvable via retry (format mismatches) versus which are not (information absent from source).
- Add few-shot examples demonstrating extraction from documents with varied formats (e.g., inline citations vs bibliographies, narrative descriptions vs structured tables) and verify improved handling of structural variety.
- Design a batch processing strategy: submit a batch of 100 documents using the Message Batches API, handle failures by
custom_id, resubmit failed documents with modifications (e.g., chunking oversized documents), and calculate total processing time relative to SLA constraints.- Implement a human review routing strategy: have the model output field-level confidence scores, route low-confidence extractions to human review, and analyse accuracy by document type and field to verify consistent performance.
Domains reinforced: Domain 4 (Prompt Engineering & Structured Output), Domain 5 (Context Management & Reliability)