A deep research engine that tackles complex questions through iterative planning, multi-source search, chain-of-thought and tree-of-thought reasoning, self-reflection, and automatic report generation. Powered by Claude's extended thinking mode to demonstrate how inference-time compute scaling produces dramatically better answers on hard problems.
| Concept | Description |
|---|---|
| Reasoning LLMs | How models like Claude with extended thinking and OpenAI o1/o3 allocate extra compute at inference time |
| Chain-of-Thought (CoT) | Step-by-step reasoning that makes the model "show its work" |
| Tree-of-Thought (ToT) | Exploring multiple reasoning branches in parallel, evaluating, and pruning |
| Self-Reflection | The model critiques and revises its own answers when confidence is low |
| Inference-Time Scaling | Why spending more tokens thinking leads to better answers on complex tasks |
| Research Planning | Decomposing complex questions into dependency-ordered sub-questions |
| Iterative Deepening | Verifying findings, identifying gaps, and refining the research plan |
| Report Generation | Structured output with executive summaries, sections, citations, and confidence assessments |
ββββββββββββββββββββββββββββ
β Research Question β
β "What are the long-term β
β economic impacts of AI β
β on the labor market?" β
ββββββββββββββ¬ββββββββββββββ
β
ββββββββββΌβββββββββ
β PLAN NODE β
β β
β Decompose into β
β 3-7 sub-questionsβ
β with dependency β
β graph + priorityβ
ββββββββββ¬βββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ITERATION LOOP (depth 1..N) βββ β
β βββ β
β ββββββββββββββββββββββββββββββ βββ β
β β SEARCH NODE ββββββββ β
β β β ββ β
β β Parallel web searches β ββ β
β β via Tavily for each β ββ β
β β ready sub-question β ββ β
β ββββββββββββββ¬βββββββββββββββ ββ β
β β ββ β
β ββββββββββββββΌβββββββββββββββ ββ β
β β ANALYSE NODE β ββ β
β β β ββ β
β β Extract key findings β ββ β
β β from sources per β ββ β
β β sub-question β ββ β
β ββββββββββββββ¬βββββββββββββββ ββ β
β β ββ β
β ββββββββββββββΌβββββββββββββββ ββ β
β β REASON NODE β ββ β
β β β ββ β
β β CoT + Extended Thinking β ββ β
β β Synthesize findings per β ββ β
β β sub-question β ββ β
β β Self-reflect if low conf β ββ β
β ββββββββββββββ¬βββββββββββββββ ββ β
β β ββ β
β ββββββββββββββΌβββββββββββββββ ββ β
β β VERIFY NODE β ββ β
β β β ββ β
β β Cross-check findings β ββ β
β β Identify contradictions β ββ β
β β Find knowledge gaps β ββ β
β ββββββββββββββ¬βββββββββββββββ ββ β
β β ββ β
β βββββββββΌββββββββ ββ β
β β Gaps found β ββ β
β β AND depth < βββββYESβββββΆββ ββββββββββββββββββββββββββββββββ β
β β max_depth? β ββ β ITERATE NODE β β
β βββββββββ¬ββββββββ ββ β β β
β NO β ββ β Refine plan: β β
β β ββ β β’ Skip resolved questions β β
β β ββ β β’ Add new sub-questions β β
β β ββ β β’ Re-prioritize β β
β β βββββ β β
β β β βββββββββββββββββββββββββββββββ β
βββββββββββββββββΌβββββββββββββββββββββ β
β β
ββββββββββΌβββββββββ β
β REPORT NODE β β
β β β
β Extended thinkingβ β
β report generationβ β
β β β
β Output: β β
β β’ Title β β
β β’ Exec Summary β β
β β’ Key Findings β β
β β’ Sections β β
β β’ Citations β β
β β’ Confidence β β
β β’ Further β β
β Research β β
ββββββββββ¬βββββββββ β
β β
βΌ β
END β
def should_iterate(state: ResearchState) -> Literal["iterate", "report"]:
"""Decide whether to research deeper or generate the report."""
gaps = state.get("gaps", [])
current_depth = state.get("current_depth", 0)
max_depth = state.get("max_depth", 5)
if gaps and current_depth < max_depth:
return "iterate" # Go deeper: refine plan β search again
return "report" # Satisfied: generate final reportdocker build -f Dockerfile \
-t deep-research .
docker run -p 8000:8000 \
-e DEEP_RESEARCH_ANTHROPIC_API_KEY=your-key \
-e DEEP_RESEARCH_TAVILY_API_KEY=your-tavily-key \
deep-researchpython -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
export DEEP_RESEARCH_ANTHROPIC_API_KEY=your-key
export DEEP_RESEARCH_TAVILY_API_KEY=your-tavily-key
# Already in project root
python -m deep_research.mainThe API will be available at http://localhost:8000. Interactive docs at http://localhost:8000/docs.
curl http://localhost:8000/health# Start research (returns immediately with a task ID)
curl -X POST http://localhost:8000/api/v1/research \
-H "Content-Type: application/json" \
-d '{
"question": "What are the long-term economic impacts of generative AI on the global labor market?",
"max_depth": 5
}'
# Poll for results
curl http://localhost:8000/api/v1/research/{task_id}curl -N -X POST http://localhost:8000/api/v1/research/stream \
-H "Content-Type: application/json" \
-d '{
"question": "Compare the effectiveness of mRNA vs protein subunit COVID vaccines",
"max_depth": 3
}'Events emitted: planning, searching, analysing, reasoning, verifying, iterating, reporting, completed.
Each event includes task_id, status, message, progress_pct, and metadata.
curl -X POST http://localhost:8000/api/v1/reason \
-H "Content-Type: application/json" \
-d '{
"query": "A farmer has 17 sheep. All but 9 die. How many sheep does the farmer have left?",
"strategy": "chain_of_thought",
"use_extended_thinking": true
}'Available strategies: direct, chain_of_thought, tree_of_thought.
curl -X POST http://localhost:8000/api/v1/compare-reasoning \
-H "Content-Type: application/json" \
-d '{
"query": "If a ball is placed on top of a hill and rolls down, will it end up at the bottom? Consider the shape of the terrain, obstacles, and wind.",
"strategies": ["direct", "chain_of_thought", "tree_of_thought"]
}'This returns a side-by-side comparison of each strategy's answer, confidence, reasoning steps, and token usage -- letting you see firsthand how additional inference-time compute improves results.
Chain-of-Thought (CoT) instructs the model to reason step by step rather than jumping to an answer. This is implemented in reasoning.py with a structured system prompt:
System Prompt:
"For EACH step you produce, output it in this format:
[Step N]
Description: <one-line summary>
Reasoning: <detailed reasoning>
Confidence: <0.0 to 1.0>"
Input: "A farmer has 17 sheep. All but 9 die. How many are left?"
WITHOUT CoT (direct): WITH CoT:
βββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββ
β Answer: 8 β β [Step 1] β
β Confidence: 0.50 β β Description: Parse the question β
β β β Reasoning: "All but 9 die" β
β (WRONG) β β means 9 survive. β
β β β Confidence: 0.95 β
β β β β
β β β [Step 2] β
β β β Description: Calculate answer β
β β β Reasoning: 9 sheep remain β
β β β alive regardless of the 17. β
β β β Confidence: 0.95 β
β β β β
β β β [Final Answer] β
β β β 9 sheep β
β β β Overall Confidence: 0.95 β
β β β β
β β β (CORRECT) β
βββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββ
Why does CoT work? By generating intermediate steps, the model effectively allocates more compute to the problem. Each step conditions the next, reducing the chance of errors propagating silently. Research shows CoT improves accuracy on math, logic, and multi-step reasoning by 15-40%.
Self-Reflection Loop: When the CoT confidence falls below the confidence_threshold (default: 0.7), the engine automatically triggers a self-reflection round:
CoT answer (confidence: 0.55) β Self-Reflection
β
Critique: "Step 2 assumes X without justification..."
Revised answer: "..."
New confidence: 0.82
Where CoT follows a single reasoning path, Tree-of-Thought (ToT) explores multiple paths simultaneously and picks the best one:
Root Question
β
βββββββββββββββββΌββββββββββββββββ
β β β
Approach 1 Approach 2 Approach 3
(score: 0.4) (score: 0.85) (score: 0.6)
[PRUNED] β [PRUNED]
β
ββββββββββββββΌβββββββββββββ
β β β
Sub-idea 1 Sub-idea 2 Sub-idea 3
(score: 0.7) (score: 0.9) (score: 0.5)
β [PRUNED]
β
ββββββββββΌβββββββββ
β β β
... ... ...
β
Best Leaf
(score: 0.9)
β
βββββββββββΌβββββββββββ
β SYNTHESIZE β
β Final answer β
β from best path β
ββββββββββββββββββββββ
The ToT Algorithm (from reasoning.py):
- Generate -- Ask the LLM to produce
branching_factor(default: 3) distinct approaches - Evaluate -- A separate LLM call scores each approach (0.0-1.0) for logical soundness, relevance, and promise
- Prune -- Keep the top half, discard the rest
- Recurse -- Expand surviving branches up to
tot_max_depth(default: 3) - Select -- DFS to find the highest-scored leaf node
- Synthesize -- Generate the final answer using the best reasoning path
When to use ToT vs CoT:
| Factor | Chain-of-Thought | Tree-of-Thought |
|---|---|---|
| Best for | Problems with a clear solution path | Ambiguous problems with multiple valid approaches |
| Token cost | ~1x | ~6-15x (multiple branches) |
| Latency | Low (1 LLM call) | High (many sequential calls) |
| Accuracy gain | +15-40% over direct | +5-15% over CoT on hard problems |
Traditional ML scaling improves models by training longer on more data (training-time scaling). Inference-time scaling takes a different approach: give the model more compute at inference time.
Scaling Dimension What Changes Examples
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Training-time scaling Model size, dataset, epochs GPT-3 β GPT-4
Inference-time scaling Thinking tokens, search depth Standard β Extended Thinking
ββββββββββββββββββββββββββββββββββββββββββββ
β Accuracy vs Inference Compute β
β β
Accuracy β β±ββ ToT β
β β β± β
β β β±ββββ CoT + reflection β
β β β±β± β
β β β±β±ββββ CoT β
β β β±β± β
β β β±β±ββββββββ Direct β
β β β±β± β
β β±β± β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Tokens Used β
How this project implements inference-time scaling:
- Extended Thinking Mode: Claude's extended thinking API allocates a dedicated thinking budget (default: 10,000 tokens) before generating the visible answer:
response = await client.messages.create(
model="claude-opus-4-6",
max_tokens=16_384,
temperature=1, # Required for extended thinking
thinking={
"type": "enabled",
"budget_tokens": 10_000, # Up to 128K for hard problems
},
messages=[...],
)
# Response contains both thinking and visible content
for block in response.content:
if block.type == "thinking":
print(f"Internal reasoning: {block.thinking}")
elif block.type == "text":
print(f"Final answer: {block.text}")-
Adaptive Compute Allocation: The
DeepResearcherautomatically gives harder sub-questions (those with lower initial confidence) more thinking budget. -
Strategy Comparison: The
/api/v1/compare-reasoningendpoint lets you see howdirect(low compute) vschain_of_thought(medium) vstree_of_thought(high) trade off between cost and quality on the same question.
The full pipeline orchestrates all the above into a 7-stage iterative workflow:
Stage 1: PLAN
Input: "What are the long-term economic impacts of AI on labor?"
Process: LLM decomposes into 3-7 sub-questions with dependencies
Output: ResearchPlan with prioritized SubQuestion graph
Example sub-questions:
SQ-1: Historical precedents of technology displacing labor (Priority: 1)
SQ-2: Current AI adoption rates across industries (Priority: 1)
SQ-3: Economic models for AI-driven productivity gains (Priority: 2, depends on SQ-2)
SQ-4: Policy responses and retraining programs (Priority: 3, depends on SQ-1, SQ-3)
Stage 2: SEARCH
Input: Ready sub-questions (all dependencies met)
Process: Parallel Tavily searches with semaphore-limited concurrency
Output: SearchResult list linked to sub-questions
Stage 3: ANALYSE
Input: Search results per sub-question
Process: LLM extracts key findings from raw source content
Output: ResearchFinding list (claim + evidence + sources)
Stage 4: REASON
Input: Findings per sub-question
Process: CoT + extended thinking synthesis
Output: ReasoningResult with confidence score
Side effect: Marks sub-question as completed in the plan
Stage 5: VERIFY
Input: All findings so far
Process: LLM cross-checks for contradictions and gaps
Output: Gap list (empty = all verified)
Stage 6: ITERATE (conditional)
Condition: Gaps exist AND current_depth < max_depth
Process: Refine the plan -- add new sub-questions, skip resolved ones
Then: Loop back to Stage 2
Stage 7: REPORT
Input: All findings, reasoning results, and sources
Process: Extended thinking report generation
Output: ResearchReport with sections, citations, confidence assessment
Example output structure:
# Research Report: Economic Impacts of AI on the Global Labor Market
## Executive Summary
AI is projected to automate 25-40% of current work tasks by 2035...
## Key Findings
1. Historical technology transitions created more jobs than they displaced...
2. Current AI adoption is concentrated in knowledge work sectors...
## Section: Historical Precedents
[detailed analysis with inline citations]
## Section: Current AI Adoption Rates
[detailed analysis with inline citations]
## Confidence Assessment
High confidence in near-term projections (0.85), moderate confidence
in long-term economic models (0.65) due to limited historical precedent.
## Areas for Further Research
- Impact on developing economies
- Role of AI in creating new job categories
- Effectiveness of retraining programs
## References
1. [McKinsey Global Institute - AI and the Future of Work](https://...)
2. [OECD Employment Outlook 2024](https://...)| Layer | Technology | Purpose |
|---|---|---|
| Framework | FastAPI | Async REST API with SSE streaming |
| Orchestration | LangGraph | State graph with iterative deepening loop |
| Reasoning Model | Claude Opus 4.6 | Complex reasoning with extended thinking |
| Fast Model | Claude Sonnet 4.5 | Quick extractions, analysis, verification |
| Extended Thinking | Anthropic API | Inference-time compute scaling (up to 128K thinking tokens) |
| Web Search | Tavily API | Real-time information retrieval |
| Streaming | SSE-Starlette | Real-time progress events |
| Data Models | Pydantic v2 + dataclasses | Type-safe configuration and structured output |
| Config | Pydantic Settings | Environment-based configuration with env prefix |
| Logging | structlog | Structured JSON logging |
| Database | PostgreSQL + asyncpg | (Optional) persistent research storage |
| Cache | Redis | (Optional) result caching |
| Containerization | Docker | Multi-stage production builds |
04-deep-research/
βββ src/deep_research/
β βββ __init__.py
β βββ main.py # Uvicorn entry point
β βββ api.py # FastAPI app: research, reason, compare-reasoning
β βββ config.py # Settings (models, reasoning params, research params, ToT config)
β βββ reasoning.py # ReasoningEngine: Direct, CoT, ToT, self-reflection
β βββ planner.py # ResearchPlanner: question decomposition, plan refinement
β βββ researcher.py # DeepResearcher: full pipeline orchestration + WebSearcher
β βββ report.py # ReportGenerator: structured report with citations
β βββ workflow.py # LangGraph StateGraph: planβsearchβanalyseβreasonβverifyβreport
βββ tests/
β βββ conftest.py
β βββ test_api.py
β βββ test_reasoning.py
β βββ test_workflow.py
βββ k8s/
β βββ deployment.yaml
βββ Dockerfile
βββ pyproject.toml
βββ README.md
- Fork the repository
- Create a feature branch:
git checkout -b feature/my-feature - Install dev dependencies:
pip install -e ".[dev]" - Run tests:
pytest tests/ -v - Submit a pull request
Reasoning strategies are modular -- adding a new strategy means implementing a _reason_X method and registering it in the dispatch dict.
This project is part of the AI Engineer Portfolio and is licensed under the MIT License.