The Map of the Agent Universe

the map of the agent universe

Silicons

Single silicon for building single project. (like claude for building thoughtee)

Multi agent system of talking silicons. (like v1 of shubham's silicon)

|_ manager workers

|_ distributed system of workers

|_ multi manager multi worker system

( no manager big enough to know everything about the system

otherwise he will become dictator )

1. single silicon: how is context managed?

200k tokens is 600 pages.

first claude saves all the messages i have sent and files it has read.

do you pass it in prompt everytime? FREAKING YES.

doesn't it keep piling up? kinda does.

and performance doesn't degrade? yeh, that's why the limit.

persistent context — CLAUDE.md ( it can exist at sub directory level as well )

project overview and architecture

tech stack and conventions

commands to run (build, test, lint)

preferences and rules you've set

claude's todo.md

project settings.json (which tools are allowed)

and all the tools and MCP i connect, their skills and MCP documents

gets appended in prompt right? so more tools, i will be able to read lesser files.

Yes true. it happens.

interesting thing.

so you can literally give claude 600 pages of prompt and it will perform as well.

DEEPER.

how does this happen at architecture level.

SUB AGENTS

get context from parent. returns summary. parent also gets little summary.

so now parent is not sharing 200k among 3 tasks i had to do.

parent is just distributing tasks and getting results.

worker is also not sharing context with other 2. it just does its own thing.

each gets better.

IT IS WIN WIN

BRANCH

super human.

parallel universes.

it's like you decide to make a decision at time t1 and reach t2 (t2 > t1)

and then YOU FREAKING GO BACK TO TIME t1 again and decide something else

and go to time T3.

and come back and keep doing it.

The Deeper Implication

This breaks a fundamental constraint of intelligence that has existed since the beginning of thinking.

Every philosopher, scientist, engineer in history made a decision and lived with it.

The sunk cost of time meant commitment was inevitable.

Branching eliminates sunk cost entirely.

You never have to commit until you've seen the actual outcomes of multiple timelines.

It's not just a software feature.

It's a different shape of reasoning that biology never had access to.

SKILLS

like a bachelor's degree with lightspeed.

i need this claude to be able to draw, pass SKILL.md and it knows how to draw.

MULTI AGENT SYSTEMS

|_ manager workers

|_ distributed system of workers

|_ multi manager multi worker system

( no manager big enough to know everything about the system

otherwise he will become dictator )

1. manager + tiny workers

manager can work on 100s of tasks simultaneously.

( can benchmark if one token one task works nice, what is the max tasks claude can work with )

and each task can be 200k again. wait.

what if i have manager delegating to manager and so on.

huh.

2. distributed system of workers

each worker working on 200k token length.

and they spare tiny fraction of it to know what others are doing.

each one connected to few others.

3. multi manager multi worker

a soup of whole spectrum of sessions.

each agent using n fraction for its own task, 10n fraction to talk.

and n varies from 0 to 1.

RAGs

this is how theoretically LLMs can have infinite memory.

retrieve what is needed in context window of 200k.

and then fill back too.

RAG primarily means pre-retrieval. LLM doesn't know it happened.

and TEXT to SQL is LLM writing prompt to retrieve. like a tool.

claude for RAG is text embedding large from open AI.

so here it matters how are you embedding, both user's message and the document.

simple embedding

contextual embeddings.

Anthropic's Recommended Embedding Provider

Voyage AI's embedding models. voyage-3-large for general text, voyage-code-3 for code.

Anthropic's Contextual Retrieval — The Real Recommendation

reduces failed retrievals by 49%, and with reranking by 67%.

core problem: traditional RAG destroys context.

when you chunk a document, individual chunks lose surrounding context.

a chunk saying "the revenue was $2.3 billion" doesn't know it came from Q3 2024 earnings.

their solution — two things combined:

1. Contextual Embeddings

before embedding each chunk, use Claude to prepend a context snippet.

original chunk: "Revenue was $2.3 billion"

after: "From Q3 2024 earnings report, discussing financial results. Revenue was $2.3 billion"

now the embedding carries full context, not just the isolated fragment.

2. Contextual BM25 (Hybrid Search)

embedding models miss exact matches. BM25 looks for specific text strings.

query "Error code TS-999" — embedding might miss the exact TS-999 match.

so combine both:

Query → semantic search + BM25 exact match

→ Rank Fusion (RRF)

→ Top 150 chunks

→ Reranker

→ Top 20 chunks

→ Claude answers

The Full Anthropic-Recommended Stack

Embedding model → Voyage AI (voyage-3-large or voyage-code-3)

Chunking → contextual chunks (Claude adds context to each chunk)

Retrieval → hybrid: semantic + BM25 combined via rank fusion

Reranking → pass top 150 through reranker, keep top 20

Generation → Claude

One More Thing Anthropic Points Out

if your knowledge base is smaller than 200k tokens (about 500 pages),

just include the entire knowledge base in the prompt — no need for RAG at all.

prompt caching makes this approach significantly faster and more cost-effective.

so ironically — for small enough knowledge bases, RAG is overkill.

RAG only becomes necessary when you exceed 200k tokens.

graph RAGs.

agentic RAGs.

LLM MEMORY

Types of Memory in Agents

In-context — what's in the prompt right now. Expensive, limited, temporary.

RAG / Vector DB — external knowledge base. Searched semantically. Cheap, unlimited, read-only.

Key-value store — structured facts. Fast lookup. Like a cache. user_preference: dark_mode.

Episodic — logs of past interactions. "Last time this user asked X, we did Y."

In-weights — what the model already knows from training. Always there, can't be changed at runtime.

1. KV caching prompt that is being used again and again.

fine tuning

WHEN TO FINE TUNE WHEN TO USE RAG?

parameters to compare

RAM

compute

inference time

accuracy

how much data is enough to fine tune.

pre-training ( only labs can do it. ) HOW THE FREAK THEY DO IT. ENTIRE ARTICLE ON IT OF COURSE.

NEXT ARTICLE ANOTHER OF COURSE. gwen dissection. architecture. in pytorch.

NEXT on model distillation again. how does thinking that we see happen in architecture.

IT AIN'T THE REAL THINKING OF COURSE.