Reverse Engineering AI Answers: What I Learned Building a Custom AEO Pipeline

I built a custom AEO pipeline to understand how models generate answers and why those answers break when you rely on them for technical work. I wanted a tool that pulls research, tracks citations, and assembles reproducible implementation notes for engineering teams. I built Google Cloud Tasks dispatchers to route requests to stateless Python workers, used n8n as the entry and delivery layer for Slack integration, and wired a custom MCP server as the sensory hub through which all retrieval tools—web search, RAG, scraping, etc—are routed. The orchestration exposed two hard truths. Models hallucinate when their retrieval layer feeds stale or misaligned context, and retrievers, not generators, often trigger failure modes.
I reverse engineered answers by instrumenting each stage. I logged retrieval queries, normalized metadata, and enforced timestamps on every ingested event to close the gap between when a source was retrieved and when the agent trusted it. That empirical approach forced me to separate confidence scores from source trails. Confidence shows the model's self-belief. A source trail shows where that belief came from. Only the latter supports engineering decisions.
I also confronted an emergent form of manipulation that I call Keyword Stuffing 2.0. Instead of stuffing pages with tokens, pipelines over-index lightweight citations and recycled summaries to achieve surface relevance. Surface relevance does not equate to implementable architecture. Coupled with opaque synthesis heuristics, those citations anchor faulty design choices.
This project demanded technical rigor over persuasion. I avoided high-level cliché and built instrumentation that answered precise questions. Which source produced a claim. When did that source publish. What API version did the examples use. Those fields transformed the pipeline from a black box into an auditable trail.
The Death of the Matrix — Semantic Intent Replaces Keyword Grids
I started by trusting the keyword matrix. It promised clear signals and measurable coverage. That faith collapsed as retrieval architectures shifted. The reality was different. Literal token matches became brittle when queries expressed intent rather than exact phrasing. TF‑IDF thinking misaligned with user behavior because systems stopped surfacing relevant passages unless queries matched training vocabularies.
Vector embeddings encode semantics across many dimensions rather than counting tokens. Dense representations group latent intent into clusters, whereas sparse lexical matches catch exact phrasing. That shift improved recall, but semantic grouping carries its own maintenance weight — cluster boundaries drift as usage patterns evolve, and index refresh cadences need active governance.
Operational migration created friction. When I moved to Firestore vector search, I expected the index to behave like a faster keyword store. It did not. The embedding index size, query latency, and cost profile were fundamentally different operational variables. Real-time constraints demanded hybrid pipelines that preserve exact-match filters while applying semantic reranking. I also confronted an explainability tradeoff. Debugging a TF-IDF score felt straightforward, whereas tracing similarity in high-dimensional space required entirely different tooling
In practice I architected hybrid flows that treat lexical and semantic signals as complementary. I prioritize intent signals for recall and lexical checks for precision and I keep lexical fallbacks for edge cases. The lesson was stubborn and simple. I abandoned the matrix mindset but kept the matrix as a safety net.
I learned to treat nearest‑neighbor stores as operational components with their own lifecycle. Index sharding, refresh frequency, and retraining cadence influence latency and consistency. I instrumented failure detection, rolled progressive rollouts, and enforced lexical fallbacks where needed.
Context Rot and the War Story of a Failed Roadmap
I built a custom AI citation tracker and a marketing research agent using n8n, Python, and the Google ADK. I expected a reliable pipeline where a retrieval agent fed clean grounding to a writer agent and synthesis followed. Instead I watched Context Rot erode trust across the system. I call this episode the War Story of a Failed Roadmap because the failure sprang from an architectural blind spot rather than a flaky library or a misconfigured container.
Here is what happened. Agent 1, the retrieval stage, returned grounding from my 2025 v1beta implementation of the Google Merchant API. That material matched query semantics but dated from an older internal spec. I later labeled that symptom Temporal Drift because subsequently retrieved content retained valid tokens and plausible structure while carrying obsolete assumptions. Agent 2, the writer, consumed those tokens as if they were current authority and synthesized an Operational Playbook that used Web 2.0 primitives such as CRLs, simple JWKS endpoints, and "signing tokens". Those primitives contradicted the Web 3.0 Verifiable Credential design required by the incoming V1 API.
The cascade looked healthy to classic observability. Logs showed reads and normal latencies. Vector distances and snippet confidences reported as expected. Yet the output proposed the wrong security model. Standard traces do not expose semantic freshness, and that is exactly where the corruption lived. I relied on architectural intuition to detect the mismatch and to trace its origin.
I instrumented the retrieval prompt to extract explicit temporal metadata and lineage. The change required Agent 1 to return the original document timestamp, an authoritative source label, and a confidence flag that indicated whether the document matched or preceded {current_year}. I injected a runtime substitution for {current_year} and made temporal validation mandatory. The writer then had to cite those fields before synthesizing architecture recommendations.
The remediation worked because it converted implicit trust into an explicit contract. The writer could no longer assume a fragment was authoritative. It had to justify each architectural claim against a stamped source. That shift prevented the synthetic leap into obsolete Web 2.0 conventions and reoriented synthesis toward current API reality. The problem and its fix align with broader findings about context degradation in multi‑agent retrieval pipelines, a failure mode Google's own ADK architecture team describes as the 'scaling bottleneck' — where stale tool outputs or deprecated state cause models to fixate on past patterns rather than the immediate instruction.
Chroma's research on context window degradation shows that what enters the context window is as important as how large it is — a finding that extends naturally to retrieval pipelines, where the quality of retrieved content is the upstream variable.
- I enforced temporal hygiene at retrieval by requiring timestamp and source fields.
- I made source attribution blocks explicit and machine‑readable so writers could gate synthesis on them.
- I introduced failure‑mode tests that simulate deprecated docs and assert that synthesis refuses deprecated primitives.
Keyword Stuffing 2.0
I built a pipeline and watched Keyword Stuffing 2.0 fail in the wild. Semantic Over‑Optimization describes the practice of cramming facts, entities, and structured markers into content to game retrieval. Authors flood documents with dense entity clusters, repeated JSON‑like markers, and forced anchor phrases. The aim is signal amplification. The effect is brittle pipelines and noisy attention.
Piling semantic cues into prose does not strengthen relevance. Hierarchical attention mechanisms distribute focus across tokens and segments. When I overloaded documents with dense clusters, attention heads began to scatter; they prioritized repeated tokens over contextual connectors. That scattering turned useful signal into noise and eroded ranking fidelity. Citation validators then struggled because excessive semantic density forced costly disambiguation and increased false positives.
I observed brittle agents that consumed indexes stuffed with synthetic markers, then suggested spurious citation chain or chased irrelevant entity variants. I favor structural remedies over surface tricks. Split factual assertions into verifiable segments. Store fielded metadata separate from narrative text. Expose canonical source pointers that agents can follow. Those patterns streamline verification and reduce ambiguous anchors.
ToC Versus Answer Blocks — Structuring for LLM Attention
I found that a well‑formed Table of Contents outperforms fifty‑word answer blocks when agents must cite long‑form content. A ToC acts as a hierarchical semantic map that anchors the model's attention and reduces repeated token processing. When an agent receives a ToC it can target a subtree instead of rereading the entire document, which conserves the model's limited context budget and improves citation precision.
Technically this works because language models operate inside bounded token windows and use attention weighting to prioritize tokens. A ToC supplies coarse‑grain tokens that label major sections and subsections. The model assigns higher attention weights to tokens that match the requested semantic intent, and it expands only the nodes that matter for the current query. Recent work on hierarchical context and retrieval engineering supports this pattern.
Programmatic navigation follows naturally. I parse the ToC into a lightweight index of pointers. The agent resolves a user's question to a ToC node, fetches the corresponding subsection text, and computes local attention over that chunk. This pattern prevents the agent from treating each micro summary as an isolated fact. Answer blocks often interrupt attention flow and introduce redundant tokens, which pushes useful context out of the window.
Practical rules I apply are straightforward. Craft headers with explicit semantic intent. Lead with an action verb or an outcome and add subject and qualifier. Keep shallow navigational depth and deeper nesting where technical detail belongs. Store section offsets so the agent can jump to exact offsets and return a line‑limited quote plus header context. That header then justifies the citation and explains why the block is authoritative.
Final Reflection
Building a custom AEO pipeline taught me that control beats convenience. Reversing AI answers forced explicit assumptions and surfaced brittle heuristics, so I replaced guesswork with structured tests and clear decision rules. I moved faster because experiments produced observable effects rather than vague outputs. Grounding answers in verified sources anchored stakeholder trust and reduced rework. The work demanded engineers, product owners, and content leads to align on minimal interfaces and quality gates. The result lowered ambiguity and sped deployment.
The biggest win came from clearer questions, not fancier models. I build with guardrails, instrument for feedback, and prioritize clarity of intent. Those practices turn AI answers from surprises into reliable features.
The most expensive lesson was not architectural. It was epistemic. I had built instrumentation that logged every retrieval call, every vector distance, every latency metric, and none of it told me the draft was wrong. The pipeline looked healthy right up until an engineer would have shipped a security model built on deprecated primitives. No dashboard catches that. Only knowing the system deeply enough to ask the right question does.


Comments
Sign in to join the conversation
Sign In