From Linear Pipelines to Recursive Cores: Rethinking Enterprise-Grade AI Reasoning Systems

A technical reflection on redesigning an AI reasoning architecture to achieve enterprise-grade depth, auditability, and deterministic cost governance.

Feb 2026 · 19 MIN

A technical reflection on system design under real constraints.

Executive Summary

We designed and built a decision-support reasoning system intended to produce institutional-grade research artifacts — analysis defensible enough to inform capital allocation, strategic pivots, and risk assessment. Our first production architecture was an 8-stage linear pipeline with strict epistemic guardrails: structural separation between evidence gathering and interpretation, independent adversarial review at every stage, and explicit boundary contracts preventing any agent from evaluating its own output.

The architecture embodied genuine engineering discipline. In practice, it produced shallow output at a fixed cost of approximately $1.00–$1.50 per run — regardless of query complexity. The problem was not a deficiency in any individual stage. It was a loss of cognitive continuity across eight sequential handoffs. Context fragmented. Nuance dissipated. Token budgets compounded without proportional depth.

We redesigned the system around a recursive core: three phases, five cognitive roles, and bounded iterative loops that concentrate reasoning where evidence is weak while locking sections that are already strong. Every epistemic principle from the original architecture was retained. What changed was the topology — the shape of the computation itself.

This article describes what we built, what we observed under production conditions, why the topology constrained reasoning depth, and what the redesign revealed about building AI systems for high-stakes, auditable decision support.

Why This Matters for Enterprise Leaders

Most AI systems marketed as "research tools" are, architecturally, summarization engines. They retrieve relevant passages and reformat them into readable prose. This is adequate for information gathering. It is inadequate for decision-grade reasoning.

Capital allocation decisions, regulatory compliance assessments, competitive due diligence, and strategic risk evaluation share a common requirement: the reasoning behind the conclusion must be defensible. Not merely plausible — defensible. Every claim must trace to a source. Contradictions must be surfaced, not smoothed over. Uncertainty must be marked explicitly, not hidden behind balanced language.

This imposes specific architectural demands that summarization systems do not meet. Evidence gathering and interpretation must be structurally separated to prevent hallucination laundering — the well-documented failure mode where retrieval and synthesis contaminate each other. Review must be independent. Termination must be guaranteed. Cost must be predictable and bounded.

Our experience building and then redesigning a system under these constraints surfaced an insight that we believe generalizes: process rigor and cognitive depth are not the same thing. A system can satisfy every checklist for disciplined multi-agent design — role separation, independent review, structured checkpoints — and still produce shallow analysis. The determining factor is not how many stages the system has, or how cleanly their responsibilities are separated. It is the topology of reasoning — whether the architecture permits iterative deepening where evidence is weak, or merely passes incomplete work forward.

Architecture topology determines reasoning quality. This is the central observation of this article.

1. The Problem Space: Decision-Grade Reasoning

There is a meaningful architectural distinction between AI-assisted search and AI-assisted reasoning. Search retrieves relevant information. Reasoning constructs a defensible point of view from that information, pressure-tests it against counter-evidence, surfaces internal contradictions, and marks residual uncertainty explicitly.

Under conditions of ambiguity — where the most consequential decisions tend to live — summarization systems default to balanced, non-committal output. They present information from multiple angles without constructing or defending a position. This is precisely the failure mode that decision-makers cannot afford. The value of analysis lies not in presenting options but in constructing and stress-testing arguments.

The system we set out to build was designed to operate as a reasoning engine, not a retrieval engine. The requirements were specific and non-negotiable:

Every claim in the output must trace to a cited source. Evidence gathering and interpretation must be structurally separated — never handled by the same agent in the same step. The system must interrogate its own assumptions before committing resources to research. Cost must be predictable and bounded by deterministic ceilings. The system must terminate within a known envelope — it cannot loop indefinitely in pursuit of completeness.

These are not features. They are constraints that define the boundary between a research tool and a decision-support reasoning system.

2. The Original Architecture: Discipline-First Design

2.1 Structural Overview

The original system was an 8-stage linear pipeline: Framing, Exploration, Synthesis, Critical Review, Counter-arguments, Refinement, Narrative Construction, and Final Output. Each stage was implemented as a separate agent with its own prompt contract, model assignment, and explicit boundary declarations defining what the agent does and does not do.

The architecture was organised around principles that we considered — and still consider — non-negotiable for any system whose output informs real decisions.

2.2 Design Principles

Separation of evidence and interpretation. We enforce a discipline we refer to as the Two-Gap Model. The Information Gap (evidence collection) and the Generation Gap (reasoning and synthesis) are never handled by the same agent in the same step. The exploration stage produces Working Research Notes — structured, source-attributed evidence artifacts containing raw facts, data points, and quotations. No conclusions. No rankings. No synthesis. This material is passed, unmodified, to the synthesis stage, which constructs meaning from it. This structural separation prevents the failure mode where a model retrieves thin evidence and then confidently synthesizes conclusions from it — because the same cognitive process handled both retrieval and interpretation.

Independent adversarial review. No stage in the pipeline evaluates its own output. The synthesis stage constructs arguments; a separate critical review stage challenges them. A dedicated counter-arguments stage generates adversarial perspectives. This separation was enforced architecturally, not suggested through prompting.

Structured epistemic checkpoints. At designated intervention points, the system generates targeted questions about its own reasoning — probing assumptions embedded in the framing, identifying what would change the conclusion entirely, and surfacing what remains unknown. This forces the system to interrogate its own premises before committing resources to downstream analysis.

Boundary contracts per agent. Every agent prompt included explicit declarations of scope: what the agent is responsible for, and what it is prohibited from doing. The exploration agent does not synthesize. The synthesis agent does not critique. The critical review agent does not retrieve new evidence. These contracts prevent stage bleed — the gradual erosion of role separation that degrades multi-agent systems over time and produces work that appears complete but lacks the depth that specialization provides.

Multi-model orchestration. Different cognitive tasks were assigned to different models based on demonstrated strengths: reasoning-class models for framing and adversarial critique, diverse model ensembles for exploration to reduce confirmation bias, writing-optimized models for narrative construction.

2.3 Why This Architecture Was Sound

Every design decision addressed a documented failure mode in production AI systems. The Two-Gap Model prevents hallucination laundering. Independent adversarial review prevents self-reinforcement. Boundary contracts prevent scope drift. Multi-model orchestration reduces confirmation bias in evidence gathering.

This was a considered, principled architecture built to meet the requirements of institutional-grade analysis.

3. What We Observed Under Production Conditions

3.1 Output Quality

The system consistently produced output that was polished, balanced, and shallow. Under ambiguity — precisely the conditions where deep analysis matters most — it defaulted to hedged, non-committal language. It surfaced relevant information but rarely constructed a genuinely defensible argument from it.

The critical review stage, designed to challenge the synthesis, produced generic critique. The counter-arguments stage generated plausible-sounding objections without the evidential depth to make them substantive. The refinement stage smoothed prose but rarely deepened reasoning.

All eight stages executed. None thought deeply.

3.2 Cost Structure

The pipeline consumed approximately 87,000 tokens per run at a fixed cost of $1.00–$1.50, regardless of query complexity. A straightforward factual question consumed the same budget as a nuanced strategic analysis. Significant overhead accumulated through context duplication — each stage received the serialized outputs of all prior stages, meaning the same evidence was re-processed multiple times. By the final stages, the context window was dominated by accumulated artifacts rather than fresh reasoning.

For a decision-maker running five to ten queries per day, this fixed cost structure was operationally unsustainable.

3.3 Cognitive Fragmentation

This was the root cause. Eight agents passed artifacts to each other in sequence. Each operated on accumulated context, but none had the cognitive continuity to build and refine a sustained line of reasoning across iterations.

The synthesis stage drafted an argument. The critical review stage challenged it. The counter-arguments stage added objections. The refinement stage attempted reconciliation. But reconciliation is not depth. Each stage contributed a thin layer of processing. The final output was the sum of eight shallow passes — not the product of deep, iterative reasoning.

Context fragmented across handoffs. Model switching between stages introduced subtle discontinuities in reasoning emphasis and style. The pipeline had process discipline. It did not have cognitive continuity.

The architecture was disciplined. It was not deep.

4. The Diagnosis: Topology as the Binding Constraint

The limitation was not in the agents. Each agent, evaluated in isolation, was capable of strong reasoning. The limitation was in how they were connected.

Linear pipelines encode a specific structural assumption: that reasoning proceeds in a single forward pass, where each stage transforms its input and delivers it to the next. This model works well for manufacturing and data processing. It works poorly for analysis, where the quality of a conclusion depends on the ability to revisit and strengthen the evidence that supports it.

In a linear topology, the critical review stage can identify that a section lacks sufficient evidence. But it cannot direct the exploration agent to retrieve targeted evidence for that specific weakness. It can only pass its critique forward, where subsequent stages attempt to work around the gap. The system optimizes each stage locally but has no mechanism for targeted remediation at the global level.

This is what we came to recognize as process ceremony — the structural appearance of rigor without the cognitive substance of depth. Every stage executes. Every boundary is respected. Every handoff is clean. But the system as a whole produces analysis that is less than the sum of its parts, because the forward-only topology prevents the iterative deepening that defensible reasoning requires.

The insight was precise: we needed to change the shape of the computation, not the quality of the components.

5. The Redesigned Architecture: Recursive Core

5.1 What Was Retained

Every epistemic principle from the original architecture carries forward unchanged. Evidence gathering and interpretation remain structurally separated. No role evaluates its own output. Boundary contracts are enforced. Source traceability is mandatory. The Two-Gap Model is retained as the foundational design constraint.

The redesign was not a retreat from discipline. It was a change in how discipline is applied.

5.2 Structural Overview

The 8-stage linear pipeline was replaced with three phases and five cognitive roles, organised around a bounded iterative core that operates at the dimension level.

Phase 1 — The Architect. The Architect decomposes the research question into discrete, independent dimensions — for a startup evaluation, this might produce dimensions such as market sizing, competitive landscape, unit economics, and regulatory exposure. The Architect performs no evidence retrieval. Its function is purely structural: producing a research plan as structured output that defines dimensions, hypotheses, and initial retrieval queries for each.

Phase 2 — The Recursive Core. This is where the substantive architectural change resides. For each dimension independently, three roles execute in a targeted remediation cycle:

The Scout retrieves evidence. It produces Working Research Notes — structured, source-attributed facts with no interpretation. The Scout never synthesizes. Its boundary is absolute: it fetches evidence and returns it. Nothing more.

The Analyst constructs an argument from the Scout's evidence. It drafts a dimension section, maps every claim to a source entry in a structured evidence ledger, and identifies remaining gaps. The Analyst builds arguments. It does not evaluate them.

The Skeptic evaluates the Analyst's draft against a structured scorecard. Evidence quality, completeness, and reasoning coherence are scored independently on numerical scales. The Skeptic produces a pass/fail decision with explicit rationale. If the dimension fails, the Skeptic generates specific missing evidence queries — not vague requests for improvement, but targeted retrieval instructions that direct the Scout to address the identified gap.

This is the critical design shift. The loop is targeted. Only dimensions with weak evidence re-loop. Dimensions that pass on the first iteration are locked and excluded from further computation. The system invests in depth only where depth is needed.

Phase 3 — The Integrator. The Integrator assembles approved dimension sections into the final artifact. It executes a cross-dimension stress test — detecting contradictions between dimensions, identifying overarching blind spots, computing a global confidence score derived from evidence density and individual dimension scores. The Integrator is architecturally blocked from triggering new evidence retrieval. It synthesizes and delivers. This constraint is a deliberate, hard termination guarantee.

5.3 Cognitive Continuity

Within the Recursive Core, a single model handles both the Analyst and Skeptic roles by default. This is intentional. The multi-model orchestration in the original architecture introduced subtle reasoning discontinuities at every handoff — shifts in emphasis, analytical style, and inferential patterns. A single model, reasoning iteratively within a tight loop, maintains the thread of an argument across iterations in a way that sequential model handoffs cannot.

The system also maintains an Emerging Thesis — a rolling global point of view that updates after each dimension reaches its terminal state. This allows later dimensions to incorporate insights generated by earlier ones, compounding analytical coherence across the research run without triggering additional retrieval.

5.4 Memory Discipline

Recursive systems face a specific and well-understood risk: context window bloat across iterations. The redesigned architecture addresses this through a structured memory protocol. After the Analyst generates its output from the Scout's Working Research Notes, the raw evidence notes are discarded. What persists across loops is the drafted section, the structured evidence ledger (source URLs, excerpts, confidence ratings), and the Skeptic's scorecard. On subsequent loops, the Scout starts fresh with targeted queries, and the Analyst rebuilds from new evidence plus the retained ledger.

This keeps token consumption approximately flat across iterations, preventing the cost escalation that renders naive recursive architectures impractical at scale.

6. Auditability and Evidence Reconstruction

For any system whose output informs consequential decisions — capital allocation, regulatory compliance, competitive strategy — the ability to reconstruct the reasoning chain after the fact is not a convenience. It is a requirement.

The structured evidence ledger maintained across the Recursive Core serves as a durable audit trail. Every claim in the final artifact maps to a ledger entry containing the source URL, a verbatim excerpt, a source-type classification (primary, regulatory, filing, credible secondary), and a confidence rating. This mapping is enforced structurally — the Analyst cannot produce a claim without a corresponding ledger entry. The ledger deduplicates across loop iterations, ensuring that the evidence base is traceable even when a dimension has undergone multiple retrieval cycles.

This architecture supports board-level defensibility. When a stakeholder asks "where did this conclusion come from," the system can reconstruct the evidence chain from claim to source to retrieval query to the Skeptic's scorecard that triggered the retrieval. For organisations operating in regulated environments, this level of transparency is not optional — it is the baseline expectation for any analytical system that touches decision-making processes.

7. Termination and Cost Governance

Recursive systems that lack explicit termination controls are operationally hazardous. A system that can always elect to "reason further" will eventually exhaust its budget on a single run, or loop indefinitely on a dimension where no amount of additional evidence will satisfy a miscalibrated quality gate.

Deterministic cost ceilings are not an optimization — they are a prerequisite for enterprise deployment. Without bounded, predictable cost envelopes, a reasoning system cannot be operationalized at scale regardless of its analytical quality. The redesigned architecture enforces termination at multiple levels.

Per-dimension iteration limits. Each dimension may loop a maximum of three times through the Scout-Analyst-Skeptic cycle. If the Skeptic has not approved the dimension after three iterations, it is classified as "Approved with Caveats" and its unresolved blind spots are recorded for the Integrator's cross-dimension stress test.

Architectural termination lock. Phase 3 (The Integrator) can flag contradictions and mark uncertainty, but it cannot trigger new evidence retrieval. This is a hard architectural constraint — not a configurable parameter. It guarantees that every run terminates after the Integrator completes its assembly.

Deterministic cost checkpoints. Cost is evaluated at four explicit points in the execution path: before each Scout retrieval call, after each Skeptic evaluation, before any retrieval tool escalation, and before entering a new loop iteration. If accumulated cost meets or exceeds the run ceiling, further loops are halted immediately. Active dimensions are marked with caveats. The Integrator assembles the best available output and marks confidence accordingly.

Dynamic cost proportionality. A straightforward query where most dimensions pass on the first iteration consumes minimal resources. A complex query where multiple dimensions require targeted deepening consumes more — but only in proportion to the actual depth required. Cost scales with analytical complexity, not with pipeline length.

8. Observed Improvements

The improvements are structural, not incremental.

Argument depth. The bounded iterative cycle forces the system to confront weak evidence directly rather than passing it forward for subsequent stages to accommodate. When the Skeptic identifies an evidentiary gap, the Scout retrieves targeted evidence for that specific gap. The Analyst rebuilds with stronger material. The resulting arguments are denser, more specific, and more defensible under scrutiny.

Evidence density and traceability. The structured evidence ledger, maintained across iterations with deduplication, produces a cleaner and more auditable evidence base than the original architecture's accumulated Working Research Notes, which grew unwieldy across eight sequential stages.

Contradiction detection. The Integrator's cross-dimension stress test surfaces tensions that the original linear flow structurally could not detect. When one dimension concludes that premium pricing is viable and another identifies rapid market commoditization, the Integrator flags this contradiction explicitly — with source references for both positions — rather than allowing incompatible conclusions to coexist without acknowledgment.

Cost efficiency. Average cost per run decreased because straightforward queries no longer pay for eight full stages of processing. Complex queries may consume more resources per dimension than the original architecture — the iterative loops add retrieval calls — but overall cost is proportional to the depth actually required, not to a fixed pipeline length.

9. Broader Observations for AI System Design

Several observations from this redesign generalize beyond our specific system.

Additional stages do not produce additional depth. Adding stages to a pipeline adds processing steps. It does not add depth. Each stage operates on the accumulated context of its predecessors, but a forward-only topology prevents any stage from improving the evidential foundation it builds upon. Depth requires the architectural capacity to revisit and strengthen — not merely to append.

Process ceremony can substitute for cognitive rigor. A system with eight stages, boundary contracts, independent review roles, and structured checkpoints satisfies every reasonable checklist for disciplined multi-agent design. But if no component can trigger iterative improvement of the evidence base, the discipline is procedural rather than cognitive. The system follows a rigorous process. It does not reason deeply.

Cognitive continuity outweighs component diversity. Multi-model orchestration was designed to reduce confirmation bias by assigning different models to different stages. In practice, the reasoning discontinuities introduced by model switching outweighed the diversity benefit. A single model, reasoning iteratively within a bounded loop, produces more coherent and defensible analysis than multiple models reasoning once each in sequence.

Recursive capability requires explicit governance. The capacity for iterative reasoning — the ability to loop until quality thresholds are met — is simultaneously the most valuable and the most operationally dangerous property of a reasoning system. Without hard termination limits, deterministic cost checkpoints, and architectural constraints that prevent certain roles from initiating new loops, recursive systems will spiral. Governance is not supplementary to the architecture. It is integral to it.

Targeted iteration outperforms blanket repetition. Per-dimension looping, where only weak sections re-enter the remediation cycle and strong sections are locked, is both more effective and more resource-efficient than architectures that re-process everything when any component is weak. Precision in where computational depth is invested matters as much as the depth itself.

10. Implications for Enterprise Decision Systems

These observations carry direct implications for organisations building, procuring, or evaluating AI systems for decision support.

Summarization architectures fail under ambiguity. When the answer is not contained in any single source — when it must be constructed from fragmentary, sometimes contradictory evidence — systems optimized for retrieval and reformatting produce output that reads well but does not withstand scrutiny. Decision-support systems require the ability to construct, pressure-test, and defend arguments. Retrieval alone is insufficient.

Cost predictability is an operational requirement, not a technical preference. A system that produces excellent reasoning at unpredictable cost cannot be operationalized at enterprise scale. Deterministic cost governance — with bounded ceilings, explicit checkpoints, and graceful degradation when limits are reached — is a deployment prerequisite.

Termination guarantees are non-negotiable. Any system that reasons iteratively must provide hard, verifiable guarantees about when and how it stops. This includes cost-based halting, iteration limits, and architectural constraints that structurally prevent certain operations from initiating new cycles. An enterprise user must know, before initiating a run, that the system will terminate within a bounded cost and time envelope.

Auditability separates decision systems from chat systems. The structural separation of evidence and interpretation, the refusal to allow any agent to evaluate its own output, the requirement that every claim trace to a cited source through a structured evidence ledger — these are not differentiating features. They are baseline requirements for any system whose output will inform consequential decisions.

11. Current Areas of Exploration

The recursive core is domain-agnostic by design. New research verticals — equity analysis, policy research, due diligence, energy assessment — are configured through domain definition files without modification to the core reasoning engine.

We are currently exploring several areas: plateau detection within dimensions, where iterative loops terminate early if successive iterations do not materially improve scores; domain-specific source policies that enforce evidence standards appropriate to the regulatory and analytical context of each field; evaluation frameworks for calibrating quality-gate thresholds across domains; and reliability scoring that provides users with a transparent, evidence-derived measure of confidence in the output.

These remain active engineering problems. We will publish what we learn as the work matures.

Appendix: Architecture Mapping

For readers interested in the specific structural correspondence between the two architectures, this table maps every original stage to its redesigned equivalent.

Original Stage	Redesigned Equivalent	What Changed
Stage 1: Framing	The Architect	Output is now structured JSON (research plan with dimensions), not prose
Stage 2: Exploration	The Scout (per dimension)	Executes per-dimension with targeted queries, not as a single global retrieval pass
Stage 3: Synthesis	The Analyst (per dimension)	Scoped to a single dimension with structured evidence ledger, not full-corpus synthesis
Stage 4: Critical Review	The Skeptic (per dimension)	Outputs a structured scorecard with pass/fail and targeted remediation queries, not prose critique
Stage 5: Counter-arguments	Folded into Skeptic cycle	The Skeptic's missing evidence queries trigger targeted Scout re-retrieval
Stage 6: Refinement	Analyst re-execution on loop	Automatic via the iterative cycle when the Skeptic requires improvement
Stage 7: Narrative Construction	The Integrator	Assembles from individually approved dimensions with cross-dimension stress test
Stage 8: Final Output	Merged into Integrator	Single delivery phase with contradiction detection and confidence scoring

Eight stages became three phases. Eight sequential handoffs became targeted, per-dimension iterative cycles. The epistemic principles remained unchanged. The topology of reasoning changed entirely.

*ValueNova builds AI reasoning systems for high-stakes decision environments. We welcome structured conversations with enterprise leaders, transformation architects, and investors exploring the intersection of AI capability and Business Enterprise.