From Desks to Agents
Trading firms spent decades optimizing their organizational structure. Fundamental analysts process balance sheets and earnings. Technical analysts interpret price charts and momentum. Sentiment analysts parse news flow and social media. Risk managers evaluate portfolio exposure. Portfolio managers synthesize these inputs into decisions. Execution traders handle order routing. Each role is specialized. Each contributes a different cognitive capability to the collective decision.
A systematic review of 88 academic papers published between 2022 and 2026 reveals that researchers building multi-agent LLM trading systems are recapitulating this organizational evolution in compressed form.
In 2023, TradingGPT introduced a simple two-agent architecture with peer-to-peer dialogue. By 2024, TradingAgents described five or more specialized agents (fundamental analyst, technical analyst, sentiment analyst, risk manager, coordinator) that explicitly mirror the structure of real trading firms. The 2025 cohort pushed further with graph-structured topologies, code-generation pipelines, multi-modal integration, and portfolio-aware hedging. By early 2026, proposed systems like AGORA-F, FinEvo, and TrustTrade added gradient-orchestrated coordination, ecological competition between agent strategies, and trust-weighted consensus mechanisms.
The convergence on canonical roles (analyst, trader, risk manager, coordinator) across 29 independently proposed research systems is not coincidence. It mirrors what socio-technical systems theory predicts: organizational structures evolve toward configurations that optimize the interaction between social and technical subsystems. The multi-agent LLM trading field is recapitulating, in compressed form, the organizational evolution that human trading firms underwent over decades. Researchers are arriving at the same architecture by different routes.
72 % of the 88 analyzed papers were published in 2025 or 2026. This field barely existed three years ago.
TradingAgents: The Most Complete Architecture in the Field
With 49,000 GitHub stars and active maintenance through version 0.2.3 (April 2026), TradingAgents by Tauric Research is by a wide margin the most adopted implementation in this review. The star count reflects something real. This is not a minimal proof-of-concept. It is the most fully developed instantiation of the architectural principles the field has converged on, and it is worth examining in detail.
The framework deploys 11 named agents across four sequential layers, with two internal debate sub-loops. The input is a ticker symbol and a date. The output is a five-tier trade signal: Buy, Overweight, Hold, Underweight, or Sell. Between input and output, the system runs what amounts to a structured deliberation process across specialized cognitive roles.
Four Layers, Two Debate Loops
The analyst layer runs four agents sequentially: Market Analyst (price data and technical indicators), Social Analyst (news-as-sentiment proxy), News Analyst (global news and insider transactions), and Fundamentals Analyst (balance sheet, cash flow, income statement). Each agent calls its own data tools, writes a structured report into shared state, then passes control. There is no parallelism in the current implementation. Each analyst clears the message history before the next one begins.
The research layer takes those four reports and runs a structured debate. Bull Researcher and Bear Researcher alternate for a configurable number of rounds (default: one), each reading the full analyst reports plus the opponent's last argument. A Research Manager, using a reasoning-optimized model, acts as debate judge and synthesizes the exchange into an investment plan. This is the first point in the pipeline where a "slow thinking" model appears.
The Trader reads the investment plan plus all four analyst reports and must conclude its output with a literal string: "FINAL TRANSACTION PROPOSAL: BUY / HOLD / SELL". This constraint forces a parseable decision at a defined point in the chain, which is the kind of design choice that matters for any downstream processing.
The risk layer then subjects the Trader's proposal to a three-way debate: Aggressive Analyst (upside, risk-on), Conservative Analyst (downside, caution), and Neutral Analyst (balanced synthesis) rotate for a configurable number of rounds. The Portfolio Manager, again using a deep-think model, reads the full risk debate transcript and produces the final five-tier decision with an executive summary including entry strategy, position sizing guidance, key risk levels, and time horizon.
Two-Tier LLM Design
TradingAgents draws a clear distinction between agents that gather and argue, and agents that decide. Analysts, researchers, the Trader, and risk debate agents use a fast, cost-efficient model (quick_think_llm). Research Manager and Portfolio Manager use a reasoning-optimized model (deep_think_llm). In the original paper experiments, this was gpt-4o-mini versus o1-preview. In v0.2.3, the framework supports OpenAI, Anthropic (Claude 4.x), Google (Gemini 3.x), xAI, OpenRouter, and local Ollama models, configurable independently for each tier.
This tiering is architecturally significant. It encodes a judgment about which cognitive tasks in the pipeline require slow reasoning and which can be handled fast. The field has not converged on this distinction. Most of the other 28 systems use a single model throughout. TradingAgents' two-tier approach is one of the design choices that makes it worth studying regardless of backtesting results.
The Memory System: BM25, Not Embeddings
Five agents in the TradingAgents pipeline have persistent memory: Bull Researcher, Bear Researcher, Trader, Research Manager, and Portfolio Manager. Each has a separate BM25 (Best Match 25) index. After each trading day, a Reflector process takes the full market situation (all four analyst reports concatenated), the agent's output, and the actual P&L result, and generates a structured reflection: what was correct, what was incorrect, why, and what to do differently. This reflection is stored as a (situation, lesson) pair in the agent's BM25 index.
At the start of each subsequent trading session, the agent retrieves the two most lexically similar past situations and injects the corresponding lessons directly into its prompt. The retrieval is entirely offline, with no vector database, no embedding API calls, and no additional latency beyond a BM25 score computation. This is an episodic, lexical memory: fast, transparent, and deterministic. It is also the most robust memory architecture in the research review, which is part of why TradingAgents is the only system in the 31% with a credible memory implementation.
Backtesting Results and Their Limits
The paper reports a Sharpe ratio of 8.21, cumulative returns of 26.62%, and maximum drawdown of approximately 2% on US equity backtests. These numbers beat standard rule-based baselines (Buy-and-Hold, MACD, SMA) across the test period. The ablation study shows that removing the debate mechanism degrades decision quality, and that the risk management layer specifically reduces drawdown without proportionally reducing returns.
The README includes a prominent disclaimer that results vary significantly by backbone LLM, temperature settings, trading period, and data quality. The Sharpe ratio in particular is highly sensitive to test period selection. These are backtesting results under controlled conditions, not live performance records. The gap between reported backtesting numbers and live trading outcomes is exactly the third critical gap identified in the research review, and TradingAgents is not an exception.
What TradingAgents Still Does Not Have
No governance framework. No compliance audit trail. No formal attribution of which agent contributed what weight to the final decision. The Portfolio Manager's output is a human-readable text block, not a structured, machine-parseable decision record with provenance metadata. When the system makes a trade that loses money, there is no formal mechanism for determining whether the fault lay with the Market Analyst's indicator selection, the Bull Researcher's argument, the Trader's interpretation, or the Portfolio Manager's final judgment.
This is not a criticism of TradingAgents specifically. It is the governance gap (0% coverage across all 29 systems) demonstrated in the most prominent example in the field. The most sophisticated research prototype available still has no answer to the accountability question that any institutional compliance function would ask on day one. That gap is precisely where the research frontier ends and the institutional deployment challenge begins.
Five Dimensions That Define the Architecture
Across the 88 papers, a taxonomy of five architectural dimensions emerges that differentiates these research systems. Each represents a design choice with direct consequences for performance, robustness, and, eventually, institutional viability. The figures below reflect how many of the 29 named systems explicitly address each dimension in their design or evaluation.
1. Agent Roles: Convergence on a canonical set
The large majority of research systems include some form of analyst agent that processes market data, and a substantial proportion include a decision or trading agent. Risk management agents appear in only about 28 % of systems, a notable gap given the centrality of risk management in real trading operations. Coordination and manager agents appear in most hierarchical designs, mirroring the portfolio manager role in institutional finance.
2. Communication Topology: Four dominant patterns
Hierarchical topologies place a manager at top, with analysts reporting upward. Debate and deliberation topologies enable peer-to-peer refinement through structured dialogue. Pipeline topologies flow information through a sequential chain. Competitive topologies let agents compete, forwarding only the best signal. In practice, hybrid approaches (hierarchical with embedded debate) appear to offer the best tradeoff between decision consistency and adversarial robustness.
3. Memory: The most under-specified dimension
Only 31 % of reviewed papers explicitly address memory mechanisms. This is striking given that memory and context management are among the most significant limitations of LLM-based systems. Approaches in the literature range from simple shared context buffers to layered cognitive models (short-term, medium-term, long-term memory), knowledge graphs for structured causal reasoning, and multi-agent RAG architectures. Research systems without robust memory cannot learn from past trades or adapt to regime changes, which goes a long way toward explaining why none have demonstrated sustained performance across multiple market regimes.
4. Decision Aggregation: Where collective intelligence lives or dies
The mechanism through which multiple agent outputs combine into a single trading decision is where multi-agent systems either realize or fail to realize collective intelligence. Patterns include: manager-decides (simplest, most common), voting and consensus, verbal reinforcement (where the manager provides feedback that shapes analyst behavior over time), trust-weighted selective consensus, and ecological fitness selection. The field has not converged on an optimal approach. This is the hardest problem in multi-agent system design.
5. Tool Integration: From advisory to autonomous
Tool integration transforms agents from advisory systems into autonomous traders. All systems integrate financial data feeds, but sophistication varies from single-source (Yahoo Finance) to multi-source aggregations combining market data, news, SEC filings, social media, and macroeconomic indicators. Some systems enable agents to write and execute code, generating and backtesting quantitative strategies programmatically. Few integrate directly with trading execution APIs. The gap between signal generation and order execution remains a significant barrier to production deployment.
Architectural Patterns in the Literature
Behind the matrix, four communication topologies account for virtually all 29 systems. The topology determines how information flows between agents, how disagreement is resolved, and how the final signal reaches the execution layer. Hierarchical structures dominate the research field. Debate-based and competitive designs represent the more sophisticated end. Pipeline architectures are the simplest. Trust-weighted and graph-based topologies define the 2025-2026 frontier.
Open-Source Adoption: Which Systems Have Public Code
Nine of the 29 research systems have confirmed public repositories. The distribution is highly concentrated. TradingAgents alone accounts for more than 90% of the combined star count, reflecting its role as the most complete, actively maintained, and broadly cited implementation in the field. Most 2026 preprints have not yet released code. AlphaAgents (BlackRock research group) has no official public release.
Architecture Beats the Model
Several controlled experiments in the literature report that a well-designed multi-agent architecture using GPT-3.5 can outperform a single GPT-4 agent on standardized trading tasks. The collaborative intelligence emerging from agent interaction appears to matter more than the raw capability of the underlying LLM. This is one of the more practically important findings in the review, because it implies that institutions do not need to wait for the next frontier model. The design of the system is the lever.
This is consistent with distributed cognition theory: the benefits of distributing reasoning across specialized agents are most pronounced when the task exceeds the capacity of any single cognitive agent. In trading, where decisions require integrating fundamental analysis, technical signals, sentiment data, and risk constraints simultaneously, no single agent can maintain expertise across all domains.
Experimental results across the 29 systems show the performance advantage of multi-agent designs is most visible during volatile or uncertain market periods, where the diversity of analytical perspectives provided by specialized agents proves most valuable. During calm, trending markets, the advantage narrows. Even simple single-agent strategies can capture directional moves without the overhead of multi-agent deliberation.
The implication for institutions evaluating agent architectures is direct: invest in the structure of collaboration (role design, communication protocols, aggregation mechanisms), not in chasing the latest foundation model. The research already shows that architecture is the variable that determines outcomes. A model-agnostic design that can route between providers without code changes will outperform one locked to a single vendor, regardless of that vendor's current benchmark scores.
Three Areas the Research Has Left to Practitioners
The concept matrix reveals three dimensions that the academic literature has systematically not addressed. This is worth reading carefully: the absence of governance frameworks in research systems does not mean that institutions deploying agent technology have no governance. Financial institutions have compliance infrastructure, risk frameworks, and regulatory oversight that exist independently of any AI system. What the research lacks is a model for how multi-agent systems interface with that infrastructure. That integration problem is where the engineering work happens, and it happens in deployment teams, not in papers.
What This Means
The honest summary of the research field is this: academic teams have built 29 sophisticated systems that demonstrate the architectural viability of multi-agent LLM trading. They have not solved governance, adversarial robustness, or the backtest-to-live transition. None of these systems has been validated in live production at institutional scale.
That gap between academic demonstration and institutional deployment is itself the most important finding of this review. The technology is capable. The scaffolding required to deploy it responsibly does not yet exist in the literature.
In our previous analysis, we documented how AI agents are repricing an entire software sector. McKinsey's data showed enterprises redirecting budgets from SaaS toward agentic AI systems. The MGI study quantified the scale: 44 % of cognitive work hours automatable, $2.9 trillion in economic value by 2030.
The research reviewed here shows what that agentic infrastructure looks like at the architectural level, in one of the most demanding domains for AI. The research systems that perform best are not the ones with the most powerful models. They are the ones with the most thoughtful organizational design: specialized roles, structured communication, robust memory, and adversarial verification.
The institutions that will capture value from multi-agent AI in finance are the ones that treat governance as a design input, not an afterthought. Not by waiting for the research community to close the gaps, but by building accountability, explainability, and sovereign control into their agent architectures from the start. The competitive advantage of the next era will not come from the model. It will come from the institution that knows how to deploy one responsibly.
What Does the Post-SaaS Operating Layer Look Like?
We are building it. Sovereign, model-agnostic, designed for institutions that want to control their own infrastructure and their own future.
See the ArchitectureReference: All 29 Systems
The full concept matrix mapping all 29 named research systems across ten architectural dimensions. Role design and tool integration are well-covered. Memory, explainability, and agent-level accountability remain systematically absent from the research literature.
| System | Year | Agents | Roles | Comm. | Memory | Dec. Agg. | Tools | Backtest | Explain. | Risk | Gov. |
|---|---|---|---|---|---|---|---|---|---|---|---|
| TradingGPT | 2023 | 2+ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| TradingAgents | 2024 | 5+ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ |
| FinCon | 2024 | 3+ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| StockAgent | 2025 | Mult. | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| HedgeAgents | 2025 | 3+ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✓ | ✗ |
| ATLAS | 2025 | 3+ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| ElliottAgents | 2025 | 4+ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| GWise | 2025 | 3+ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| MountainLion | 2025 | 3+ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| AlphaQuanter | 2025 | 2+ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| Automate Str. Finding | 2025 | 2+ | ✓ | ✓ | ✗ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| Oprea & Bara MAS | 2025 | 3 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✓ | ✗ |
| Trade in Minutes | 2025 | 3+ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| QuantAgents | 2025 | 3+ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| AlphaAgents | 2025 | 3+ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| FactorMAD | 2025 | 2+ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| QuantAgent HFT | 2025 | 3+ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| ContestTrade | 2025 | 3+ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| Agentic Portfolio | 2025 | 3+ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| Trading-R1 | 2025 | 3+ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| Multi-Agent Alpha | 2025 | 2+ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| FLAG-Trader | 2025 | 3+ | ✓ | ✓ | ✗ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| TrustTrade | 2026 | 3+ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| MARAG-Fin | 2026 | 3+ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| Expert Teams | 2026 | 5+ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| Apex Quant | 2026 | 3+ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| AGORA-F | 2026 | 3+ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| FinEvo | 2026 | 5+ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| Adaptive LLM | 2026 | 3+ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
Concept matrix based on systematic review of 88 papers (2022-2026). Binary coding: each dimension scored 1 if explicitly addressed through system design, evaluation, or substantive discussion.