When Agents Become the Trading Desk – April 2026

From Desks to Agents

Trading firms spent decades optimizing their organizational structure. Fundamental analysts process balance sheets and earnings. Technical analysts interpret price charts and momentum. Sentiment analysts parse news flow and social media. Risk managers evaluate portfolio exposure. Portfolio managers synthesize these inputs into decisions. Execution traders handle order routing. Each role is specialized. Each contributes a different cognitive capability to the collective decision.

A systematic review of 88 academic papers published between 2022 and 2026 reveals that researchers building multi-agent LLM trading systems are recapitulating this organizational evolution in compressed form.

In 2023, TradingGPT introduced a simple two-agent architecture with peer-to-peer dialogue. By 2024, TradingAgents described five or more specialized agents (fundamental analyst, technical analyst, sentiment analyst, risk manager, coordinator) that explicitly mirror the structure of real trading firms. The 2025 cohort pushed further with graph-structured topologies, code-generation pipelines, multi-modal integration, and portfolio-aware hedging. By early 2026, proposed systems like AGORA-F, FinEvo, and TrustTrade added gradient-orchestrated coordination, ecological competition between agent strategies, and trust-weighted consensus mechanisms.

The convergence on canonical roles (analyst, trader, risk manager, coordinator) across 29 independently proposed research systems is not coincidence. It mirrors what socio-technical systems theory predicts: organizational structures evolve toward configurations that optimize the interaction between social and technical subsystems. The multi-agent LLM trading field is recapitulating, in compressed form, the organizational evolution that human trading firms underwent over decades. Researchers are arriving at the same architecture by different routes.

72 % of the 88 analyzed papers were published in 2025 or 2026. This field barely existed three years ago.

TradingAgents: The Most Complete Architecture in the Field

With 49,000 GitHub stars and active maintenance through version 0.2.3 (April 2026), TradingAgents by Tauric Research is by a wide margin the most adopted implementation in this review. The star count reflects something real. This is not a minimal proof-of-concept. It is the most fully developed instantiation of the architectural principles the field has converged on, and it is worth examining in detail.

The framework deploys 11 named agents across four sequential layers, with two internal debate sub-loops. The input is a ticker symbol and a date. The output is a five-tier trade signal: Buy, Overweight, Hold, Underweight, or Sell. Between input and output, the system runs what amounts to a structured deliberation process across specialized cognitive roles.

TradingAgents architecture (arXiv:2412.20138 · TauricResearch/TradingAgents). Based on source code and paper. M = BM25 episodic memory bank (per agent).

Four Layers, Two Debate Loops

The analyst layer runs four agents sequentially: Market Analyst (price data and technical indicators), Social Analyst (news-as-sentiment proxy), News Analyst (global news and insider transactions), and Fundamentals Analyst (balance sheet, cash flow, income statement). Each agent calls its own data tools, writes a structured report into shared state, then passes control. There is no parallelism in the current implementation. Each analyst clears the message history before the next one begins.

The research layer takes those four reports and runs a structured debate. Bull Researcher and Bear Researcher alternate for a configurable number of rounds (default: one), each reading the full analyst reports plus the opponent's last argument. A Research Manager, using a reasoning-optimized model, acts as debate judge and synthesizes the exchange into an investment plan. This is the first point in the pipeline where a "slow thinking" model appears.

The Trader reads the investment plan plus all four analyst reports and must conclude its output with a literal string: "FINAL TRANSACTION PROPOSAL: BUY / HOLD / SELL". This constraint forces a parseable decision at a defined point in the chain, which is the kind of design choice that matters for any downstream processing.

The risk layer then subjects the Trader's proposal to a three-way debate: Aggressive Analyst (upside, risk-on), Conservative Analyst (downside, caution), and Neutral Analyst (balanced synthesis) rotate for a configurable number of rounds. The Portfolio Manager, again using a deep-think model, reads the full risk debate transcript and produces the final five-tier decision with an executive summary including entry strategy, position sizing guidance, key risk levels, and time horizon.

Two-Tier LLM Design

TradingAgents draws a clear distinction between agents that gather and argue, and agents that decide. Analysts, researchers, the Trader, and risk debate agents use a fast, cost-efficient model (quick_think_llm). Research Manager and Portfolio Manager use a reasoning-optimized model (deep_think_llm). In the original paper experiments, this was gpt-4o-mini versus o1-preview. In v0.2.3, the framework supports OpenAI, Anthropic (Claude 4.x), Google (Gemini 3.x), xAI, OpenRouter, and local Ollama models, configurable independently for each tier.

This tiering is architecturally significant. It encodes a judgment about which cognitive tasks in the pipeline require slow reasoning and which can be handled fast. The field has not converged on this distinction. Most of the other 28 systems use a single model throughout. TradingAgents' two-tier approach is one of the design choices that makes it worth studying regardless of backtesting results.

The Memory System: BM25, Not Embeddings

Five agents in the TradingAgents pipeline have persistent memory: Bull Researcher, Bear Researcher, Trader, Research Manager, and Portfolio Manager. Each has a separate BM25 (Best Match 25) index. After each trading day, a Reflector process takes the full market situation (all four analyst reports concatenated), the agent's output, and the actual P&L result, and generates a structured reflection: what was correct, what was incorrect, why, and what to do differently. This reflection is stored as a (situation, lesson) pair in the agent's BM25 index.

At the start of each subsequent trading session, the agent retrieves the two most lexically similar past situations and injects the corresponding lessons directly into its prompt. The retrieval is entirely offline, with no vector database, no embedding API calls, and no additional latency beyond a BM25 score computation. This is an episodic, lexical memory: fast, transparent, and deterministic. It is also the most robust memory architecture in the research review, which is part of why TradingAgents is the only system in the 31% with a credible memory implementation.

Backtesting Results and Their Limits

The paper reports a Sharpe ratio of 8.21, cumulative returns of 26.62%, and maximum drawdown of approximately 2% on US equity backtests. These numbers beat standard rule-based baselines (Buy-and-Hold, MACD, SMA) across the test period. The ablation study shows that removing the debate mechanism degrades decision quality, and that the risk management layer specifically reduces drawdown without proportionally reducing returns.

The README includes a prominent disclaimer that results vary significantly by backbone LLM, temperature settings, trading period, and data quality. The Sharpe ratio in particular is highly sensitive to test period selection. These are backtesting results under controlled conditions, not live performance records. The gap between reported backtesting numbers and live trading outcomes is exactly the third critical gap identified in the research review, and TradingAgents is not an exception.

What TradingAgents Still Does Not Have

No governance framework. No compliance audit trail. No formal attribution of which agent contributed what weight to the final decision. The Portfolio Manager's output is a human-readable text block, not a structured, machine-parseable decision record with provenance metadata. When the system makes a trade that loses money, there is no formal mechanism for determining whether the fault lay with the Market Analyst's indicator selection, the Bull Researcher's argument, the Trader's interpretation, or the Portfolio Manager's final judgment.

This is not a criticism of TradingAgents specifically. It is the governance gap (0% coverage across all 29 systems) demonstrated in the most prominent example in the field. The most sophisticated research prototype available still has no answer to the accountability question that any institutional compliance function would ask on day one. That gap is precisely where the research frontier ends and the institutional deployment challenge begins.

Five Dimensions That Define the Architecture

Across the 88 papers, a taxonomy of five architectural dimensions emerges that differentiates these research systems. Each represents a design choice with direct consequences for performance, robustness, and, eventually, institutional viability. The figures below reflect how many of the 29 named systems explicitly address each dimension in their design or evaluation.

100 %

Define Agent Roles

83 %

Specify Communication Topology

31 %

Address Memory Mechanisms

72 %

Define Decision Aggregation

83 %

Integrate External Tools

0 %

Include Governance Frameworks

1. Agent Roles: Convergence on a canonical set

The large majority of research systems include some form of analyst agent that processes market data, and a substantial proportion include a decision or trading agent. Risk management agents appear in only about 28 % of systems, a notable gap given the centrality of risk management in real trading operations. Coordination and manager agents appear in most hierarchical designs, mirroring the portfolio manager role in institutional finance.

2. Communication Topology: Four dominant patterns

Hierarchical topologies place a manager at top, with analysts reporting upward. Debate and deliberation topologies enable peer-to-peer refinement through structured dialogue. Pipeline topologies flow information through a sequential chain. Competitive topologies let agents compete, forwarding only the best signal. In practice, hybrid approaches (hierarchical with embedded debate) appear to offer the best tradeoff between decision consistency and adversarial robustness.

3. Memory: The most under-specified dimension

Only 31 % of reviewed papers explicitly address memory mechanisms. This is striking given that memory and context management are among the most significant limitations of LLM-based systems. Approaches in the literature range from simple shared context buffers to layered cognitive models (short-term, medium-term, long-term memory), knowledge graphs for structured causal reasoning, and multi-agent RAG architectures. Research systems without robust memory cannot learn from past trades or adapt to regime changes, which goes a long way toward explaining why none have demonstrated sustained performance across multiple market regimes.

4. Decision Aggregation: Where collective intelligence lives or dies

The mechanism through which multiple agent outputs combine into a single trading decision is where multi-agent systems either realize or fail to realize collective intelligence. Patterns include: manager-decides (simplest, most common), voting and consensus, verbal reinforcement (where the manager provides feedback that shapes analyst behavior over time), trust-weighted selective consensus, and ecological fitness selection. The field has not converged on an optimal approach. This is the hardest problem in multi-agent system design.

5. Tool Integration: From advisory to autonomous

Tool integration transforms agents from advisory systems into autonomous traders. All systems integrate financial data feeds, but sophistication varies from single-source (Yahoo Finance) to multi-source aggregations combining market data, news, SEC filings, social media, and macroeconomic indicators. Some systems enable agents to write and execute code, generating and backtesting quantitative strategies programmatically. Few integrate directly with trading execution APIs. The gap between signal generation and order execution remains a significant barrier to production deployment.

Architectural Patterns in the Literature

Behind the matrix, four communication topologies account for virtually all 29 systems. The topology determines how information flows between agents, how disagreement is resolved, and how the final signal reaches the execution layer. Hierarchical structures dominate the research field. Debate-based and competitive designs represent the more sophisticated end. Pipeline architectures are the simplest. Trust-weighted and graph-based topologies define the 2025-2026 frontier.

Hierarchical

TradingAgents · FinCon · HedgeAgents · ATLAS · ~15 of 29 systems

Analyst agents report upward to a coordinator who holds final decision authority. The most common pattern in the literature and the one that most directly mirrors the institutional trading desk. Risk management agents, when present, typically act as a constraint layer below the analyst tier.

Debate

TradingGPT · FLAG-Trader · bull-bear systems

Agents with opposing mandates run structured debate rounds before a synthesizer produces the final signal. Research consistently finds this topology improves decision quality under high uncertainty, because it forces the system to surface and engage with contradictory evidence before committing.

Pipeline

Automate Strategy Finding · AlphaQuanter · Trading-R1

Information flows sequentially through dedicated processing stages with no feedback loops. Computationally efficient and easy to reason about. Brittle when upstream agents produce errors that compound downstream without correction. Best suited to structured, well-defined sub-tasks.

Competitive

ContestTrade · FinEvo · Apex Quant

Multiple strategy agents generate parallel signals. A judge or fitness function selects the best output before execution. FinEvo extends this into an ecological model: underperforming strategy agents are replaced in an evolutionary loop. Research shows competitive topologies are less susceptible to groupthink than hierarchical ones.

Open-Source Adoption: Which Systems Have Public Code

Nine of the 29 research systems have confirmed public repositories. The distribution is highly concentrated. TradingAgents alone accounts for more than 90% of the combined star count, reflecting its role as the most complete, actively maintained, and broadly cited implementation in the field. Most 2026 preprints have not yet released code. AlphaAgents (BlackRock research group) has no official public release.

GitHub stars as of April 2026 · logarithmic scale · 20 of 29 systems have no confirmed public repository

Architecture Beats the Model

The research finding is consistent: the architecture of collaboration matters more than the capability of any single model.

Several controlled experiments in the literature report that a well-designed multi-agent architecture using GPT-3.5 can outperform a single GPT-4 agent on standardized trading tasks. The collaborative intelligence emerging from agent interaction appears to matter more than the raw capability of the underlying LLM. This is one of the more practically important findings in the review, because it implies that institutions do not need to wait for the next frontier model. The design of the system is the lever.

This is consistent with distributed cognition theory: the benefits of distributing reasoning across specialized agents are most pronounced when the task exceeds the capacity of any single cognitive agent. In trading, where decisions require integrating fundamental analysis, technical signals, sentiment data, and risk constraints simultaneously, no single agent can maintain expertise across all domains.

Experimental results across the 29 systems show the performance advantage of multi-agent designs is most visible during volatile or uncertain market periods, where the diversity of analytical perspectives provided by specialized agents proves most valuable. During calm, trending markets, the advantage narrows. Even simple single-agent strategies can capture directional moves without the overhead of multi-agent deliberation.

The implication for institutions evaluating agent architectures is direct: invest in the structure of collaboration (role design, communication protocols, aggregation mechanisms), not in chasing the latest foundation model. The research already shows that architecture is the variable that determines outcomes. A model-agnostic design that can route between providers without code changes will outperform one locked to a single vendor, regardless of that vendor's current benchmark scores.

Three Areas the Research Has Left to Practitioners

The concept matrix reveals three dimensions that the academic literature has systematically not addressed. This is worth reading carefully: the absence of governance frameworks in research systems does not mean that institutions deploying agent technology have no governance. Financial institutions have compliance infrastructure, risk frameworks, and regulatory oversight that exist independently of any AI system. What the research lacks is a model for how multi-agent systems interface with that infrastructure. That integration problem is where the engineering work happens, and it happens in deployment teams, not in papers.

Agent-level accountability is unspecified in the research

None of the 29 systems implements a formal model of decision attribution within the agent pipeline. When a multi-agent system produces a trade signal, the research does not specify how much influence each agent contributed, which agent's output was decisive, or how to reconstruct the reasoning chain after the fact. Institutions deploying these systems will need to build this attribution layer themselves, connecting agent outputs to existing audit trail requirements. The research has not provided a template. Practitioners are building it from scratch, which means the architectural decisions made now will define the standard.

Adversarial testing is absent from the evaluation protocols

The 29 systems are almost universally evaluated under benign conditions: clean historical data, no fabricated inputs, no simulated adversarial counterparties. Research demonstrates that LLM trading agents can be manipulated through subtly biased news or crafted prompt injections, but this finding appears in separate adversarial robustness papers, not in the evaluation frameworks of the trading systems themselves. Practitioners operating in live markets face adversarial conditions as the baseline, not the exception. Bridging this gap requires red-teaming protocols that the research community has not yet standardized.

The backtest-to-live transition is empirically thin

Research results are dominated by backtests. Live performance data is sparse, and where it exists, the sample sizes are small. LLM-specific production challenges compound the classic quant problem: multi-agent deliberation introduces latency that millisecond-sensitive strategies cannot absorb; backtests use clean historical data while live systems face delayed feeds and corrections; memory implementations are too early-stage to adapt to regime shifts in real time. None of this is insurmountable. Quant teams have solved analogous problems for rule-based systems over decades. The difference is that LLM-based systems are newer, less predictable, and the tooling for production monitoring is still being built.

On systemic risk: Research experiments show that LLM agents exhibit herd behavior, converging on similar strategies when exposed to similar inputs. If multiple institutions deploy architecturally similar multi-agent systems at scale, the behavioral correlation could amplify market moves in ways that individual firms cannot predict or hedge against. This is a coordination problem, not a firm-level problem, and it has no clear owner yet.

What This Means

The honest summary of the research field is this: academic teams have built 29 sophisticated systems that demonstrate the architectural viability of multi-agent LLM trading. They have not solved governance, adversarial robustness, or the backtest-to-live transition. None of these systems has been validated in live production at institutional scale.

That gap between academic demonstration and institutional deployment is itself the most important finding of this review. The technology is capable. The scaffolding required to deploy it responsibly does not yet exist in the literature.

In our previous analysis, we documented how AI agents are repricing an entire software sector. McKinsey's data showed enterprises redirecting budgets from SaaS toward agentic AI systems. The MGI study quantified the scale: 44 % of cognitive work hours automatable, $2.9 trillion in economic value by 2030.

The research reviewed here shows what that agentic infrastructure looks like at the architectural level, in one of the most demanding domains for AI. The research systems that perform best are not the ones with the most powerful models. They are the ones with the most thoughtful organizational design: specialized roles, structured communication, robust memory, and adversarial verification.

The pattern across both analyses is the same. In enterprise software, the budget split is backwards: most spend on technology, almost nothing on process and governance. In multi-agent trading research, the effort split is backwards: nearly all work goes into architectural sophistication, almost none into governance, explainability, or production readiness. The institutions that invert these priorities will be the ones that close the research-to-production gap. The ones that don't will replicate what the academic literature has already shown: impressive prototypes that never reach live deployment.

The institutions that will capture value from multi-agent AI in finance are the ones that treat governance as a design input, not an afterthought. Not by waiting for the research community to close the gaps, but by building accountability, explainability, and sovereign control into their agent architectures from the start. The competitive advantage of the next era will not come from the model. It will come from the institution that knows how to deploy one responsibly.

What Does the Post-SaaS Operating Layer Look Like?

We are building it. Sovereign, model-agnostic, designed for institutions that want to control their own infrastructure and their own future.

See the Architecture

More Research · Get in Touch

Reference: All 29 Systems

The full concept matrix mapping all 29 named research systems across ten architectural dimensions. Role design and tool integration are well-covered. Memory, explainability, and agent-level accountability remain systematically absent from the research literature.

System	Year	Agents	Roles	Comm.	Memory	Dec. Agg.	Tools	Backtest	Explain.	Risk	Gov.
TradingGPT	2023	2+	✓	✓	✓	✓	✓	✓	✗	✗	✗
TradingAgents	2024	5+	✓	✓	✓	✓	✓	✓	✓	✓	✗
FinCon	2024	3+	✓	✓	✓	✓	✓	✓	✗	✗	✗
StockAgent	2025	Mult.	✓	✓	✓	✓	✓	✓	✗	✗	✗
HedgeAgents	2025	3+	✓	✓	✓	✓	✓	✓	✗	✓	✗
ATLAS	2025	3+	✓	✓	✓	✓	✓	✓	✗	✗	✗
ElliottAgents	2025	4+	✓	✓	✓	✓	✓	✓	✗	✗	✗
GWise	2025	3+	✓	✓	✓	✓	✓	✓	✗	✗	✗
MountainLion	2025	3+	✓	✓	✓	✓	✓	✓	✗	✗	✗
AlphaQuanter	2025	2+	✓	✓	✓	✓	✓	✓	✗	✗	✗
Automate Str. Finding	2025	2+	✓	✓	✗	✓	✓	✓	✗	✗	✗
Oprea & Bara MAS	2025	3	✓	✓	✓	✓	✓	✓	✗	✓	✗
Trade in Minutes	2025	3+	✓	✓	✓	✓	✓	✓	✗	✗	✗
QuantAgents	2025	3+	✓	✓	✓	✓	✓	✓	✗	✗	✗
AlphaAgents	2025	3+	✓	✓	✓	✓	✓	✓	✗	✗	✗
FactorMAD	2025	2+	✓	✓	✓	✓	✓	✓	✗	✗	✗
QuantAgent HFT	2025	3+	✓	✓	✓	✓	✓	✓	✗	✗	✗
ContestTrade	2025	3+	✓	✓	✓	✓	✓	✓	✗	✗	✗
Agentic Portfolio	2025	3+	✓	✓	✓	✓	✓	✓	✗	✗	✗
Trading-R1	2025	3+	✓	✓	✓	✓	✓	✓	✗	✗	✗
Multi-Agent Alpha	2025	2+	✓	✓	✓	✓	✓	✓	✗	✗	✗
FLAG-Trader	2025	3+	✓	✓	✗	✓	✓	✓	✗	✗	✗
TrustTrade	2026	3+	✓	✓	✓	✓	✓	✓	✗	✗	✗
MARAG-Fin	2026	3+	✓	✓	✓	✓	✓	✓	✗	✗	✗
Expert Teams	2026	5+	✓	✓	✓	✓	✓	✓	✗	✗	✗
Apex Quant	2026	3+	✓	✓	✓	✓	✓	✓	✗	✗	✗
AGORA-F	2026	3+	✓	✓	✓	✓	✓	✓	✗	✗	✗
FinEvo	2026	5+	✓	✓	✓	✓	✓	✓	✗	✗	✗
Adaptive LLM	2026	3+	✓	✓	✓	✓	✓	✓	✗	✗	✗

Concept matrix based on systematic review of 88 papers (2022-2026). Binary coding: each dimension scored 1 if explicitly addressed through system design, evaluation, or substantive discussion.