Rethinking Data Agent Benchmarks: DAComp Exposes Critical Gaps in Enterprise Intelligence
Executive Summary
The challenge of automating complex enterprise data workflows remains immense. Traditional AI benchmarks often focus narrowly on code generation or static analysis, failing to replicate the real-world, multi-stage process that runs from raw source data ingestion to actionable business insights. This research introduces DAComp, a comprehensive benchmark comprising 210 tasks that meticulously map the full data intelligence lifecycle: encompassing repository-level Data Engineering (DE) and open-ended Data Analysis (DA). The primary takeaway is sobering: even state-of-the-art data agents achieve success rates under 20% on DE tasks and average below 40% on DA tasks. This stark performance gap signals that current models lack the essential holistic orchestration and strategic reasoning required for autonomous operation in complex enterprise environments.
The Motivation: What Problem Does This Solve?
Enterprise data intelligence isn't a single step; it's a dynamic, evolving workflow. Existing evaluation methods generally compartmentalize this process, perhaps testing SQL generation against a pre-defined schema, or simple querying abilities. This approach fails to capture the complexity inherent in industrial settings: designing and building multi-stage SQL pipelines from scratch, evolving existing systems under new requirements, and tackling open-ended business questions that demand iterative exploratory analysis. The core problem is that agents capable of passing narrow tests often fail catastrophically when presented with holistic, end-to-end data tasks that require planning, execution, iteration, and interpretation.
Key Contributions
How the Method Works
DAComp is structured around two primary task categories that mimic industrial complexity.
Architecture: Task Design
The Data Engineering (DE) component focuses heavily on system evolution and pipeline creation. Tasks require agents to interact with a repository environment, often necessitating modifications to schema or the construction of complex, multi-stage data transformation pipelines written in SQL. These aren't isolated queries; they demand architectural understanding of how data flows and integrates across different stages and tables.
Evaluation
For DE tasks, evaluation is execution-based. The agent's generated pipeline must execute correctly and produce the desired outcome, measured using multiple metrics to ensure completeness and efficiency.
The Data Analysis (DA) component presents agents with open-ended business problems. The agent must first plan an approach, then execute iterative exploratory analysis using code (likely Python/Pandas), interpret the results of intermediate steps, and ultimately synthesize findings into a clear, actionable recommendation for a hypothetical business stakeholder. Since these tasks are open-ended and subjective, the research utilizes an experimentally validated LLM-judge. This judge uses meticulously crafted, hierarchical rubrics to assess the quality of the planning, execution steps, interpretation, and final synthesis, ensuring reliability across qualitative evaluation.
Results & Benchmarks
The findings from benchmarking state-of-the-art (SOTA) agents on DAComp reveal significant limitations across the board.
| Task Category | Average Success Rate | Primary Bottleneck |
|---|---|---|
| Data Engineering (DE) | Under 20% | Holistic Pipeline Orchestration |
| Data Analysis (DA) | Below 40% | Open-Ended Reasoning and Interpretation |
The sub-20% performance on DE tasks is particularly striking. It confirms that while modern agents might generate syntactically correct SQL snippets, they fail when required to stitch those snippets into a coherent, multi-step pipeline that operates within the constraints of an evolving industrial repository. Additionally, the average success rate of below 40% on DA tasks underscores a profound deficiency in higher-order cognitive capabilities, namely strategic planning and synthesizing actionable intelligence from iterative analysis. This strongly suggests that DE and DA require distinct, non-overlapping capabilities that current monolithic agent architectures struggle to master simultaneously.
Strengths: What This Research Achieves
The core strength of DAComp is its fidelity to real-world enterprise complexity. By forcing agents to handle the full lifecycle-from raw data engineering to final analytical recommendation-it moves beyond simplistic code generation benchmarks. Additionally, the dual-methodology for evaluation is robust: execution-based testing provides objective metrics for engineering, while the LLM-judge, guided by detailed rubrics, introduces a scalable yet reliable means to assess the qualitative nature of open-ended analysis. This rigor provides the community with a necessary diagnostic tool to pinpoint exactly where autonomous data agents fail in industrial settings.
Limitations & Failure Cases
While DAComp is highly rigorous, implementing such a complex benchmark introduces limitations. The use of an LLM-judge, though validated, inherently carries some risk of subjective bias, regardless of the rubrics' quality. Furthermore, the bottleneck identified in DE is "holistic pipeline orchestration." This definition is broad and suggests that current agents lack the robust internal planning and state management required for sequential, complex steps. The specific constraints and scale of the "industrial schemas" used in the benchmark are crucial details that could impact generalizability-if the complexity is too low, the 20% figure might be inflated; if it's too high, it might discourage immediate progress. Scalability remains a huge risk; complex agent planning often hits prompt length limits or exponential complexity growth in real-time enterprise scenarios.
Real-World Implications & Applications
If agents could successfully navigate the DAComp benchmark, the implications for enterprise data management would be transformative. Data Engineering teams, currently bottlenecked by manual schema evolution and pipeline debugging, could leverage autonomous agents for dynamic system maintenance and rapid response to changing business requirements. We'll see accelerated time-to-insight, as analysts could delegate initial exploratory data analysis (EDA) and iterative querying to agents, allowing them to focus exclusively on strategic decision-making based on the agents' synthesized reports. Ultimately, this research provides the necessary target specifications for developing truly autonomous data intelligence platforms that minimize human intervention across the entire data-to-decision pipeline.
Relation to Prior Work
The state-of-the-art prior work largely centered on discrete aspects: agents excelling at text-to-SQL translation (e.g., Spider benchmark variants) or models trained for descriptive statistical analysis on clean datasets. However, these models operated under idealized conditions-assuming a stable schema and focused tasks. DAComp fills the critical gap by integrating the messy reality of data manipulation (DE) with the ambiguity of business problem solving (DA). It sets a higher bar than preceding benchmarks by requiring agents not only to generate code but to understand the context, manage state, and interpret results strategically, demonstrating that holistic intelligence is necessary, not just technical proficiency.
Conclusion: Why This Paper Matters
DAComp is a pivotal contribution because it shatters the illusion that current SOTA agents are close to achieving autonomy in enterprise data intelligence. The extremely low success rates-particularly the sub-20% figure in data engineering-provide clear, undeniable evidence of fundamental architectural deficiencies in planning and orchestration. For architects and researchers, this paper is not just a benchmark; it's a diagnostic tool that clearly delineates the path forward: future agent development must prioritize robust sequential reasoning, architectural awareness, and integrated planning capabilities to successfully bridge the chasm between narrow task performance and full-spectrum enterprise deployment.
Appendix
The benchmark data and code are open-sourced at https://da-comp.github.io, offering the community the ability to reproduce results and directly contribute improvements. The design methodology for the LLM-judge assessment provides a valuable template for evaluating complex, open-ended tasks where simple execution correctness is insufficient. The research emphasizes the distinction between simple code generation and holistic pipeline management.
Stay Ahead of the Curve
Get the top 1% of AI breakthroughs and engineering insights delivered to your inbox. No noise, just signal.
Commercial Applications
Autonomous ETL Pipeline Maintenance
Enterprise data teams can utilize advanced agents to automatically evolve existing data transformation pipelines (ETL/ELT) in response to upstream source schema changes or evolving compliance requirements, minimizing manual intervention and reducing data downtime.
Strategic Business Intelligence Generation
Agents capable of passing the DA tasks can be deployed within BI platforms to perform initial exploratory data analysis on open-ended business questions ('Why did sales drop last quarter?'), autonomously identify relevant datasets, iterate through hypotheses via code execution, and synthesize a narrative report for human review.
Data Governance and Auditing
Leveraging the DE capabilities, agents could be used to audit data quality and lineage by designing and executing complex validation pipelines across diverse data repositories, ensuring data integrity and adherence to internal governance standards without constant human oversight of SQL logic.