Skip to content

adobe-research/multi-agent-eval-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MP-Bench: Multi-Perspective Failure Attribution Benchmark for Multi-Agent Systems

Official dataset and benchmark for the paper:

"Rethinking Failure Attribution in Multi-Agent Systems: A Multi-Perspective Benchmark and Evaluation"


🚀 Overview

Multi-agent systems (MAS) powered by large language models (LLMs) are increasingly used to solve complex tasks through collaboration. However, these systems frequently fail due to complex inter-agent dependencies and multi-step reasoning processes.

Understanding where and why failures occur is critical for improving reliability.

Key insight: Failure attribution in MAS is inherently non-deterministic — multiple valid explanations for a failure can coexist.

To address this, we introduce:

  • MP-Bench: The first benchmark for multi-perspective failure attribution
  • A new evaluation paradigm that moves beyond single “ground-truth” failure labels

🧠 Motivation

❌ Limitation of Existing Benchmarks

Existing benchmarks assume:

  • A single deterministic failure step
  • A unique correct answer

However, in real-world MAS:

  • Multiple reasoning paths may be valid
  • Different analysts may attribute failure to different steps

✅ Our Perspective

We argue that:

Failure attribution should be modeled as a set of plausible explanations, not a single label.


📊 Illustration

Multi-Perspective Failure Attribution

  • (a) Execution trace of a MAS
  • (b) Deterministic attribution (existing assumption)
  • (c) Multi-perspective attribution (our formulation)

Different experts may:

  • Blame different steps
  • Provide different reasoning
  • Still be equally valid

📦 Dataset Description

🔹 Key Features

  • 289 execution logs (Collected from existing benchmarks, MAST and Who&When)
  • 121 diverse MAS configurations (including Hand-Crafted and Automated MAS configurations)
  • 3 expert annotators per instance(experts hired by rigorous interview process)
  • Step-level annotations + reasoning

📁 Repository layout

All released JSON instances live under MP-Bench/. The directory mirrors which of the three expert annotators produced the labels, and how the underlying MAS was configured:

Path Meaning
MP-Bench/1/, 2/, 3/ Annotations from expert 1, 2, and 3 (three independent perspectives per the benchmark design).
…/automatic/ Traces from automated / algorithm-generated MAS setups (e.g., configurations aligned with algorithmically generated pipelines).
…/manual/ Traces from hand-crafted MAS setups (e.g., expert-designed agent workflows).

Each *.json file is one execution log instance; the numeric filename is an instance identifier within that split.

MP-Bench/
├── 1/
│   ├── automatic/   # Expert 1 · automated MAS configurations
│   └── manual/      # Expert 1 · hand-crafted MAS configurations
├── 2/
│   ├── automatic/
│   └── manual/
└── 3/
    ├── automatic/
    └── manual/

Each JSON instance includes:

{
    "log_source": "source link",
    "annotation": [
        {
            "step": "5",
            "fail_annotation": "1",
            "fail_category": "Resource Access Error",
            "fail_reason": "Direct access to ScienceDirect was blocked with error message 'There was a problem providing the content you requested' and reference number, indicating institutional access restrictions",
            "ideal_action": "The WebSurfer agent should have immediately recognized the access restriction and pivoted to alternative approaches like searching for published research that might contain this statistical data instead of continuing attempts to access restricted content"
        },
        {
            "step": "16",
            "fail_annotation": "1",
            "fail_category": "System Error",
            "fail_reason": "The WebSurfer agent innvocation have raised an error",
            "ideal_action": "The WebSurfer agent could have had better exception handling."
        }
    ]
}

📝 Citation

@article{in2026rethinking,
  title={Rethinking Failure Attribution in Multi-Agent Systems: A Multi-Perspective Benchmark and Evaluation},
  author={In, Yeonjun and Tanjim, Mehrab and Subramanian, Jayakumar and Kim, Sungchul and Bhattacharya, Uttaran and Kim, Wonjoong and Park, Sangwu and Sarkhel, Somdeb and Park, Chanyoung},
  journal={arXiv preprint arXiv:2603.25001},
  year={2026}
}

📜 License

This dataset is released under the Adobe Research License.

🔒 Usage Restrictions

  • The dataset is strictly limited to non-commercial research purposes only
  • Allowed uses include:
    • Academic research
    • Teaching
  • Commercial use is strictly prohibited, including:
    • Product development
    • Commercial distribution
    • Any activity leading to financial gain

🔁 Redistribution

If you redistribute this dataset (or modified versions):

  • You must include a copy of the original Adobe Research License
  • Any derivative work must also be restricted to non-commercial research use

📝 Attribution

  • You must retain all original copyright notices and disclaimers

⚠️ Disclaimer

The dataset is provided "as is" without warranty of any kind.
Adobe is not liable for any damages resulting from its use.


For full details, please refer to Adobe Research License v1.2.txt.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors