MP-Bench: Multi-Perspective Failure Attribution Benchmark for Multi-Agent Systems

Official dataset and benchmark for the paper:

"Rethinking Failure Attribution in Multi-Agent Systems: A Multi-Perspective Benchmark and Evaluation"

🚀 Overview

Multi-agent systems (MAS) powered by large language models (LLMs) are increasingly used to solve complex tasks through collaboration. However, these systems frequently fail due to complex inter-agent dependencies and multi-step reasoning processes.

Understanding where and why failures occur is critical for improving reliability.

❗ Key insight: Failure attribution in MAS is inherently non-deterministic — multiple valid explanations for a failure can coexist.

To address this, we introduce:

MP-Bench: The first benchmark for multi-perspective failure attribution
A new evaluation paradigm that moves beyond single “ground-truth” failure labels

🧠 Motivation

❌ Limitation of Existing Benchmarks

Existing benchmarks assume:

A single deterministic failure step
A unique correct answer

However, in real-world MAS:

Multiple reasoning paths may be valid
Different analysts may attribute failure to different steps

✅ Our Perspective

We argue that:

Failure attribution should be modeled as a set of plausible explanations, not a single label.

📊 Illustration

(a) Execution trace of a MAS
(b) Deterministic attribution (existing assumption)
(c) Multi-perspective attribution (our formulation)

Different experts may:

Blame different steps
Provide different reasoning
Still be equally valid

📦 Dataset Description

🔹 Key Features

289 execution logs (Collected from existing benchmarks, MAST and Who&When)
121 diverse MAS configurations (including Hand-Crafted and Automated MAS configurations)
3 expert annotators per instance(experts hired by rigorous interview process)
Step-level annotations + reasoning

📁 Repository layout

All released JSON instances live under MP-Bench/. The directory mirrors which of the three expert annotators produced the labels, and how the underlying MAS was configured:

Path	Meaning
`MP-Bench/1/`, `2/`, `3/`	Annotations from expert 1, 2, and 3 (three independent perspectives per the benchmark design).
`…/automatic/`	Traces from automated / algorithm-generated MAS setups (e.g., configurations aligned with algorithmically generated pipelines).
`…/manual/`	Traces from hand-crafted MAS setups (e.g., expert-designed agent workflows).

Each *.json file is one execution log instance; the numeric filename is an instance identifier within that split.

MP-Bench/
├── 1/
│   ├── automatic/   # Expert 1 · automated MAS configurations
│   └── manual/      # Expert 1 · hand-crafted MAS configurations
├── 2/
│   ├── automatic/
│   └── manual/
└── 3/
    ├── automatic/
    └── manual/

Each JSON instance includes:

{
    "log_source": "source link",
    "annotation": [
        {
            "step": "5",
            "fail_annotation": "1",
            "fail_category": "Resource Access Error",
            "fail_reason": "Direct access to ScienceDirect was blocked with error message 'There was a problem providing the content you requested' and reference number, indicating institutional access restrictions",
            "ideal_action": "The WebSurfer agent should have immediately recognized the access restriction and pivoted to alternative approaches like searching for published research that might contain this statistical data instead of continuing attempts to access restricted content"
        },
        {
            "step": "16",
            "fail_annotation": "1",
            "fail_category": "System Error",
            "fail_reason": "The WebSurfer agent innvocation have raised an error",
            "ideal_action": "The WebSurfer agent could have had better exception handling."
        }
    ]
}

📝 Citation

@article{in2026rethinking,
  title={Rethinking Failure Attribution in Multi-Agent Systems: A Multi-Perspective Benchmark and Evaluation},
  author={In, Yeonjun and Tanjim, Mehrab and Subramanian, Jayakumar and Kim, Sungchul and Bhattacharya, Uttaran and Kim, Wonjoong and Park, Sangwu and Sarkhel, Somdeb and Park, Chanyoung},
  journal={arXiv preprint arXiv:2603.25001},
  year={2026}
}

📜 License

This dataset is released under the Adobe Research License.

🔒 Usage Restrictions

The dataset is strictly limited to non-commercial research purposes only
Allowed uses include:
- Academic research
- Teaching
Commercial use is strictly prohibited, including:
- Product development
- Commercial distribution
- Any activity leading to financial gain

🔁 Redistribution

If you redistribute this dataset (or modified versions):

You must include a copy of the original Adobe Research License
Any derivative work must also be restricted to non-commercial research use

📝 Attribution

You must retain all original copyright notices and disclaimers

⚠️ Disclaimer

The dataset is provided "as is" without warranty of any kind.
Adobe is not liable for any damages resulting from its use.

For full details, please refer to Adobe Research License v1.2.txt.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
MP-Bench		MP-Bench
assets		assets
.gitignore		.gitignore
Adobe Research License v1.2.txt		Adobe Research License v1.2.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MP-Bench: Multi-Perspective Failure Attribution Benchmark for Multi-Agent Systems

🚀 Overview

🧠 Motivation

❌ Limitation of Existing Benchmarks

✅ Our Perspective

📊 Illustration

📦 Dataset Description

🔹 Key Features

📁 Repository layout

📝 Citation

📜 License

🔒 Usage Restrictions

🔁 Redistribution

📝 Attribution

⚠️ Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

MP-Bench: Multi-Perspective Failure Attribution Benchmark for Multi-Agent Systems

🚀 Overview

🧠 Motivation

❌ Limitation of Existing Benchmarks

✅ Our Perspective

📊 Illustration

📦 Dataset Description

🔹 Key Features

📁 Repository layout

📝 Citation

📜 License

🔒 Usage Restrictions

🔁 Redistribution

📝 Attribution

⚠️ Disclaimer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages