feat: add multimodal evaluators and prompt templates for image-to-text evaluation by sangminwoo · Pull Request #187 · strands-agents/evals

sangminwoo · 2026-04-07T22:17:04Z

Description

This PR adds multimodal (MLLM-as-a-Judge) evaluators for image-to-text tasks with reference-free and reference-based evaluation support.

Introduced MultimodalCorrectnessEvaluator, MultimodalFaithfulnessEvaluator, MultimodalInstructionFollowingEvaluator, and MultimodalOverallQualityEvaluator to assess various aspects of multimodal responses.
Implemented MultimodalOutputEvaluator as a base class for handling multimodal inputs and outputs.
Created prompt templates for evaluation rubrics including correctness, faithfulness, instruction following, and overall quality.
Developed a structured approach for composing evaluation prompts with support for media content.
Added ImageData class to manage image sources and formats, enabling flexible input handling for evaluators.
Established a unified module for multimodal evaluation types and data structures.
Supports reference-free evaluation (default) and reference-based evaluation (automatically toggled when expected_output is provided in the test case).

Related Issues

#128

Documentation PR

strands-agents/docs#674

Type of Change

New feature

Testing

How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli

I ran hatch run prepare

Checklist

I have read the CONTRIBUTING document
I have added any necessary tests that prove my fix is effective or my feature works
I have updated the documentation accordingly
I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
My changes generate no new warnings
Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…t evaluation - Introduced `MultimodalFaithfulnessEvaluator`, `MultimodalInstructionFollowingEvaluator`, and `MultimodalOverallQualityEvaluator` to assess various aspects of multimodal responses. - Implemented `MultimodalOutputEvaluator` as a base class for handling multimodal inputs and outputs. - Created prompt templates for evaluation rubrics including correctness, faithfulness, instruction following, and overall quality. - Developed a structured approach for composing evaluation prompts with support for media content. - Added `ImageData` class to manage image sources and formats, enabling flexible input handling for evaluators. - Established a unified module for multimodal evaluation types and data structures.

… fixes

Co-authored-by: Sungyeon Kim <ksy9597@gmail.com>

…h reference suffix Co-authored-by: Sungyeon Kim <ksy9597@gmail.com>

…ics fix, S3 removal, typing cleanup

…putEvaluator

afarntrog

I would like to see some tests for this new multimodal. Please add unit and integ tests

Co-authored-by: Sungyeon Kim <ksy9597@gmail.com>

sangminwoo · 2026-04-22T00:46:59Z

@afarntrog Added unit and integration tests:

test_multimodal.py - 45 tests for ImageData, MultimodalInput, resolve_image_bytes, format detection
test_multimodal_output_evaluator.py - 18 tests for base evaluator init, rubric selection, prompt building, evaluate/evaluate_async
test_multimodal_specialized_evaluators.py - 23 tests for all four specialized subclasses (Correctness, Faithfulness, InstructionFollowing, OverallQuality)
test_multimodal_case_prompt_template.py - 16 tests for media resolution, content block building, prompt composition
test_evaluation_report.py - 11 tests for _format_input_for_display and flatten()
tests_integ/test_multimodal_output_evaluator.py - 9 integration tests covering MLLM/LLM modes, reference-based/free, multi-image, file path, sync/async, context, and batch

Also, could we add Sungyeon Kim (@sung-yeon-kim) as a collaborator on this repo? He's been co-authoring the work on this PR.

Co-authored-by: Sungyeon Kim <ksy9597@gmail.com>

jjbuck · 2026-04-29T04:14:48Z

1. Drop `include_media` entirely

As noted in one of my previous comments [https://github.com//pull/187#discussion_r3048446548], I still think we should drop this boolean flag. Boolean flags on a class are typically an anti-pattern that make extensibility difficult - they almost always signal the class has two modes and should either be two classes, or the "mode" should be driven by the data. Here it's the latter. The class has a single coherent purpose, i.e., "evaluate a multimodal response with an MLLM judge, using content blocks when media is available." The flag re-introduces optionality that the _build_prompt refactor was specifically designed to eliminate.

This would touch all four specialized evaluators, multimodal_output_evaluator.py, multimodal_case_prompt_template.py Additionally, there's a silent corruption path here wherein include_media: false in an evaluator config can round-trip a multimodal case into silent text-only evaluation.

What to do:

Remove include_media from MultimodalOutputEvaluator.__init__, all four specialized subclasses, and compose_multimodal_test_prompt.
Remove the self.include_media attribute.

Make dispatch purely data-driven in compose_multimodal_test_prompt:

if isinstance(input_, MultimodalInput) and input_.media:
    # build content-block list
else:
    # text-only prompt
    if isinstance(input_, MultimodalInput) and not input_.media:
        warnings.warn("MultimodalInput has empty media; evaluating as text-only.", ...)

Warn only when the input declares itself multimodal but has no media — that's an actual data inconsistency. Don't warn on plain-text inputs (since nothing is wrong with those)

Remove the duplicated warning at the evaluator level (the prompt-composer is the right place for it.

2. Silent corruption on save/load round-trip

File: src/strands_evals/evaluators/multimodal_output_evaluator.py, line ~85

Blocker: silent corruption on save/load round-trip.

Experiment.from_dict (in experiment.py:662) calls Case.model_validate(case_data) without the [MultimodalInput, str] parameterization.
Pydantic has no way to know the original Case was parameterized, so case.input comes back as a raw dict rather than a MultimodalInput.
Then in _build_prompt:

if (self.include_media
    and isinstance(evaluation_case.input, MultimodalInput)  # False — input is a dict
    and not evaluation_case.input.media):

the isinstance guard returns False, and we fall through to text-only mode with no warning. You can verify this by saving a multimodal Case to JSON and reloading, where it gets evaluated as text only, with the image data preserved in the file but never reaching the judge. Users see plausible butwrong scores with no indication anything went wrong.

What I'd recommend is

In compose_multimodal_test_prompt, coerce the input before the isinstance dispatch:

      input_ = evaluation_case.input
      if isinstance(input_, dict):
          input_ = MultimodalInput.model_validate(input_)

Do this at one well-defined boundary (the prompt composer) rather than scattering coercion through the evaluators.

Add a round-trip regression test that saves a MultimodalInput-bearing experiment, reloads it, builds the prompt, and asserts the prompt is a list (content blocks), not a str. Asserting on prompt shape is the only way to catch silent fallthroughs like this. Asserting on round-trip JSON equality won't catch it because the JSON content is preserved.

3. `uses_environment_state` kwarg leak

File: src/strands_evals/evaluators/multimodal_output_evaluator.py, __init__

Blocker: uses_environment_state kwarg leak breaks from_dict even with custom_evaluators passed.

MultimodalOutputEvaluator.__init__ calls super().__init__(...) which sets self.uses_environment_state = False. ButMultimodalOutputEvaluator.__init__'s own signature doesn't have uses_environment_state. Evaluator.to_dict compares self.__dict__ againstinspect.signature(self.__class__.__init__)'s defaults — the attribute isn't in that signature, so it's treated as a non-default value and always serialized.

The result is that MultimodalCorrectnessEvaluator().to_dict() emits uses_environment_state: False, and feeding that dict back through from_dict raises TypeError: MultimodalCorrectnessEvaluator.init() got an unexpected keyword argument 'uses_environment_state'

This fails even when the user passes custom_evaluators=[MultimodalCorrectnessEvaluator]. There's no user workaround today.

Fix: surface uses_environment_state on MultimodalOutputEvaluator.__init__ and forward it. Same for all four specialized subclasses — they need to accept it and pass it up, keeping the default False.

def __init__(
    self,
    rubric: str,
    model: Model | str | None = None,
    include_inputs: bool = True,
    system_prompt: str | None = None,
    reference_suffix: str | None = None,
    uses_environment_state: bool = False,
):
    super().__init__(
        rubric=rubric, model=model,
        system_prompt=system_prompt or MLLM_JUDGE_SYSTEM_PROMPT,
        include_inputs=include_inputs,
        uses_environment_state=uses_environment_state,
    )
    ...

…trip bugs Co-authored-by: Sungyeon Kim <ksy9597@gmail.com>

sangminwoo · 2026-04-29T20:57:16Z

Thanks @jjbuck for the detailed review and suggestions!

1. Drop `include_media` entirely

Dispatch is now data-driven in compose_multimodal_test_prompt:

MultimodalInput with media → MLLM mode
MultimodalInput with empty media → LLM mode + warning
Plain text input → LLM mode, no warning (if someone wants text-only judging, they pass plain-text input instead of a MultimodalInput)

Removed include_media from MultimodalOutputEvaluator, the four specialized subclasses, and compose_multimodal_test_prompt, plus the duplicated warning at the evaluator level.

2. Silent corruption on save/load round-trip

The fix follows your suggestion: coerce the input in compose_multimodal_test_prompt, so there's one place handling it. Also added a regression test: it saves a MultimodalInput, reloads it as a raw dict (simulating the from_dict path), and checks the built prompt is a list with an image block, not a plain string. This way, if the coercion ever breaks again, the test will catch it.

3. `uses_environment_state` kwarg leak

As you suggested, I've added uses_environment_state: bool = False to MultimodalOutputEvaluator.__init__ and all four subclasses, forwarded to super().__init__.

While fixing this, I noticed reference_suffix had the same issue. Fixed it in all four subclasses and added a round-trip test covering all four evaluators.

Thanks again!

sangminwoo added 2 commits April 7, 2026 14:03

feat: update README to add multimodal evaluators and apply formatting…

7376c71

… fixes

sangminwoo requested a deployment to manual-approval April 7, 2026 22:17 — with GitHub Actions Waiting

fix: resolve type errors in multimodal evaluators

ff74d44

Co-authored-by: Sungyeon Kim <ksy9597@gmail.com>

sangminwoo requested a deployment to manual-approval April 7, 2026 22:32 — with GitHub Actions Waiting

refactor: unify ref-based and ref-free rubrics into single rubric wit…

b951326

…h reference suffix Co-authored-by: Sungyeon Kim <ksy9597@gmail.com>

sangminwoo requested a deployment to manual-approval April 7, 2026 23:08 — with GitHub Actions Waiting

jjbuck reviewed Apr 7, 2026

View reviewed changes

refactor: BaseModel conversion, resolve_image_bytes extraction, gener…

4da410a

…ics fix, S3 removal, typing cleanup

sangminwoo requested a deployment to manual-approval April 10, 2026 17:29 — with GitHub Actions Waiting

refactor: remove evaluate/evaluate_async overrides from MultimodalOut…

8b4ff91

…putEvaluator

sangminwoo temporarily deployed to manual-approval April 20, 2026 21:02 — with GitHub Actions Inactive

afarntrog requested changes Apr 21, 2026

View reviewed changes

test: add unit and integration tests for multimodal evaluators

5d0f2eb

Co-authored-by: Sungyeon Kim <ksy9597@gmail.com>

sangminwoo requested a deployment to manual-approval April 22, 2026 00:35 — with GitHub Actions Waiting

Merge branch 'main' into main

ee06adf

sangminwoo requested a deployment to manual-approval April 22, 2026 00:41 — with GitHub Actions Waiting

fix: resolve mypy type error, apply ruff formatting

aa5f460

Co-authored-by: Sungyeon Kim <ksy9597@gmail.com>

sangminwoo had a problem deploying to manual-approval April 22, 2026 01:08 — with GitHub Actions Failure

poshinchen reviewed Apr 28, 2026

View reviewed changes

Comment thread src/strands_evals/evaluators/multimodal_output_evaluator.py Outdated

poshinchen reviewed Apr 28, 2026

View reviewed changes

Comment thread src/strands_evals/evaluators/multimodal_instruction_following_evaluator.py Outdated

poshinchen reviewed Apr 28, 2026

View reviewed changes

Comment thread src/strands_evals/evaluators/multimodal_faithfulness_evaluator.py Outdated

poshinchen reviewed Apr 28, 2026

View reviewed changes

Comment thread src/strands_evals/evaluators/multimodal_correctness_evaluator.py Outdated

fix: drop include_media from multimodal evaluators and address round-…

be643a3

…trip bugs Co-authored-by: Sungyeon Kim <ksy9597@gmail.com>

sangminwoo requested a deployment to manual-approval April 29, 2026 20:55 — with GitHub Actions Waiting

Merge branch 'main' into main

bb8929b

sangminwoo temporarily deployed to manual-approval April 29, 2026 20:58 — with GitHub Actions Inactive

jjbuck approved these changes Apr 30, 2026

View reviewed changes

poshinchen approved these changes Apr 30, 2026

View reviewed changes

jjbuck enabled auto-merge (squash) April 30, 2026 16:37

afarntrog approved these changes Apr 30, 2026

View reviewed changes

jjbuck merged commit 4ef54fa into strands-agents:main Apr 30, 2026
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add multimodal evaluators and prompt templates for image-to-text evaluation#187

feat: add multimodal evaluators and prompt templates for image-to-text evaluation#187
jjbuck merged 11 commits intostrands-agents:mainfrom
sangminwoo:main

sangminwoo commented Apr 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

afarntrog left a comment

Uh oh!

sangminwoo commented Apr 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jjbuck commented Apr 29, 2026

Uh oh!

sangminwoo commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

sangminwoo commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Documentation PR

Type of Change

Testing

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

afarntrog left a comment

Choose a reason for hiding this comment

Uh oh!

sangminwoo commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jjbuck commented Apr 29, 2026

1. Drop include_media entirely

2. Silent corruption on save/load round-trip

3. uses_environment_state kwarg leak

Uh oh!

sangminwoo commented Apr 29, 2026

1. Drop include_media entirely

2. Silent corruption on save/load round-trip

3. uses_environment_state kwarg leak

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sangminwoo commented Apr 7, 2026 •

edited

Loading

sangminwoo commented Apr 22, 2026 •

edited

Loading

1. Drop `include_media` entirely

3. `uses_environment_state` kwarg leak

1. Drop `include_media` entirely

3. `uses_environment_state` kwarg leak