Skip to content

feat: add multimodal evaluators and prompt templates for image-to-text evaluation#187

Merged
jjbuck merged 11 commits intostrands-agents:mainfrom
sangminwoo:main
Apr 30, 2026
Merged

feat: add multimodal evaluators and prompt templates for image-to-text evaluation#187
jjbuck merged 11 commits intostrands-agents:mainfrom
sangminwoo:main

Conversation

@sangminwoo
Copy link
Copy Markdown
Collaborator

@sangminwoo sangminwoo commented Apr 7, 2026

Description

This PR adds multimodal (MLLM-as-a-Judge) evaluators for image-to-text tasks with reference-free and reference-based evaluation support.

  • Introduced MultimodalCorrectnessEvaluator, MultimodalFaithfulnessEvaluator, MultimodalInstructionFollowingEvaluator, and MultimodalOverallQualityEvaluator to assess various aspects of multimodal responses.
  • Implemented MultimodalOutputEvaluator as a base class for handling multimodal inputs and outputs.
  • Created prompt templates for evaluation rubrics including correctness, faithfulness, instruction following, and overall quality.
  • Developed a structured approach for composing evaluation prompts with support for media content.
  • Added ImageData class to manage image sources and formats, enabling flexible input handling for evaluators.
  • Established a unified module for multimodal evaluation types and data structures.
  • Supports reference-free evaluation (default) and reference-based evaluation (automatically toggled when expected_output is provided in the test case).

Related Issues

#128

Documentation PR

strands-agents/docs#674

Type of Change

New feature

Testing

How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli

  • I ran hatch run prepare

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…t evaluation

- Introduced `MultimodalFaithfulnessEvaluator`, `MultimodalInstructionFollowingEvaluator`, and `MultimodalOverallQualityEvaluator` to assess various aspects of multimodal responses.
- Implemented `MultimodalOutputEvaluator` as a base class for handling multimodal inputs and outputs.
- Created prompt templates for evaluation rubrics including correctness, faithfulness, instruction following, and overall quality.
- Developed a structured approach for composing evaluation prompts with support for media content.
- Added `ImageData` class to manage image sources and formats, enabling flexible input handling for evaluators.
- Established a unified module for multimodal evaluation types and data structures.
Co-authored-by: Sungyeon Kim <ksy9597@gmail.com>
…h reference suffix

Co-authored-by: Sungyeon Kim <ksy9597@gmail.com>
Comment thread src/strands_evals/evaluators/multimodal_output_evaluator.py Outdated
Comment thread src/strands_evals/evaluators/multimodal_output_evaluator.py
Comment thread src/strands_evals/evaluators/multimodal_output_evaluator.py
Comment thread src/strands_evals/evaluators/multimodal_output_evaluator.py
Comment thread src/strands_evals/types/multimodal.py Outdated
Comment thread src/strands_evals/types/multimodal.py
Comment thread src/strands_evals/types/multimodal.py
Comment thread src/strands_evals/types/multimodal.py
Comment thread src/strands_evals/types/multimodal.py Outdated
Comment thread src/strands_evals/types/multimodal.py Outdated
Copy link
Copy Markdown
Contributor

@afarntrog afarntrog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see some tests for this new multimodal. Please add unit and integ tests

Co-authored-by: Sungyeon Kim <ksy9597@gmail.com>
@sangminwoo
Copy link
Copy Markdown
Collaborator Author

sangminwoo commented Apr 22, 2026

@afarntrog Added unit and integration tests:

  • test_multimodal.py - 45 tests for ImageData, MultimodalInput, resolve_image_bytes, format detection
  • test_multimodal_output_evaluator.py - 18 tests for base evaluator init, rubric selection, prompt building, evaluate/evaluate_async
  • test_multimodal_specialized_evaluators.py - 23 tests for all four specialized subclasses (Correctness, Faithfulness, InstructionFollowing, OverallQuality)
  • test_multimodal_case_prompt_template.py - 16 tests for media resolution, content block building, prompt composition
  • test_evaluation_report.py - 11 tests for _format_input_for_display and flatten()
  • tests_integ/test_multimodal_output_evaluator.py - 9 integration tests covering MLLM/LLM modes, reference-based/free, multi-image, file path, sync/async, context, and batch

Also, could we add Sungyeon Kim (@sung-yeon-kim) as a collaborator on this repo? He's been co-authoring the work on this PR.

Co-authored-by: Sungyeon Kim <ksy9597@gmail.com>
Comment thread src/strands_evals/evaluators/multimodal_output_evaluator.py Outdated
Comment thread src/strands_evals/evaluators/multimodal_instruction_following_evaluator.py Outdated
Comment thread src/strands_evals/evaluators/multimodal_faithfulness_evaluator.py Outdated
Comment thread src/strands_evals/evaluators/multimodal_correctness_evaluator.py Outdated
@jjbuck
Copy link
Copy Markdown
Collaborator

jjbuck commented Apr 29, 2026

1. Drop include_media entirely

As noted in one of my previous comments [https://github.com//pull/187#discussion_r3048446548], I still think we should drop this boolean flag. Boolean flags on a class are typically an anti-pattern that make extensibility difficult - they almost always signal the class has two modes and should either be two classes, or the "mode" should be driven by the data. Here it's the latter. The class has a single coherent purpose, i.e., "evaluate a multimodal response with an MLLM judge, using content blocks when media is available." The flag re-introduces optionality that the _build_prompt refactor was specifically designed to eliminate.

This would touch all four specialized evaluators, multimodal_output_evaluator.py, multimodal_case_prompt_template.py Additionally, there's a silent corruption path here wherein include_media: false in an evaluator config can round-trip a multimodal case into silent text-only evaluation.

What to do:

  1. Remove include_media from MultimodalOutputEvaluator.__init__, all four specialized subclasses, and compose_multimodal_test_prompt.
  2. Remove the self.include_media attribute.
  3. Make dispatch purely data-driven in compose_multimodal_test_prompt:
    if isinstance(input_, MultimodalInput) and input_.media:
        # build content-block list
    else:
        # text-only prompt
        if isinstance(input_, MultimodalInput) and not input_.media:
            warnings.warn("MultimodalInput has empty media; evaluating as text-only.", ...)

Warn only when the input declares itself multimodal but has no media — that's an actual data inconsistency. Don't warn on plain-text inputs (since nothing is wrong with those)

  1. Remove the duplicated warning at the evaluator level (the prompt-composer is the right place for it.

2. Silent corruption on save/load round-trip

File: src/strands_evals/evaluators/multimodal_output_evaluator.py, line ~85

Blocker: silent corruption on save/load round-trip.

Experiment.from_dict (in experiment.py:662) calls Case.model_validate(case_data) without the [MultimodalInput, str] parameterization.
Pydantic has no way to know the original Case was parameterized, so case.input comes back as a raw dict rather than a MultimodalInput.
Then in _build_prompt:

if (self.include_media
    and isinstance(evaluation_case.input, MultimodalInput)  # False — input is a dict
    and not evaluation_case.input.media):

the isinstance guard returns False, and we fall through to text-only mode with no warning. You can verify this by saving a multimodal Case to JSON and reloading, where it gets evaluated as text only, with the image data preserved in the file but never reaching the judge. Users see plausible butwrong scores with no indication anything went wrong.

What I'd recommend is

  1. In compose_multimodal_test_prompt, coerce the input before the isinstance dispatch:
      input_ = evaluation_case.input
      if isinstance(input_, dict):
          input_ = MultimodalInput.model_validate(input_)

Do this at one well-defined boundary (the prompt composer) rather than scattering coercion through the evaluators.

  1. Add a round-trip regression test that saves a MultimodalInput-bearing experiment, reloads it, builds the prompt, and asserts the prompt is a list (content blocks), not a str. Asserting on prompt shape is the only way to catch silent fallthroughs like this. Asserting on round-trip JSON equality won't catch it because the JSON content is preserved.

3. uses_environment_state kwarg leak

File: src/strands_evals/evaluators/multimodal_output_evaluator.py, __init__

Blocker: uses_environment_state kwarg leak breaks from_dict even with custom_evaluators passed.

MultimodalOutputEvaluator.__init__ calls super().__init__(...) which sets self.uses_environment_state = False. ButMultimodalOutputEvaluator.__init__'s own signature doesn't have uses_environment_state. Evaluator.to_dict compares self.__dict__ againstinspect.signature(self.__class__.__init__)'s defaults — the attribute isn't in that signature, so it's treated as a non-default value and always serialized.

The result is that MultimodalCorrectnessEvaluator().to_dict() emits uses_environment_state: False, and feeding that dict back through from_dict raises TypeError: MultimodalCorrectnessEvaluator.init() got an unexpected keyword argument 'uses_environment_state'

This fails even when the user passes custom_evaluators=[MultimodalCorrectnessEvaluator]. There's no user workaround today.

Fix: surface uses_environment_state on MultimodalOutputEvaluator.__init__ and forward it. Same for all four specialized subclasses — they need to accept it and pass it up, keeping the default False.

def __init__(
    self,
    rubric: str,
    model: Model | str | None = None,
    include_inputs: bool = True,
    system_prompt: str | None = None,
    reference_suffix: str | None = None,
    uses_environment_state: bool = False,
):
    super().__init__(
        rubric=rubric, model=model,
        system_prompt=system_prompt or MLLM_JUDGE_SYSTEM_PROMPT,
        include_inputs=include_inputs,
        uses_environment_state=uses_environment_state,
    )
    ...

…trip bugs

Co-authored-by: Sungyeon Kim <ksy9597@gmail.com>
@sangminwoo
Copy link
Copy Markdown
Collaborator Author

Thanks @jjbuck for the detailed review and suggestions!

1. Drop include_media entirely

Dispatch is now data-driven in compose_multimodal_test_prompt:

  • MultimodalInput with media → MLLM mode
  • MultimodalInput with empty media → LLM mode + warning
  • Plain text input → LLM mode, no warning (if someone wants text-only judging, they pass plain-text input instead of a MultimodalInput)

Removed include_media from MultimodalOutputEvaluator, the four specialized subclasses, and compose_multimodal_test_prompt, plus the duplicated warning at the evaluator level.

2. Silent corruption on save/load round-trip

The fix follows your suggestion: coerce the input in compose_multimodal_test_prompt, so there's one place handling it. Also added a regression test: it saves a MultimodalInput, reloads it as a raw dict (simulating the from_dict path), and checks the built prompt is a list with an image block, not a plain string. This way, if the coercion ever breaks again, the test will catch it.

3. uses_environment_state kwarg leak

As you suggested, I've added uses_environment_state: bool = False to MultimodalOutputEvaluator.__init__ and all four subclasses, forwarded to super().__init__.

While fixing this, I noticed reference_suffix had the same issue. Fixed it in all four subclasses and added a round-trip test covering all four evaluators.

Thanks again!

@jjbuck jjbuck enabled auto-merge (squash) April 30, 2026 16:37
@jjbuck jjbuck merged commit 4ef54fa into strands-agents:main Apr 30, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants