design: Add 0004-multimodal-i2t proposal by sangminwoo · Pull Request #674 · strands-agents/docs

sangminwoo · 2026-03-17T23:53:51Z

Description

Add design doc for multimodal image-to-text evaluation support in strands-evals SDK.

Introduces MultimodalOutputEvaluator extending OutputEvaluator to enable MLLM-as-a-Judge evaluation for multimodal tasks, starting with image/document-to-text. The evaluator composes multimodal prompts using strands SDK ContentBlock format and supports both reference-free and reference-based evaluation across four dimensions: Overall Quality (P0), Correctness (P0), Faithfulness (P1), and Instruction Following (P1).

Key design decisions:

Extends OutputEvaluator by overriding a new _build_prompt() hook on the parent; evaluate() and evaluate_async() are inherited unchanged
Same Agent.__call__(prompt, structured_output_model=...) invocation pattern as the parent (accepts both str and list[ContentBlock])
MultimodalInput is a Pydantic BaseModel with fields media: ImageData | list[ImageData] | str, instruction: str, and optional context: str
Data-driven mode dispatch: a MultimodalInput carrying media yields content blocks; any other input yields a text-only prompt (no include_media flag)
Reference handling appends a configurable reference_suffix to the rubric when expected_output is present (_select_rubric()); no parallel *_REF rubric variants
Built-in rubrics CORRECTNESS_RUBRIC_V0, FAITHFULNESS_RUBRIC_V0, INSTRUCTION_FOLLOWING_RUBRIC_V0, OVERALL_QUALITY_RUBRIC_V0 + convenience subclasses per dimension (MultimodalOverallQualityEvaluator uses a dimension-specific default suffix)
ImageData supports file paths, base64, data URLs, HTTP(S) URLs (auto-fetched via stdlib urllib.request), bytes, and PIL Images.

Related Issues

strands-agents/evals Issue #128

Type of Change

New content

Checklist

I have read the CONTRIBUTING document
My changes follow the project's documentation style
I have tested the documentation locally using npm run dev
Links in the documentation are valid and working

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

afarntrog · 2026-03-19T18:42:00Z

+
+* `InputT=dict` is less type-safe than a dataclass (`MultimodalInput` TypedDict provides partial typing)
+* Multimodal judge calls are more expensive/slower than text-only (image tokens cost more)
+* Remote image sources (S3, HTTP URLs) require user to download before evaluation — no built-in fetching to avoid heavy dependencies (boto3, requests)


I think we can probably support remote images in this:

# Define cases with image data in input dict cases = [Case[dict, str]( input={"image": ImageData(source="chart.png"), "instruction": "What is the revenue trend?"}, )]

Good point. we can support HTTP URLs using urllib.request (stdlib), so no new dependency is needed. For S3 URIs, we can make boto3 an optional dependency:

HTTP/HTTPS: auto fetched via urllib.request

S3: auto fetched if boto3 is installed, error message otherwise

Does this makes sense to you?

afarntrog · 2026-04-01T21:37:31Z

+                 include_image: bool = True, include_inputs: bool = True):
+        super().__init__(rubric=rubric, model=model,
+                         system_prompt=system_prompt, include_inputs=include_inputs)
+        self.include_image = include_image


Can we make this more generic to account for other types of objects such as audio or documents? I want to avoid pigeonholing ourselves to just images with the MultimodalOutputEvaluator

updated the doc to reflect broader multimodal support for future use

…ic selection

sangminwoo · 2026-04-06T22:11:42Z

Hi @afarntrog, this PR is ready for the final review. I've updated the design doc to reflect your comments: 1/ added support for remote URIs/URLs and 2/ broadened the multimodality class to accommodate future media types. Would appreciate an approval when you get a chance so we can get this merged.

design: Add 0004-multimodal-i2t proposal

920914b

sangminwoo had a problem deploying to manual-approval March 17, 2026 23:54 — with GitHub Actions Failure

sangminwoo had a problem deploying to manual-approval March 17, 2026 23:54 — with GitHub Actions Error

sangminwoo marked this pull request as draft March 18, 2026 00:08

sangminwoo marked this pull request as ready for review March 18, 2026 00:08

sangminwoo had a problem deploying to manual-approval March 18, 2026 00:12 — with GitHub Actions Failure

afarntrog reviewed Mar 19, 2026

View reviewed changes

support for remote image sources

a8b8d7d

sangminwoo had a problem deploying to manual-approval March 21, 2026 00:30 — with GitHub Actions Failure

sangminwoo had a problem deploying to manual-approval March 21, 2026 00:30 — with GitHub Actions Error

Add overall quality evaluator

1c10b8e

sangminwoo had a problem deploying to manual-approval March 23, 2026 22:45 — with GitHub Actions Error

sangminwoo had a problem deploying to manual-approval March 23, 2026 22:45 — with GitHub Actions Failure

Fix correctness/faithfulness scale to binary

0dd14b1

sangminwoo had a problem deploying to manual-approval March 27, 2026 00:26 — with GitHub Actions Failure

sangminwoo had a problem deploying to manual-approval March 27, 2026 00:26 — with GitHub Actions Error

afarntrog previously approved these changes Apr 1, 2026

View reviewed changes

Update media key naming, Agent invocation pattern, and reference rubr…

d1a2354

…ic selection

sangminwoo dismissed afarntrog’s stale review via d1a2354 April 6, 2026 22:00

sangminwoo had a problem deploying to manual-approval April 6, 2026 22:00 — with GitHub Actions Failure

sangminwoo had a problem deploying to manual-approval April 6, 2026 22:00 — with GitHub Actions Error

sangminwoo mentioned this pull request Apr 7, 2026

feat: add multimodal evaluators and prompt templates for image-to-text evaluation strands-agents/evals#187

Merged

7 tasks

sync design doc with merged implementation (evals#187)

2127538

sangminwoo requested a deployment to manual-approval May 6, 2026 21:19 — with GitHub Actions Waiting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

design: Add 0004-multimodal-i2t proposal#674

design: Add 0004-multimodal-i2t proposal#674
sangminwoo wants to merge 6 commits intostrands-agents:mainfrom
sangminwoo:main

sangminwoo commented Mar 17, 2026 •

edited

Loading

Uh oh!

afarntrog Mar 19, 2026

Uh oh!

sangminwoo Mar 19, 2026

Uh oh!

afarntrog Apr 1, 2026

Uh oh!

sangminwoo Apr 3, 2026 •

edited

Loading

Uh oh!

sangminwoo commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sangminwoo commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Type of Change

Checklist

Uh oh!

afarntrog Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

sangminwoo Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

afarntrog Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

sangminwoo Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sangminwoo commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sangminwoo commented Mar 17, 2026 •

edited

Loading

sangminwoo Apr 3, 2026 •

edited

Loading