Skip to content

design: Add 0004-multimodal-i2t proposal#674

Open
sangminwoo wants to merge 6 commits intostrands-agents:mainfrom
sangminwoo:main
Open

design: Add 0004-multimodal-i2t proposal#674
sangminwoo wants to merge 6 commits intostrands-agents:mainfrom
sangminwoo:main

Conversation

@sangminwoo
Copy link
Copy Markdown

@sangminwoo sangminwoo commented Mar 17, 2026

Description

Add design doc for multimodal image-to-text evaluation support in strands-evals SDK.

Introduces MultimodalOutputEvaluator extending OutputEvaluator to enable MLLM-as-a-Judge evaluation for multimodal tasks, starting with image/document-to-text. The evaluator composes multimodal prompts using strands SDK ContentBlock format and supports both reference-free and reference-based evaluation across four dimensions: Overall Quality (P0), Correctness (P0), Faithfulness (P1), and Instruction Following (P1).

Key design decisions:

  • Extends OutputEvaluator by overriding a new _build_prompt() hook on the parent; evaluate() and evaluate_async() are inherited unchanged
  • Same Agent.__call__(prompt, structured_output_model=...) invocation pattern as the parent (accepts both str and list[ContentBlock])
  • MultimodalInput is a Pydantic BaseModel with fields media: ImageData | list[ImageData] | str, instruction: str, and optional context: str
  • Data-driven mode dispatch: a MultimodalInput carrying media yields content blocks; any other input yields a text-only prompt (no include_media flag)
  • Reference handling appends a configurable reference_suffix to the rubric when expected_output is present (_select_rubric()); no parallel *_REF rubric variants
  • Built-in rubrics CORRECTNESS_RUBRIC_V0, FAITHFULNESS_RUBRIC_V0, INSTRUCTION_FOLLOWING_RUBRIC_V0, OVERALL_QUALITY_RUBRIC_V0 + convenience subclasses per dimension (MultimodalOverallQualityEvaluator uses a dimension-specific default suffix)
  • ImageData supports file paths, base64, data URLs, HTTP(S) URLs (auto-fetched via stdlib urllib.request), bytes, and PIL Images.

Related Issues

Type of Change

  • New content

Checklist

  • I have read the CONTRIBUTING document
  • My changes follow the project's documentation style
  • I have tested the documentation locally using npm run dev
  • Links in the documentation are valid and working

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@sangminwoo sangminwoo marked this pull request as draft March 18, 2026 00:08
@sangminwoo sangminwoo marked this pull request as ready for review March 18, 2026 00:08

* `InputT=dict` is less type-safe than a dataclass (`MultimodalInput` TypedDict provides partial typing)
* Multimodal judge calls are more expensive/slower than text-only (image tokens cost more)
* Remote image sources (S3, HTTP URLs) require user to download before evaluation — no built-in fetching to avoid heavy dependencies (boto3, requests)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can probably support remote images in this:

# Define cases with image data in input dict
cases = [Case[dict, str](
    input={"image": ImageData(source="chart.png"), "instruction": "What is the revenue trend?"},
)]

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. we can support HTTP URLs using urllib.request (stdlib), so no new dependency is needed. For S3 URIs, we can make boto3 an optional dependency:

  • HTTP/HTTPS: auto fetched via urllib.request
  • S3: auto fetched if boto3 is installed, error message otherwise

Does this makes sense to you?

afarntrog
afarntrog previously approved these changes Apr 1, 2026
include_image: bool = True, include_inputs: bool = True):
super().__init__(rubric=rubric, model=model,
system_prompt=system_prompt, include_inputs=include_inputs)
self.include_image = include_image
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this more generic to account for other types of objects such as audio or documents? I want to avoid pigeonholing ourselves to just images with the MultimodalOutputEvaluator

Copy link
Copy Markdown
Author

@sangminwoo sangminwoo Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the doc to reflect broader multimodal support for future use

@sangminwoo
Copy link
Copy Markdown
Author

Hi @afarntrog, this PR is ready for the final review. I've updated the design doc to reflect your comments: 1/ added support for remote URIs/URLs and 2/ broadened the multimodality class to accommodate future media types. Would appreciate an approval when you get a chance so we can get this merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants