design: Add 0004-multimodal-i2t proposal#674
design: Add 0004-multimodal-i2t proposal#674sangminwoo wants to merge 6 commits intostrands-agents:mainfrom
Conversation
|
|
||
| * `InputT=dict` is less type-safe than a dataclass (`MultimodalInput` TypedDict provides partial typing) | ||
| * Multimodal judge calls are more expensive/slower than text-only (image tokens cost more) | ||
| * Remote image sources (S3, HTTP URLs) require user to download before evaluation — no built-in fetching to avoid heavy dependencies (boto3, requests) |
There was a problem hiding this comment.
I think we can probably support remote images in this:
# Define cases with image data in input dict
cases = [Case[dict, str](
input={"image": ImageData(source="chart.png"), "instruction": "What is the revenue trend?"},
)]
There was a problem hiding this comment.
Good point. we can support HTTP URLs using urllib.request (stdlib), so no new dependency is needed. For S3 URIs, we can make boto3 an optional dependency:
- HTTP/HTTPS: auto fetched via
urllib.request - S3: auto fetched if
boto3is installed, error message otherwise
Does this makes sense to you?
| include_image: bool = True, include_inputs: bool = True): | ||
| super().__init__(rubric=rubric, model=model, | ||
| system_prompt=system_prompt, include_inputs=include_inputs) | ||
| self.include_image = include_image |
There was a problem hiding this comment.
Can we make this more generic to account for other types of objects such as audio or documents? I want to avoid pigeonholing ourselves to just images with the MultimodalOutputEvaluator
There was a problem hiding this comment.
updated the doc to reflect broader multimodal support for future use
|
Hi @afarntrog, this PR is ready for the final review. I've updated the design doc to reflect your comments: 1/ added support for remote URIs/URLs and 2/ broadened the multimodality class to accommodate future media types. Would appreciate an approval when you get a chance so we can get this merged. |
Description
Add design doc for multimodal image-to-text evaluation support in strands-evals SDK.
Introduces
MultimodalOutputEvaluatorextendingOutputEvaluatorto enable MLLM-as-a-Judge evaluation for multimodal tasks, starting with image/document-to-text. The evaluator composes multimodal prompts using strands SDK ContentBlock format and supports both reference-free and reference-based evaluation across four dimensions: Overall Quality (P0), Correctness (P0), Faithfulness (P1), and Instruction Following (P1).Key design decisions:
OutputEvaluatorby overriding a new_build_prompt()hook on the parent;evaluate()andevaluate_async()are inherited unchangedAgent.__call__(prompt, structured_output_model=...)invocation pattern as the parent (accepts bothstrandlist[ContentBlock])MultimodalInputis a PydanticBaseModelwith fieldsmedia: ImageData | list[ImageData] | str,instruction: str, and optionalcontext: strMultimodalInputcarrying media yields content blocks; any other input yields a text-only prompt (noinclude_mediaflag)reference_suffixto the rubric whenexpected_outputis present (_select_rubric()); no parallel*_REFrubric variantsCORRECTNESS_RUBRIC_V0,FAITHFULNESS_RUBRIC_V0,INSTRUCTION_FOLLOWING_RUBRIC_V0,OVERALL_QUALITY_RUBRIC_V0+ convenience subclasses per dimension (MultimodalOverallQualityEvaluatoruses a dimension-specific default suffix)ImageDatasupports file paths, base64, data URLs, HTTP(S) URLs (auto-fetched via stdliburllib.request), bytes, and PIL Images.Related Issues
Type of Change
Checklist
npm run devBy submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.