Self-contained web application environments for evaluating browser-use agents. Each app is a standalone HTML/CSS/JS application served by a small Python HTTP server.
Based on WebArena-Infinity.
# Install dependencies
bash setup.sh
# Run a single app
cd apps/gmail && python3 server.py --port 8000
# Run all apps with a hub page
python3 serve_all.py --port 9000 --demoOpen http://localhost:8000 (single app) or http://localhost:9000 (hub) in your browser.
| App | Description |
|---|---|
| elation-clinical-records | EHR clinical records management |
| elation-patient-communication | EHR patient messaging |
| elation-prescriptions | EHR prescription management |
| figma-slides | Slide deck editor (Figma-style) |
| figma-text-and-typography | Text/typography editor (Figma-style) |
| gitlab-plan-and-track | Project planning and issue tracking |
| gmail | Email client |
| gmail-accounts-and-contacts | Gmail account and contacts management |
| google-sheets | Spreadsheet editor with formulas, charts, and multi-sheet workbooks |
| handshake-career-exploration | Career exploration platform |
| linear-account-settings | Project management account settings |
| paypal-my-wallet | Digital wallet management |
| superhuman-general | Email client (Superhuman-style) |
| xero-invoicing | Invoice management |
# Run a single task
uv run python evaluation/run_eval_parallel.py \
--model gpt \
--task-id task_e1 \
--workers 1 \
--web-app apps/gmail
# Run all easy tasks with visible browser
uv run python evaluation/run_eval_parallel.py \
--model gpt \
--difficulty easy \
--workers 1 \
--web-app apps/google-sheets \
--headed| Flag | Model | API Key |
|---|---|---|
gpt |
GPT-4o | OPENAI_API_KEY |
gemini-flash |
Gemini Flash 3 | GOOGLE_API_KEY |
gemini-pro |
Gemini Pro 3 | GOOGLE_API_KEY |
claude |
Claude Sonnet 4.6 | ANTHROPIC_API_KEY |
Add --test-mode when launching a server to get an in-browser test panel for manually running and verifying tasks:
cd apps/google-sheets && python3 server.py --port 8000 --test-mode