fix: improve click eval point parsing

feat: Added new datapoints
fix: include raw preview for unparsed clicks
2026-05-13 23:53:25 +00:00 · 2026-04-29 17:23:54 +01:00 · 2026-04-29 16:19:00 +01:00 · 2026-04-29 14:14:39 +01:00 · 2026-04-28 17:36:04 +01:00 · 2026-04-28 17:00:02 +01:00
39 changed files with 6112 additions and 1 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -33,3 +33,4 @@ packages/browseros/build/tools/
 # AI SDK DevTools traces
 .devtools/
 .omc/
+packages/browseros-agent/tools/dogfood/browseros-dogfood
--- a/packages/browseros/pyproject.toml
+++ b/packages/browseros/pyproject.toml
@@ -6,7 +6,7 @@ build-backend = "setuptools.build_meta"
 name = "browseros"
 version = "0.0.1"
 description = "BrowserOS Build System"
-requires-python = ">=3.12"
+requires-python = ">=3.11"
 dependencies = [
  "click>=8.0.0",
  "typer>=0.12.0",
--- a/prototypes/click_eval/.gitignore
+++ b/prototypes/click_eval/.gitignore
@@ -0,0 +1,8 @@
+/.venv/
+/.env
+/runs/
+/.pytest_cache/
+/.ruff_cache/
+/src/*.egg-info/
+__pycache__/
+*.pyc
--- a/prototypes/click_eval/README.md
+++ b/prototypes/click_eval/README.md
@@ -0,0 +1,302 @@
+# Click Eval Prototype
+
+Tiny VLM click-point evaluation harness.
+
+## Layout
+
+- `src/click_eval/`: runtime package and CLI
+- `tests/`: fixture-based tests with no network calls
+- `examples/`: default task/model config files and sample screenshot
+- `runs/`: suggested output location
+
+## Input
+
+Create a JSONL task file:
+
+```jsonl
+{"task_id":"chat_1","image_path":"screenshots/page.png","instruction":"click the chat button"}
+{"task_id":"send_1","image_path":"screenshots/page.png","instruction":"click send","gt_point":[510,742]}
+```
+
+`image_path` is resolved relative to the task file. Configured judge model(s)
+are called and cached in the run output. If `gt_point` is absent, the task is
+left unscored: judge outputs are recorded for inspection, but no median fallback
+GT is created and no green `GT` marker is rendered. If `gt_point` is present,
+that provided point is the scoring GT.
+
+The bundled Hacker News sample screenshot is captured at `1920x1152` DPR 1. That
+is a higher-resolution desktop sample while still being much smaller than a 4K
+screenshot for local GUI models.
+
+The default model config is `examples/models.json`. The abbreviated cloud/API
+portion is:
+
+```json
+{
+  "judge_models": [
+    {
+      "name": "openai-computer-use-judge",
+      "provider": "openai_computer_use",
+      "model": "computer-use-preview"
+    },
+    {
+      "name": "claude-opus-4.7-judge",
+      "provider": "openrouter",
+      "model": "anthropic/claude-opus-4.7"
+    },
+    {
+      "name": "gpt-5.5-judge",
+      "provider": "openrouter",
+      "model": "openai/gpt-5.5"
+    },
+    {
+      "name": "gemini-3.1-pro-judge",
+      "provider": "openrouter",
+      "model": "google/gemini-3.1-pro-preview"
+    }
+  ],
+  "candidate_models": [
+    {
+      "name": "qwen3-vl-8b-instruct",
+      "provider": "openrouter",
+      "model": "qwen/qwen3-vl-8b-instruct"
+    },
+    {
+      "name": "qwen3-vl-8b-thinking",
+      "provider": "openrouter",
+      "model": "qwen/qwen3-vl-8b-thinking"
+    },
+    {
+      "name": "ui-tars-1.5-7b",
+      "provider": "openrouter",
+      "model": "bytedance/ui-tars-1.5-7b"
+    },
+    {"name": "glm-4.5v", "provider": "openrouter", "model": "z-ai/glm-4.5v"},
+    {"name": "glm-4.6v", "provider": "openrouter", "model": "z-ai/glm-4.6v"},
+    {
+      "name": "glm-5v-turbo",
+      "provider": "openrouter",
+      "model": "z-ai/glm-5v-turbo"
+    },
+    {"name": "moondream", "provider": "moondream", "model": "moondream-cloud"},
+    {
+      "name": "gemini-3.1-pro",
+      "provider": "openrouter",
+      "model": "google/gemini-3.1-pro-preview"
+    }
+  ]
+}
+```
+
+The `name` is only the short label shown in plots and summary files. OpenRouter
+is the default provider, but the examples keep it explicit for routability
+audits. The default judge IDs were checked against
+`https://openrouter.ai/api/v1/models` on 2026-04-27 and are:
+
+- `anthropic/claude-opus-4.7`
+- `openai/gpt-5.5`
+- `google/gemini-3.1-pro-preview`
+
+The active OpenRouter click-model shortlist was checked against
+`https://openrouter.ai/api/v1/models` on 2026-04-26 and includes:
+
+- `qwen/qwen3-vl-8b-instruct`
+- `qwen/qwen3-vl-8b-thinking`
+- `bytedance/ui-tars-1.5-7b`
+- `z-ai/glm-4.5v`
+- `z-ai/glm-4.6v`
+- `z-ai/glm-5v-turbo`
+
+Shortlist models not found in the current OpenRouter catalog are documented
+below as `local_hf` candidates. They are included in `examples/models.json`, but
+the provider checks for a CUDA/NVIDIA GPU before importing local inference
+dependencies or downloading weights. If no usable CUDA GPU is present, they are
+recorded as skipped with the CUDA detection reason.
+
+| Model | Hosting | Setup needed |
+| --- | --- | --- |
+| `Qwen/Qwen3-VL-2B-Instruct` | Hugging Face | Included as `local_hf`; small generic Qwen3-VL baseline. |
+| `Qwen/Qwen3-VL-2B-Thinking` | Hugging Face | Included as `local_hf`; small generic Qwen3-VL thinking baseline. |
+| `Qwen/Qwen2.5-VL-3B-Instruct` | Hugging Face | Included as `local_hf`; generic Qwen2.5-VL baseline using a relative point prompt. |
+| `mPLUG/GUI-Owl-1.5-2B-Instruct` | Hugging Face | Included as `local_hf`; Qwen3-VL GUI-agent adapter. |
+| `mPLUG/GUI-Owl-1.5-4B-Instruct` | Hugging Face | Included as `local_hf`; Qwen3-VL GUI-agent adapter. |
+| `mPLUG/GUI-Owl-1.5-8B-Instruct` | Hugging Face | Included as `local_hf`; Qwen3-VL GUI-agent adapter. |
+| `vocaela/KV-Ground-8B-BaseGuiOwl1.5-0315` | Hugging Face | Included as `local_hf`; high-performing ScreenSpot-Pro GUI grounder, non-commercial license. |
+| `inclusionAI/UI-Venus-1.5-2B` | Hugging Face | Included as `local_hf`; small Qwen3-VL GUI agent. |
+| `inclusionAI/UI-Venus-1.5-8B` | Hugging Face | Included as `local_hf`; strong Apache-2.0 GUI agent/grounder. |
+| `Hcompany/Holo2-4B` | Hugging Face | Included as `local_hf`; Qwen3-VL computer-use model. |
+| `Hcompany/Holo2-8B` | Hugging Face | Included as `local_hf`; Qwen3-VL computer-use model. |
+| `Salesforce/GTA1-7B` | Hugging Face | Included as `local_hf`; outputs `pyautogui.click(...)` coordinates after Qwen smart resize. |
+| `xlangai/OpenCUA-7B` | Hugging Face | Included as `local_hf`; outputs `pyautogui.click(...)` coordinates after Qwen smart resize. |
+| `InfiX-ai/InfiGUI-G1-3B` | Hugging Face | Included as `local_hf`; outputs JSON `point_2d` coordinates after Qwen smart resize. |
+| `InfiX-ai/InfiGUI-G1-7B` | Hugging Face | Included as `local_hf`; outputs JSON `point_2d` coordinates after Qwen smart resize. |
+| `tencent/POINTS-GUI-G` | Hugging Face | Included as `local_hf`; outputs normalized `(x, y)` coordinates and needs `WePOINTS`. |
+| `Tongyi-MAI/MAI-UI-8B` | Hugging Face | Included as `local_hf`; Qwen3-VL GUI agent, may need `HF_TOKEN` depending on access. |
+| `allenai/MolmoPoint-GUI-8B` | Hugging Face | Included as `local_hf`; outputs pointing tokens, so model-specific parser tuning may improve results. |
+| `microsoft/Fara-7B` | Hugging Face and Microsoft Foundry | Included as `local_hf`; Foundry use would need endpoint credentials and a separate adapter. |
+| `ServiceNow/GroundNext-7B-V0` | Hugging Face and Azure AI Foundry | Included as `local_hf`; Azure use would need endpoint credentials and a separate adapter. |
+| `osunlp/UGround-V1-2B` | Hugging Face | Included as `local_hf`; smaller UGround model using normalized coordinates. |
+| `osunlp/UGround-V1-7B` | Hugging Face | Included as `local_hf`; the model card also documents vLLM OpenAI-compatible serving. |
+| `ByteDance-Seed/UI-TARS-2B-SFT` | Hugging Face | Included as `local_hf`; small UI-TARS model using the normalized point adapter. |
+| `zonghanHZH/ZonUI-3B` | Hugging Face | Included as `local_hf`; lightweight Qwen2.5-VL GUI grounding model. |
+| `Yuqi-Zhou/GUI-G1-3B-v1` | Hugging Face | Included as `local_hf`; 3B GUI grounding model using JSON `point_2d` output. |
+| `xlangai/Jedi-3B-1080p` | Hugging Face | Included as `local_hf`; OSWorld-G Qwen2.5-VL click/tool-call model. |
+| `xlangai/Jedi-7B-1080p` | Hugging Face | Included as `local_hf`; larger Jedi click/tool-call model. |
+| `Tongyi-MiA/UI-Ins-7B` | Hugging Face | Included as `local_hf`; GUI grounding model using tool-call coordinates. |
+| `osunlp/GUI-Drag-7B` | Hugging Face | Included as `local_hf`; drag-focused GUI model with preserved click behavior. |
+| `OS-Copilot/OS-Atlas-Base-4B` | Hugging Face | Included as `local_hf`; outputs normalized coordinates/boxes, so parser tuning may improve results. |
+| `OS-Copilot/OS-Atlas-Base-7B` | Hugging Face | Included as `local_hf`; outputs normalized coordinates/boxes, so parser tuning may improve results. |
+| `showlab/ShowUI-2B` | Hugging Face | Included as `local_hf`; parser tuning may be needed for action-dictionary outputs. |
+| `Qwen/Qwen3-VL-4B-Instruct` | Hugging Face | Included as `local_hf`; not currently routable through OpenRouter. |
+| `Qwen/Qwen3-VL-4B-Thinking` | Hugging Face | Included as `local_hf`; not currently routable through OpenRouter. |
+
+The 2026-04-26 pass also found promising custom-head local models such as
+`microsoft/GUI-Actor-3B-Qwen2.5-VL`, `inclusionAI/V2P-7B`, and
+`TESS-Computer/qwen-click-dit`. Those are not in the default list yet because
+their model cards require custom Python model classes or action heads beyond
+plain `transformers` loading.
+
+For HF-local models, install optional local dependencies with:
+
+```bash
+uv sync --extra local
+```
+
+This installs `torch`, `torchvision`, `transformers`, `accelerate`, `einops`,
+`qwen-vl-utils`, `safetensors`, `timm`, `sentencepiece`, `protobuf`,
+`requests`, `tiktoken`, and `WePOINTS`. `torch>=2.6` is required for models
+that still ship PyTorch `.bin` weights because older PyTorch releases are
+blocked by the CVE-2025-32434 `torch.load` guard. MolmoPoint also expects
+`einops`, and the Qwen-derived GUI models use `qwen-vl-utils` for image
+preprocessing. POINTS-GUI-G requires FlashAttention 2 at runtime, but it is not
+installed by the local extra because its native build must match the active
+Python, PyTorch, and CUDA environment.
+
+The local provider is intentionally conservative: it only runs when PyTorch can
+use CUDA, and it skips non-offloaded models whose estimated VRAM exceeds the
+detected GPU memory. Models marked `allow_cpu_offload` use Transformers
+`device_map="auto"`; other local models load directly onto `cuda:0`. This
+means a misconfigured container where `nvidia-smi` works but `torch.cuda` does
+not will be skipped instead of silently running an 8B model on CPU. Local
+generation uses the CLI `--timeout` value as the Transformers `max_time` budget.
+Several model-specific adapters are included for MolmoPoint, GroundNext,
+UGround, OS-Atlas, ShowUI, Qwen3-VL/MAI-UI, OpenCUA, GTA1, InfiGUI, and POINTS-GUI-G.
+Local model configs use `fp16` and CPU offload for the larger checkpoints
+instead of quantization. MolmoPoint is the
+exception: its official inference path uses BF16 autocast, and FP16 overflows in
+its pointing-token generation path. Timing for offloaded models will include
+CPU-GPU transfer overhead. The local runner keeps one HF model loaded while it
+runs every task for that model, then unloads it and clears the CUDA cache before
+the next local model. For gated/private downloads, set `HF_TOKEN`. For
+Azure/Foundry-hosted variants, expect an endpoint URL plus API key and a
+dedicated provider adapter.
+
+Moondream candidates use a provider-qualified entry. OpenAI Computer Use judges
+or candidates use `provider: "openai_computer_use"` and require
+`OPENAI_API_KEY`. The default Gemini candidate uses OpenRouter, so it only needs
+`OPENROUTER_API_KEY`.
+
+```json
+{
+  "candidate_models": [
+    {
+      "name": "openai-computer-use-judge",
+      "provider": "openai_computer_use",
+      "model": "computer-use-preview"
+    },
+    {
+      "name": "moondream",
+      "provider": "moondream",
+      "model": "moondream-cloud"
+    },
+    {
+      "name": "gemini-3.1-pro",
+      "provider": "openrouter",
+      "model": "google/gemini-3.1-pro-preview"
+    }
+  ]
+}
+```
+
+## Run
+
+```bash
+cd prototypes/click_eval
+uv sync
+export OPENROUTER_API_KEY=...
+# Optional, for Moondream candidates:
+export MOONDREAM_API_KEY=...
+# Optional, for OpenAI Computer Use judges/candidates:
+export OPENAI_API_KEY=...
+uv run click-eval run
+```
+
+Without `uv`, use:
+
+```bash
+cd prototypes/click_eval
+python -m pip install -r requirements.txt
+python -m click_eval run
+```
+
+On an interactive terminal, `run` shows tqdm progress bars for tasks and model
+calls. In non-interactive output, it prints plain status lines instead. Use
+`--no-progress` to suppress both.
+Use `--limit N` to run only the first N tasks, and `--model-limit N` to run
+only the first N candidate models.
+
+The CLI also loads `OPENAI_API_KEY`, `MOONDREAM_API_KEY`, `GEMINI_API_KEY`,
+`GOOGLE_API_KEY`, and `OPENROUTER_API_KEY` from a local `.env` file in
+`prototypes/click_eval/` or the current working directory. `GEMINI_API_KEY` is
+only needed if you manually add a native `provider: "gemini"` entry.
+Moondream calls use `POST https://api.moondream.ai/v1/point` with the screenshot
+as a base64 data URL and the click instruction converted to an object query.
+OpenRouter Claude calls resize screenshots client-side before upload when the
+image exceeds Claude's no-resize long-edge limit, then remap parsed coordinates
+from the resized image back to original screenshot pixels. Claude Opus 4.7 uses a
+2576 px long-edge target; older Claude models use 1568 px.
+OpenAI Computer Use calls use the Responses API with `computer_use_preview`,
+request `detail: "original"`, send the original screenshot dimensions, and
+return a `click` action in that display coordinate space.
+The default Gemini candidate uses OpenRouter's regular multimodal chat API and
+the same JSON point prompt as the other OpenRouter VLMs. Native Gemini Computer
+Use support remains available for manually configured `provider: "gemini"`
+entries.
+
+During a run, the CLI first resolves task GT/judge overlays, then runs
+candidate predictions model-major: one model is evaluated across all tasks
+before moving to the next model. It also prints compact status lines for GT
+resolution, provider/model calls, prediction failures, and the output directory.
+
+OpenRouter GT judges are sent concurrently in bounded batches of 4. OpenRouter
+candidate calls are also concurrent across tasks for the current model in
+batches of 4. Local HF/GPU candidates stay synchronous and serial per model to
+avoid GPU memory contention; Moondream and native Gemini provider calls remain
+synchronous.
+
+Outputs:
+
+- `resolved_tasks.jsonl`: task manifest with `gt_point` when provided
+- `predictions.jsonl`: raw candidate responses and parsed points
+- `scores.csv`: per-task L2 distances when a scoring GT is available
+- `summary.json`: aggregate metrics per model
+- `annotated/*.png`: screenshot overlays with the scoring `GT` when present,
+  judge points (`GT1`, `GT2`, ...), and candidate predictions
+
+`predictions.jsonl`, `scores.csv`, and `summary.json` include per-model
+`duration_seconds` timing fields. Skipped local models are marked with
+`skipped=true` and an error message explaining the skip reason.
+
+By default, `click-eval run` uses:
+
+- `examples/tasks.jsonl`
+- `examples/models.json`
+- `runs/<timestamp>`
+
+## Development
+
+```bash
+cd prototypes/click_eval
+uv run pytest
+uv run ruff check .
+```
--- a/prototypes/click_eval/examples/models.json
+++ b/prototypes/click_eval/examples/models.json
@@ -0,0 +1,440 @@
+{
+  "judge_models": [
+    {
+      "name": "openai-computer-use-judge",
+      "provider": "openai_computer_use",
+      "model": "computer-use-preview"
+    },
+    {
+      "name": "claude-opus-4.7-judge",
+      "provider": "openrouter",
+      "model": "anthropic/claude-opus-4.7"
+    },
+    {
+      "name": "gpt-5.5-judge",
+      "provider": "openrouter",
+      "model": "openai/gpt-5.5"
+    },
+    {
+      "name": "gemini-3.1-pro-judge",
+      "provider": "openrouter",
+      "model": "google/gemini-3.1-pro-preview"
+    }
+  ],
+  "candidate_models": [
+    {
+      "name": "qwen3-vl-8b-instruct",
+      "provider": "openrouter",
+      "model": "qwen/qwen3-vl-8b-instruct"
+    },
+    {
+      "name": "qwen3-vl-8b-thinking",
+      "provider": "openrouter",
+      "model": "qwen/qwen3-vl-8b-thinking"
+    },
+    {
+      "name": "ui-tars-1.5-7b",
+      "provider": "openrouter",
+      "model": "bytedance/ui-tars-1.5-7b"
+    },
+    {
+      "name": "glm-4.5v",
+      "provider": "openrouter",
+      "model": "z-ai/glm-4.5v"
+    },
+    {
+      "name": "glm-4.6v",
+      "provider": "openrouter",
+      "model": "z-ai/glm-4.6v"
+    },
+    {
+      "name": "glm-5v-turbo",
+      "provider": "openrouter",
+      "model": "z-ai/glm-5v-turbo"
+    },
+    {
+      "name": "moondream",
+      "provider": "moondream",
+      "model": "moondream-cloud"
+    },
+    {
+      "name": "gemini-3.1-pro",
+      "provider": "openrouter",
+      "model": "google/gemini-3.1-pro-preview"
+    },
+    {
+      "name": "qwen3-vl-2b-instruct-local",
+      "provider": "local_hf",
+      "model": "Qwen/Qwen3-VL-2B-Instruct",
+      "adapter": "qwen3_vl",
+      "dtype": "fp16",
+      "allow_cpu_offload": true,
+      "estimated_vram_gb": 8,
+      "max_pixels": 1310720,
+      "max_new_tokens": 128
+    },
+    {
+      "name": "qwen3-vl-2b-thinking-local",
+      "provider": "local_hf",
+      "model": "Qwen/Qwen3-VL-2B-Thinking",
+      "adapter": "qwen3_vl",
+      "dtype": "fp16",
+      "allow_cpu_offload": true,
+      "estimated_vram_gb": 8,
+      "max_pixels": 1310720,
+      "max_new_tokens": 1024
+    },
+    {
+      "name": "qwen3-vl-4b-instruct-local",
+      "provider": "local_hf",
+      "model": "Qwen/Qwen3-VL-4B-Instruct",
+      "adapter": "qwen3_vl",
+      "dtype": "fp16",
+      "estimated_vram_gb": 12,
+      "max_pixels": 1310720,
+      "max_new_tokens": 128
+    },
+    {
+      "name": "qwen3-vl-4b-thinking-local",
+      "provider": "local_hf",
+      "model": "Qwen/Qwen3-VL-4B-Thinking",
+      "adapter": "qwen3_vl",
+      "dtype": "fp16",
+      "allow_cpu_offload": true,
+      "estimated_vram_gb": 12,
+      "max_pixels": 1310720,
+      "max_new_tokens": 1024
+    },
+    {
+      "name": "qwen2.5-vl-3b-instruct-local",
+      "provider": "local_hf",
+      "model": "Qwen/Qwen2.5-VL-3B-Instruct",
+      "adapter": "qwen25_point_1000",
+      "dtype": "fp16",
+      "allow_cpu_offload": true,
+      "estimated_vram_gb": 10,
+      "min_pixels": 3136,
+      "max_pixels": 1003520,
+      "max_new_tokens": 128
+    },
+    {
+      "name": "gui-owl-1.5-2b-local",
+      "provider": "local_hf",
+      "model": "mPLUG/GUI-Owl-1.5-2B-Instruct",
+      "adapter": "qwen3_vl",
+      "dtype": "fp16",
+      "estimated_vram_gb": 8,
+      "max_pixels": 1310720,
+      "max_new_tokens": 128
+    },
+    {
+      "name": "gui-owl-1.5-4b-local",
+      "provider": "local_hf",
+      "model": "mPLUG/GUI-Owl-1.5-4B-Instruct",
+      "adapter": "qwen3_vl",
+      "dtype": "fp16",
+      "allow_cpu_offload": true,
+      "estimated_vram_gb": 12,
+      "max_pixels": 1310720,
+      "max_new_tokens": 128
+    },
+    {
+      "name": "gui-owl-1.5-8b-local",
+      "provider": "local_hf",
+      "model": "mPLUG/GUI-Owl-1.5-8B-Instruct",
+      "adapter": "qwen3_vl",
+      "dtype": "fp16",
+      "allow_cpu_offload": true,
+      "estimated_vram_gb": 24,
+      "max_pixels": 1310720,
+      "max_new_tokens": 128
+    },
+    {
+      "name": "kv-ground-8b-local",
+      "provider": "local_hf",
+      "model": "vocaela/KV-Ground-8B-BaseGuiOwl1.5-0315",
+      "adapter": "qwen3_vl",
+      "dtype": "fp16",
+      "allow_cpu_offload": true,
+      "estimated_vram_gb": 24,
+      "max_pixels": 1310720,
+      "max_new_tokens": 128
+    },
+    {
+      "name": "ui-venus-1.5-2b-local",
+      "provider": "local_hf",
+      "model": "inclusionAI/UI-Venus-1.5-2B",
+      "adapter": "qwen3_vl",
+      "dtype": "fp16",
+      "estimated_vram_gb": 8,
+      "max_pixels": 1310720,
+      "max_new_tokens": 128
+    },
+    {
+      "name": "ui-venus-1.5-8b-local",
+      "provider": "local_hf",
+      "model": "inclusionAI/UI-Venus-1.5-8B",
+      "adapter": "qwen3_vl",
+      "dtype": "fp16",
+      "allow_cpu_offload": true,
+      "estimated_vram_gb": 24,
+      "max_pixels": 1310720,
+      "max_new_tokens": 128
+    },
+    {
+      "name": "holo2-4b-local",
+      "provider": "local_hf",
+      "model": "Hcompany/Holo2-4B",
+      "adapter": "holo2",
+      "dtype": "fp16",
+      "allow_cpu_offload": true,
+      "estimated_vram_gb": 12,
+      "max_pixels": 1310720,
+      "max_new_tokens": 64
+    },
+    {
+      "name": "holo2-8b-local",
+      "provider": "local_hf",
+      "model": "Hcompany/Holo2-8B",
+      "adapter": "holo2",
+      "dtype": "fp16",
+      "allow_cpu_offload": true,
+      "estimated_vram_gb": 24,
+      "max_pixels": 1310720,
+      "max_new_tokens": 64
+    },
+    {
+      "name": "gta1-7b-local",
+      "provider": "local_hf",
+      "model": "Salesforce/GTA1-7B",
+      "adapter": "gta1",
+      "dtype": "fp16",
+      "allow_cpu_offload": true,
+      "estimated_vram_gb": 24,
+      "min_pixels": 3136,
+      "max_pixels": 1003520,
+      "max_new_tokens": 512
+    },
+    {
+      "name": "opencua-7b-local",
+      "provider": "local_hf",
+      "model": "xlangai/OpenCUA-7B",
+      "adapter": "opencua",
+      "dtype": "fp16",
+      "allow_cpu_offload": true,
+      "estimated_vram_gb": 24,
+      "min_pixels": 3136,
+      "max_pixels": 1003520,
+      "max_new_tokens": 512
+    },
+    {
+      "name": "infigui-g1-3b-local",
+      "provider": "local_hf",
+      "model": "InfiX-ai/InfiGUI-G1-3B",
+      "adapter": "infigui",
+      "dtype": "fp16",
+      "allow_cpu_offload": true,
+      "estimated_vram_gb": 12,
+      "min_pixels": 3136,
+      "max_pixels": 1003520,
+      "max_new_tokens": 512
+    },
+    {
+      "name": "infigui-g1-7b-local",
+      "provider": "local_hf",
+      "model": "InfiX-ai/InfiGUI-G1-7B",
+      "adapter": "infigui",
+      "dtype": "fp16",
+      "allow_cpu_offload": true,
+      "estimated_vram_gb": 24,
+      "min_pixels": 3136,
+      "max_pixels": 1003520,
+      "max_new_tokens": 512
+    },
+    {
+      "name": "points-gui-g-local",
+      "provider": "local_hf",
+      "model": "tencent/POINTS-GUI-G",
+      "adapter": "points_gui_g",
+      "dtype": "fp16",
+      "allow_cpu_offload": true,
+      "estimated_vram_gb": 24,
+      "max_new_tokens": 128
+    },
+    {
+      "name": "mai-ui-8b-local",
+      "provider": "local_hf",
+      "model": "Tongyi-MAI/MAI-UI-8B",
+      "adapter": "qwen3_vl",
+      "dtype": "fp16",
+      "allow_cpu_offload": true,
+      "estimated_vram_gb": 24,
+      "max_pixels": 1310720,
+      "max_new_tokens": 128
+    },
+    {
+      "name": "molmopoint-gui-8b-local",
+      "provider": "local_hf",
+      "model": "allenai/MolmoPoint-GUI-8B",
+      "adapter": "molmopoint",
+      "dtype": "bf16",
+      "allow_cpu_offload": true,
+      "estimated_vram_gb": 24,
+      "max_new_tokens": 200
+    },
+    {
+      "name": "fara-7b-local",
+      "provider": "local_hf",
+      "model": "microsoft/Fara-7B",
+      "adapter": "qwen25_tool_absolute",
+      "dtype": "fp16",
+      "allow_cpu_offload": true,
+      "estimated_vram_gb": 20,
+      "min_pixels": 3136,
+      "max_pixels": 1003520,
+      "max_new_tokens": 512
+    },
+    {
+      "name": "groundnext-7b-local",
+      "provider": "local_hf",
+      "model": "ServiceNow/GroundNext-7B-V0",
+      "adapter": "groundnext",
+      "dtype": "fp16",
+      "allow_cpu_offload": true,
+      "estimated_vram_gb": 24,
+      "max_pixels": 1003520,
+      "max_new_tokens": 128
+    },
+    {
+      "name": "uground-v1-7b-local",
+      "provider": "local_hf",
+      "model": "osunlp/UGround-V1-7B",
+      "adapter": "uground",
+      "dtype": "fp16",
+      "allow_cpu_offload": true,
+      "estimated_vram_gb": 24,
+      "max_pixels": 1003520
+    },
+    {
+      "name": "uground-v1-2b-local",
+      "provider": "local_hf",
+      "model": "osunlp/UGround-V1-2B",
+      "adapter": "uground",
+      "dtype": "fp16",
+      "allow_cpu_offload": true,
+      "estimated_vram_gb": 8,
+      "max_pixels": 1003520
+    },
+    {
+      "name": "ui-tars-2b-sft-local",
+      "provider": "local_hf",
+      "model": "ByteDance-Seed/UI-TARS-2B-SFT",
+      "adapter": "uground",
+      "dtype": "fp16",
+      "allow_cpu_offload": true,
+      "estimated_vram_gb": 8,
+      "max_pixels": 1003520
+    },
+    {
+      "name": "zonui-3b-local",
+      "provider": "local_hf",
+      "model": "zonghanHZH/ZonUI-3B",
+      "adapter": "zonui",
+      "dtype": "fp16",
+      "allow_cpu_offload": true,
+      "estimated_vram_gb": 10,
+      "min_pixels": 3136,
+      "max_pixels": 1003520,
+      "max_new_tokens": 64
+    },
+    {
+      "name": "gui-g1-3b-local",
+      "provider": "local_hf",
+      "model": "Yuqi-Zhou/GUI-G1-3B-v1",
+      "adapter": "qwen25_point_1000",
+      "dtype": "fp16",
+      "allow_cpu_offload": true,
+      "estimated_vram_gb": 10,
+      "min_pixels": 3136,
+      "max_pixels": 1003520,
+      "max_new_tokens": 128
+    },
+    {
+      "name": "jedi-3b-1080p-local",
+      "provider": "local_hf",
+      "model": "xlangai/Jedi-3B-1080p",
+      "adapter": "qwen25_tool_absolute",
+      "dtype": "fp16",
+      "allow_cpu_offload": true,
+      "estimated_vram_gb": 10,
+      "min_pixels": 3136,
+      "max_pixels": 1003520,
+      "max_new_tokens": 128
+    },
+    {
+      "name": "jedi-7b-1080p-local",
+      "provider": "local_hf",
+      "model": "xlangai/Jedi-7B-1080p",
+      "adapter": "qwen25_tool_absolute",
+      "dtype": "fp16",
+      "allow_cpu_offload": true,
+      "estimated_vram_gb": 24,
+      "min_pixels": 3136,
+      "max_pixels": 1003520,
+      "max_new_tokens": 128
+    },
+    {
+      "name": "ui-ins-7b-local",
+      "provider": "local_hf",
+      "model": "Tongyi-MiA/UI-Ins-7B",
+      "adapter": "qwen25_tool_absolute",
+      "dtype": "fp16",
+      "allow_cpu_offload": true,
+      "estimated_vram_gb": 24,
+      "min_pixels": 3136,
+      "max_pixels": 1003520,
+      "max_new_tokens": 256
+    },
+    {
+      "name": "gui-drag-7b-local",
+      "provider": "local_hf",
+      "model": "osunlp/GUI-Drag-7B",
+      "adapter": "gui_drag",
+      "dtype": "fp16",
+      "allow_cpu_offload": true,
+      "estimated_vram_gb": 24,
+      "min_pixels": 12544,
+      "max_pixels": 2116800,
+      "max_new_tokens": 1024
+    },
+    {
+      "name": "os-atlas-base-4b-local",
+      "provider": "local_hf",
+      "model": "OS-Copilot/OS-Atlas-Base-4B",
+      "adapter": "os_atlas_4b",
+      "dtype": "fp16",
+      "allow_cpu_offload": true,
+      "estimated_vram_gb": 12,
+      "max_new_tokens": 128
+    },
+    {
+      "name": "os-atlas-base-7b-local",
+      "provider": "local_hf",
+      "model": "OS-Copilot/OS-Atlas-Base-7B",
+      "adapter": "os_atlas_7b",
+      "dtype": "fp16",
+      "allow_cpu_offload": true,
+      "estimated_vram_gb": 24,
+      "max_pixels": 1003520
+    },
+    {
+      "name": "showui-2b-local",
+      "provider": "local_hf",
+      "model": "showlab/ShowUI-2B",
+      "adapter": "showui",
+      "dtype": "fp16",
+      "estimated_vram_gb": 8,
+      "max_pixels": 1053696
+    }
+  ]
+}
--- a/prototypes/click_eval/examples/screenshots/aliexpress_search_by_image.meta.json
+++ b/prototypes/click_eval/examples/screenshots/aliexpress_search_by_image.meta.json
@@ -0,0 +1,49 @@
+{
+  "url": "https://www.aliexpress.com/",
+  "screenshot": "prototypes/click_eval/examples/screenshots/aliexpress_search_by_image.png",
+  "image_size": {
+    "width": 1920,
+    "height": 1152
+  },
+  "device_scale_factor": 1,
+  "target": {
+    "label": "Search by image",
+    "selector": "img[alt=\"Search by image\"]",
+    "rect": {
+      "x": 1189.171875,
+      "y": 25.5,
+      "width": 24,
+      "height": 29
+    },
+    "image_rect": {
+      "x": 1189.171875,
+      "y": 25.5,
+      "width": 24,
+      "height": 24
+    },
+    "center": {
+      "x": 1201.171875,
+      "y": 40
+    }
+  },
+  "viewport": {
+    "width": 1920,
+    "height": 1152,
+    "dpr": 1,
+    "scrollX": 0,
+    "scrollY": 0,
+    "url": "https://www.aliexpress.com/",
+    "title": "AliExpress - Affordable Chinese Stores & Free Shipping - Online Shopping"
+  },
+  "captured_at_utc": "2026-04-29T15:07:12.753Z",
+  "dismissed_overlays": [
+    "Free gift modal",
+    "Cookie banner",
+    "Notification prompt"
+  ],
+  "caveats": [
+    "Live AliExpress homepage; products, placeholder text, and layout can change upstream.",
+    "Popups were dismissed before capture so the search by image button is unobstructed.",
+    "Animations/transitions were reduced before capture for determinism."
+  ]
+}
--- a/prototypes/click_eval/examples/screenshots/aliexpress_search_by_image.png
+++ b/prototypes/click_eval/examples/screenshots/aliexpress_search_by_image.png
--- a/prototypes/click_eval/examples/screenshots/bustracker_208_cork_city_centre.meta.json
+++ b/prototypes/click_eval/examples/screenshots/bustracker_208_cork_city_centre.meta.json
@@ -0,0 +1,289 @@
+{
+  "url": "https://www.bustracker.ie/?q=208",
+  "screenshot": "prototypes/click_eval/examples/screenshots/bustracker_208_cork_city_centre.png",
+  "image_size": {
+    "width": 1920,
+    "height": 1152
+  },
+  "device_scale_factor": 1,
+  "target": {
+    "label": "Route 208 bus closest to Cork city centre",
+    "selector": ".leaflet-marker-icon.route-rectangle",
+    "center": {
+      "x": 1068,
+      "y": 511
+    },
+    "rect": {
+      "x": 1038.0191650390625,
+      "y": 490.70770263671875,
+      "width": 59.961669921875,
+      "height": 40.5845947265625,
+      "top": 490.70770263671875,
+      "left": 1038.0191650390625,
+      "right": 1097.9808349609375,
+      "bottom": 531.2922973632812
+    },
+    "text": "208",
+    "latLon": {
+      "lat": 51.89778723113112,
+      "lon": -8.476724624633789
+    },
+    "distanceToCorkCentreM": 110.6165039682958,
+    "markerIndex": 4
+  },
+  "viewport": {
+    "width": 1920,
+    "height": 1152,
+    "dpr": 1,
+    "scrollX": 0,
+    "scrollY": 0,
+    "url": "https://www.bustracker.ie/?q=208",
+    "title": "bustracker"
+  },
+  "mapRect": {
+    "x": 0,
+    "y": 64,
+    "width": 1920,
+    "height": 1088,
+    "top": 64,
+    "left": 0,
+    "right": 1920,
+    "bottom": 1152
+  },
+  "updated": "Updated 31 seconds ago",
+  "corkCityCentre": {
+    "lat": 51.8985,
+    "lon": -8.4756
+  },
+  "tileRef": {
+    "src": "https://a.tile.openstreetmap.org/14/7805/5419.png",
+    "z": 14,
+    "xTile": 7805,
+    "yTile": 5419,
+    "rect": {
+      "x": 757,
+      "y": 401,
+      "width": 256,
+      "height": 256,
+      "top": 401,
+      "left": 757,
+      "right": 1013,
+      "bottom": 657
+    }
+  },
+  "rankedMarkers": [
+    {
+      "index": 4,
+      "center": {
+        "x": 1068,
+        "y": 511
+      },
+      "rect": {
+        "x": 1038.0191650390625,
+        "y": 490.70770263671875,
+        "width": 59.961669921875,
+        "height": 40.5845947265625,
+        "top": 490.70770263671875,
+        "left": 1038.0191650390625,
+        "right": 1097.9808349609375,
+        "bottom": 531.2922973632812
+      },
+      "text": "208",
+      "latLon": {
+        "lat": 51.89778723113112,
+        "lon": -8.476724624633789
+      },
+      "distanceToCorkCentreM": 110.6165039682958
+    },
+    {
+      "index": 2,
+      "center": {
+        "x": 1199,
+        "y": 547
+      },
+      "rect": {
+        "x": 1171.151611328125,
+        "y": 531.351318359375,
+        "width": 55.69677734375,
+        "height": 31.29736328125,
+        "top": 531.351318359375,
+        "left": 1171.151611328125,
+        "right": 1226.848388671875,
+        "bottom": 562.648681640625
+      },
+      "text": "208",
+      "latLon": {
+        "lat": 51.895880514665095,
+        "lon": -8.46548080444336
+      },
+      "distanceToCorkCentreM": 752.9542979626465
+    },
+    {
+      "index": 5,
+      "center": {
+        "x": 1316,
+        "y": 304.99998474121094
+      },
+      "rect": {
+        "x": 1290.1763916015625,
+        "y": 273.68878173828125,
+        "width": 51.647216796875,
+        "height": 62.622406005859375,
+        "top": 273.68878173828125,
+        "left": 1290.1763916015625,
+        "right": 1341.8236083984375,
+        "bottom": 336.3111877441406
+      },
+      "text": "208",
+      "latLon": {
+        "lat": 51.9086963310864,
+        "lon": -8.455438613891602
+      },
+      "distanceToCorkCentreM": 1788.4811355104382
+    },
+    {
+      "index": 7,
+      "center": {
+        "x": 719,
+        "y": 690
+      },
+      "rect": {
+        "x": 703.7605590820312,
+        "y": 662.3701171875,
+        "width": 30.4788818359375,
+        "height": 55.259765625,
+        "top": 662.3701171875,
+        "left": 703.7605590820312,
+        "right": 734.2394409179688,
+        "bottom": 717.6298828125
+      },
+      "text": "208",
+      "latLon": {
+        "lat": 51.88830581406175,
+        "lon": -8.50667953491211
+      },
+      "distanceToCorkCentreM": 2415.2448491301698
+    },
+    {
+      "index": 3,
+      "center": {
+        "x": 720,
+        "y": 697
+      },
+      "rect": {
+        "x": 704.7605590820312,
+        "y": 669.3701171875,
+        "width": 30.4788818359375,
+        "height": 55.259765625,
+        "top": 669.3701171875,
+        "left": 704.7605590820312,
+        "right": 735.2394409179688,
+        "bottom": 724.6298828125
+      },
+      "text": "208",
+      "latLon": {
+        "lat": 51.88793499174197,
+        "lon": -8.506593704223633
+      },
+      "distanceToCorkCentreM": 2429.719568050658
+    },
+    {
+      "index": 1,
+      "center": {
+        "x": 1623,
+        "y": 201.00000762939453
+      },
+      "rect": {
+        "x": 1607.5078125,
+        "y": 173.23468017578125,
+        "width": 30.984375,
+        "height": 55.53065490722656,
+        "top": 173.23468017578125,
+        "left": 1607.5078125,
+        "right": 1638.4921875,
+        "bottom": 228.7653350830078
+      },
+      "text": "208",
+      "latLon": {
+        "lat": 51.91420282988519,
+        "lon": -8.429088592529297
+      },
+      "distanceToCorkCentreM": 3637.2647634017035
+    },
+    {
+      "index": 8,
+      "center": {
+        "x": 1623,
+        "y": 200.00000762939453
+      },
+      "rect": {
+        "x": 1607.5078125,
+        "y": 172.23468017578125,
+        "width": 30.984375,
+        "height": 55.53065490722656,
+        "top": 172.23468017578125,
+        "left": 1607.5078125,
+        "right": 1638.4921875,
+        "bottom": 227.7653350830078
+      },
+      "text": "208",
+      "latLon": {
+        "lat": 51.91425577372331,
+        "lon": -8.429088592529297
+      },
+      "distanceToCorkCentreM": 3640.0928836883722
+    },
+    {
+      "index": 6,
+      "center": {
+        "x": 386,
+        "y": 985
+      },
+      "rect": {
+        "x": 354.7040710449219,
+        "y": 957.9580078125,
+        "width": 62.59185791015625,
+        "height": 54.083984375,
+        "top": 957.9580078125,
+        "left": 354.7040710449219,
+        "right": 417.2959289550781,
+        "bottom": 1012.0419921875
+      },
+      "text": "208",
+      "latLon": {
+        "lat": 51.872675649487675,
+        "lon": -8.535261154174805
+      },
+      "distanceToCorkCentreM": 5001.261433750701
+    },
+    {
+      "index": 0,
+      "center": {
+        "x": 297,
+        "y": 1016
+      },
+      "rect": {
+        "x": 268.0628662109375,
+        "y": 998.1650390625,
+        "width": 57.874267578125,
+        "height": 35.669921875,
+        "top": 998.1650390625,
+        "left": 268.0628662109375,
+        "right": 325.9371337890625,
+        "bottom": 1033.8349609375
+      },
+      "text": "208",
+      "latLon": {
+        "lat": 51.87103284208375,
+        "lon": -8.542900085449219
+      },
+      "distanceToCorkCentreM": 5537.540585075056
+    }
+  ],
+  "captured_at_utc": "2026-04-29T15:11:01.299Z",
+  "caveats": [
+    "BusTracker is live data; bus marker positions and closest vehicle can change between captures.",
+    "Closest marker was selected by projecting visible Leaflet marker centers to OSM tile lat/lon and comparing to Cork city centre lat/lon.",
+    "The GT point is the center of the selected route-rectangle marker in viewport/screenshot coordinates."
+  ]
+}
--- a/prototypes/click_eval/examples/screenshots/bustracker_208_cork_city_centre.png
+++ b/prototypes/click_eval/examples/screenshots/bustracker_208_cork_city_centre.png
--- a/prototypes/click_eval/examples/screenshots/f1_2026_drivers_fia_logo.meta.json
+++ b/prototypes/click_eval/examples/screenshots/f1_2026_drivers_fia_logo.meta.json
@@ -0,0 +1,44 @@
+{
+  "url": "https://www.formula1.com/en/results/2026/drivers",
+  "screenshot": "prototypes/click_eval/examples/screenshots/f1_2026_drivers_fia_logo.png",
+  "image_size": {
+    "width": 1920,
+    "height": 1152
+  },
+  "target": {
+    "selector": "a[href=\"https://www.fia.com/\"] svg[role=\"img\"]",
+    "label": "FIA",
+    "href": "https://www.fia.com/",
+    "rect": {
+      "x": 1707.328125,
+      "y": 64,
+      "width": 45.171875,
+      "height": 32
+    },
+    "center": {
+      "x": 1729.9140625,
+      "y": 80
+    },
+    "occludingElement": {
+      "tag": "path",
+      "text": "",
+      "id": null,
+      "class": null
+    },
+    "viewport": {
+      "width": 1920,
+      "height": 1152,
+      "dpr": 1,
+      "scrollX": 0,
+      "scrollY": 0,
+      "url": "https://www.formula1.com/en/results/2026/drivers",
+      "title": "2026 DRIVERS' STANDINGS"
+    }
+  },
+  "consent_dismissed": true,
+  "captured_at_utc": "2026-04-28T11:22:17.556Z",
+  "caveats": [
+    "Live Formula 1 page; content can change upstream.",
+    "Animations/transitions were reduced before capture for determinism."
+  ]
+}
--- a/prototypes/click_eval/examples/screenshots/f1_2026_drivers_fia_logo.png
+++ b/prototypes/click_eval/examples/screenshots/f1_2026_drivers_fia_logo.png
--- a/prototypes/click_eval/examples/screenshots/hacker_news_browser.png
+++ b/prototypes/click_eval/examples/screenshots/hacker_news_browser.png
--- a/prototypes/click_eval/examples/screenshots/izz_table_booking_footer.meta.json
+++ b/prototypes/click_eval/examples/screenshots/izz_table_booking_footer.meta.json
@@ -0,0 +1,66 @@
+{
+  "url": "https://izz.ie/pages/table-booking",
+  "screenshot": "prototypes/click_eval/examples/screenshots/izz_table_booking_footer.png",
+  "image_size": {
+    "width": 1920,
+    "height": 1152
+  },
+  "device_scale_factor": 1,
+  "targets": {
+    "mastercard": {
+      "label": "Mastercard logo",
+      "selector": "footer svg[aria-labelledby=\"pi-master\"]",
+      "rect": {
+        "x": 1343,
+        "y": 961.1875,
+        "width": 38,
+        "height": 24,
+        "top": 961.1875,
+        "left": 1343,
+        "right": 1381,
+        "bottom": 985.1875
+      },
+      "center": {
+        "x": 1362,
+        "y": 973.1875
+      },
+      "text": "Mastercard"
+    },
+    "tiktok": {
+      "label": "TikTok footer logo",
+      "selector": "footer a[href*=\"tiktok.com\"]",
+      "rect": {
+        "x": 1420,
+        "y": 1059.5,
+        "width": 30,
+        "height": 30,
+        "top": 1059.5,
+        "left": 1420,
+        "right": 1450,
+        "bottom": 1089.5
+      },
+      "center": {
+        "x": 1435,
+        "y": 1074.5
+      },
+      "href": "https://www.tiktok.com/@izzcafeireland",
+      "text": "TikTok"
+    },
+    "viewport": {
+      "width": 1920,
+      "height": 1152,
+      "dpr": 1,
+      "scrollX": 0,
+      "scrollY": 5322,
+      "url": "https://izz.ie/pages/table-booking",
+      "title": "Table Booking – Izz Cafe",
+      "docHeight": 6474
+    }
+  },
+  "captured_at_utc": "2026-04-29T15:10:52.819Z",
+  "caveats": [
+    "Footer capture after scrolling to the bottom of the live Shopify page.",
+    "The TikTok target is the footer Follow-us row icon, not the duplicated announcement-bar/social icon.",
+    "A small cookie banner remains visible near the bottom but does not cover the Mastercard payment logo or the selected footer TikTok logo."
+  ]
+}
--- a/prototypes/click_eval/examples/screenshots/izz_table_booking_footer.png
+++ b/prototypes/click_eval/examples/screenshots/izz_table_booking_footer.png
--- a/prototypes/click_eval/examples/screenshots/reddit_pics_leave_permalink.meta.json
+++ b/prototypes/click_eval/examples/screenshots/reddit_pics_leave_permalink.meta.json
@@ -0,0 +1,45 @@
+{
+  "leave": {
+    "text": "leave",
+    "selector": ".subscribe-button .option.remove.active",
+    "rect": {
+      "x": 1605,
+      "y": 308,
+      "width": 43.890625,
+      "height": 16
+    }
+  },
+  "sixthPermalink": {
+    "text": "permalink",
+    "href": "https://old.reddit.com/r/pics/comments/1sx21hf/someone_paid_8700_dollars_for_this_in_1995_i_paid/oimt6n0/",
+    "selector": ".comment:nth-of-type(6) a:has-text(\"permalink\")",
+    "commentId": "thing_t1_oimt6n0",
+    "author": "DizzyTelevision09",
+    "bodyPreview": "You've been paying off student loans for the past 30 years? That's crazy.",
+    "rect": {
+      "x": 166,
+      "y": 1125.265625,
+      "width": 58.546875,
+      "height": 12
+    }
+  },
+  "title": "Someone paid $8700 dollars for this in 1995. I paid $15 today. : pics",
+  "pageUrl": "https://old.reddit.com/r/pics/comments/1sx21hf/someone_paid_8700_dollars_for_this_in_1995_i_paid/",
+  "scrollHeight": 17771,
+  "imageSize": {
+    "width": 1920,
+    "height": 1600
+  },
+  "deviceScaleFactor": 1,
+  "targets": {
+    "reddit_pics_leave_button": [
+      1626.9,
+      316
+    ],
+    "reddit_pics_6th_comment_permalink": [
+      195.3,
+      1131.3
+    ]
+  },
+  "caveat": "Fresh Old Reddit session is logged out, so the real visible button is join. The hidden leave link exists in the DOM; screenshot flips the subscribe widget into its subscribed/leave visual state for this requested target."
+}
--- a/prototypes/click_eval/examples/screenshots/reddit_pics_leave_permalink.png
+++ b/prototypes/click_eval/examples/screenshots/reddit_pics_leave_permalink.png
--- a/prototypes/click_eval/examples/screenshots/temu_feedback.meta.json
+++ b/prototypes/click_eval/examples/screenshots/temu_feedback.meta.json
@@ -0,0 +1,26 @@
+{
+  "url": "https://www.temu.com/",
+  "screenshot": "prototypes/click_eval/examples/screenshots/temu_feedback.png",
+  "image_size": {
+    "width": 1920,
+    "height": 1152
+  },
+  "device_scale_factor": 1,
+  "target": {
+    "label": "Feedback",
+    "selector": "#mainToolbar [role=\"button\"]",
+    "rect": {
+      "x": 1860,
+      "y": 1016,
+      "width": 56,
+      "height": 56
+    },
+    "center": {
+      "x": 1888,
+      "y": 1044
+    }
+  },
+  "caveats": [
+    "Temu displayed a privacy/cookie settings dialog that dims the page; the bottom-right Feedback button remains visible in the screenshot."
+  ]
+}
--- a/prototypes/click_eval/examples/screenshots/temu_feedback.png
+++ b/prototypes/click_eval/examples/screenshots/temu_feedback.png
--- a/prototypes/click_eval/examples/tasks.jsonl
+++ b/prototypes/click_eval/examples/tasks.jsonl
@@ -0,0 +1,11 @@
+{"task_id":"hn_second_comments","image_path":"screenshots/hacker_news_browser.png","instruction":"Click the comments link under the second item.","gt_point":[442.8,101.5]}
+{"task_id":"hn_7th_title_link","image_path":"screenshots/hacker_news_browser.png","instruction":"Click the title link of the 7th post.","gt_point":[359.2,261.5]}
+{"task_id":"hn_19th_hide","image_path":"screenshots/hacker_news_browser.png","instruction":"Click the hide link on the 19th post.","gt_point":[354.9,696.5]}
+{"task_id":"temu_feedback","image_path":"screenshots/temu_feedback.png","instruction":"Click the Feedback button on the bottom right.","gt_point":[1888.0,1044.0]}
+{"task_id":"f1_2026_drivers_fia_logo","image_path":"screenshots/f1_2026_drivers_fia_logo.png","instruction":"Click the FIA logo on the top right.","gt_point":[1729.914,80.0]}
+{"task_id":"reddit_pics_leave_button","image_path":"screenshots/reddit_pics_leave_permalink.png","instruction":"Click the leave button near the top of the right sidebar.","gt_point":[1626.9,316.0]}
+{"task_id":"reddit_pics_6th_comment_permalink","image_path":"screenshots/reddit_pics_leave_permalink.png","instruction":"Click the permalink button for the 6th comment.","gt_point":[195.3,1131.3]}
+{"task_id":"izz_table_booking_footer_mastercard","image_path":"screenshots/izz_table_booking_footer.png","instruction":"Click the Mastercard logo at the very bottom of the page.","gt_point":[1362.0,973.2]}
+{"task_id":"izz_table_booking_footer_tiktok","image_path":"screenshots/izz_table_booking_footer.png","instruction":"Click the TikTok logo in the footer social row.","gt_point":[1435.0,1074.5]}
+{"task_id":"aliexpress_search_by_image","image_path":"screenshots/aliexpress_search_by_image.png","instruction":"Click the search by image button.","gt_point":[1201.171875,40]}
+{"task_id":"bustracker_208_cork_city_centre_bus","image_path":"screenshots/bustracker_208_cork_city_centre.png","instruction":"Click the bus closest to Cork city centre.","gt_point":[1068.0,511.0]}
--- a/prototypes/click_eval/pyproject.toml
+++ b/prototypes/click_eval/pyproject.toml
@@ -0,0 +1,56 @@
+[build-system]
+requires = ["setuptools>=68"]
+build-backend = "setuptools.build_meta"
+
+[project]
+name = "click-eval"
+version = "0.1.0"
+description = "Quick OpenRouter VLM click-point evaluation prototype"
+requires-python = ">=3.12"
+dependencies = [
+  "google-genai>=1.0.0",
+  "Pillow>=10.0.0",
+  "tqdm>=4.67.3",
+]
+
+[project.optional-dependencies]
+local = [
+  "accelerate>=1.0.0",
+  "decord2>=0.2.0",
+  "einops>=0.8.0",
+  "hf_transfer>=0.1.9",
+  "protobuf>=4.25.0",
+  "qwen-vl-utils>=0.0.14",
+  "requests>=2.32.0",
+  "safetensors>=0.4.3",
+  "sentencepiece>=0.2.0",
+  "timm>=1.0.0",
+  "torch>=2.6.0",
+  "torchvision>=0.21.0",
+  "transformers>=4.57.1,<5.0.0",
+  "WePOINTS @ git+https://github.com/WePOINTS/WePOINTS.git",
+  "tiktoken>=0.7.0",
+]
+
+[project.scripts]
+click-eval = "click_eval.cli:main"
+
+[dependency-groups]
+dev = [
+  "basedpyright>=1.39.3",
+  "pylsp-rope>=0.1.17",
+  "pytest>=8.0.0",
+  "python-lsp-server>=1.14.0",
+  "ruff>=0.14.0",
+]
+
+[tool.setuptools.packages.find]
+where = ["src"]
+
+[tool.pytest.ini_options]
+testpaths = ["tests"]
+pythonpath = ["src"]
+
+[tool.ruff]
+line-length = 88
+target-version = "py312"
--- a/prototypes/click_eval/requirements.txt
+++ b/prototypes/click_eval/requirements.txt
@@ -0,0 +1 @@
+-e .[local]
--- a/prototypes/click_eval/src/click_eval/init.py
+++ b/prototypes/click_eval/src/click_eval/init.py
@@ -0,0 +1,5 @@
+"""Quick VLM click-point evaluation prototype."""
+
+__all__ = ["__version__"]
+
+__version__ = "0.1.0"
--- a/prototypes/click_eval/src/click_eval/main.py
+++ b/prototypes/click_eval/src/click_eval/main.py
@@ -0,0 +1,5 @@
+from .cli import main
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/prototypes/click_eval/src/click_eval/cli.py
+++ b/prototypes/click_eval/src/click_eval/cli.py
@@ -0,0 +1,128 @@
+from __future__ import annotations
+
+import argparse
+import sys
+from datetime import datetime
+from pathlib import Path
+
+from .env import load_dotenv
+from .harness import RunOptions, run_eval
+from .providers import ProviderClient
+
+PROJECT_ROOT = Path(__file__).resolve().parents[2]
+DEFAULT_TASKS = PROJECT_ROOT / "examples" / "tasks.jsonl"
+DEFAULT_MODELS = PROJECT_ROOT / "examples" / "models.json"
+
+
+def main(argv: list[str] | None = None) -> int:
+    parser = argparse.ArgumentParser(description="Quick VLM click-point eval harness")
+    subparsers = parser.add_subparsers(dest="command", required=True)
+
+    run_parser = subparsers.add_parser("run", help="run a click-point evaluation")
+    run_parser.add_argument(
+        "--tasks",
+        type=Path,
+        default=DEFAULT_TASKS,
+        help=f"JSONL task manifest (default: {DEFAULT_TASKS})",
+    )
+    run_parser.add_argument(
+        "--models",
+        type=Path,
+        default=DEFAULT_MODELS,
+        help=f"model config JSON (default: {DEFAULT_MODELS})",
+    )
+    run_parser.add_argument(
+        "--out",
+        type=Path,
+        help="output run directory (default: runs/<timestamp>)",
+    )
+    run_parser.add_argument("--no-annotate", action="store_true", help="skip annotated PNGs")
+    run_parser.add_argument("--fail-fast", action="store_true", help="stop on first GT error")
+    run_parser.add_argument("--no-progress", action="store_true", help="hide progress/log output")
+    run_parser.add_argument("--limit", type=int, help="only run the first N tasks")
+    run_parser.add_argument(
+        "--model-limit",
+        type=int,
+        help="only run the first N candidate models",
+    )
+    run_parser.add_argument(
+        "--timeout",
+        type=int,
+        default=60 * 4,
+        help="API timeout seconds; also used as local HF generation max_time",
+    )
+
+    args = parser.parse_args(argv)
+    if args.command == "run":
+        load_dotenv(PROJECT_ROOT / ".env")
+        load_dotenv(Path.cwd() / ".env")
+        out_dir = args.out or _default_out_dir()
+        client = ProviderClient(
+            timeout_seconds=args.timeout,
+            log_callback=None if args.no_progress else _stderr_log,
+        )
+        summary = run_eval(
+            RunOptions(
+                tasks_path=args.tasks,
+                models_path=args.models,
+                out_dir=out_dir,
+                annotate=not args.no_annotate,
+                fail_fast=args.fail_fast,
+                limit=args.limit,
+                model_limit=args.model_limit,
+                progress=not args.no_progress,
+            ),
+            client.predict_point,
+        )
+        print(f"Wrote run to {out_dir}")
+        result_rows = summary.get("result_rows", [])
+        if isinstance(result_rows, list):
+            print(_format_result_table(result_rows))
+        return 0
+
+    parser.error(f"unknown command: {args.command}")
+    return 2
+
+
+def _default_out_dir() -> Path:
+    timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
+    return PROJECT_ROOT / "runs" / timestamp
+
+
+def _stderr_log(message: str) -> None:
+    print(message, file=sys.stderr)
+
+
+def _format_result_table(rows: list[dict[str, object]]) -> str:
+    headers = ["Model", "Status", "L2", "Duration", "Reason"]
+    table_rows = [
+        [
+            str(row.get("model", "")),
+            str(row.get("status", "")),
+            _format_l2(row.get("l2")),
+            _format_duration(row.get("duration_seconds")),
+            str(row.get("reason") or ""),
+        ]
+        for row in rows
+    ]
+    widths = [
+        max([len(headers[index]), *(len(row[index]) for row in table_rows)])
+        for index in range(len(headers))
+    ]
+    lines = ["Results"]
+    lines.append(_format_table_line(headers, widths))
+    lines.append(_format_table_line(["-" * width for width in widths], widths))
+    lines.extend(_format_table_line(row, widths) for row in table_rows)
+    return "\n".join(lines)
+
+
+def _format_table_line(cells: list[str], widths: list[int]) -> str:
+    return "  ".join(cell.ljust(widths[index]) for index, cell in enumerate(cells))
+
+
+def _format_l2(value: object) -> str:
+    return "n/a" if value is None else f"{float(value):.2f}px"
+
+
+def _format_duration(value: object) -> str:
+    return "n/a" if value is None else f"{float(value):.2f}s"
--- a/prototypes/click_eval/src/click_eval/contracts.py
+++ b/prototypes/click_eval/src/click_eval/contracts.py
@@ -0,0 +1,59 @@
+from __future__ import annotations
+
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any
+
+
+@dataclass(frozen=True)
+class Point:
+    x: float
+    y: float
+
+    def as_list(self) -> list[float]:
+        return [self.x, self.y]
+
+
+@dataclass(frozen=True)
+class ClickTask:
+    task_id: str
+    image_path: Path
+    image_path_text: str
+    instruction: str
+    gt_point: Point | None
+    raw: dict[str, Any]
+
+
+@dataclass(frozen=True)
+class ModelSpec:
+    name: str
+    model_id: str
+    provider: str = "openrouter"
+    estimated_vram_gb: float | None = None
+    adapter: str | None = None
+    quantization: str | None = None
+    allow_cpu_offload: bool = False
+    dtype: str | None = None
+    attn_implementation: str | None = None
+    min_pixels: int | None = None
+    max_pixels: int | None = None
+    max_new_tokens: int | None = None
+    revision: str | None = None
+    use_safetensors: bool | None = None
+
+
+@dataclass(frozen=True)
+class ModelReply:
+    text: str
+    raw: dict[str, Any] | None = None
+
+
+class ModelSkipped(RuntimeError):
+    """Raised when a model is intentionally skipped for environment reasons."""
+
+
+@dataclass(frozen=True)
+class ParsedPoint:
+    point: Point | None
+    reason: str | None = None
+    error: str | None = None
--- a/prototypes/click_eval/src/click_eval/env.py
+++ b/prototypes/click_eval/src/click_eval/env.py
@@ -0,0 +1,27 @@
+from __future__ import annotations
+
+import os
+from pathlib import Path
+
+
+def load_dotenv(path: Path) -> None:
+    if not path.exists():
+        return
+
+    for line in path.read_text(encoding="utf-8").splitlines():
+        stripped = line.strip()
+        if not stripped or stripped.startswith("#") or "=" not in stripped:
+            continue
+
+        key, value = stripped.split("=", 1)
+        key = key.strip()
+        if not key or key in os.environ:
+            continue
+
+        os.environ[key] = _unquote(value.strip())
+
+
+def _unquote(value: str) -> str:
+    if len(value) >= 2 and value[0] == value[-1] and value[0] in {"'", '"'}:
+        return value[1:-1]
+    return value
--- a/prototypes/click_eval/src/click_eval/gemini.py
+++ b/prototypes/click_eval/src/click_eval/gemini.py
@@ -0,0 +1,241 @@
+from __future__ import annotations
+
+import json
+import os
+from pathlib import Path
+from typing import Any
+
+from .contracts import ModelReply, Point
+from .image_utils import image_size
+
+
+class GeminiComputerUseClient:
+    def __init__(
+        self,
+        api_key: str | None = None,
+        timeout_seconds: int = 90,
+    ) -> None:
+        self.api_key = (
+            api_key
+            or os.environ.get("GEMINI_API_KEY")
+            or os.environ.get("GOOGLE_API_KEY")
+        )
+        if not self.api_key:
+            raise RuntimeError("GEMINI_API_KEY or GOOGLE_API_KEY is required")
+        self.timeout_seconds = timeout_seconds
+
+    def predict_point(
+        self,
+        model_id: str,
+        image_path: Path,
+        instruction: str,
+        purpose: str,
+    ) -> ModelReply:
+        try:
+            from google import genai
+            from google.genai import types
+            from google.genai.types import Content, Part
+        except ImportError as exc:
+            raise RuntimeError(
+                "google-genai is required for provider=gemini; run `uv sync`"
+            ) from exc
+
+        width, height = image_size(image_path)
+        client = genai.Client(api_key=self.api_key)
+        contents = [
+            Content(
+                role="user",
+                parts=[
+                    Part(text=_computer_use_prompt(instruction, purpose)),
+                    Part.from_bytes(
+                        data=image_path.read_bytes(),
+                        mime_type="image/png",
+                    ),
+                ],
+            )
+        ]
+        config = types.GenerateContentConfig(
+            tools=[
+                types.Tool(
+                    computer_use=types.ComputerUse(
+                        environment=types.Environment.ENVIRONMENT_BROWSER,
+                        excluded_predefined_functions=_excluded_functions(),
+                    )
+                )
+            ],
+            temperature=0,
+        )
+        _set_high_media_resolution(config, types)
+        response = client.models.generate_content(
+            model=model_id,
+            contents=contents,
+            config=config,
+        )
+        call = _first_function_call(response)
+        raw = _raw_response(response)
+        if call is None:
+            return ModelReply(text=_response_text(response), raw=raw)
+
+        point = _point_from_call(call)
+        if point is None:
+            return ModelReply(text=_response_text(response), raw=raw)
+
+        scaled = _scale_computer_use_point(model_id, point, width, height)
+        return ModelReply(
+            text=json.dumps(
+                {
+                    "x": scaled.x,
+                    "y": scaled.y,
+                    "reason": f"Gemini Computer Use function_call {call['name']}",
+                    "display_x": point.x,
+                    "display_y": point.y,
+                }
+            ),
+            raw=raw,
+        )
+
+
+def _computer_use_prompt(instruction: str, purpose: str) -> str:
+    role_line = (
+        "Choose the ground-truth click point for this instruction."
+        if purpose == "ground_truth"
+        else "Predict the click point for this instruction."
+    )
+    return (
+        f"{role_line}\n\n"
+        "Use the screenshot and emit exactly one Computer Use `click_at` action. "
+        "Do not navigate, type, scroll, hover, or wait. Choose the center of the "
+        "target UI element when possible.\n\n"
+        f"Instruction: {instruction}"
+    )
+
+
+def _excluded_functions() -> list[str]:
+    return [
+        "open_web_browser",
+        "wait_5_seconds",
+        "go_back",
+        "go_forward",
+        "search",
+        "navigate",
+        "hover_at",
+        "type_text_at",
+        "key_combination",
+        "scroll_document",
+        "drag_and_drop",
+    ]
+
+
+def _set_high_media_resolution(config, types) -> None:
+    media_resolution = getattr(types, "MediaResolution", None)
+    if media_resolution is None:
+        return
+    value = (
+        getattr(media_resolution, "MEDIA_RESOLUTION_HIGH", None)
+        or getattr(media_resolution, "HIGH", None)
+    )
+    if value is not None:
+        try:
+            config.media_resolution = value
+        except (AttributeError, TypeError, ValueError):
+            return
+
+
+def _first_function_call(response) -> dict[str, Any] | None:
+    raw_call = _first_function_call_from_dict(_raw_response(response))
+    if raw_call is not None:
+        return raw_call
+
+    for candidate in getattr(response, "candidates", []) or []:
+        content = getattr(candidate, "content", None)
+        for part in getattr(content, "parts", []) or []:
+            function_call = getattr(part, "function_call", None)
+            if function_call is not None:
+                return _function_call_dict(function_call)
+    return None
+
+
+def _function_call_dict(function_call) -> dict[str, Any]:
+    name = getattr(function_call, "name", None)
+    args = getattr(function_call, "args", None)
+    return {"name": str(name or ""), "args": _plain_dict(args)}
+
+
+def _first_function_call_from_dict(value: Any) -> dict[str, Any] | None:
+    if isinstance(value, dict):
+        function_call = value.get("functionCall") or value.get("function_call")
+        if isinstance(function_call, dict):
+            return {
+                "name": str(function_call.get("name") or ""),
+                "args": _plain_dict(function_call.get("args")),
+            }
+        for child in value.values():
+            found = _first_function_call_from_dict(child)
+            if found is not None:
+                return found
+    if isinstance(value, list):
+        for child in value:
+            found = _first_function_call_from_dict(child)
+            if found is not None:
+                return found
+    return None
+
+
+def _point_from_call(call: dict[str, Any]) -> Point | None:
+    args = call.get("args")
+    if not isinstance(args, dict):
+        return None
+    try:
+        return Point(x=float(args["x"]), y=float(args["y"]))
+    except (KeyError, TypeError, ValueError):
+        return None
+
+
+def _scale_computer_use_point(
+    model_id: str, point: Point, width: int, height: int
+) -> Point:
+    if model_id.startswith("gemini-3-"):
+        return point
+    return Point(x=point.x / 1000 * width, y=point.y / 1000 * height)
+
+
+def _response_text(response) -> str:
+    parts = []
+    for candidate in getattr(response, "candidates", []) or []:
+        content = getattr(candidate, "content", None)
+        for part in getattr(content, "parts", []) or []:
+            text = getattr(part, "text", None)
+            if text:
+                parts.append(str(text))
+    if parts:
+        return "\n".join(parts)
+    return json.dumps(_raw_response(response), ensure_ascii=False)
+
+
+def _raw_response(response) -> dict[str, Any]:
+    for method_name in ("to_json_dict", "model_dump", "dict"):
+        method = getattr(response, method_name, None)
+        if method is None:
+            continue
+        try:
+            value = method()
+        except TypeError:
+            continue
+        if isinstance(value, dict):
+            return _plain_dict(value)
+    return {"repr": repr(response)}
+
+
+def _plain_dict(value: Any) -> Any:
+    if isinstance(value, dict):
+        return {str(key): _plain_dict(item) for key, item in value.items()}
+    if isinstance(value, list):
+        return [_plain_dict(item) for item in value]
+    if isinstance(value, tuple):
+        return [_plain_dict(item) for item in value]
+    if isinstance(value, (str, int, float, bool)) or value is None:
+        return value
+    try:
+        return dict(value)
+    except (TypeError, ValueError):
+        return repr(value)
--- a/prototypes/click_eval/src/click_eval/harness.py
+++ b/prototypes/click_eval/src/click_eval/harness.py
@@ -0,0 +1,653 @@
+from __future__ import annotations
+
+import csv
+import concurrent.futures
+import json
+import math
+import statistics
+import sys
+import time
+from contextlib import nullcontext
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Callable
+
+from tqdm import tqdm
+
+from .contracts import ClickTask, ModelReply, ModelSkipped, ModelSpec, Point
+from .io import load_model_config, load_tasks, write_jsonl
+from .parsing import parse_point_response, parse_point_value
+from .scoring import SCORE_FIELDNAMES, score_point, summarize_scores
+from .viz import annotate_image
+
+PredictPoint = Callable[[ModelSpec, Path, str, str], ModelReply]
+OPENROUTER_CANDIDATE_CONCURRENCY = 4
+
+
+@dataclass(frozen=True)
+class RunOptions:
+    tasks_path: Path
+    models_path: Path
+    out_dir: Path
+    annotate: bool = True
+    fail_fast: bool = False
+    limit: int | None = None
+    model_limit: int | None = None
+    progress: bool = True
+
+
+@dataclass
+class TaskRunState:
+    task: ClickTask
+    gt_point: Point | None
+    judge_annotations: list[dict[str, object]]
+    annotations: list[dict[str, object]]
+
+
+def run_eval(options: RunOptions, predict_point: PredictPoint) -> dict[str, object]:
+    judges, candidates, config = load_model_config(options.models_path)
+    tasks = load_tasks(options.tasks_path)
+    if options.limit is not None:
+        tasks = tasks[: options.limit]
+    if options.model_limit is not None:
+        if options.model_limit < 1:
+            raise ValueError("model_limit must be at least 1")
+        candidates = candidates[: options.model_limit]
+
+    options.out_dir.mkdir(parents=True, exist_ok=True)
+    _log(
+        options,
+        f"Loaded {len(tasks)} task(s), {len(candidates)} candidate model(s). "
+        f"Output: {options.out_dir}",
+    )
+    resolved_rows: list[dict[str, object]] = []
+    prediction_rows: list[dict[str, object]] = []
+    score_rows: list[dict[str, object]] = []
+    task_states: list[TaskRunState] = []
+
+    task_iter = _progress(options, tasks, desc="GT", unit="task")
+    for task in task_iter:
+        if task.gt_point is not None and judges:
+            _log(options, _judge_overlay_log_message(task.task_id, judges))
+        elif task.gt_point is not None:
+            _log(options, f"[{task.task_id}] Using provided GT")
+        elif judges:
+            _log(options, _judge_overlay_without_gt_log_message(task.task_id, judges))
+        else:
+            _log(options, f"[{task.task_id}] No GT; scoring will be n/a")
+        try:
+            gt_point, resolved = _resolve_ground_truth(task, judges, predict_point)
+        except Exception as exc:
+            _log(options, f"[{task.task_id}] GT failed: {exc}")
+            if options.fail_fast:
+                raise
+            resolved = dict(task.raw)
+            resolved["ground_truth_error"] = str(exc)
+            resolved_rows.append(resolved)
+            continue
+
+        if gt_point is None:
+            _log(options, f"[{task.task_id}] GT: n/a")
+        else:
+            _log(options, f"[{task.task_id}] GT: ({gt_point.x:.1f}, {gt_point.y:.1f})")
+        _log_judge_statuses(options, task.task_id, resolved)
+        resolved_rows.append(resolved)
+        judge_annotations = _judge_annotations(resolved, gt_point)
+        task_states.append(
+            TaskRunState(
+                task=task,
+                gt_point=gt_point,
+                judge_annotations=judge_annotations,
+                annotations=[],
+            )
+        )
+
+    for model, state, prediction in _predict_candidates_for_tasks(
+        options, task_states, candidates, predict_point
+    ):
+        prediction_rows.append(prediction)
+        parsed_point = prediction.get("_point")
+        point = parsed_point if isinstance(parsed_point, Point) else None
+        score = score_point(
+            state.task.task_id,
+            model.name,
+            state.gt_point,
+            point,
+            error=str(prediction.get("error") or ""),
+        )
+        score["duration_seconds"] = prediction.get("duration_seconds", "")
+        score["skipped"] = bool(prediction.get("skipped"))
+        score_rows.append(score)
+        if point is not None:
+            state.annotations.append(
+                {
+                    "model": model.name,
+                    "point": point,
+                    "l2": score["l2"] if score["l2"] != "" else None,
+                }
+            )
+
+    for state in task_states:
+        if options.annotate:
+            annotate_image(
+                state.task.image_path,
+                options.out_dir / "annotated" / f"{state.task.task_id}.png",
+                state.gt_point,
+                state.annotations,
+                judge_points=state.judge_annotations,
+            )
+
+    for row in prediction_rows:
+        row.pop("_point", None)
+
+    summary = {
+        "tasks": len(tasks),
+        "models": [model.name for model in candidates],
+        "judge_models": [judge.model_id for judge in judges],
+        "config": {
+            key: value
+            for key, value in config.items()
+            if key not in {"candidate_models", "judge_model", "judge_models"}
+        },
+        "summary": summarize_scores(score_rows),
+        "result_rows": _build_result_rows(score_rows),
+    }
+
+    write_jsonl(options.out_dir / "resolved_tasks.jsonl", resolved_rows)
+    write_jsonl(options.out_dir / "predictions.jsonl", prediction_rows)
+    _write_scores_csv(options.out_dir / "scores.csv", score_rows)
+    (options.out_dir / "summary.json").write_text(
+        json.dumps(summary, indent=2, ensure_ascii=False) + "\n", encoding="utf-8"
+    )
+    _log(options, f"Wrote results to {options.out_dir}")
+    return summary
+
+
+def _predict_candidates_for_tasks(
+    options: RunOptions,
+    task_states: list[TaskRunState],
+    candidates: list[ModelSpec],
+    predict_point: PredictPoint,
+) -> list[tuple[ModelSpec, TaskRunState, dict[str, object]]]:
+    predictions: list[tuple[ModelSpec, TaskRunState, dict[str, object]]] = []
+    progress_bar = _candidate_progress(
+        options, len(task_states) * len(candidates), "Candidates"
+    )
+    try:
+        for model in candidates:
+            with _model_run_context(predict_point, model):
+                if model.provider.lower() == "openrouter":
+                    for state, prediction in _predict_openrouter_model_for_tasks(
+                        options, model, task_states, predict_point
+                    ):
+                        predictions.append((model, state, prediction))
+                        _log_prediction_status(
+                            options, state.task.task_id, model, prediction
+                        )
+                        _update_progress(progress_bar)
+                    continue
+
+                for state in task_states:
+                    _log_running(options, state.task.task_id, model)
+                    prediction = _predict_candidate(
+                        state.task, model, predict_point
+                    )
+                    predictions.append((model, state, prediction))
+                    _log_prediction_status(
+                        options, state.task.task_id, model, prediction
+                    )
+                    _update_progress(progress_bar)
+    finally:
+        if progress_bar is not None:
+            progress_bar.close()
+
+    return predictions
+
+
+def _predict_openrouter_model_for_tasks(
+    options: RunOptions,
+    model: ModelSpec,
+    task_states: list[TaskRunState],
+    predict_point: PredictPoint,
+) -> list[tuple[TaskRunState, dict[str, object]]]:
+    if not task_states:
+        return []
+
+    for state in task_states:
+        _log_running(options, state.task.task_id, model)
+
+    max_workers = min(OPENROUTER_CANDIDATE_CONCURRENCY, len(task_states))
+    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
+        futures = [
+            executor.submit(_predict_candidate, state.task, model, predict_point)
+            for state in task_states
+        ]
+        return [
+            (state, future.result())
+            for state, future in zip(task_states, futures, strict=True)
+        ]
+
+
+def _model_run_context(predict_point: PredictPoint, model: ModelSpec):
+    owner = getattr(predict_point, "__self__", None)
+    context_factory = getattr(owner, "model_run_context", None)
+    if context_factory is None:
+        return nullcontext()
+    return context_factory(model)
+
+
+def _resolve_ground_truth(
+    task, judges: list[ModelSpec], predict_point: PredictPoint
+) -> tuple[Point | None, dict[str, object]]:
+    resolved = dict(task.raw)
+    if task.gt_point is not None:
+        resolved["gt_point"] = task.gt_point.as_list()
+        if judges:
+            judge_rows, successful = _resolve_judges(task, judges, predict_point)
+            resolved["gt_judges"] = judge_rows
+            resolved["gt_models"] = [
+                judge.model_id for judge, _point, _reason, _raw in successful
+            ]
+            resolved["gt_model"] = "provided"
+            resolved["gt_reason"] = "provided gt_point; judges recorded for overlay"
+        return task.gt_point, resolved
+
+    resolved["gt_point"] = None
+    resolved["gt_model"] = None
+    resolved["gt_reason"] = "missing gt_point; candidates are unscored"
+    if judges:
+        judge_rows, successful = _resolve_judges(task, judges, predict_point)
+        resolved["gt_judges"] = judge_rows
+        resolved["gt_models"] = [
+            judge.model_id for judge, _point, _reason, _raw in successful
+        ]
+        resolved["gt_reason"] = (
+            "missing gt_point; judges recorded for overlay only"
+        )
+    return None, resolved
+
+
+def _resolve_judges(
+    task,
+    judges: list[ModelSpec],
+    predict_point: PredictPoint,
+) -> tuple[list[dict[str, object]], list[tuple[ModelSpec, Point, str | None, str]]]:
+    successful: list[tuple[ModelSpec, Point, str | None, str]] = []
+    judge_results = _predict_judges_for_task(task, judges, predict_point)
+    judge_rows = [row for row, _point, _reason, _raw_text in judge_results]
+    for judge, (_row, point, reason, raw_text) in zip(
+        judges, judge_results, strict=True
+    ):
+        if point is not None:
+            successful.append((judge, point, reason, raw_text))
+    return judge_rows, successful
+
+
+def _predict_judge(
+    task,
+    judge: ModelSpec,
+    predict_point: PredictPoint,
+) -> tuple[dict[str, object], Point | None, str | None, str]:
+    row: dict[str, object] = {
+        "name": judge.name,
+        "provider": judge.provider,
+        "model_id": judge.model_id,
+        "point": None,
+        "reason": None,
+        "raw_text": None,
+        "error": None,
+        "skipped": False,
+        "duration_seconds": None,
+    }
+    started = time.perf_counter()
+    try:
+        reply = predict_point(judge, task.image_path, task.instruction, "ground_truth")
+    except ModelSkipped as exc:
+        row["duration_seconds"] = time.perf_counter() - started
+        row["skipped"] = True
+        row["error"] = str(exc)
+        return row, None, None, ""
+    except Exception as exc:
+        row["duration_seconds"] = time.perf_counter() - started
+        row["error"] = str(exc)
+        return row, None, None, ""
+
+    row["duration_seconds"] = time.perf_counter() - started
+    parsed = parse_point_response(reply.text)
+    row["raw_text"] = reply.text
+    row["reason"] = parsed.reason
+    if parsed.point is None:
+        row["error"] = parsed.error
+        return row, None, parsed.reason, reply.text
+
+    row["point"] = parsed.point.as_list()
+    return row, parsed.point, parsed.reason, reply.text
+
+
+def _judge_overlay_log_message(task_id: str, judges: list[ModelSpec]) -> str:
+    names = ", ".join(judge.name for judge in judges)
+    return (
+        f"[{task_id}] Using provided GT and resolving "
+        f"{len(judges)} judge overlay(s): {names}"
+    )
+
+
+def _judge_overlay_without_gt_log_message(task_id: str, judges: list[ModelSpec]) -> str:
+    names = ", ".join(judge.name for judge in judges)
+    return (
+        f"[{task_id}] No GT; resolving {len(judges)} judge overlay(s) "
+        f"without scoring fallback: {names}"
+    )
+
+
+def _judge_annotations(
+    resolved: dict[str, object], gt_point: Point | None
+) -> list[dict[str, object]]:
+    rows = resolved.get("gt_judges")
+    if not isinstance(rows, list):
+        return []
+
+    annotations: list[dict[str, object]] = []
+    for index, row in enumerate(rows, start=1):
+        if not isinstance(row, dict):
+            continue
+        point = parse_point_value(row.get("point"))
+        annotations.append(
+            {
+                "label": f"GT{index}",
+                "model": str(
+                    row.get("name") or row.get("model_id") or f"judge-{index}"
+                ),
+                "point": point,
+                "l2": (
+                    math.hypot(point.x - gt_point.x, point.y - gt_point.y)
+                    if point is not None and gt_point is not None
+                    else None
+                ),
+                "error": row.get("error"),
+                "skipped": row.get("skipped"),
+            }
+        )
+    return annotations
+
+
+def _log_judge_statuses(
+    options: RunOptions, task_id: str, resolved: dict[str, object]
+) -> None:
+    rows = resolved.get("gt_judges")
+    if not isinstance(rows, list):
+        return
+
+    for index, row in enumerate(rows, start=1):
+        if not isinstance(row, dict):
+            continue
+        name = str(row.get("name") or row.get("model_id") or f"judge-{index}")
+        duration = row.get("duration_seconds")
+        duration_text = (
+            f" in {float(duration):.2f}s"
+            if isinstance(duration, int | float)
+            else ""
+        )
+        error = row.get("error")
+        if row.get("skipped"):
+            _log(options, f"[{task_id}] GT{index} {name} skipped{duration_text}: {error}")
+            continue
+        if error:
+            _log(options, f"[{task_id}] GT{index} {name} failed{duration_text}: {error}")
+            continue
+        point = parse_point_value(row.get("point"))
+        if point is None:
+            _log(options, f"[{task_id}] GT{index} {name} finished{duration_text}: no point")
+            continue
+        _log(
+            options,
+            f"[{task_id}] GT{index} {name} finished{duration_text}: "
+            f"({point.x:.1f}, {point.y:.1f})",
+        )
+
+
+def _predict_judges_for_task(
+    task,
+    judges: list[ModelSpec],
+    predict_point: PredictPoint,
+) -> list[tuple[dict[str, object], Point | None, str | None, str]]:
+    results: list[tuple[dict[str, object], Point | None, str | None, str] | None] = [
+        None
+    ] * len(judges)
+    start = 0
+    while start < len(judges):
+        judge = judges[start]
+        if judge.provider.lower() == "openrouter":
+            end = start + 1
+            while (
+                end < len(judges)
+                and judges[end].provider.lower() == "openrouter"
+            ):
+                end += 1
+            group_results = _predict_openrouter_judges(
+                task, judges[start:end], predict_point
+            )
+            for offset, result in enumerate(group_results):
+                results[start + offset] = result
+            start = end
+            continue
+
+        results[start] = _predict_judge(task, judge, predict_point)
+        start += 1
+
+    return [result for result in results if result is not None]
+
+
+def _predict_openrouter_judges(
+    task,
+    judges: list[ModelSpec],
+    predict_point: PredictPoint,
+) -> list[tuple[dict[str, object], Point | None, str | None, str]]:
+    max_workers = min(OPENROUTER_CANDIDATE_CONCURRENCY, len(judges))
+    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
+        futures = [
+            executor.submit(_predict_judge, task, judge, predict_point)
+            for judge in judges
+        ]
+        return [future.result() for future in futures]
+
+
+def _predict_candidate(
+    task, model: ModelSpec, predict_point: PredictPoint
+) -> dict[str, object]:
+    base: dict[str, object] = {
+        "task_id": task.task_id,
+        "image_path": task.image_path_text,
+        "instruction": task.instruction,
+        "model": model.name,
+        "model_id": model.model_id,
+        "point": None,
+        "reason": None,
+        "raw_text": None,
+        "error": None,
+        "skipped": False,
+        "duration_seconds": None,
+    }
+    started = time.perf_counter()
+    try:
+        reply = predict_point(model, task.image_path, task.instruction, "candidate")
+    except ModelSkipped as exc:
+        base["duration_seconds"] = time.perf_counter() - started
+        base["skipped"] = True
+        base["error"] = str(exc)
+        return base
+    except Exception as exc:
+        base["duration_seconds"] = time.perf_counter() - started
+        base["error"] = str(exc)
+        return base
+
+    base["duration_seconds"] = time.perf_counter() - started
+    parsed = parse_point_response(reply.text)
+    base["raw_text"] = reply.text
+    base["reason"] = parsed.reason
+    if parsed.point is None:
+        base["error"] = parsed.error
+        return base
+
+    base["point"] = parsed.point.as_list()
+    base["_point"] = parsed.point
+    return base
+
+
+def _log_running(options: RunOptions, task_id: str, model: ModelSpec) -> None:
+    _log(
+        options,
+        f"[{task_id}] Running {model.name} ({model.provider}/{model.model_id})",
+    )
+
+
+def _log_prediction_status(
+    options: RunOptions,
+    task_id: str,
+    model: ModelSpec,
+    prediction: dict[str, object],
+) -> None:
+    duration = prediction.get("duration_seconds")
+    duration_text = f" in {float(duration):.2f}s" if isinstance(duration, float) else ""
+    if prediction.get("skipped"):
+        _log(
+            options,
+            f"[{task_id}] {model.name} skipped{duration_text}: {prediction['error']}",
+        )
+    elif prediction.get("error"):
+        _log(
+            options,
+            f"[{task_id}] {model.name} failed{duration_text}: {prediction['error']}",
+        )
+    else:
+        _log(options, f"[{task_id}] {model.name} finished{duration_text}")
+
+
+def _write_scores_csv(path: Path, rows: list[dict[str, object]]) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    with path.open("w", encoding="utf-8", newline="") as handle:
+        writer = csv.DictWriter(handle, fieldnames=SCORE_FIELDNAMES)
+        writer.writeheader()
+        writer.writerows(rows)
+
+
+def _build_result_rows(score_rows: list[dict[str, object]]) -> list[dict[str, object]]:
+    grouped: dict[str, list[dict[str, object]]] = {}
+    for row in score_rows:
+        grouped.setdefault(str(row["model"]), []).append(row)
+
+    result_rows: list[dict[str, object]] = []
+    for model_name, rows in grouped.items():
+        distances = [float(row["l2"]) for row in rows if row.get("l2") != ""]
+        durations = [
+            float(row["duration_seconds"])
+            for row in rows
+            if row.get("duration_seconds") not in {"", None}
+        ]
+        skipped = sum(1 for row in rows if row.get("skipped") is True)
+        errors = sum(
+            1 for row in rows if row.get("error") and row.get("skipped") is not True
+        )
+        status = _result_status(len(rows), len(distances), errors, skipped)
+        result_rows.append(
+            {
+                "model": model_name,
+                "status": status,
+                "l2": statistics.fmean(distances) if distances else None,
+                "duration_seconds": statistics.fmean(durations) if durations else None,
+                "reason": _result_reason(rows, status),
+            }
+        )
+
+    return sorted(result_rows, key=_result_sort_key)
+
+
+def _result_status(total: int, scored: int, errors: int, skipped: int) -> str:
+    if scored == total and errors == 0 and skipped == 0:
+        return "ok"
+    if scored > 0 and (errors or skipped):
+        return "partial"
+    if errors:
+        return "error"
+    if skipped:
+        return "skipped"
+    if scored == 0:
+        return "unscored"
+    return "ok"
+
+
+def _result_reason(rows: list[dict[str, object]], status: str) -> str:
+    if status == "ok":
+        return ""
+    if status == "partial":
+        return _partial_result_reason(rows)
+    for row in rows:
+        if status == "skipped" and row.get("skipped") is not True:
+            continue
+        reason = row.get("error")
+        if reason:
+            return str(reason)
+    return "no score"
+
+
+def _partial_result_reason(rows: list[dict[str, object]]) -> str:
+    errors = [
+        str(row.get("error"))
+        for row in rows
+        if row.get("error") and row.get("skipped") is not True
+    ]
+    skipped = [row for row in rows if row.get("skipped") is True]
+    parts: list[str] = []
+    if errors:
+        parts.append(f"{len(errors)} error(s)")
+    if skipped:
+        parts.append(f"{len(skipped)} skipped")
+    prefix = ", ".join(parts) if parts else "partial"
+    return f"{prefix}; first: {errors[0]}" if errors else prefix
+
+
+def _result_sort_key(row: dict[str, object]) -> tuple[int, float, str]:
+    l2 = row.get("l2")
+    is_ranked = l2 is not None
+    return (
+        0 if is_ranked else 1,
+        float(l2) if l2 is not None else float("inf"),
+        str(row["model"]),
+    )
+
+
+def _progress(options: RunOptions, items, **kwargs):
+    if not _show_progress(options):
+        return items
+    return tqdm(items, dynamic_ncols=True, **kwargs)
+
+
+def _candidate_progress(options: RunOptions, total: int, task_id: str):
+    if not _show_progress(options):
+        return None
+    desc = task_id if task_id == "Candidates" else f"{task_id} candidates"
+    return tqdm(
+        total=total,
+        desc=desc,
+        unit="call",
+        leave=False,
+        dynamic_ncols=True,
+    )
+
+
+def _update_progress(progress_bar) -> None:
+    if progress_bar is not None:
+        progress_bar.update(1)
+
+
+def _log(options: RunOptions, message: str) -> None:
+    if not options.progress:
+        return
+    if _show_progress(options):
+        tqdm.write(message, file=sys.stderr)
+        return
+    print(message, file=sys.stderr)
+
+
+def _show_progress(options: RunOptions) -> bool:
+    return options.progress and sys.stderr.isatty()
--- a/prototypes/click_eval/src/click_eval/image_utils.py
+++ b/prototypes/click_eval/src/click_eval/image_utils.py
@@ -0,0 +1,30 @@
+from __future__ import annotations
+
+import base64
+import mimetypes
+from pathlib import Path
+
+
+def require_pillow():
+    try:
+        from PIL import Image, ImageDraw, ImageFont
+    except ImportError as exc:
+        raise RuntimeError(
+            "Pillow is required for image dimensions and annotations. "
+            "Install prototypes/click_eval/requirements.txt or run with "
+            "`uv run --with pillow ...`."
+        ) from exc
+
+    return Image, ImageDraw, ImageFont
+
+
+def image_size(path: Path) -> tuple[int, int]:
+    Image, _, _ = require_pillow()
+    with Image.open(path) as image:
+        return image.size
+
+
+def image_data_url(path: Path) -> str:
+    mime_type = mimetypes.guess_type(path.name)[0] or "image/png"
+    encoded = base64.b64encode(path.read_bytes()).decode("ascii")
+    return f"data:{mime_type};base64,{encoded}"
--- a/prototypes/click_eval/src/click_eval/io.py
+++ b/prototypes/click_eval/src/click_eval/io.py
@@ -0,0 +1,145 @@
+from __future__ import annotations
+
+import json
+from pathlib import Path
+from typing import Any, Iterable
+
+from .contracts import ClickTask, ModelSpec
+from .parsing import parse_point_value
+
+
+def read_jsonl(path: Path) -> list[dict[str, Any]]:
+    rows: list[dict[str, Any]] = []
+    with path.open("r", encoding="utf-8") as handle:
+        for line_number, line in enumerate(handle, start=1):
+            stripped = line.strip()
+            if not stripped:
+                continue
+            try:
+                rows.append(json.loads(stripped))
+            except json.JSONDecodeError as exc:
+                raise ValueError(f"{path}:{line_number}: invalid JSONL: {exc}") from exc
+    return rows
+
+
+def write_jsonl(path: Path, rows: Iterable[dict[str, Any]]) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    with path.open("w", encoding="utf-8") as handle:
+        for row in rows:
+            handle.write(json.dumps(row, ensure_ascii=False) + "\n")
+
+
+def load_tasks(path: Path) -> list[ClickTask]:
+    rows = read_jsonl(path)
+    tasks: list[ClickTask] = []
+    base = path.parent
+    for index, row in enumerate(rows, start=1):
+        try:
+            task_id = str(row["task_id"])
+            image_path_text = str(row["image_path"])
+            instruction = str(row["instruction"])
+        except KeyError as exc:
+            raise ValueError(f"{path}:{index}: missing required field {exc}") from exc
+
+        image_path = Path(image_path_text)
+        if not image_path.is_absolute():
+            image_path = base / image_path
+
+        gt_point = parse_point_value(row.get("gt_point"))
+        tasks.append(
+            ClickTask(
+                task_id=task_id,
+                image_path=image_path,
+                image_path_text=image_path_text,
+                instruction=instruction,
+                gt_point=gt_point,
+                raw=dict(row),
+            )
+        )
+    return tasks
+
+
+def load_model_config(
+    path: Path,
+) -> tuple[list[ModelSpec], list[ModelSpec], dict[str, Any]]:
+    config = json.loads(path.read_text(encoding="utf-8"))
+    judges = _judge_specs(config)
+
+    candidate_entries = config.get("candidate_models") or []
+    candidates = [_model_spec(entry) for entry in candidate_entries]
+    if not candidates:
+        raise ValueError(f"{path}: candidate_models must contain at least one model")
+
+    return judges, candidates, config
+
+
+def _judge_specs(config: dict[str, Any]) -> list[ModelSpec]:
+    if "judge_models" in config:
+        entries = config.get("judge_models") or []
+        if isinstance(entries, (str, dict)):
+            entries = [entries]
+        if not isinstance(entries, list):
+            raise ValueError("judge_models must be a model entry or list of entries")
+        return [
+            _model_spec(entry, default_name=f"judge-{index}")
+            for index, entry in enumerate(entries, start=1)
+        ]
+
+    if config.get("judge_model"):
+        return [_model_spec(config["judge_model"], default_name="judge")]
+
+    return []
+
+
+def _model_spec(entry: Any, default_name: str | None = None) -> ModelSpec:
+    if isinstance(entry, str):
+        return ModelSpec(name=default_name or entry, model_id=entry)
+    if isinstance(entry, dict):
+        model_id = str(entry.get("model") or entry.get("id") or "")
+        if not model_id:
+            raise ValueError(f"model entry is missing model/id: {entry}")
+        return ModelSpec(
+            name=str(entry.get("name") or default_name or model_id),
+            model_id=model_id,
+            provider=str(entry.get("provider") or "openrouter"),
+            estimated_vram_gb=_optional_float(
+                entry.get("estimated_vram_gb") or entry.get("vram_gb")
+            ),
+            adapter=_optional_string(entry.get("adapter")),
+            quantization=_optional_string(entry.get("quantization")),
+            allow_cpu_offload=_optional_bool(entry.get("allow_cpu_offload")) or False,
+            dtype=_optional_string(entry.get("dtype")),
+            attn_implementation=_optional_string(entry.get("attn_implementation")),
+            min_pixels=_optional_int(entry.get("min_pixels") or entry.get("image_min_pixels")),
+            max_pixels=_optional_int(entry.get("max_pixels") or entry.get("image_max_pixels")),
+            max_new_tokens=_optional_int(entry.get("max_new_tokens")),
+            revision=_optional_string(entry.get("revision")),
+            use_safetensors=_optional_bool(entry.get("use_safetensors")),
+        )
+    raise ValueError(f"invalid model entry: {entry!r}")
+
+
+def _optional_float(value: Any) -> float | None:
+    if value is None or value == "":
+        return None
+    return float(value)
+
+
+def _optional_int(value: Any) -> int | None:
+    if value is None or value == "":
+        return None
+    return int(value)
+
+
+def _optional_string(value: Any) -> str | None:
+    if value is None or value == "":
+        return None
+    return str(value)
+
+
+def _optional_bool(value: Any) -> bool | None:
+    if value is None or value == "":
+        return None
+    if isinstance(value, str):
+        return value.strip().lower() in {"1", "true", "yes", "on"}
+    return bool(value)
--- a/prototypes/click_eval/src/click_eval/local_hf.py
+++ b/prototypes/click_eval/src/click_eval/local_hf.py
--- a/prototypes/click_eval/src/click_eval/moondream.py
+++ b/prototypes/click_eval/src/click_eval/moondream.py
@@ -0,0 +1,96 @@
+from __future__ import annotations
+
+import json
+import os
+import re
+import urllib.error
+import urllib.request
+from pathlib import Path
+from typing import Any
+
+from .contracts import ModelReply
+from .image_utils import image_data_url, image_size
+
+MOONDREAM_POINT_URL = "https://api.moondream.ai/v1/point"
+
+
+class MoondreamClient:
+    def __init__(
+        self,
+        api_key: str | None = None,
+        base_url: str = MOONDREAM_POINT_URL,
+        timeout_seconds: int = 90,
+    ) -> None:
+        self.api_key = api_key or os.environ.get("MOONDREAM_API_KEY")
+        if not self.api_key:
+            raise RuntimeError("MOONDREAM_API_KEY is required")
+        self.base_url = base_url
+        self.timeout_seconds = timeout_seconds
+
+    def predict_point(
+        self,
+        model_id: str,
+        image_path: Path,
+        instruction: str,
+        purpose: str,
+    ) -> ModelReply:
+        width, height = image_size(image_path)
+        payload = {
+            "image_url": image_data_url(image_path),
+            "object": _object_query(instruction),
+        }
+        raw = self._post(payload)
+        raw.setdefault("model", model_id)
+        return ModelReply(text=_point_text(raw, width, height), raw=raw)
+
+    def _post(self, payload: dict[str, Any]) -> dict[str, Any]:
+        request = urllib.request.Request(
+            self.base_url,
+            data=json.dumps(payload).encode("utf-8"),
+            headers={
+                "X-Moondream-Auth": self.api_key,
+                "Content-Type": "application/json",
+                "User-Agent": "click-eval/0.1",
+            },
+            method="POST",
+        )
+        try:
+            with urllib.request.urlopen(
+                request, timeout=self.timeout_seconds
+            ) as response:
+                body = response.read().decode("utf-8")
+        except urllib.error.HTTPError as exc:
+            detail = exc.read().decode("utf-8", errors="replace")
+            raise RuntimeError(f"Moondream HTTP {exc.code}: {detail}") from exc
+        except urllib.error.URLError as exc:
+            raise RuntimeError(f"Moondream request failed: {exc}") from exc
+
+        return json.loads(body)
+
+
+def _point_text(raw: dict[str, Any], width: int, height: int) -> str:
+    try:
+        first_point = raw["points"][0]
+        x = float(first_point["x"]) * width
+        y = float(first_point["y"]) * height
+    except (KeyError, IndexError, TypeError, ValueError) as exc:
+        raise RuntimeError(f"Unexpected Moondream response shape: {raw}") from exc
+
+    return json.dumps(
+        {
+            "x": x,
+            "y": y,
+            "reason": "first point returned by Moondream point API",
+        }
+    )
+
+
+def _object_query(instruction: str) -> str:
+    stripped = instruction.strip().rstrip(".")
+    query = re.sub(
+        r"^(?:please\s+)?(?:click|tap|press|select|choose|open)\s+(?:on\s+)?",
+        "",
+        stripped,
+        flags=re.IGNORECASE,
+    ).strip()
+    return query or stripped
--- a/prototypes/click_eval/src/click_eval/openai_cu.py
+++ b/prototypes/click_eval/src/click_eval/openai_cu.py
@@ -0,0 +1,191 @@
+from __future__ import annotations
+
+import base64
+import io
+import json
+import os
+import urllib.error
+import urllib.request
+from pathlib import Path
+from typing import Any
+
+from .contracts import ModelReply, Point
+from .image_utils import image_size, require_pillow
+
+OPENAI_RESPONSES_URL = "https://api.openai.com/v1/responses"
+
+
+class OpenAIComputerUseClient:
+    def __init__(
+        self,
+        api_key: str | None = None,
+        base_url: str = OPENAI_RESPONSES_URL,
+        timeout_seconds: int = 90,
+        max_output_tokens: int = 1024,
+    ) -> None:
+        self.api_key = api_key or os.environ.get("OPENAI_API_KEY")
+        if not self.api_key:
+            raise RuntimeError("OPENAI_API_KEY is required")
+        self.base_url = base_url
+        self.timeout_seconds = timeout_seconds
+        self.max_output_tokens = max_output_tokens
+
+    def predict_point(
+        self,
+        model_id: str,
+        image_path: Path,
+        instruction: str,
+        purpose: str,
+    ) -> ModelReply:
+        original_width, original_height = image_size(image_path)
+        screenshot = _original_screenshot(image_path)
+        raw = self._post(
+            {
+                "model": model_id,
+                "max_output_tokens": self.max_output_tokens,
+                "truncation": "auto",
+                "tools": [
+                    {
+                        "type": "computer_use_preview",
+                        "display_width": screenshot.width,
+                        "display_height": screenshot.height,
+                        "environment": "browser",
+                    }
+                ],
+                "input": [
+                    {
+                        "role": "user",
+                        "content": [
+                            {
+                                "type": "input_text",
+                                "text": _computer_use_prompt(instruction, purpose),
+                            },
+                            {
+                                "type": "input_image",
+                                "image_url": screenshot.data_url,
+                                "detail": "original",
+                            },
+                        ],
+                    }
+                ],
+            }
+        )
+        point = _point_from_computer_response(raw)
+        if point is None:
+            return ModelReply(text=_raw_text(raw), raw=raw)
+        return _reply_from_point(
+            point, screenshot, original_width, original_height, raw
+        )
+
+    def _post(self, payload: dict[str, Any]) -> dict[str, Any]:
+        request = urllib.request.Request(
+            self.base_url,
+            data=json.dumps(payload).encode("utf-8"),
+            headers={
+                "Authorization": f"Bearer {self.api_key}",
+                "Content-Type": "application/json",
+            },
+            method="POST",
+        )
+        try:
+            with urllib.request.urlopen(
+                request, timeout=self.timeout_seconds
+            ) as response:
+                body = response.read().decode("utf-8")
+        except urllib.error.HTTPError as exc:
+            detail = exc.read().decode("utf-8", errors="replace")
+            raise RuntimeError(f"OpenAI HTTP {exc.code}: {detail}") from exc
+        except urllib.error.URLError as exc:
+            raise RuntimeError(f"OpenAI request failed: {exc}") from exc
+        return json.loads(body)
+
+
+def _computer_use_prompt(instruction: str, purpose: str) -> str:
+    role_line = (
+        "Choose the ground-truth click point for this instruction."
+        if purpose == "ground_truth"
+        else "Predict the click point for this instruction."
+    )
+    return (
+        f"{role_line}\n\n"
+        "Use the computer tool and emit exactly one left-button click action at "
+        "the center of the target UI element. Do not type, scroll, navigate, "
+        "wait, or use any other action.\n\n"
+        f"Instruction: {instruction}"
+    )
+
+
+class _ScaledScreenshot:
+    def __init__(self, width: int, height: int, data_url: str) -> None:
+        self.width = width
+        self.height = height
+        self.data_url = data_url
+
+
+def _original_screenshot(path: Path) -> _ScaledScreenshot:
+    Image, _, _ = require_pillow()
+    with Image.open(path) as source:
+        image = source.convert("RGB")
+    buffer = io.BytesIO()
+    image.save(buffer, format="PNG")
+    encoded = base64.b64encode(buffer.getvalue()).decode("ascii")
+    return _ScaledScreenshot(
+        width=image.width,
+        height=image.height,
+        data_url=f"data:image/png;base64,{encoded}",
+    )
+
+
+def _point_from_computer_response(raw: dict[str, Any]) -> Point | None:
+    for item in raw.get("output", []):
+        if not isinstance(item, dict) or item.get("type") != "computer_call":
+            continue
+        action = item.get("action")
+        if not isinstance(action, dict) or action.get("type") != "click":
+            continue
+        try:
+            return Point(x=float(action["x"]), y=float(action["y"]))
+        except (KeyError, TypeError, ValueError):
+            return None
+    return None
+
+
+def _reply_from_point(
+    point: Point,
+    screenshot: _ScaledScreenshot,
+    original_width: int,
+    original_height: int,
+    raw: dict[str, Any],
+) -> ModelReply:
+    scaled = Point(
+        x=point.x * original_width / screenshot.width,
+        y=point.y * original_height / screenshot.height,
+    )
+    return ModelReply(
+        text=json.dumps(
+            {
+                "x": scaled.x,
+                "y": scaled.y,
+                "reason": "OpenAI Computer Use click action",
+                "display_x": point.x,
+                "display_y": point.y,
+                "display_width": screenshot.width,
+                "display_height": screenshot.height,
+                "original_width": original_width,
+                "original_height": original_height,
+            }
+        ),
+        raw=raw,
+    )
+
+
+def _raw_text(raw: dict[str, Any]) -> str:
+    text_parts = []
+    for item in raw.get("output", []):
+        if not isinstance(item, dict):
+            continue
+        if item.get("type") == "message":
+            for content in item.get("content", []):
+                if isinstance(content, dict) and "text" in content:
+                    text_parts.append(str(content["text"]))
+    return "\n".join(text_parts) if text_parts else json.dumps(raw, ensure_ascii=False)
--- a/prototypes/click_eval/src/click_eval/openrouter.py
+++ b/prototypes/click_eval/src/click_eval/openrouter.py
@@ -0,0 +1,319 @@
+from __future__ import annotations
+
+import base64
+import io
+import json
+import os
+import urllib.error
+import urllib.request
+from pathlib import Path
+from typing import Any
+
+from .contracts import ModelReply
+from .image_utils import image_data_url, image_size, require_pillow
+from .parsing import parse_point_response
+
+OPENROUTER_URL = "https://openrouter.ai/api/v1/chat/completions"
+GPT_POINT_MAX_TOKENS = 8192
+LENGTH_RETRY_MAX_TOKENS = 16384
+
+
+class OpenRouterClient:
+    def __init__(
+        self,
+        api_key: str | None = None,
+        base_url: str = OPENROUTER_URL,
+        timeout_seconds: int = 90,
+        temperature: float = 0.0,
+        max_tokens: int = 512,
+    ) -> None:
+        self.api_key = api_key or os.environ.get("OPENROUTER_API_KEY")
+        if not self.api_key:
+            raise RuntimeError("OPENROUTER_API_KEY is required")
+        self.base_url = base_url
+        self.timeout_seconds = timeout_seconds
+        self.temperature = temperature
+        self.max_tokens = max_tokens
+
+    def predict_point(
+        self,
+        model_id: str,
+        image_path: Path,
+        instruction: str,
+        purpose: str,
+    ) -> ModelReply:
+        width, height = image_size(image_path)
+        image_payload = _image_payload_for_model(model_id, image_path, width, height)
+        prompt = _point_prompt(
+            instruction,
+            image_payload.width,
+            image_payload.height,
+            purpose,
+            original_width=width,
+            original_height=height,
+            resized=image_payload.resized,
+        )
+        payload = {
+            "model": model_id,
+            "temperature": self.temperature,
+            "max_tokens": _max_tokens_for_point_call(model_id, self.max_tokens),
+            "messages": [
+                {
+                    "role": "system",
+                    "content": (
+                        "You identify a single click coordinate in screenshots. "
+                        "Return only valid JSON."
+                    ),
+                },
+                {
+                    "role": "user",
+                    "content": [
+                        {"type": "text", "text": prompt},
+                        {
+                            "type": "image_url",
+                            "image_url": {"url": image_payload.data_url},
+                        },
+                    ],
+                },
+            ],
+        }
+        reasoning = _reasoning_for_point_call(model_id)
+        if reasoning is not None:
+            payload["reasoning"] = reasoning
+        if _force_json_response_format(model_id):
+            payload["response_format"] = {"type": "json_object"}
+        raw = self._post(payload)
+        try:
+            text = _message_text(raw)
+        except RuntimeError:
+            if not _is_null_content_length_response(raw):
+                raise
+            payload["max_tokens"] = max(
+                int(payload.get("max_tokens") or 0), LENGTH_RETRY_MAX_TOKENS
+            )
+            raw = self._post(payload)
+            text = _message_text(raw)
+        if image_payload.resized:
+            return _rescaled_reply(text, raw, image_payload, width, height)
+        return ModelReply(text=text, raw=raw)
+
+    def _post(self, payload: dict[str, Any]) -> dict[str, Any]:
+        headers = {
+            "Authorization": f"Bearer {self.api_key}",
+            "Content-Type": "application/json",
+        }
+        referer = os.environ.get("OPENROUTER_HTTP_REFERER")
+        title = os.environ.get("OPENROUTER_TITLE", "BrowserOS click eval")
+        if referer:
+            headers["HTTP-Referer"] = referer
+        if title:
+            headers["X-Title"] = title
+
+        request = urllib.request.Request(
+            self.base_url,
+            data=json.dumps(payload).encode("utf-8"),
+            headers=headers,
+            method="POST",
+        )
+        try:
+            with urllib.request.urlopen(
+                request, timeout=self.timeout_seconds
+            ) as response:
+                body = response.read().decode("utf-8")
+        except urllib.error.HTTPError as exc:
+            detail = exc.read().decode("utf-8", errors="replace")
+            raise RuntimeError(f"OpenRouter HTTP {exc.code}: {detail}") from exc
+        except urllib.error.URLError as exc:
+            raise RuntimeError(f"OpenRouter request failed: {exc}") from exc
+
+        return json.loads(body)
+
+
+def _point_prompt(
+    instruction: str,
+    width: int,
+    height: int,
+    purpose: str,
+    *,
+    original_width: int,
+    original_height: int,
+    resized: bool,
+) -> str:
+    role_line = (
+        "Choose the ground-truth click point for this instruction."
+        if purpose == "ground_truth"
+        else "Predict where this model should click for this instruction."
+    )
+    resize_line = ""
+    if resized:
+        resize_line = (
+            "The attached image was resized client-side from the original "
+            f"{original_width}x{original_height} screenshot. Return coordinates "
+            "in the attached image's pixel coordinate space; the harness will "
+            "map them back to the original screenshot.\n"
+        )
+    return (
+        f"{role_line}\n\n"
+        f"Screenshot size: {width}x{height} pixels.\n"
+        f"{resize_line}"
+        "Coordinate system: x increases left to right, y increases top to bottom, "
+        "origin is the top-left pixel of the screenshot.\n\n"
+        f"Instruction: {instruction}\n\n"
+        "Choose the center of the target UI element. If your natural output is "
+        "a bounding box, convert it to its center point. Always estimate a point; "
+        "do not answer with a label, description, placeholder, or bounding box.\n\n"
+        "The requested target is present in the screenshot. Never answer that no "
+        "target exists; choose the closest matching visible UI element if uncertain.\n\n"
+        "Return only this JSON shape with numeric pixel coordinates, no markdown:\n"
+        '{"x": 123, "y": 456, "reason": "short reason"}'
+    )
+
+
+def _message_text(raw: dict[str, Any]) -> str:
+    try:
+        choice = raw["choices"][0]
+        message = choice["message"]
+        content = message["content"]
+    except (KeyError, IndexError, TypeError) as exc:
+        raise RuntimeError(f"Unexpected OpenRouter response shape: {raw}") from exc
+
+    if isinstance(content, str):
+        return content
+    if isinstance(content, list):
+        parts = []
+        for item in content:
+            if isinstance(item, dict) and item.get("type") == "text":
+                parts.append(str(item.get("text", "")))
+        return "\n".join(parts)
+
+    if content is None:
+        finish_reason = choice.get("finish_reason")
+        has_reasoning = bool(message.get("reasoning"))
+        raise RuntimeError(
+            "OpenRouter returned null message.content "
+            f"(finish_reason={finish_reason}, has_reasoning={has_reasoning})"
+        )
+
+    return str(content)
+
+
+def _reasoning_for_point_call(model_id: str) -> dict[str, object] | None:
+    lowered = model_id.lower()
+    if lowered.startswith("z-ai/glm-"):
+        return {"effort": "none", "exclude": True}
+    if lowered.startswith("openai/") or "gpt-5" in lowered:
+        effort = os.environ.get("OPENROUTER_GPT_POINT_REASONING_EFFORT", "low")
+        return {"effort": effort, "exclude": True}
+    return None
+
+
+def _force_json_response_format(model_id: str) -> bool:
+    lowered = model_id.lower()
+    if lowered.startswith("z-ai/glm-5v"):
+        return False
+    return lowered.startswith("z-ai/glm-") or lowered.startswith("openai/")
+
+
+def _max_tokens_for_point_call(model_id: str, default: int) -> int:
+    lowered = model_id.lower()
+    if lowered.startswith("openai/") or "gpt-5" in lowered:
+        return max(default, GPT_POINT_MAX_TOKENS)
+    return default
+
+
+def _is_null_content_length_response(raw: dict[str, Any]) -> bool:
+    try:
+        choice = raw["choices"][0]
+        return (
+            choice.get("finish_reason") == "length"
+            and choice.get("message", {}).get("content") is None
+        )
+    except (KeyError, IndexError, TypeError, AttributeError):
+        return False
+
+
+class _ImagePayload:
+    def __init__(
+        self,
+        width: int,
+        height: int,
+        data_url: str,
+        resized: bool,
+    ) -> None:
+        self.width = width
+        self.height = height
+        self.data_url = data_url
+        self.resized = resized
+
+
+def _image_payload_for_model(
+    model_id: str, image_path: Path, width: int, height: int
+) -> _ImagePayload:
+    max_long_edge = _claude_max_long_edge(model_id)
+    if max_long_edge is None or max(width, height) <= max_long_edge:
+        return _ImagePayload(
+            width=width,
+            height=height,
+            data_url=image_data_url(image_path),
+            resized=False,
+        )
+
+    Image, _, _ = require_pillow()
+    with Image.open(image_path) as source:
+        image = source.convert("RGB")
+    scale = max_long_edge / max(width, height)
+    image = image.resize(
+        (max(1, round(width * scale)), max(1, round(height * scale))),
+        Image.Resampling.LANCZOS,
+    )
+    buffer = io.BytesIO()
+    image.save(buffer, format="PNG")
+    encoded = base64.b64encode(buffer.getvalue()).decode("ascii")
+    return _ImagePayload(
+        width=image.width,
+        height=image.height,
+        data_url=f"data:image/png;base64,{encoded}",
+        resized=True,
+    )
+
+
+def _claude_max_long_edge(model_id: str) -> int | None:
+    lowered = model_id.lower()
+    if not lowered.startswith("anthropic/claude"):
+        return None
+    if "opus-4.7" in lowered or "opus-4-7" in lowered:
+        return 2576
+    return 1568
+
+
+def _rescaled_reply(
+    text: str,
+    raw: dict[str, Any],
+    image_payload: _ImagePayload,
+    original_width: int,
+    original_height: int,
+) -> ModelReply:
+    parsed = parse_point_response(text)
+    if parsed.point is None:
+        return ModelReply(text=text, raw=raw)
+
+    point = parsed.point
+    scaled_x = point.x * original_width / image_payload.width
+    scaled_y = point.y * original_height / image_payload.height
+    return ModelReply(
+        text=json.dumps(
+            {
+                "x": scaled_x,
+                "y": scaled_y,
+                "reason": parsed.reason or "OpenRouter image resized client-side",
+                "display_x": point.x,
+                "display_y": point.y,
+                "display_width": image_payload.width,
+                "display_height": image_payload.height,
+                "original_width": original_width,
+                "original_height": original_height,
+                "raw_text": text,
+            }
+        ),
+        raw=raw,
+    )
--- a/prototypes/click_eval/src/click_eval/parsing.py
+++ b/prototypes/click_eval/src/click_eval/parsing.py
@@ -0,0 +1,539 @@
+from __future__ import annotations
+
+import ast
+import json
+import math
+import re
+from typing import Any
+
+from .contracts import ParsedPoint, Point
+
+_NUMBER_PATTERN = r"[-+]?(?:\d+(?:\.\d*)?|\.\d+)"
+_POINT_KEY_PATTERN = (
+    r"['\"]?(?:click_point|point_2d|POINT_2D|POINT|bbox_2d|coordinates|"
+    r"coordinate|coords|xy|screen_point|click_position|click_coordinates|"
+    r"cursor_position|target_position|start_box|position|bbox|box|target|"
+    r"point|center|location)['\"]?"
+)
+_ACTION_CONTEXT_PATTERN = (
+    r"(?:tool_call|computer_use|pyautogui|left_click|right_click|double_click|"
+    r"middle_click|mouse_move|tap|click|Action\.[A-Za-z_]+)"
+)
+
+
+def parse_point_value(value: Any) -> Point | None:
+    if isinstance(value, Point):
+        return value
+
+    if isinstance(value, str):
+        point = _point_from_keyed_text(value)
+        if point is not None:
+            return point
+        return _point_from_standalone_numeric_text(value)
+
+    if isinstance(value, (list, tuple)):
+        if len(value) == 1:
+            return parse_point_value(value[0])
+        if len(value) >= 4:
+            x1 = _to_float(value[0])
+            y1 = _to_float(value[1])
+            x2 = _to_float(value[2])
+            y2 = _to_float(value[3])
+            if None not in {x1, y1, x2, y2}:
+                return Point(x=(x1 + x2) / 2, y=(y1 + y2) / 2)
+        if len(value) >= 2:
+            direct_point = _point_from_numbers(value[0], value[1])
+            if direct_point is not None:
+                return direct_point
+        if len(value) >= 2 and isinstance(value[0], (list, tuple)):
+            first = parse_point_value(value[0])
+            second = parse_point_value(value[1])
+            if first is not None and second is not None:
+                return Point(x=(first.x + second.x) / 2, y=(first.y + second.y) / 2)
+        for item in value:
+            point = parse_point_value(item)
+            if point is not None:
+                return point
+
+    if isinstance(value, dict):
+        if "x" in value and "y" in value:
+            point = _point_from_numbers(value["x"], value["y"])
+            if point is not None:
+                return point
+            for key in ("x", "y"):
+                point = parse_point_value(value[key])
+                if point is not None:
+                    return point
+        for key in (
+            "point",
+            "POINT",
+            "click_point",
+            "clickPoint",
+            "target_point",
+            "coordinate",
+            "Coordinate",
+            "coordinates",
+            "coords",
+            "xy",
+            "point_2d",
+            "POINT_2D",
+            "bbox_2d",
+            "click_position",
+            "click_coordinates",
+            "cursor_position",
+            "target_position",
+            "position",
+            "bbox",
+            "box",
+            "start_box",
+            "arguments",
+        ):
+            if key in value:
+                point = parse_point_value(value[key])
+                if point is not None:
+                    return point
+
+    return None
+
+
+def _strip_thinking(text: str) -> str:
+    marker = "</think>"
+    if marker in text:
+        return text.split(marker, 1)[1].strip()
+    return text
+
+
+def parse_point_response(text: str) -> ParsedPoint:
+    text = _strip_thinking(text)
+    fallback_error: str | None = None
+
+    for value_text in _structured_value_candidates(text):
+        obj, error = _parse_structured_value(value_text)
+        if error is not None:
+            fallback_error = fallback_error or error
+            point = _point_from_text_fragment(value_text)
+            if point is not None:
+                return ParsedPoint(point=point)
+            continue
+
+        point = parse_point_value(obj)
+        if point is not None:
+            reason = obj.get("reason") if isinstance(obj, dict) else None
+            return ParsedPoint(
+                point=point, reason=str(reason) if reason is not None else None
+            )
+
+        point = _point_from_text_fragment(value_text)
+        if point is not None:
+            return ParsedPoint(point=point)
+        fallback_error = fallback_error or "response did not contain numeric x/y"
+
+    point = _point_from_text_fragment(text)
+    if point is not None:
+        return ParsedPoint(point=point)
+
+    return ParsedPoint(
+        point=None,
+        error=_parse_failure_message(text, fallback_error),
+    )
+
+
+def _point_from_numbers(x_value: Any, y_value: Any) -> Point | None:
+    x = _to_float(x_value)
+    y = _to_float(y_value)
+    if x is None or y is None:
+        return None
+
+    if not math.isfinite(x) or not math.isfinite(y):
+        return None
+
+    return Point(x=x, y=y)
+
+
+def _to_float(value: Any) -> float | None:
+    try:
+        return float(value)
+    except (TypeError, ValueError):
+        return None
+
+
+def _point_from_text_fragment(text: str) -> Point | None:
+    for parser in (
+        _point_from_keyed_text,
+        _point_from_action_context,
+        _point_from_coordinate_context,
+        _point_from_standalone_numeric_text,
+    ):
+        point = parser(text)
+        if point is not None:
+            return point
+    return None
+
+
+def _point_from_keyed_text(text: str) -> Point | None:
+    click_call = re.search(
+        rf"(?:pyautogui\.)?click\s*\(\s*(?:x\s*=\s*)?({_NUMBER_PATTERN})"
+        rf"\s*,\s*(?:y\s*=\s*)?({_NUMBER_PATTERN})",
+        text,
+        flags=re.IGNORECASE,
+    )
+    if click_call:
+        return _point_from_numbers(click_call.group(1), click_call.group(2))
+
+    keyed_numbers = _point_from_keyed_numeric_tail(text)
+    if keyed_numbers is not None:
+        return keyed_numbers
+
+    keyed_sequence = re.search(
+        rf"(?<![A-Za-z0-9_]){_POINT_KEY_PATTERN}"
+        rf"(?![A-Za-z0-9_])\s*(?::|=)?\s*['\"]?[\[(]\s*"
+        rf"({_NUMBER_PATTERN}(?:\s*,\s*{_NUMBER_PATTERN}){{1,3}})",
+        text,
+        flags=re.IGNORECASE,
+    )
+    if keyed_sequence:
+        numbers = re.findall(_NUMBER_PATTERN, keyed_sequence.group(1))
+        point = parse_point_value(numbers[:4])
+        if point is not None:
+            return point
+
+    keyed_pair = re.search(
+        rf"(?<![A-Za-z0-9_]){_POINT_KEY_PATTERN}"
+        rf"(?![A-Za-z0-9_])\s*(?::|=)?\s*"
+        rf"({_NUMBER_PATTERN})\s*,\s*({_NUMBER_PATTERN})",
+        text,
+        flags=re.IGNORECASE,
+    )
+    if keyed_pair:
+        return _point_from_numbers(keyed_pair.group(1), keyed_pair.group(2))
+
+    xy_point = _point_from_xy_text(text)
+    if xy_point is not None:
+        return xy_point
+
+    malformed_x_pair = re.search(
+        rf"(?<![A-Za-z0-9_])['\"]?x['\"]?(?![A-Za-z0-9_])"
+        rf"\s*(?::|=)\s*({_NUMBER_PATTERN})\s*,\s*['\"]?\s*({_NUMBER_PATTERN})"
+        rf"(?=\s*(?:['\"]?\s*,|\}}|\]|\)|$))",
+        text,
+        flags=re.IGNORECASE,
+    )
+    if malformed_x_pair:
+        return _point_from_numbers(malformed_x_pair.group(1), malformed_x_pair.group(2))
+
+    action_pair = re.search(
+        rf"(?:left_click|double_click|right_click|mouse_move|click|tap)"
+        rf"\s*(?:at|to|\(|coordinate|coordinates)\s*[^0-9+\-.]{{0,80}}"
+        rf"({_NUMBER_PATTERN})\s*[, ]\s*({_NUMBER_PATTERN})",
+        text,
+        flags=re.IGNORECASE,
+    )
+    if action_pair:
+        return _point_from_numbers(action_pair.group(1), action_pair.group(2))
+
+    return None
+
+
+def _point_from_action_context(text: str) -> Point | None:
+    if not re.search(_ACTION_CONTEXT_PATTERN, text, flags=re.IGNORECASE):
+        return None
+
+    action_window = _point_from_action_windows(text)
+    if action_window is not None:
+        return action_window
+
+    tagged_window = _point_from_tagged_tool_calls(text)
+    if tagged_window is not None:
+        return tagged_window
+
+    return _point_from_bracketed_pairs(text)
+
+
+def _point_from_action_windows(text: str) -> Point | None:
+    patterns = (
+        r"(?:left_click|right_click|double_click|middle_click|tap|click|"
+        r"Action\.[A-Za-z_]*CLICK[A-Za-z_]*)",
+        _ACTION_CONTEXT_PATTERN,
+    )
+    for pattern in patterns:
+        for match in re.finditer(pattern, text, flags=re.IGNORECASE):
+            window = text[match.start() : match.end() + 260]
+            point = _point_from_keyed_numeric_tail(window)
+            if point is not None:
+                return point
+            point = _point_from_bracketed_pairs(window)
+            if point is not None:
+                return point
+            point = _point_from_colon_numeric_pair(window)
+            if point is not None:
+                return point
+            point = _point_from_xy_text(window)
+            if point is not None:
+                return point
+    return None
+
+
+def _point_from_tagged_tool_calls(text: str) -> Point | None:
+    for tagged in re.finditer(
+        r"<tool_call>\s*(.*?)\s*</tool_call>",
+        text,
+        flags=re.DOTALL | re.IGNORECASE,
+    ):
+        content = tagged.group(1)
+        point = _point_from_keyed_numeric_tail(content)
+        if point is not None:
+            return point
+        point = _point_from_bracketed_pairs(content)
+        if point is not None:
+            return point
+    return None
+
+
+def _point_from_bracketed_pairs(text: str) -> Point | None:
+    for match in re.finditer(
+        rf"[\[(]\s*({_NUMBER_PATTERN}(?:\s*[, ]\s*{_NUMBER_PATTERN}){{1,3}})",
+        text,
+    ):
+        point = parse_point_value(re.findall(_NUMBER_PATTERN, match.group(1))[:4])
+        if point is not None:
+            return point
+    return None
+
+
+def _point_from_colon_numeric_pair(text: str) -> Point | None:
+    for match in re.finditer(
+        rf"(?::|[\[(])\s*({_NUMBER_PATTERN})\s*,\s*({_NUMBER_PATTERN})"
+        rf"(?=\s*(?:[\]}}\)]|,|$))",
+        text,
+    ):
+        point = _point_from_numbers(match.group(1), match.group(2))
+        if point is not None:
+            return point
+    return None
+
+
+def _point_from_coordinate_context(text: str) -> Point | None:
+    context = (
+        r"(?:coordinates?|coords?|point|position|center|centre|location|"
+        r"target|click(?:\s+(?:point|position|coordinate|location))?)"
+    )
+    relation = (
+        r"(?:(?:is|are|at|to|around|approximately|approx\.?|about|near|"
+        r"would\s+be|should\s+be)\b|=|:)"
+    )
+    patterns = (
+        rf"\b{context}\b[^\n\r.;]{{0,120}}\b{relation}\s*"
+        rf"[\[(]?\s*(?:x\s*(?:=|:)\s*)?({_NUMBER_PATTERN})\s*,\s*"
+        rf"(?:y\s*(?:=|:)\s*)?({_NUMBER_PATTERN})",
+        rf"\b(?:at|around|approximately|approx\.?|about|near)\b"
+        rf"[^\n\r.;]{{0,40}}[\[(]\s*({_NUMBER_PATTERN})\s*,\s*"
+        rf"({_NUMBER_PATTERN})",
+    )
+    for pattern in patterns:
+        for match in re.finditer(pattern, text, flags=re.IGNORECASE):
+            point = _point_from_numbers(match.group(1), match.group(2))
+            if point is not None:
+                return point
+    return None
+
+
+def _point_from_keyed_numeric_tail(text: str) -> Point | None:
+    for match in re.finditer(
+        rf"(?<![A-Za-z0-9_]){_POINT_KEY_PATTERN}(?![A-Za-z0-9_])",
+        text,
+        flags=re.IGNORECASE,
+    ):
+        tail = text[match.end() : match.end() + 220]
+        assigned = re.match(r"\s*(?::|=|\(|is\b|are\b|at\b)\s*", tail)
+        if assigned is None:
+            continue
+        point = _point_from_numeric_text(tail[assigned.end() :])
+        if point is not None:
+            return point
+    return None
+
+
+def _point_from_standalone_numeric_text(text: str) -> Point | None:
+    stripped = text.strip().strip("`")
+    if not re.fullmatch(r"[\s\[\]\(\)\{\},;:'\"xXyY=.+\-0-9]+", stripped):
+        return None
+    return _point_from_numeric_text(stripped)
+
+
+def _point_from_numeric_text(text: str) -> Point | None:
+    segment = text[:220]
+    xy_point = _point_from_xy_text(segment)
+    if xy_point is not None:
+        return xy_point
+
+    wrapped_numbers = _constructor_wrapped_numbers(segment)
+    if len(wrapped_numbers) >= 2:
+        return parse_point_value(wrapped_numbers[:4])
+
+    sequence = re.search(
+        rf"[\[(]\s*({_NUMBER_PATTERN}(?:\s*[, ]\s*{_NUMBER_PATTERN}){{1,3}})",
+        segment,
+    )
+    numbers = (
+        re.findall(_NUMBER_PATTERN, sequence.group(1))
+        if sequence
+        else re.findall(_NUMBER_PATTERN, segment)
+    )
+    if len(numbers) >= 4:
+        return parse_point_value(numbers[:4])
+    if len(numbers) >= 2:
+        return parse_point_value(numbers[:2])
+    return None
+
+
+def _constructor_wrapped_numbers(text: str) -> list[str]:
+    return [
+        match.group(1)
+        for match in re.finditer(
+            rf"(?:[A-Za-z_][\w]*\.)?[A-Za-z_][\w]*\(\s*({_NUMBER_PATTERN})\s*\)",
+            text,
+        )
+    ]
+
+
+def _point_from_xy_text(text: str) -> Point | None:
+    x_match = re.search(
+        rf"(?<![A-Za-z0-9_])['\"]?x['\"]?(?![A-Za-z0-9_])"
+        rf"\s*(?::|=|is\b)\s*(?:[A-Za-z_][\w.]*\(\s*)?({_NUMBER_PATTERN})",
+        text,
+        flags=re.IGNORECASE,
+    )
+    y_match = re.search(
+        rf"(?<![A-Za-z0-9_])['\"]?y['\"]?(?![A-Za-z0-9_])"
+        rf"\s*(?::|=|is\b)\s*(?:[A-Za-z_][\w.]*\(\s*)?({_NUMBER_PATTERN})",
+        text,
+        flags=re.IGNORECASE,
+    )
+    if x_match and y_match:
+        return _point_from_numbers(x_match.group(1), y_match.group(1))
+    return None
+
+
+def _parse_structured_value(text: str) -> tuple[Any | None, str | None]:
+    try:
+        return json.loads(text), None
+    except json.JSONDecodeError as json_error:
+        try:
+            return ast.literal_eval(text), None
+        except (SyntaxError, ValueError) as literal_error:
+            return (
+                None,
+                f"invalid JSON/Python literal: {json_error.msg}; {literal_error}",
+            )
+
+
+def _parse_failure_message(text: str, structured_error: str | None) -> str:
+    message = "response did not contain a numeric point value"
+    if structured_error:
+        message += "; structured JSON/Python parse failed"
+    return f"{message}; raw preview: {_raw_preview(text)}"
+
+
+def _raw_preview(text: str, max_chars: int = 240) -> str:
+    compact = " ".join(text.strip().split())
+    if len(compact) <= max_chars:
+        return compact
+    return f"{compact[: max_chars - 1]}..."
+
+
+def _structured_value_candidates(text: str) -> list[str]:
+    candidates: list[str] = []
+    seen: set[str] = set()
+
+    def add(value: str | None) -> None:
+        if value is None:
+            return
+        stripped = value.strip()
+        if not stripped or stripped in seen:
+            return
+        seen.add(stripped)
+        candidates.append(stripped)
+
+    for fenced in re.finditer(r"```(?:json)?\s*(.*?)\s*```", text, flags=re.DOTALL):
+        add(fenced.group(1))
+
+    for tagged in re.finditer(
+        r"<tool_call>\s*(.*?)\s*</tool_call>",
+        text,
+        flags=re.DOTALL | re.IGNORECASE,
+    ):
+        add(tagged.group(1))
+
+    for tagged in re.finditer(
+        r"<(?:answer|final|json)[^>]*>\s*(.*?)\s*</(?:answer|final|json)>",
+        text,
+        flags=re.DOTALL | re.IGNORECASE,
+    ):
+        add(tagged.group(1))
+
+    for tagged in re.finditer(
+        r"<\|(?:point|box)_start\|>\s*(.*?)\s*<\|(?:point|box)_end\|>",
+        text,
+        flags=re.DOTALL | re.IGNORECASE,
+    ):
+        add(tagged.group(1))
+
+    for tagged in re.finditer(
+        r"<(?:point|box)[^>]*>\s*(.*?)\s*</(?:point|box)>",
+        text,
+        flags=re.DOTALL | re.IGNORECASE,
+    ):
+        add(tagged.group(1))
+
+    for opener, closer in (("{", "}"), ("[", "]"), ("(", ")")):
+        for value in _balanced_values(text, opener, closer):
+            add(value)
+
+    return candidates
+
+
+def _balanced_values(text: str, opener: str, closer: str) -> list[str]:
+    values: list[str] = []
+    index = 0
+    while index < len(text):
+        start = text.find(opener, index)
+        if start == -1:
+            break
+        value = _balanced_from(text, start, opener, closer)
+        if value is None:
+            index = start + 1
+            continue
+        values.append(value)
+        index = start + len(value)
+    return values
+
+
+def _balanced_from(text: str, start: int, opener: str, closer: str) -> str | None:
+    depth = 0
+    in_string = False
+    escaped = False
+    string_quote = ""
+
+    for index in range(start, len(text)):
+        char = text[index]
+
+        if in_string:
+            if escaped:
+                escaped = False
+            elif char == "\\":
+                escaped = True
+            elif char == string_quote:
+                in_string = False
+            continue
+
+        if char in {"'", '"'}:
+            in_string = True
+            string_quote = char
+        elif char == opener:
+            depth += 1
+        elif char == closer:
+            depth -= 1
+            if depth == 0:
+                return text[start : index + 1]
+
+    return None
--- a/prototypes/click_eval/src/click_eval/providers.py
+++ b/prototypes/click_eval/src/click_eval/providers.py
@@ -0,0 +1,93 @@
+from __future__ import annotations
+
+from contextlib import nullcontext
+from pathlib import Path
+from typing import Callable
+
+from .contracts import ModelReply, ModelSpec
+from .gemini import GeminiComputerUseClient
+from .local_hf import LocalHFClient
+from .moondream import MoondreamClient
+from .openai_cu import OpenAIComputerUseClient
+from .openrouter import OpenRouterClient
+
+
+class ProviderClient:
+    def __init__(
+        self,
+        timeout_seconds: int = 90,
+        log_callback: Callable[[str], None] | None = None,
+    ) -> None:
+        self.timeout_seconds = timeout_seconds
+        self._log_callback = log_callback
+        self._openrouter: OpenRouterClient | None = None
+        self._moondream: MoondreamClient | None = None
+        self._gemini: GeminiComputerUseClient | None = None
+        self._openai_cu: OpenAIComputerUseClient | None = None
+        self._local_hf: LocalHFClient | None = None
+
+    def predict_point(
+        self,
+        model: ModelSpec,
+        image_path: Path,
+        instruction: str,
+        purpose: str,
+    ) -> ModelReply:
+        provider = model.provider.lower()
+        if provider == "openrouter":
+            return self._openrouter_client().predict_point(
+                model.model_id, image_path, instruction, purpose
+            )
+        if provider == "moondream":
+            return self._moondream_client().predict_point(
+                model.model_id, image_path, instruction, purpose
+            )
+        if provider == "gemini":
+            return self._gemini_client().predict_point(
+                model.model_id, image_path, instruction, purpose
+            )
+        if provider == "openai_computer_use":
+            return self._openai_cu_client().predict_point(
+                model.model_id, image_path, instruction, purpose
+            )
+        if provider == "local_hf":
+            return self._local_hf_client().predict_point(
+                model, image_path, instruction, purpose
+            )
+
+        raise RuntimeError(f"Unsupported model provider: {model.provider}")
+
+    def model_run_context(self, model: ModelSpec):
+        if model.provider.lower() == "local_hf":
+            return self._local_hf_client().retain_loaded_models()
+        return nullcontext()
+
+    def _openrouter_client(self) -> OpenRouterClient:
+        if self._openrouter is None:
+            self._openrouter = OpenRouterClient(timeout_seconds=self.timeout_seconds)
+        return self._openrouter
+
+    def _moondream_client(self) -> MoondreamClient:
+        if self._moondream is None:
+            self._moondream = MoondreamClient(timeout_seconds=self.timeout_seconds)
+        return self._moondream
+
+    def _gemini_client(self) -> GeminiComputerUseClient:
+        if self._gemini is None:
+            self._gemini = GeminiComputerUseClient(timeout_seconds=self.timeout_seconds)
+        return self._gemini
+
+    def _openai_cu_client(self) -> OpenAIComputerUseClient:
+        if self._openai_cu is None:
+            self._openai_cu = OpenAIComputerUseClient(
+                timeout_seconds=self.timeout_seconds
+            )
+        return self._openai_cu
+
+    def _local_hf_client(self) -> LocalHFClient:
+        if self._local_hf is None:
+            self._local_hf = LocalHFClient(
+                timeout_seconds=self.timeout_seconds,
+                log_callback=self._log_callback,
+            )
+        return self._local_hf
--- a/prototypes/click_eval/src/click_eval/scoring.py
+++ b/prototypes/click_eval/src/click_eval/scoring.py
@@ -0,0 +1,101 @@
+from __future__ import annotations
+
+import math
+import statistics
+from collections import defaultdict
+from typing import Any
+
+from .contracts import Point
+
+SCORE_FIELDNAMES = [
+    "task_id",
+    "model",
+    "gt_x",
+    "gt_y",
+    "pred_x",
+    "pred_y",
+    "dx",
+    "dy",
+    "l2",
+    "duration_seconds",
+    "skipped",
+    "error",
+]
+
+
+def score_point(
+    task_id: str,
+    model_name: str,
+    gt: Point | None,
+    pred: Point | None,
+    error: str | None = None,
+) -> dict[str, Any]:
+    row: dict[str, Any] = {
+        "task_id": task_id,
+        "model": model_name,
+        "gt_x": gt.x if gt is not None else "",
+        "gt_y": gt.y if gt is not None else "",
+        "pred_x": "",
+        "pred_y": "",
+        "dx": "",
+        "dy": "",
+        "l2": "",
+        "error": error or "",
+    }
+
+    if pred is None:
+        return row
+
+    row.update({"pred_x": pred.x, "pred_y": pred.y})
+    if gt is None:
+        return row
+
+    dx = pred.x - gt.x
+    dy = pred.y - gt.y
+    l2 = math.hypot(dx, dy)
+
+    row.update(
+        {
+            "dx": dx,
+            "dy": dy,
+            "l2": l2,
+        }
+    )
+
+    return row
+
+
+def summarize_scores(score_rows: list[dict[str, Any]]) -> dict[str, Any]:
+    grouped: dict[str, list[dict[str, Any]]] = defaultdict(list)
+    for row in score_rows:
+        grouped[str(row["model"])].append(row)
+
+    summary: dict[str, Any] = {}
+    for model_name, rows in grouped.items():
+        distances = [float(row["l2"]) for row in rows if row["l2"] != ""]
+        durations = [
+            float(row["duration_seconds"])
+            for row in rows
+            if row.get("duration_seconds") not in {"", None}
+        ]
+        model_summary: dict[str, Any] = {
+            "total": len(rows),
+            "scored": len(distances),
+            "skipped": sum(1 for row in rows if row.get("skipped") is True),
+            "errors": sum(
+                1
+                for row in rows
+                if row.get("error") and row.get("skipped") is not True
+            ),
+            "mean_l2": statistics.fmean(distances) if distances else None,
+            "median_l2": statistics.median(distances) if distances else None,
+            "mean_duration_seconds": statistics.fmean(durations)
+            if durations
+            else None,
+            "median_duration_seconds": statistics.median(durations)
+            if durations
+            else None,
+        }
+        summary[model_name] = model_summary
+
+    return summary
--- a/prototypes/click_eval/src/click_eval/viz.py
+++ b/prototypes/click_eval/src/click_eval/viz.py
@@ -0,0 +1,116 @@
+from __future__ import annotations
+
+from pathlib import Path
+from typing import Any
+
+from .contracts import Point
+from .image_utils import require_pillow
+
+COLORS = [
+    (56, 132, 255),
+    (255, 149, 0),
+    (175, 82, 222),
+    (255, 45, 85),
+    (90, 200, 250),
+    (255, 204, 0),
+]
+GT_COLOR = (20, 180, 70)
+JUDGE_COLORS = [
+    (0, 128, 128),
+    (180, 90, 0),
+    (120, 80, 220),
+]
+
+
+def annotate_image(
+    image_path: Path,
+    output_path: Path,
+    gt_point: Point | None,
+    predictions: list[dict[str, Any]],
+    judge_points: list[dict[str, Any]] | None = None,
+) -> None:
+    Image, ImageDraw, ImageFont = require_pillow()
+    with Image.open(image_path) as source:
+        image = source.convert("RGB")
+
+    draw = ImageDraw.Draw(image)
+    font = ImageFont.load_default()
+    legend_lines: list[tuple[str, tuple[int, int, int]]] = []
+
+    if gt_point is not None:
+        legend_lines.append(("GT", GT_COLOR))
+        _draw_marker(draw, gt_point, GT_COLOR, "GT")
+
+    for index, judge in enumerate(judge_points or []):
+        color = JUDGE_COLORS[index % len(JUDGE_COLORS)]
+        label = str(judge.get("label") or f"GT{index + 1}")
+        model_name = str(judge.get("model") or f"judge-{index + 1}")
+        point = judge.get("point")
+        l2 = judge.get("l2")
+        if isinstance(point, Point):
+            _draw_marker(draw, point, color, label)
+            distance = f"{float(l2):.1f}px" if l2 is not None else "n/a"
+            legend_lines.append((f"{label} {model_name}: {distance}", color))
+        else:
+            error = str(judge.get("error") or "no point")
+            legend_lines.append((f"{label} {model_name}: {_short_text(error)}", color))
+
+    for index, prediction in enumerate(_sort_predictions_by_l2(predictions)):
+        color = COLORS[index % len(COLORS)]
+        model_name = str(prediction["model"])
+        point = prediction.get("point")
+        l2 = prediction.get("l2")
+        if isinstance(point, Point):
+            _draw_marker(draw, point, color, str(index + 1))
+            distance = f"{float(l2):.1f}px" if l2 is not None else "n/a"
+            legend_lines.append((f"{index + 1} {model_name}: {distance}", color))
+        else:
+            legend_lines.append((f"{index + 1} {model_name}: error", color))
+
+    _draw_legend(draw, legend_lines, font)
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    image.save(output_path)
+
+
+def _sort_predictions_by_l2(predictions: list[dict[str, Any]]) -> list[dict[str, Any]]:
+    def sort_key(item: tuple[int, dict[str, Any]]) -> tuple[int, float, int]:
+        index, prediction = item
+        l2 = prediction.get("l2")
+        if l2 is None:
+            return (1, 0.0, index)
+        try:
+            return (0, float(l2), index)
+        except (TypeError, ValueError):
+            return (1, 0.0, index)
+
+    return [prediction for _, prediction in sorted(enumerate(predictions), key=sort_key)]
+
+
+def _draw_marker(draw, point: Point, color: tuple[int, int, int], label: str) -> None:
+    x = int(round(point.x))
+    y = int(round(point.y))
+    radius = 8
+    draw.ellipse((x - radius, y - radius, x + radius, y + radius), outline=color, width=3)
+    draw.line((x - 12, y, x + 12, y), fill=color, width=2)
+    draw.line((x, y - 12, x, y + 12), fill=color, width=2)
+    draw.text((x + 10, y + 8), label, fill=color)
+
+
+def _draw_legend(draw, lines, font) -> None:
+    if not lines:
+        return
+    padding = 6
+    line_height = 14
+    width = max(90, max(len(text) for text, _ in lines) * 7 + padding * 2)
+    height = padding * 2 + line_height * len(lines)
+    draw.rectangle((0, 0, width, height), fill=(255, 255, 255), outline=(30, 30, 30))
+    for index, (text, color) in enumerate(lines):
+        y = padding + index * line_height
+        draw.text((padding, y), text, fill=color, font=font)
+
+
+def _short_text(text: str, max_chars: int = 80) -> str:
+    compact = " ".join(text.split())
+    if len(compact) <= max_chars:
+        return compact
+    return compact[: max_chars - 1] + "..."
--- a/prototypes/click_eval/tests/test_local_hf.py
+++ b/prototypes/click_eval/tests/test_local_hf.py
@@ -0,0 +1,53 @@
+from types import SimpleNamespace
+
+from click_eval.local_hf import (
+    _patch_qwen25_mrope_section,
+    _qwen25_manual_prompt_text,
+)
+
+
+def test_qwen25_manual_prompt_text_includes_image_placeholder():
+    text = _qwen25_manual_prompt_text(
+        [
+            {"role": "system", "content": "system prompt"},
+            {
+                "role": "user",
+                "content": [
+                    {"type": "image", "image": object()},
+                    {"type": "text", "text": "click the button"},
+                ],
+            },
+        ]
+    )
+
+    assert "<|vision_start|><|image_pad|><|vision_end|>" in text
+    assert text.endswith("<|im_start|>assistant\n")
+    assert "click the button" in text
+
+
+def test_patch_qwen25_mrope_section_adds_missing_default():
+    text_config = SimpleNamespace(rope_scaling={"type": "default"})
+    hf_model = SimpleNamespace(
+        config=SimpleNamespace(text_config=text_config),
+        model=None,
+        language_model=None,
+    )
+
+    _patch_qwen25_mrope_section(hf_model)
+
+    assert text_config.rope_scaling["mrope_section"] == [16, 24, 24]
+
+
+def test_patch_qwen25_mrope_section_preserves_existing_value():
+    text_config = SimpleNamespace(
+        rope_scaling={"type": "default", "mrope_section": [1, 2, 3]}
+    )
+    hf_model = SimpleNamespace(
+        config=SimpleNamespace(text_config=text_config),
+        model=None,
+        language_model=None,
+    )
+
+    _patch_qwen25_mrope_section(hf_model)
+
+    assert text_config.rope_scaling["mrope_section"] == [1, 2, 3]
Author	SHA1	Message	Date
Neel Gupta	74b7ec397e	fix: improve click eval point parsing	2026-04-29 17:23:54 +01:00
Neel Gupta	6dec35475b	feat: Added new datapoints	2026-04-29 16:19:00 +01:00
Neel Gupta	f9c56546cc	fix: include raw preview for unparsed clicks	2026-04-29 14:14:39 +01:00
Neel Gupta	670b8d9745	fix: recover constructor-wrapped click points	2026-04-28 17:36:04 +01:00
Neel Gupta	76b6869219	refactor: trim click eval dead code	2026-04-28 17:00:02 +01:00
Neel Gupta	95303f4374	fix: make click extraction model tolerant	2026-04-28 16:56:42 +01:00
Neel Gupta	bccccce0a7	fix: harden click model adapters	2026-04-28 15:57:39 +01:00
Neel Gupta	cda9965927	fix: sort annotated predictions by distance	2026-04-28 15:53:21 +01:00
Neel Gupta	31eb93bdf8	feat: task major -> model major	2026-04-28 15:49:40 +01:00
Neel Gupta	3386f0a5ce	feat: more tasks	2026-04-28 12:36:45 +01:00
Neel Gupta	b8ed09eaba	fix: bump up the res	2026-04-27 23:51:33 +01:00
Neel Gupta	907f10e7c8	fix: OAI model fix: OAI reasoning	2026-04-27 23:43:46 +01:00
Neel Gupta	db8bec9b59	fix: judge error logging + ss rez	2026-04-27 23:27:43 +01:00
Neel Gupta	2879574219	fix: screenshot res	2026-04-27 23:00:35 +01:00
Neel Gupta	7125029049	fix: Added GT	2026-04-27 21:21:24 +01:00
Neel Gupta	e6c6c29472	fix: remove GT fallback and wrong OAI req	2026-04-27 21:03:58 +01:00
Neel Gupta	d35c02b223	fix: judge models	2026-04-27 18:15:42 +01:00
Neel Gupta	b0c5383407	feat: allow limiting to N models	2026-04-27 17:42:44 +01:00
Neel Gupta	bf9fc96f42	fix: GLM bugfix	2026-04-27 15:13:07 +01:00
Neel Gupta	307c02c44a	feat: More HN tasks	2026-04-27 12:26:12 +01:00
Neel Gupta	d865931db3	fix: Use all 3 judges	2026-04-27 12:08:03 +01:00
Neel Gupta	4e55be8b9f	feat: Added 3 judges + updates screenshot	2026-04-27 11:38:52 +01:00
Neel Gupta	a39dfa52f3	fix: last attempt for fix	2026-04-26 13:43:04 +01:00
Neel Gupta	a4584142a1	feat: Added even more models	2026-04-26 13:21:38 +01:00
Neel Gupta	f538615a4f	fix: more patches	2026-04-26 13:07:29 +01:00
Neel Gupta	08968ff16e	feat: gemini computer use model	2026-04-26 12:58:00 +01:00
Neel Gupta	dca8b8555f	fix: infigui model	2026-04-26 12:52:24 +01:00
Neel Gupta	98cc128d3b	fix: more deps & model bugs	2026-04-26 12:42:11 +01:00
Neel Gupta	a6ae8bba56	fix: more bugs w/ models	2026-04-26 11:52:52 +01:00
Neel Gupta	a5c3769e4e	fix: bugs w/ model & ernie	2026-04-26 11:31:02 +01:00
Neel Gupta	4051fe189b	fix: Clear VRAM after loading models	2026-04-26 11:08:41 +01:00
Neel Gupta	eb08cac743	feat: CPU offload + fp16 + 4bit	2026-04-26 09:57:43 +01:00
Neel Gupta	144a10946d	fix: cuda timeout	2026-04-26 09:57:43 +01:00
Neel Gupta	e30f29dd06	fix: deps fix: downgrade python	2026-04-26 09:57:42 +01:00
Neel Gupta	b4e08d3a13	feat: initial click eval harness commit	2026-04-25 23:06:56 +01:00