Commit Graph

6 Commits

Author SHA1 Message Date
shivammittal274
d383b5e344 feat(eval): add claude-generated run report artifact (#892)
* feat(eval): add claude-generated run report artifact

* fix(eval): install claude code cli for CI evals

* fix(eval): bypass claude code tool permissions

* Eval metrics configs (#932)

* feat(eval): add agisdk comparison metrics configs

* fix(eval): keep cdp crashes from aborting run
2026-05-04 21:09:06 +05:30
shivammittal274
c81906ecbf feat(eval): add claude code eval agent (#885) 2026-05-01 02:25:08 +05:30
Nikhil
26afb826c6 feat(eval): add viewer manifest contract (#878)
* refactor(eval): canonicalize viewer manifest contract

* refactor(eval): publish canonical viewer manifests

* feat(eval): make r2 viewer use manifest artifact paths

* fix(eval): keep weekly report compatible with viewer manifests

* docs(eval): document r2 viewer manifest contract

* chore: self-review fixes

* fix: address review feedback for PR #878
2026-04-29 20:50:35 -07:00
Nikhil
b2340c8afa refactor(eval): split orchestrated executor backends (#876)
* refactor(eval): split orchestrated executor backends

* fix(eval): address executor backend review comments
2026-04-29 18:02:32 -07:00
Nikhil
84a79ba0a1 feat: refactor eval pipeline workflow (#875)
* feat(eval): add suite variant config bridge

* feat(eval): add stable run artifacts

* refactor(eval): add shared grader contract

* feat(eval): persist grader artifacts

* refactor(eval): rename runner layers

* refactor(eval): add executor backend boundary

* refactor(eval): split clado backend

* feat(eval): add workflow compatible cli

* feat(eval): add r2 publisher module

* ci(eval): migrate weekly workflow to eval cli

* docs(eval): document suite pipeline

* chore(eval): verify pipeline refactor

* fix: address review feedback for PR #875

* docs(eval): add env example

* docs(eval): explain suites and variants

* chore(eval): organize config layouts

* chore(eval): colocate grader python evaluators
2026-04-29 17:21:02 -07:00
shivammittal274
0babc05077 feat(eval): NopeCHA CAPTCHA solver integration (#537)
* feat(eval): show mean score instead of pass/fail in report and viewer

* feat(eval): integrate NopeCHA CAPTCHA solver into eval pipeline

Add CAPTCHA detection and waiting so screenshots capture post-solve state.
Run headed with xvfb on CI since headless breaks extension content scripts.

- Add CaptchaWaiter module (detect reCAPTCHA/hCaptcha/Turnstile, poll until solved)
- Add optional `captcha` config block to EvalConfigSchema
- Wait for CAPTCHA solve before screenshot in single-agent and orchestrator-executor
- Patch NopeCHA manifest with API key before launching workers
- Fix CAPTCHA_EXT_DIR path (was pointing one level too high)
- Remove --incognito (extensions don't run in incognito; fresh user-data-dir isolates)
- CI: install xvfb, run headed via xvfb-run, pass NOPECHA_API_KEY secret
2026-03-24 00:14:16 +05:30