Skip to content

[STG-2173] feat(evals): add EDGAR 10-Q multi-company extraction agent eval#2191

Draft
shrey150 wants to merge 1 commit into
mainfrom
shrey/edgar-10q-eval
Draft

[STG-2173] feat(evals): add EDGAR 10-Q multi-company extraction agent eval#2191
shrey150 wants to merge 1 commit into
mainfrom
shrey/edgar-10q-eval

Conversation

@shrey150

@shrey150 shrey150 commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

What

Adds a long-horizon agent eval to packages/evals: agent/edgar_10q.

For three companies (Snowflake / Datadog / MongoDB) the agent must:

  1. Find each company's most recent 10-Q on SEC EDGAR,
  2. Open the actual primary document (not the filing index/cover/exhibit),
  3. Extract quarterly revenue, YoY growth, RPO, and the top risk factor,
  4. Return a 3-company comparison table.

It exercises long-horizon navigation (unknown number of pages, nested filing index → primary doc) plus multi-document synthesis.

Why

We needed a realistic long-horizon eval to measure task closure — does the agent actually finish and return the synthesized result, not just do the work. Scoring is objective: the agent's final answer must contain the correct quarterly revenue for all three companies (ground truth verified against SEC XBRL, data.sec.gov/api/xbrl/companyconcept). Scoring the final answer (not intermediate tool output) is deliberate — an agent that extracts the data but reports only "task complete" has not finished the job, and this catches that.

Notes for reviewers

  • Follows the defineBenchTask pattern (auto-discovered from tasks/bench/agent/, no registration).
  • No changeset@browserbasehq/stagehand-evals is private: true (not published).
  • Ground-truth drift: the task targets the "MOST RECENT" 10-Q, so the hardcoded figures are a dated snapshot (verified 2026-06). They must be refreshed when newer filings post. Alternative: pin the instruction to specific filing periods so ground truth never drifts — happy to switch if preferred (flagged in a code comment).
  • Draft for review.

E2E Test Matrix

Command / flow Observed output Confidence / sufficiency
evals run agent/edgar_10q -e browserbase -t 5 --agent-mode hybrid (Opus 4.8) 5/5 passed (100%); correct revenue for SNOW/DDOG/MDB in every trial Proves the task runs end-to-end on a live Browserbase cloud session and the scorer validates real output
Same task, Opus 4.7 vs 4.8 (cumulative across batches) 4.8 task-closure 9/9; 4.7 ~2/8 Proves the eval discriminates — it separates "extracted the data" from "returned the answer," which is its purpose
pnpm build:esm then evals list exit 0; agent/edgar_10q auto-discovered Compiles cleanly and registers with no config edit

Ground-truth revenue independently confirmed against SEC XBRL (SNOW $1,390,951K, DDOG $1,006,426K, MDB $687,616K).

Linear: STG-2173

🤖 Generated with Claude Code


Summary by cubic

Adds agent/edgar_10q, a long-horizon eval that tests multi-company 10‑Q extraction from SEC EDGAR and scores agents on returning the correct quarterly revenue. Addresses Linear STG-2173 by measuring task closure with objective, final-answer scoring.

  • New Features
    • Added packages/evals/tasks/bench/agent/edgar_10q.ts, auto-discovered via defineBenchTask.
    • Requires finding each company’s most recent 10‑Q, opening the primary document, extracting revenue, YoY growth, RPO, and top risk; outputs a 3‑company table for SNOW, DDOG, and MDB.
    • Scorer checks the final answer for correct revenue across all three (ground truth verified against SEC XBRL; figures are a 2026‑06 snapshot).

Written for commit 59501a7. Summary will update on new commits.

Review in cubic

Long-horizon agent eval: for SNOW/DDOG/MDB, find the most recent 10-Q on
EDGAR, open the primary document, extract revenue/YoY/RPO/top risk, and
return a comparison table. Objective scoring requires the correct quarterly
revenue for all three companies in the final answer (ground truth verified
against SEC XBRL). Useful for measuring long-horizon task closure.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@changeset-bot

changeset-bot Bot commented Jun 4, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: 59501a7

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant