SAN FRANCISCO: Judgment Labs, a three-month-old infrastructure startup building tools to help AI-powered agents improve from real-world production data, has raised $32 million in combined seed and Series A funding, with both rounds led by Lightspeed Venture Partners.

The funding will expand the company’s research team and platform, which is already in use at several agent-native companies. Judgment’s technology helps developers turn production data — the actual, sometimes messy interactions agents have with users and tools — into continuous improvements for those same agents.

Lightspeed led the seed round earlier this year and doubled down less than six months later to lead the Series A. Nova Global, SV Angel, Valor Equity Partners, and Dynamic also participated.

“Judgment is solving the hardest problem in the agent stack — how do you measure and improve something that thinks, plans, uses tools, and remembers?” said James Alcorn, a partner at Lightspeed. “We led the seed because the bet was obvious, and we led the Series A because the results have been extraordinary.”

The startup was founded by three childhood friends: CEO Alex Shan, 22; Chief Scientist Andrew Li, 23; and CTO Joseph Camyre, 23. Shan was an AI researcher at Stanford’s NLP group, Li was an early research hire at TogetherAI, and Camyre built large-scale infrastructure at Datadog.

The company is betting on a fundamental shift in how artificial intelligence is deployed. For the past several years, most LLM-powered products were chatbots: a user sends a question, the model returns an answer. A new generation of “deep agents” — including Anthropic’s Claude Code, OpenAI’s Codex, and Cognition’s Devin — perform end-to-end tasks autonomously, reasoning through open-ended problems, writing and running code, and sometimes running for hours on a single objective.

That shift breaks the evaluation methods inherited from the chatbot era, which assumed a single input and a single output. Deep agents produce trajectories: long chains of decisions, search queries, partial results, and self-corrections. When an agent fails, the error is often buried somewhere in that chain rather than in the final answer.

Judgment’s platform traces those trajectories, surfaces recurring failure patterns, and turns each real interaction into a fix teams can ship back into their products.

“The teams building deep agents didn’t have tools that understood what their agents were actually doing,” Shan said. “Input-output evals miss so much of where agents go wrong.”

The company plans to use the new capital primarily to hire AI researchers and engineers in San Francisco, and to expand its forward-deployed engineering team for its growing customer base.

Editor’s Commentary: Let’s be honest: “evals” is one of the least sexy words in AI, and that’s exactly why Judgment Labs matters. The industry has spent billions teaching models to generate, but almost nothing teaching them to get better at generating the right thing after they’re already in the wild. Every AI demo looks magical until you see an agent spend 40 minutes and $12 in API credits confidently solving the wrong problem.

What’s striking here isn’t just the speed of Lightspeed’s double-down — seed to Series A in under six months is aggressive even by 2021 standards — but the founders’ ages. Three people who can’t yet rent a car without a young driver fee are building infrastructure for companies that can’t afford to have their agents fail silently. That’s either visionary or terrifying, and in AI right now, those are often the same thing.

The real test will be whether Judgment can stay ahead of the model providers themselves. OpenAI, Anthropic, and Google are all building their own agentic evaluation tooling. A startup’s best defense is speed and specialization. At 22, Shan and his team have time on their side. Whether they have enough of it is the question.

Judgment Labs raises $32M to help AI agents learn from their own mistakes

Comments

Leave a Reply Cancel reply