monday Service + LangSmith: Building a Code-First Evaluation Strategy from Day 1 — news

monday Service Builds Code-First AI Agent Evaluation with LangSmith

NEW YORK — monday.com’s enterprise service division has achieved an 8.7x speedup in AI agent evaluation by adopting a code-first testing framework built on LangSmith, reducing feedback loops from 162 seconds to 18 seconds per test cycle.

The company’s monday Service team developed an eval-driven development framework for its customer-facing service agents, treating evaluations as version-controlled code from day one. The approach integrates LangSmith, LangGraph agents, and Vitest to enable comprehensive testing across hundreds of examples in minutes and real-time end-to-end quality monitoring on production traces using multi-turn evaluators.

“By treating evaluations as code, we could leverage AI IDEs like Cursor and Claude Code to refine complex prompts directly within our primary workspace,” the monday Service team explained in a LangChain blog post. “It felt natural to write offline evaluations for our judges to ensure accuracy before they ever touch production traffic.”

The framework represents a shift toward systematic, developer-centric evaluation strategies for production AI agents. Rather than relying on ad-hoc testing or manual review, monday Service implemented GitOps-style CI/CD deployment where evaluations are managed as version-controlled code. This allows the team to maintain rigorous quality standards while accelerating iteration cycles for their LangGraph-powered service agents.

Technical Implementation and Results

The integration delivers multiple operational improvements. Teams can now run comprehensive test suites across hundreds of examples in minutes instead of hours. Production monitoring uses multi-turn evaluators that analyze complete conversation traces in real-time, providing immediate visibility into agent performance on live customer interactions.

The 8.7x reduction in evaluation time — from 162 seconds to 18 seconds per test cycle — dramatically shortens the feedback loop between making changes and validating their impact. This acceleration is particularly valuable for customer-facing agents where reliability and response quality directly affect user experience.

The approach also addresses a core challenge in agent development: the difficulty of validating improvements without systematic evaluation. By embedding evaluation logic directly into the codebase, monday Service ensures that every prompt change, model update, or workflow modification can be tested against established quality benchmarks before deployment.

Industry Context

The monday Service implementation comes as more enterprises seek reliable methods for building and maintaining production AI agents. LangSmith, developed by LangChain, provides observability and evaluation tools specifically designed for LLM applications. The platform enables developers to trace agent reasoning, create custom evaluators, and monitor performance at scale.

monday.com’s enterprise service division, which handles complex customer workflows and support processes, represents a typical use case where AI agents must handle nuanced, multi-turn conversations while maintaining high accuracy and consistency.

The success of this code-first strategy highlights growing industry recognition that reliable agents require systematic evaluation infrastructure. Organizations building customer-facing AI systems increasingly view evaluation not as an afterthought but as a foundational component of the development process.

Impact on Developers and Teams

For developers, the monday Service approach offers a blueprint for integrating evaluation into existing software engineering workflows. By managing evaluations as code, teams can apply familiar practices like version control, code review, and automated testing to AI development.

The integration with tools like Cursor and Claude Code further demonstrates how AI-assisted development can enhance the process of building AI systems themselves. Developers can iterate on complex evaluation logic using natural language tools while maintaining the rigor of traditional software engineering practices.

The framework also enables better collaboration between development, quality assurance, and product teams. When evaluations are version-controlled and part of the CI/CD pipeline, quality metrics become transparent and actionable across the organization.

What's Next

The LangChain blog post detailing monday Service’s implementation serves as a practical case study for other organizations building production AI agents. While specific timelines for broader adoption of similar frameworks are not detailed, the approach aligns with industry movement toward more systematic LLMOps practices.

As AI agent deployments continue to scale, the ability to rapidly evaluate and validate changes will become increasingly critical. monday Service’s results suggest that code-first evaluation strategies can significantly reduce the operational overhead of maintaining reliable AI systems while improving overall development velocity.

The full case study is available on the LangChain blog, providing technical details and implementation guidance for teams looking to adopt similar evaluation-first approaches to agent development.

monday Service + LangSmith: Building a Code-First Evaluation Strategy from Day 1 — news

Original Source

Related Topics

Comments