At Metaview, we build AI agents for recruiting. Our stack runs on AWS Lambdas for most backend tasks, LangSmith for agentic workflows, Postgres on AWS, Datadog for logs and RUM, and Slack for error notifications — both automated alerts and issues flagged by our customer success team.

Like most teams running AI agents in production, we quickly discovered that observability is hard. When something breaks in a traditional web service, you look at the logs, find the error, and fix it. When something breaks in an AI agent, you're dealing with a different beast: execution graphs, tool calls, state that evolves across multiple steps, and errors that can surface anywhere in the pipeline.

Our triage workflow involved two sources: automated Datadog alerts hitting Slack, and issues flagged by CS from user reports. Either way, the debugging process looked the same: open Datadog, find error logs, extract relevant IDs, check the agent traces, check database records, watch RUM recordings. This took 30-45 minutes per incident. We were spending hours on triage alone.

The Insight: Triage is just data gathering

When we analyzed what engineers actually did during triage, we realized something obvious in retrospect: almost all of the time was spent gathering data, not thinking. Collecting the logs, traces, and database context was mechanical. But it required access to multiple systems and institutional knowledge about how they connected: which Datadog queries to run, which database tables to check, how to get from a log line to the relevant LangSmith trace.

Once engineers had all the context in front of them, figuring out the root cause was usually fast. The bottleneck was never the thinking. It was the data gathering.

Enter Devin

We connected Devin to our observability infrastructure using MCPs. Not copy-pasting logs into a chat window — proper integrations that let Devin query our systems the same way engineers do.

Devin has an MCP marketplace with integrations for common tools. We added Datadog and Postgres from there, plus LangSmith as a custom integration. No code required on our end. Our setup gives Devin:

  • Datadog for querying logs and RUM data
  • Postgres with read-only access to a replica for database context
  • LangSmith for inspecting agent execution history, tool calls, and state transitions

Then we wrote playbooks: markdown files in Devin's settings that encode our triage workflow. When triggered, Devin follows the same steps an engineer would: summarize the Slack thread to understand what's failing and when it started, query Datadog for error groups and representative logs, confirm the issue is recurring rather than a one-off, trace the failure to the exact code path, and pull database context where relevant.

The playbooks include issue-specific guidance (e.g. how to handle DLQ failures versus timeouts versus database errors) and strict rules like "never speculate" and "never swallow errors as a fix." They also enforce evidence standards: Devin has to back up its conclusions with logs, traces, and timestamps rather than just inspecting the code and guessing.

We also maintain AGENTS.md files in each repository. These describe our coding conventions and patterns, so when Devin proposes a fix, it matches how we'd write it ourselves.

The output

The primary deliverable is a root cause analysis posted to Slack: what failed, the evidence trail, the failure mechanism, and why safeguards didn't catch it. But Devin also takes a first pass at fixing the issue. Sometimes the fix is straightforward and we merge it directly. Sometimes it gives engineers a head start on a more complex change. Either way, engineers review a proposed solution rather than starting from scratch.

A lot of bugs get fixed with almost no human involvement beyond clicking merge. Others need meaningful product decisions or deeper investigation. But even in those cases, the data gathering is done.

Results

After a few weeks of iteration, our triage workflow went from 30-45 minutes per incident to about 5 minutes of human review. Engineers verify Devin's analysis and decide on next steps rather than doing the data gathering themselves.

We're seeing roughly 80% reduction in engineering time spent on triage. The tickets are also better — they include full context that used to live only in an engineer's head during debugging.

And because it's automated, triage doesn't stop when engineers go offline. Bug reports that come in overnight get investigated immediately. Engineers wake up to a root cause analysis and proposed fix for everything that broke while they were asleep.

What we learned

The bottleneck was data gathering, not thinking. We assumed triage required experienced engineers because it involved judgment. It turns out most of it was mechanical: querying the right systems in the right order. That's what we automated.

Read-only database access is safe and valuable. We were initially nervous about giving an AI access to production data. But read-only access to a replica is exactly what engineers use during triage. Devin uses it the same way: to understand context, not to make changes.

MCPs make real integration possible. The difference between "paste your logs here" and "query Datadog directly" is enormous. MCPs let Devin follow iterative workflows - query one system, find an ID, use it to query another - in the same way engineers do.

The setup is surprisingly simple. We didn't build custom integrations. We used off-the-shelf MCPs and wrote markdown playbooks describing our workflow. The hardest part was articulating the tribal knowledge that experienced engineers use unconsciously. But that knowledge was already in our head, we just had to write it down.

What comes next

This is version one. The MCP ecosystem is growing fast, with more tools, better integrations, richer context. The playbooks will get smarter as we encode more edge cases. The models will get better at reasoning through ambiguous failures.

We automated the mechanical parts of triage. The creative work, like understanding novel failures, making product decisions, designing systemic fixes, still requires humans. But that's the work engineers actually want to do. The rest was just overhead.


We're hiring engineers who want to build AI products that work in production. If this kind of problem sounds interesting, check out m.careers.