Blog Post

Why DevOps & MLOps Isn’t Good Enough for Scaling Robotics

The $50B observability stack was built for web apps. Robots need something fundamentally different.

The problem isn’t the robots anymore‍

Waymo is doing 450,000 paid rides per week.

Amazon has 750,000+ robots in its warehouses.

Tesla’s FSD is here and it feels like magic.

And VCs are trying to cash in, with $8 billion invested in 2025, up 10% from the previous year.

Alas - robots can perceive, plan, and act in the real world! 🤖

But there’s a catch.

Waymo has been at this for 15+ years. Alphabet has poured over $30 billion into it. Tesla spent $6 billion on R&D last year alone. Amazon paid $775 million for Kiva back in 2012, then spent billions more scaling it.

These companies made the operational problem manageable with massive teams, custom internal tooling, years of iteration.

What about everyone else?

The Series B warehouse robotics company. The agricultural startup with 50 robots in the field. The defense contractor scaling from 3 drones to 300.

They’re hitting the same debugging wall the giants have been solving.They just can’t throw billions at it.

‍What debugging actually looks like

‍A failure happens in the field.

A field operator reports it by Slack or a written report.

An engineer pulls the data.

But the problem is a ... single robot can generate upto gigabytes per minute.

A fleet of 50 generates more data in a day than most teams can review in a month.

It’s not just logs, but video, lidar, IMU, joint encoders, state machines, point clouds etc.

So the engineer scrubs through hours of video like it’s a movie, cross-referencing sensor telemetry, logs, state transitions. Trying to manually reconstruct what went wrong. Relying on tribal knowledge.‍

Debugging a 30-minute incident can take hours.

‍And that’s if you know where to look.

This pattern shows up everywhere:

Debugging time doesn’t scale. More robots means more hours, not smarter analysis
There’s no memory of previous debugging sessions
When you’ve got ten robots in the field and one operator, you’ve hit your ceiling

There’s nowhere to go from there without something fundamentally different.

‍Fleet scale changes everything

‍A defense robotics engineer told us: “If you look at one bag, you don’t notice it. If you look at all three, you can suddenly see what’s going on.”

They were running multi-drone operations and kept finding anomalies that were invisible in single-unit analysis: network congestion across vehicles simultaneously, GPS signal loss affecting the whole fleet, timestamp drift, communication failures that only showed up in coordination.

None of this appeared when you looked at one drone’s logs.

This is the same pattern we’re hearing from warehouse robotics, agricultural robots, marine autonomy. Anyone with multiple robots in the field.

The debugging surface doesn’t scale linearly. It scales combinatorially.Configuration drift across identical hardware creates “works on my machine” scenarios, except now it’s “works on robot 247 but not 248.”

The whole paradigm of “pull one robot’s logs and figure it out” breaks down past a handful of simultaneous deployments.

‍Why the Datadog model breaks

‍The web observability stack was built for request-response. Discrete transactions. Known failure modes. You instrument your system, set thresholds, get alerted when something crosses a line. Over $50 billion in combined market cap was built to solve this problem.

But this whole stack assumes you know what to look for ahead of time.

Engineers can’t watch every log line. So we built abstractions: dashboards, rollups, alerts tuned to fire only when something’s definitively wrong. We traded fidelity for digestibility because we had to.

Robots don’t work that way.

You don’t know what “good” looks like until you’ve seen enough “bad.” And even then, the failure modes keep surprising you.

‍Multimodal AI changes this. You can ingest everything (video, sensor streams, state transitions, logs) and surface what’s relevant. Ask “what happened?” and get a story back, not a dashboard.

The right answer is probably somewhere in the middle: deterministic checks for failures you know are catastrophic, semantic analysis for the long tail of weird stuff you haven’t seen before.

The possibility space is bigger than it used to be.

Most tooling hasn’t caught up.

‍The birth of RobotOps

‍People hear “observability” and think dashboards. This is different.

Rich, multimodal understanding where the system watches everything and surfaces what broke and why.

The same way DevOps and MLOps became categories, RobotOps names what comes next.

Imagine custom mission reports that tell you what happened, searchable context across every test you’ve run, gap analysis of your ML models, and the ability to ask questions and get precise, detailed answers from your data, and more.

That’s what we’re building at Alloy.

‍(P.S. If you’re running robots in production and want to find out why senior leaders at Waymo, Tesla & OpenAI have invested in us, reach out.)

See what 10× faster analysis looks like

Book a call to see how Alloy works with your data.

book a 30 min call