Why The Product Is Split In Two
Every robotics platform eventually has to answer a fundamental design question: where does the authority live? A purely cloud-centric product assumes persistent, low-latency connectivity between the robot and a remote server. That assumption fails in warehouses with steel racking that blocks Wi-Fi, in outdoor environments where cellular coverage is spotty, and in any scenario where a network partition should not cause a robot to stop moving. A purely edge-centric product keeps everything local but sacrifices the fleet-wide visibility, cross-robot coordination, and operational workflows that teams need once they move past a single-site pilot.
ROBOFLOW AI resolves this tension with a two-part architecture: a lightweight edge agent that runs close to the robot and a cloud control plane that provides centralized operations. The split is not arbitrary. It follows from three hard constraints that govern real robot deployments.
Latency and safety. Safety-critical decisions, obstacle avoidance, emergency stops, mission state transitions, must happen in single-digit milliseconds. These cannot tolerate a round trip to a cloud server. The edge agent keeps these decisions local, operating against the robot runtime directly.
Connectivity and resilience. Network links to robot fleets are unreliable by nature. Robots operate in loading docks, mine shafts, hospital corridors, and open fields. The architecture must treat connectivity as a resource that comes and goes, not as a prerequisite for correct operation. The edge agent buffers, queues, and reconciles. The cloud control plane accepts eventual consistency.
Scale and coordination. A single operator managing three robots at one site can get by with SSH sessions and a shared spreadsheet. Fifty robots across ten sites demands centralized fleet state, role-based access, deployment orchestration, analytics, and integration with business systems. That coordination layer belongs in the cloud, where it can serve multiple teams, aggregate data across environments, and evolve independently of the edge runtime.
This design draws directly from patterns proven in distributed infrastructure. The Kubernetes architecture uses a nearly identical split: the kubelet runs on each node, managing local container lifecycle and reporting status, while the API server and controller managers in the control plane handle scheduling, desired-state reconciliation, and cluster-wide policy. ROBOFLOW AI applies the same principle to physical machines. The edge agent is the kubelet for your robot. The cloud control plane is the API server for your fleet.
What The Edge Agent Does
The edge agent is a single, statically linked process that runs on the robot's onboard computer or on a co-located gateway device. Its design priorities are minimal resource footprint, offline-first operation, and controlled synchronization with the cloud.
Resource footprint. The agent targets less than 128 MB of resident memory and under 2% CPU utilization on a modern ARM-based compute module. It avoids heavy runtimes, language VMs, and framework dependencies. The binary is self-contained. This matters because robot compute budgets are tight: perception, planning, and control stacks already contend for CPU, GPU, and memory. An operational agent that competes with the autonomy stack for resources is an agent that will be removed by the first engineer who notices a latency spike.
Local-first operation. The edge agent maintains a local state store, a lightweight embedded database, that holds the robot's identity, current configuration, mission context, buffered telemetry, and pending workflow triggers. If the network disappears, the agent continues to collect data, evaluate local trigger conditions, and queue outbound messages. There is no degraded mode. Local-first is the mode. Cloud sync is an optimization on top of it.
Data buffering and selective forwarding. Not all data generated at the edge needs to reach the cloud. A lidar sensor producing 300,000 points per second at 10 Hz generates roughly 70 MB/s of raw data. Forwarding that in its entirety is neither feasible nor useful for most operational purposes. The edge agent applies selective telemetry forwarding: it captures high-frequency signals locally and forwards aggregated summaries, event markers, and anomaly snapshots to the cloud. Configuration policies, pushed from the control plane, determine what gets forwarded at what resolution. Operators can temporarily increase telemetry resolution for a specific robot during an active investigation, then revert to baseline.
Sync protocol. The agent communicates with the cloud control plane over mTLS-encrypted gRPC streams. The sync protocol is designed around three message types: state reports (robot health, mission status, software version), event emissions (mission completions, faults, workflow triggers), and command acknowledgments (configuration updates, deployment directives). When connectivity resumes after a gap, the agent replays buffered events in order, and the control plane applies idempotent reconciliation to avoid duplicate processing. This is closer to a log-based replication model than a request-response API: the agent appends to a local log, and the sync process tails that log to the cloud.
Integration surface. The agent exposes a local API (Unix socket or localhost HTTP) that the robot's existing software can use to publish events, register telemetry channels, and query configuration. This means teams do not need to rewrite their ROS 2 nodes, vendor SDK integrations, or custom mission managers. They instrument at the boundary. The agent handles the rest.
What The Cloud Control Plane Does
The cloud control plane is a multi-tenant platform that serves as the operational hub for every robot connected through an edge agent. It is where fleet state converges, where operators interact with the system, and where workflows, analytics, and integrations execute.
Fleet state management. The control plane maintains a real-time materialized view of every robot's status: connectivity, software version, mission state, health signals, and configuration. This view is assembled from the stream of state reports and events arriving from edge agents across the fleet. The state model uses last-writer-wins with vector clocks for conflict resolution in cases where an agent replays buffered state after reconnection. Operators see a live dashboard that reflects ground truth within seconds of an event occurring at the edge.
Deployment orchestration. Pushing a software update, a new ML model, or a configuration change to a fleet of robots is operationally risky. The control plane supports staged rollout strategies: canary deployments that target a single robot first, percentage-based rollouts that gradually expand coverage, and environment-scoped deployments that target a specific site. Each rollout is tracked as a first-class object with status, history, and rollback capability. This follows the GitOps pattern: the desired state of the fleet is declared in the control plane, and edge agents converge toward that desired state at their own pace. If an agent is offline during a rollout, it picks up the new target state when it reconnects.
Workflow engine. The workflow builder in ROBOFLOW AI is backed by a directed-acyclic-graph execution engine in the control plane. Workflows are triggered by robot events (mission failure, anomaly detection, geofence breach), and they orchestrate sequences of actions: send a notification, create an incident ticket, escalate to a human operator, pause a robot, trigger a diagnostic capture. Workflows execute in the cloud because they often involve cross-robot logic, external integrations, and human-in-the-loop steps that do not belong on the edge.
API layer and integrations. The control plane exposes a RESTful API and webhook system for connecting robot operations to external business systems: warehouse management systems, ticketing platforms, communication tools, BI dashboards. Every event that flows through the platform can be forwarded to an external endpoint, filtered and transformed as needed. The API also supports programmatic fleet management for teams that want to build custom tooling on top of the platform.
Multi-tenancy and access control. The platform supports multiple organizations, teams, and roles. Role-based access control governs who can view fleet data, trigger workflows, approve deployments, or modify configurations. This is critical for enterprise deployments where operations teams, engineering teams, and site managers need different levels of access to the same fleet.
Data Flow: What Stays On The Edge vs. What Syncs To The Cloud
The boundary between edge and cloud is not just architectural. It is a data flow design decision that affects bandwidth costs, storage requirements, latency, and privacy. Getting this wrong, either by forwarding too much or too little, creates operational problems.
ROBOFLOW AI uses a tiered data model with three categories:
Edge-only data. High-frequency sensor streams (lidar point clouds, camera frames, IMU readings) stay on the robot by default. This data is consumed by the local autonomy stack for perception and control. It is not forwarded to the cloud unless explicitly requested for a bounded time window during debugging or data collection campaigns. Keeping this data local avoids the bandwidth problem entirely: a fleet of 50 robots each generating 50 MB/s of sensor data would require 2.5 GB/s of sustained upstream bandwidth to forward everything, which is neither economical nor necessary.
Summarized and event-driven data. The edge agent computes local aggregations, 1-minute health summaries, mission outcome events, fault signatures, and resource utilization snapshots, and forwards these to the cloud. This is the primary data channel for fleet observability. It provides enough resolution for operators to understand what is happening without the cost of raw data forwarding. Typical bandwidth per robot for this tier is 5-50 KB/s, which is manageable even over constrained cellular links.
On-demand high-resolution data. When an operator or an automated workflow needs deeper visibility into a specific robot, the control plane can issue a telemetry escalation request to the edge agent. This temporarily increases the forwarding resolution for selected data channels: uploading a 30-second camera clip around a fault event, streaming a higher-frequency diagnostic signal, or capturing a snapshot of the local state store. Once the investigation window closes, the agent reverts to baseline forwarding policy.
This tiered model is the practical answer to a question every robotics platform has to confront: how do you provide cloud-scale analytics without cloud-scale bandwidth? The answer is that you push aggregation and filtering to the edge, treat the cloud as the system of record for operational events, and use on-demand escalation for the cases that require raw data. It is the same pattern that modern observability platforms like Datadog and Honeycomb use for application telemetry: sample aggressively by default, and retain full fidelity only where it matters.
Security Model: mTLS, Certificate Rotation, and Access Control
A robotics platform that bridges edge devices and cloud infrastructure presents a significant attack surface. The edge agent runs on hardware that may be physically accessible. The communication channel traverses networks that the platform does not control. The cloud control plane holds fleet-wide operational data. ROBOFLOW AI's security model addresses each layer.
Mutual TLS for all edge-cloud communication. Every connection between an edge agent and the cloud control plane uses mutual TLS (mTLS). Both sides present and verify certificates. This ensures that the cloud only accepts connections from authenticated agents, and that agents only send data to the legitimate control plane. The certificates are issued by a platform-managed certificate authority, and the enrollment process uses short-lived bootstrap tokens that expire after initial registration.
Automated certificate rotation. Long-lived credentials on edge devices are a liability. If a robot is decommissioned, lost, or compromised, its credentials must be revocable without disrupting the rest of the fleet. The platform implements automated certificate rotation on a configurable schedule (default: 24 hours). The edge agent requests a new certificate before the current one expires, using the existing valid certificate as proof of identity. If an agent fails to rotate (because it was offline for an extended period), it must re-enroll through an operator-approved process. Revocation is immediate: the control plane maintains a certificate revocation list that is checked on every connection.
Role-based access control in the cloud. The control plane enforces RBAC at the API, dashboard, and workflow level. Roles include fleet viewer (read-only access to dashboards and analytics), operator (can trigger workflows and approve interventions), deployer (can initiate software rollouts), and administrator (full configuration and access management). Permissions are scoped to organizations, teams, and optionally to specific environments or robot groups. All access is logged in an immutable audit trail.
Edge agent isolation. The agent process runs with minimal system privileges on the robot. It does not require root access. It communicates with the robot runtime through a defined local API surface and does not have arbitrary access to the host filesystem or network interfaces. On Linux-based robot computers, the agent can optionally run in a container or under a dedicated system user with AppArmor or seccomp profiles applied.
The overall security posture follows the zero-trust principle: no component in the system implicitly trusts another. Every connection is authenticated, every action is authorized, and every event is auditable.
Failure Modes: What Happens When Things Break
Distributed systems fail. Networks partition. Processes crash. Hardware degrades. The measure of an architecture is not whether it prevents failures but how it behaves during and after them. ROBOFLOW AI is designed around explicit failure mode handling for the most common scenarios.
Network partition (agent loses cloud connectivity). This is the most frequent failure mode and the one the architecture is most explicitly designed for. The edge agent continues operating in local-first mode. Telemetry is buffered to the local state store, which is sized to hold at least 72 hours of summarized data at default forwarding resolution. Workflow triggers that require cloud-side execution are queued locally with timestamps. When connectivity resumes, the agent replays buffered data in chronological order. The control plane applies idempotent event processing: duplicate events (from retries or overlapping replay windows) are deduplicated using event IDs and timestamps. Fleet dashboard state converges within seconds of reconnection.
Edge agent process crash. The agent is managed by a system-level process supervisor (systemd on Linux, launchd on macOS-based development environments). If the process exits unexpectedly, the supervisor restarts it within seconds. The local state store is persisted to disk, so the agent recovers its buffered data, configuration, and identity after restart. The control plane detects the brief connectivity gap and marks the robot as temporarily unreachable, then clears the status once the agent reconnects and resumes reporting.
Cloud control plane degradation. The control plane is deployed across multiple availability zones with active-active redundancy for the state management and API layers. If a zone fails, traffic is routed to healthy zones. From the edge agent's perspective, a cloud-side outage looks like a network partition: the agent buffers locally and retries. Operators using the dashboard may experience degraded responsiveness during a failover, but the fleet itself is unaffected because robots do not depend on the cloud for safe operation. This is the critical safety property of the architecture: a cloud outage is an operational inconvenience, not a safety incident.
Conflict resolution after extended partition. When an agent has been offline for an extended period, there is potential for state conflicts. The robot may have been reconfigured locally during the partition. The control plane may have issued a new deployment target that the agent never received. ROBOFLOW AI resolves these conflicts using a control-plane-authoritative model with manual override: the control plane's desired state is treated as the source of truth, and the agent converges toward it after reconnection. However, operators can flag specific robots for manual reconciliation if the local state includes changes that should not be overwritten. This avoids the worst failure mode in fleet management: a silent overwrite of field-level configuration that an operator deliberately set during an outage.
Degraded telemetry. If the robot's compute resources are constrained (high CPU from the autonomy stack, low disk space), the edge agent automatically reduces its own footprint. It drops to a survival mode that forwards only critical health signals and fault events, pausing summary aggregation and non-essential telemetry. This self-throttling ensures the agent never becomes the reason a robot mission fails.
How The Architecture Enables Platform Capabilities
The edge-cloud split is not an end in itself. It is the foundation that makes every user-facing capability in ROBOFLOW AI possible. Each platform feature maps directly to a property of the architecture.
Fleet Operations Dashboard depends on the real-time materialized view of fleet state in the control plane. Because every agent reports health, mission status, and software version through a structured state protocol, the dashboard can present a live, consistent view of the entire fleet without polling individual robots. Operators see what they need, connection status, active missions, recent faults, deployment versions, without SSH access or vendor-specific portals.
Workflow Builder depends on the event stream from edge agents feeding into the cloud-side DAG execution engine. Workflows are triggered by events that originate at the edge (a robot completes a mission, enters a geofence, reports a hardware fault) but execute in the cloud where they can coordinate across robots, call external APIs, and involve human decision points. The edge agent's role is to emit events reliably. The cloud's role is to orchestrate the response.
Analytics depends on the tiered data flow model. Summarized telemetry from every robot in the fleet is aggregated in the cloud, enabling fleet-wide trend analysis: mission success rates by environment, intervention frequency by robot model, uptime patterns over time. Because the edge agent handles local aggregation, the analytics pipeline receives pre-processed data that is ready for indexing and querying, not raw sensor streams that would require expensive cloud-side processing.
Integrations depend on the API layer and webhook system in the control plane. When a robot event needs to trigger an action in an external system, the integration fires from the cloud, where it has reliable network access, retry logic, and credential management. The edge agent never needs direct connectivity to third-party services. This simplifies network configuration at deployment sites and centralizes integration management for the operations team.
Deployment orchestration depends on the GitOps-inspired desired-state model. The control plane declares what each robot (or group of robots) should be running. Edge agents converge toward that desired state independently. This decoupling means deployments are resilient to the messy realities of fleet operations: some robots are online, some are not, some are mid-mission, some are in maintenance mode. The control plane tracks convergence and reports progress without requiring synchronous coordination with every agent.
The architecture is designed so that each capability is a natural consequence of the edge-cloud split rather than a feature bolted on afterward. The edge agent provides reliable data collection and local autonomy. The cloud control plane provides coordination, analysis, and integration. The user-facing features are the interface between those two layers, giving teams the operational leverage they need to manage robots at fleet scale.