Back to list

Inside the Engine Room: How OpenAI Streams Intelligence at Scale

Date

October 15, 2025

Author

Akshay Atam

In Part 1, we explored why OpenAI had to move beyond batch processing and build a real-time streaming system to keep its AI models learning continuously.

Now it is time to open the hood and see how they actually built it.

This part of the story is where software engineering meets research infrastructure. It is not just about moving data faster but also about building a system that survives failures, scales seamlessly, and feels natural to Python-first AI teams.

The Big Picture

OpenAI’s streaming platform is built on top of Apache Flink, accessed primarily through its Python API, PyFlink. Flink is the distributed processing engine that executes computations over continuous data streams. However, by itself, Flink was not enough to meet OpenAI’s needs for reliability and flexibility at massive scale.

To fill the gaps, the engineering team built several layers around it:

A control plane for multi-cluster job management and automatic failover.
A Kubernetes-based setup for container orchestration and team isolation.
Watchdog services that monitor Kafka’s health and adapt pipelines automatically.
State and storage management that decouples Flink’s memory from clusters and keeps checkpoints resilient.

Together, these layers create a self-healing, Python-friendly ecosystem that can process terabytes of data while staying responsive to infrastructure failures.

The Control Plane: Keeping the System Aware

The control plane is the command center of OpenAI’s stream processing platform. It manages every job that runs across multiple Flink clusters and ensures they remain healthy.

If a Kubernetes cluster experiences an outage, the control plane automatically triggers job failover. Instead of letting a pipeline go down, it moves the job to another cluster while restoring its state from storage. This ensures that data streams do not break even when parts of the infrastructure fail.

The control plane also connects directly to OpenAI’s internal deployment system. That means developers can submit, upgrade, or roll back streaming jobs using the same workflows they use for deploying other services. This consistency keeps the developer experience smooth and minimizes friction between research and production operations.

In short, the control plane handles the "when and where" of streaming jobs, while Flink focuses on the "how."

Running Apache Flink on Kubernetes

Instead of deploying Flink on bare-metal servers, OpenAI uses Kubernetes to manage it. They rely on the Flink Kubernetes Operator, which automates deployments, scaling, and recovery.

Every research team or project runs in its own Kubernetes namespace. This isolation provides three key benefits:

Reliability: A crash in one namespace does not affect other pipelines.
Security: Each team accesses only its own data and storage accounts.
Scalability: Teams can tune resources independently without coordination bottlenecks.

The result is a modular and flexible architecture. Each namespace behaves like its own mini cluster, which allows OpenAI to scale individual pipelines without risking others.

Watchdogs: The System's Silent Guardians

Flink pipelines are tightly coupled with Kafka, the event-streaming platform that delivers continuous flows of data to and from models, experiments, and services. But Kafka itself is a living system. Topics evolve, partitions move, and sometimes clusters fail over.

To keep Flink pipelines stable through all that change, OpenAI created watchdog services. These watchdogs continuously monitor Kafka’s topology and automatically adjust Flink jobs when something shifts.

If a Kafka topic gains new partitions, the watchdog scales the pipeline to read from them.
If a cluster fails, it ensures that another source takes over without human intervention.

These background services act like automated operations engineers, preventing downtime and keeping the system resilient even under unpredictable load.

Remembering using State and Storage

Streaming jobs depend heavily on state, which is the memory of what a job has processed so far. Losing that state during an outage would mean losing progress and potentially corrupting results.

OpenAI uses RocksDB to store local operator state within Flink. RocksDB is a high-performance embedded key-value store designed to handle huge amounts of intermediate data efficiently.

To make recovery possible across clusters, the team built per-namespace blob storage accounts for durable state backups and checkpoints. Because these storage accounts exist outside any specific Kubernetes cluster, Flink jobs can be migrated seamlessly. If one cluster fails, the new one simply restores its last known checkpoint from blob storage and continues processing without data loss.

Security was also a priority. By upgrading hadoop-azure to version 3.4.1, OpenAI enabled Azure workload identity authentication, which allows pipelines to access blob storage securely without hard-coded credentials. It is a small but essential step for reducing operational risk and improving security hygiene.

Bringing Streaming to Python using PyFlink

Most AI researchers at OpenAI work in Python, so the streaming platform had to feel natural to them. That is where PyFlink, the Python API for Apache Flink, plays a central role.

PyFlink provides two main interfaces:

DataStream API, for granular control of how data flows through operators.
Table/SQL API, for writing SQL-like transformations that feel closer to data science workflows.

Both APIs integrate directly with OpenAI’s Python monorepo, allowing developers to import familiar libraries and tools.

However, PyFlink is not without challenges. It can run in two modes:

Process Mode, where Python code runs in its own process. This offers isolation but adds communication overhead.
Thread Mode, where Python runs in the same process as Java, which reduces latency but sacrifices some safety.

The team continues to balance these trade-offs to make streaming both performant and developer-friendly.

Kafka Connector Design

The link between Flink and Kafka is at the heart of OpenAI’s data flow. Since OpenAI runs multiple primary Kafka clusters for high availability, the engineering team had to go beyond standard Flink connectors.

They created a custom multi-cluster connector that can read from multiple Kafka primaries simultaneously and write data reliably even during failovers. For writing, they built the Prism Sink, which sends data back into Kafka. Although it does not yet guarantee perfect “exactly-once” semantics, it significantly improves resilience.

In addition, OpenAI engineers contributed open-source improvements back to the Flink community. These include features like metadata fetch retries (FLINK-37366) and dynamic Kafka sources that adapt at runtime. This collaboration not only benefits OpenAI but also strengthens the streaming ecosystem for everyone.

High Availability and Failover

Failures are not hypothetical in cloud-scale systems; they are expected. Entire Kubernetes clusters can go down because of provider outages.

To prepare for such events, OpenAI’s control plane coordinates with decoupled state and HA storage. If one cluster fails, the system automatically restarts the job on another healthy cluster and restores its state from checkpoints. Even the storage accounts themselves can fail over independently, creating a multi-layered resilience system.

The result is a streaming backbone that keeps operating even when parts of the cloud go dark.

Conclusion

In conclusion, building a stream processing system at OpenAI’s scale is not only about using powerful tools like Flink or Kafka. It is about integrating them into a fault-tolerant, Python-friendly, and experiment-driven ecosystem.

Every layer, from watchdogs to Kubernetes namespaces, is designed to reduce friction and keep the data flywheel spinning.