Back to list

The Data Flywheel: How OpenAI Uses Streaming to Keep AI Fresh

Date

October 7, 2025

Author

Akshay Atam

What's one thing that comes into your mind when you think about OpenAI? Most of you would probably think of ChatGPT, the LLM that reshaped they way we interact with technology or Sora 2, the video generation model that made text-to-film a reality.

But beneath the dazzling demos and billion-parameter architectures lies something less visible yet far more foundational: data that never rests.

Over the weekend, I found myself thinking back to the first time I saw the barebones GPT-3 model during my Masters. I remember being amazed not just by what it could do, but by what it implied. How does a model like that evolve into the world’s most used AI application? GPUs and model architectures play their part, sure. But what about the pipeline that feeds it?

That curiosity led me to explore how OpenAI engineered its data backbone - a platform that turns raw, fast-moving data into an intelligent feedback loop the team calls the data flywheel. It’s a system where fresher data drives faster learning, and faster learning leads to better models.

The insights I share are based on details made public by OpenAI and the Confluent Engineering Team. All technical credit belongs to them, my goal is simply to unpack how such systems come together and why streaming has become indispensable for modern AI.

This blog kicks off a two-part series exploring that system.

Part 1 focuses on the why - why OpenAI needed to rethink its entire data backbone.
Part 2 will focus on the how - how they built it, from PyFlink and Kubernetes to Kafka and the custom watchdog services that make it all happen.

From Batch Processing to Stream Processing

AI systems live and die by data quality and freshness. In the early days, most companies (even the ones running large-scale ML models) used batch processing - collecting data for hours/days, then process it all at once. That's fine for nightly analytics. But for systems that learn from user interactions every minute like ChatGPT, the data collected already becomes outdated and stale.

Enter stream processing.

Stream processing flips the definition: process the data as it arrives. Instead of collecting data and waiting for a batch job to process it, the system transforms, cleans, and routes the data in near real-time.

For OpenAI, that change unlocked two huge advantages:

Fresher training data gives smarter models: The faster new data reaches the model, the faster it learns. This strengthens the so-called data flywheel. New data is learned in real time, resulting a better learning. The better the learning, the better the AI model. The cycles repeats with each new data fed into the system.
Faster experimentation results in faster innovation: With a constant stream of data, researchers at OpenAI can run experiments daily resulting in real-time logs and analytics. What took days of work now only takes hours. A boost in iteration speed gives a competitive edge.

Why Streaming Is Hard at OpenAI's scale

On paper, feeding data in real time sounds easy. "Just stream everything!" However, there are some challenges that the engineers at OpenAI needed to overcome.

Python First, Flink Later

Python is the language of writing ML models and almost every researcher at OpenAI writes in Python. However, Apache Flink - the industry's workhorse for distributed stream processing - has it's source code in Java and Scala. So, the team at OpenAI has to extend PyFlink, Apache Flink's Python API, to feel natural for ML Engineers while retaining the reliability of the Java Virtual Machine (JVM) backend.

This meant deep customization, new tools, and patience for quirks that came with bridging two languages.

Cloud Capacity and Scalability Constraints

Running continuous data pipelines pushes cloud limits such as compute quota, storage I/O, and network bandwidth. Streaming data in real-time meant that the system needed to survive partial outages and shifting compute without losing data or brining pipelines down.

Multi-Primary Kafka Complexity

For reliability, OpenAI runs multiple Kafka clusters instead of one. That's great for uptime, but terrible for off-the-shelf connectors. This is because for Apache Flink and Kafka to work, Kafka integrations assume a single cluster. A simple network blip could crash the entire job.

Solving that meant re-engineering connectors to gracefully handle multi-cluster failsafe, one of the trickiest pieces of distributed data design.

Enter the Data Flywheel

The engineering effort that went on to overcome the challenges give the result that OpenAI was looking for: reducing data latency at every step. Every hour saved in data ingestion or preprocessing compresses the loop between user feedback and model improvement.

That's the Data Flywheel!

And it's why OpenAI's systems keep feeling fresher. Looking back on how GPT-2 and GPT-3 worked and comparing it to the newest GPT-5 showcases one thing: each release was built on the lessons learned from previous builds.