Thursday, October 23, 2025
HomeData Modelling & AIBuilding systems for massive scale data applications

Building systems for massive scale data applications

In this episode of the neveropen Data Show, I sat down with Tyler Akidau one of the lead engineers in Google’s streaming and Dataflow technologies. He recently wrote an extremely popular article that provided a framework for how to think about bounded and unbounded data processing (a follow-up article is due out soon). We talked about the evolution of stream processing, the challenges of building systems that scale to massive data sets, and the recent surge in interest in all things real time:

On the need for MillWheel: A new stream processing engine

At the time [that MillWheel was built], there was, as far as I know, literally nothing externally that could handle the scale that we needed to handle. A lot of the existing streaming systems didn’t focus on out-of-order processing, which was a big deal for us internally. Also we really wanted to hit a strong focus on consistency — being able to get absolutely correct answers. … All three of these things were lacking in at least some area in [the systems we examined].

Learn faster. Dig deeper. See farther.

Join the O’Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

The Dataflow model

There are two projects that we say Dataflow came out of. The FlumeJava project, which, for anybody who is not familiar, is a higher level language for describing large-scale, massive-scale data processing systems and then running it through an optimizer and coming up with an execution plan. … We had all sorts of use cases at Google where people were stringing together these series of MapReduce [jobs]. It was complex and difficult to deal with, and you had to try to manually optimize them for performance. If you do what the database folks have done,[you] run it through an optimizer.

Flume is the primary data processing system, so as part of that for the last few years, we’ve been moving MillWheel to be essentially a secondary execution engine for FlumeJava. You can either do it on batch mode and run on MapReduce or you can execute it on MillWheel. … FlumeJava plus MillWheel — it’s this evolution that’s happened internally, and now we’ve externalized it.

Balancing correctness, latency, and cost

There’s a wide variety of use cases out there. Sometimes you need high correctness; sometimes you don’t; sometimes you need low latency; sometimes higher latency is okay. Sometimes you’re willing to pay a lot for those other two features; sometimes you don’t want to pay as much. The real key, at least as far as having a system that is broadly applicable, is being able to be flexible and give people the choices to make the trade-offs they have to make.

There is a single knob which is, which runner am I going to use: batch or streaming? Aside from that, the other level at which you get to make these choices is when you’re deciding exactly when you materialize your results within the pipeline.

Once you have a streaming system or streaming execution engine that gives you this automatic-scaling, like Dataflow does, and it gives you consistency and strong tools for working with your data, then people start to build these really complicated services on them. It may not just be data processing. It actually becomes a nice platform for orchestrating events or orchestrating distributed state machines and things like that. We have a lot of users internally doing this stuff.

Related resources:

Post topics: AI & ML, Data, O’Reilly Data Show Podcast
Post tags: Podcast
Share:

RELATED ARTICLES

Most Popular

Dominic
32361 POSTS0 COMMENTS
Milvus
88 POSTS0 COMMENTS
Nango Kala
6728 POSTS0 COMMENTS
Nicole Veronica
11892 POSTS0 COMMENTS
Nokonwaba Nkukhwana
11954 POSTS0 COMMENTS
Shaida Kate Naidoo
6852 POSTS0 COMMENTS
Ted Musemwa
7113 POSTS0 COMMENTS
Thapelo Manthata
6805 POSTS0 COMMENTS
Umr Jansen
6801 POSTS0 COMMENTS