In this episode of the neveropen Data Show, I spoke with M.C. Srivas, co-founder of MapR and currently chief architect for data at Uber. We discussed his long career in data management and his experience building a variety of distributed systems. In the course of his career, Srivas has architected key components that now comprise many data platforms (distributed file system, database, query engine, messaging system, etc.).
As Srivas is quick to point out, for these systems to be widely deployed in the enterprise, they require features like security, disaster recovery, and support for multiple data centers. We covered many topics in big data, with particular emphasis on real-time systems and applications. Below are some highlights from our conversation:
Applications and systems that run on multiple data centers
Ad serving has a limit of about 70 to 80 milliseconds where you have to reply to an advertisement. When you click on a Web page, the banner ads and the ads on the side and the bottom have to be served within 80 milliseconds. People place data centers across the world near each of the major centers where there’s a lot of people, where there’s a lot of activity. Many of our customers have data centers in Japan, in China, in Singapore, in Hong Kong, in India, in Russia, in Germany, across the United States, and worldwide. However, the billing is typically consolidated so that they bring data from all these data centers into central data centers where they process the entire clickstream and understand how to bill it back to their customers.
…
Then they need a clean way to bring these clickstreams back into the central data centers, maybe running in the U.S. or in Japan or Germany, or somewhere where the consolidation on the overall view of the customer is created. Typically, this has been done by running completely independent Kafka systems in each place. As soon as that happens, the producers and the consumers are not coordinated across data centers. Think about a data center in Japan that has a Kafka cluster running. Well, it cannot failover to the Kafka cluster in Hong Kong because that’s a completely independent cluster and doesn’t understand what has been consumed and what has been produced in Japan. If a consumer who was consuming things from the Japanese Kafka moved to the Hong Kong Kafka, they would get garbage. This is the main problem that a lot of customers asked us to solve.
…
The data sources have now gone not into a few data centers, but into millions of data centers. Think about every self-driving car. Every self-driving car is a data center in itself. It generates so much data. Think about a plane flying. A plane flying is a full data center. There’s 400 people on the plane, it’s a captive audience, and there’s enough data generated just for the preventative maintenance kind of stuff on the plane anyways. That’s the thinking behind MapR Streams—what do we need for the Internet of Things scale.
Streaming and messaging systems for IoT
A file system is very passive. You write some files, read some files, and how interesting could that get? If I look at a streaming system, what we’re looking for is completely real time. That is, if a publisher publishes something, then all listeners who want to listen to what the publisher is saying will get notified within five milliseconds inside the same data center. Five milliseconds to get a notification saying, “Hey, this was published.” Instantaneous almost. If I cross data centers, let’s say our data center halfway across the world, and you publish something, let’s say, in Japan and the person in South Africa or somewhere can get that information in under a second. They’ll be notified of that. There’s a push that we do so they get notified of it under a second, at a scale that’s billions of messages per second.
…
We have learned from Kafka, we have learned from Tibco, we have learned from RabbitMQ and so many other technologies that have preceded us. We learned a lot from watching all those things, and they have paved a way for us. I think what we’ve done is now taking it to the next level, which is what we really need for IoT.
Powering the world’s largest biometric identity system
We implemented this thing in Aadhaar, a biometric identity project. It links you to your banking, your hospital admissions, all the records—whether it’s school admissions, hospital admissions, even airport entry, passport, pension payments.
…
There’s about a billion people online right now. There’s another 300 million to go, but what I wanted to point out is that it’s completely digitized. If you want to withdraw money from an ATM, you put your fingerprint and take the money out. You don’t need a card.
…
There was a flood in Chennai last November/December. Massive floods. It rained like it’s never rained before. It rained continuously for two months, and the houses were submerged in 10 feet of water. People lost everything—the entire Tamil Nadu state in India, people lost everything. But when they were rescued, they still had their fingerprints and they could access everything. Their bank accounts, their records, and everything because the Aadhaar project was biometrics-based. Really, they lost everything, but they still had it. They could get to everything right away. Think about what happens here if you lose your wallet. All your credit cards, your driver’s license, everything. You don’t have that kind of an issue anymore. That problem was solved.
Editor’s note: This interview took place in mid January 2016, at that time M.C. Srivas served as CTO of MapR. Srivas currently serves as the chief architect for data at Uber and is a member of the Board of Directors of MapR.
Related resources:
- Strata San Jose 2016 session: “Real-time Hadoop: What an ideal messaging system should bring to Hadoop” (featuring Ted Dunning of MapR)
- Strata San Jose 2016 session: “When one data center is not enough: Building large-scale stream infrastructure across multiple data centers with Apache Kafka”
- I Heart Logs
- Streaming 101 and Streaming 102
- Architecting the World’s Largest Biometric Identity System
- Srivas was on a panel on Stream Processing Systems that I moderated in early January.