So you need to redesign your company’s data infrastructure.
Do you buy a solution from a big integration company like IBM, Cloudera, or Amazon? Do you engage many small startups, each focused on one part of the problem? A little of both? We see trends shifting towards focused best-of-breed platforms. That is, products that are laser-focused on one aspect of the data science and machine learning workflows, in contrast to all-in-one platforms that attempt to solve the entire space of data workflows.
This article, which examines this shift in more depth, is an opinionated result of countless conversations with data scientists about their needs in modern data science workflows.
The Two Cultures of Data Tooling
Today we see two different kinds of offerings in the marketplace:
- All-in-one platforms like Amazon Sagemaker, AzureML, Cloudera Data Science Workbench, and Databricks (which is now a unified analytics platform);
- Best of Breed products that are laser-focused on one aspect of the data science or the machine learning process like Snowflake, Confluent/Kafka, MongoDB/Atlas, Coiled/Dask and Plotly.1
Integrated all-in-one platforms assemble many tools together, and can therefore provide a full solution to common workflows. They’re reliable and steady, but they tend not to be exceptional at any part of that workflow and they tend to move slowly. For this reason, such platforms may be a good choice for companies that don’t have the culture or skills to assemble their own platform.
In contrast, best-of-breed products take a more craftsman approach: they do one thing well and move quickly (often they are the ones driving technological change). They usually meet the needs of end users more effectively, are cheaper, and easier to work with. However some assembly is required because they need to be used alongside other products to create full solutions. Best-of-breed products require a DIY spirit that may not be appropriate for slow-moving companies.
Which path is best? This is an open question, but we’re putting our money on best-of-breed products. We’ll share why in a moment, but first, we want to look at a historical perspective with what happened to data warehouses and data engineering platforms.
Lessons Learned from Data Warehouse and Data Engineering Platforms
Historically, companies bought Oracle, SAS, Teradata or other data all-in-one data warehousing solutions. These were rock solid at what they did–and “what they did” includes offering packages that are valuable to other parts of the company, such as accounting–but it was difficult for customers to adapt to new workloads over time.
Next came data engineering platforms like Cloudera, Hortonworks, and MapR, which broke open the Oracle/SAS hegemony with open source tooling. These provided a greater level of flexibility with Hadoop, Hive, and Spark.
However, while Cloudera, Hortonworks, and MapR worked well for a set of common data engineering workloads, they didn’t generalize well to workloads that didn’t fit the MapReduce paradigm, including deep learning and new natural language models. As companies moved to cloud, embraced interactive Python, integrated GPUs, or moved to a greater diversity of data science and machine learning use cases, these data engineering platforms weren’t ideal. Data scientists rejected these platforms and went back to working on their laptops where they had full control to play around and experiment with new libraries and hardware.
While data engineering platforms provided a great place for companies to start building data assets, their rigidity becomes especially challenging when companies embrace data science and machine learning, both of which are highly dynamic fields with heavy churn that require much more flexibility in order to stay relevant. An all-in-one platform makes it easy to get started, but can become a problem when your data science practice outgrows it.
So if data engineering platforms like Cloudera displaced data warehousing platforms like SAS/Oracle, what will displace Cloudera as we move into the data science/machine learning age?
Why we think Best-of-Breed will displace walled garden platforms
The worlds of data science and machine learning move at a much faster pace than data warehousing and much of data engineering. All-in-one platforms are too large and rigid to keep up. Additionally, the benefits of integration are less relevant today with technologies like Kubernetes. Let’s dive into these reasons in more depth.
Data Science and Machine Learning Require Flexibility
“Data science” is an incredibly broad term that encompasses dozens of activities like ETL, machine learning, model management, and user interfaces, each of which have many rapidly evolving choices. Only part of a data scientist’s workflow is typically supported by even the most mature data science platforms. Any attempt to build a one-size-fits-all integrated platform would have to include such a wide range of features, and such a wide range of choices within each feature, that it would be extremely difficult to maintain and keep up to date. What happens when you want to incorporate real-time data feeds? What happens when you want to start analyzing time series data? Yes, the all-in-one platforms will have tools to meet these needs; but will they be the tools you want, or the tools you’d choose if you had the opportunity?
Consider user interfaces. Data scientists use many tools like Jupyter notebooks, IDEs, custom dashboards, text editors, and others throughout their day. Platforms offering only “Jupyter notebooks in the cloud” cover only a small fraction of what actual data scientists use in a given day. This leaves data scientists spending half of their time in the platform, half outside the platform, and a new third half migrating between the two environments.
Consider also the computational libraries that all-in-one platforms support, and the speed at which they go out of date quickly. Famously, Cloudera ran Spark 1.6 for years after Spark 2.0 was released–even though (and perhaps because) Spark 2.0 was released only 6 months after 1.6. It’s quite hard for a platform to stay on top of all of the rapid changes that are happening today. They’re too broad and numerous to keep up with.
Kubernetes and the cloud commoditize integration
While the variety of data science has made all-in-one platforms harder, at the same time advances in infrastructure have made integrating best-of-breed products easier.
Cloudera, Hortonworks, and MapR were necessary at the time because Hadoop, Hive, and Spark were notoriously difficult to set up and coordinate. Companies that lacked technical skills needed to buy an integrated solution.
But today things are different. Modern data technologies are simpler to set up and configure. Also, technologies like Kubernetes and the cloud help to commoditize configuration and reduce integration pains with many narrowly-scoped products. Kubernetes lowers the barrier to integrating new products, which allows modern companies to assimilate and retire best-of-breed products on an as-needed basis without a painful onboarding process. For example, Kubernetes helps data scientists deploy APIs that serve models (machine learning or otherwise), build machine learning workflow systems, and is an increasingly common substrate for web applications that allows data scientists to integrate OSS technologies, as reported here by Hamel Hussain, Staff Machine Learning Engineer at Github.
Kubernetes provides a common framework in which most deployment concerns can be specified programmatically. This puts more control into the hands of library authors, rather than individual integrators. As a result the work of integration is greatly reduced, often just specifying some configuration values and hitting deploy. A good example here is the Zero to JupyterHub guide. Anyone with modest computer skills can deploy JupyterHub on Kubernetes without knowing too much in about an hour. Previously this would have taken a trained professional with pretty deep expertise several days.
Final Thoughts
We believe that companies that adopt a best-of-breed data platform will be more able to adapt to technology shifts that we know are coming. Rather than being tied into a monolithic data science platform on a multi-year time scale, they will be able to adopt, use, and swap out products as their needs change. Best of breed platforms enable companies to evolve and respond to today’s rapidly changing environment.
The rise of the data analyst, data scientist, machine learning engineer and all the satellite roles that tie the decision function of organizations to data, along with increasing amounts of automation and machine intelligence, require tooling that meet these end users’ needs. These needs are rapidly evolving and tied to open source tooling that is also evolving rapidly. Our strong opinion (strongly held) is that best-of-breed platforms are better positioned to serve these rapidly evolving needs by building on these OSS tools than all-in-platforms. We look forward to finding out.
Footnote
1 Note that we’re discussing data platforms that are built on top of OSS technologies, rather than the OSS technologies themselves. This is not another Dask vs Spark post, but a piece weighing up the utility of two distinct types of modern data platforms.