More and more, we’re seeing data science and data engineering teams turning toward Feature Stores to help manage the data sets and data pipelines needed to productionize machine learning applications. Features, or more formally “feature variables,” arguably are the most important ingredient for successful machine learning. The process of creating features is called “feature engineering” which is a complex yet critical component for any machine learning process. Well-crafted features mean better models resulting in better business outcomes. In fact, many past Kaggle grandmasters indicate that the primary key to their success in data science competitions isn’t necessarily the most powerful algorithm but rather “clever feature engineering.”
As a quick refresher, a “feature” is data used as an input signal to a predictive model. Other common terms for feature include: predictor, independent variable, explanatory variable, or covariate. A set of features used for a machine learning model is called a “feature vector.” Being able to build the best feature vector often leads to the best predictive results.
Features are instrumental to the process of machine learning. Features, along with a target (aka response) variable, comprise the observations (aka examples) found in a data set used for a machine learning project. A feature is an individual measurement that can be quantified, so it can be used to build a machine learning model. For example, a machine learning model trying to predict the likelihood of heart disease in patients could include features such as gender, age, height, weight, blood pressure, etc. Choosing explanatory, discerning and independent features is a critical step towards the effective use of machine learning. Additionally, the accuracy of a machine learning model is based on a well-thought-out set of features. Using irrelevant features with a machine learning model can actually decrease its accuracy since the learning is based on non-optimal features. At its most basic level, a “feature store” serves as a principal warehouse for the most useful features.
Feature stores were first discussed back in Sept. 2017 in a blog post describing the Michelangelo platform developed by Uber. The company found the need to develop a feature store that allowed teams to share, discover, and use a highly curated set of features for their machine learning problems. Uber found that many modeling problems use identical or similar features and there was substantial value in enabling teams to share features between their own projects for teams in different areas of the organization.
This article describes the fundamental components of a contemporary feature store and how the sum total of these ingredients act as an important facilitator for enterprises, by minimizing duplication of data engineering efforts, accelerating the machine learning lifecycle, and opening up new levels of collaboration across data science teams. We’ll take a dive into what feature stores are all about and why they’re needed. We’ll also take a high-level view of the technologies that support this effort. We’ll consider some use case examples of where feature stores make a difference. And then we’ll wrap up with a short list of important players in this space. Be sure to check out my previous article on ModeOps which offers a related perspective on enterprise-class machine learning efforts.
What is a Feature Store?
Feature stores have emerged as a necessary component of the operational machine learning stack. Feature stores accelerate feature engineering by removing bottlenecks in data preparation, data transformation, and feature selection.
Data scientists working with data engineers use a variety of tools to create “features.” Building features for training and inference is time-consuming work, and it’s not uncommon for data scientists to build features from scratch every time they engage a new project. Data scientists can spend up to 80% of their time on feature engineering, and because teams typically don’t have a way to collaborate on this work, the same work is repeated by teams throughout the organization. Feature stores allow data scientists to build more accurate features and deploy these features in production at a much more rapid pace. Prior to feature stores, there weren’t places to store and access features from previous projects. As data repositories continue to be important to every business, demand is growing to make features reusable. Feature stores are seen as a critical component of the infrastructure stack for machine learning because they solve the hardest problem with operationalizing machine learning—building and serving machine learning models to production.
A feature store has several primary components as shown in the figure below: transformation, storage, serving, monitoring, and feature registry.
- Transformation – operational machine learning applications require regular refreshing and processing of new data into feature variable values so models can make predictions using an updated view of the data domain. Feature stores ingest data provided by external sources and systems and then manage and orchestrate data transformations that yield these values.
- Storage – feature stores persist feature data in order to support retrieval through feature serving layers.
- Serving – feature stores serve feature data for use by machine learning models.
- Monitoring –feature stores are well-positioned to detect and surface issues involving machine learning systems. They’re able to compute metrics on the features they store that describe correctness and quality. Feature stores monitor these metrics to deliver an important signal reflecting the overall health of a machine learning model.
- Feature registry – another central component in feature stores is a centralized registry of standardized feature definitions and metadata. The registry serves as a sole source of truth for information about a feature in an enterprise.
Source: Tecton – 5 main components of a modern feature store
“In working with thousands of data scientists, we saw first-hand that data science projects continuously fail due to technical limitations and inefficiencies in the feature engineering process,” said Jared Parker, Rasgo founder and CEO. “Many ML projects never made it to production, and those that did were plagued with end-user frustration due to the sheer time it took to clean, join, and transform data. At Rasgo, we are building a platform that enables data scientists to access and transform data into highly curated ML features in minutes, not weeks. Our early customers have eliminated the feature engineering bottleneck and are now creating more accurate features and models, which have allowed them to finally achieve tangible financial value from ML.”
Data science teams have realized that operational machine learning requires solving data problems that go far beyond the management of data pipelines. Increasingly, data science and engineering teams are turning towards feature stores to manage the data sets and data pipelines to productionize their machine learning applications.
Common Problems Feature Stores Can Solve
Every data science project begins with a qualified search for the right features to be consumed by the algorithm. The problem is there typically isn’t a single, central location to search; features are hosted in many places, especially in a large enterprise. The solution offered by a feature store is that it provides a principal repository for sharing all available features. When a data scientist starts a new project, he or she can go to this collection and easily find the required features. A feature store is also a data transformation service enabling data scientists to handle raw data and store it as features ready to be used by a machine learning model.
Feature stores aim to solve the full complement of data management challenges encountered when building and operationalizing machine learning applications. Here is a short-list of a number of enterprise machine learning issues that feature stores can help resolve:
- Productionize features without extensive support from data engineering
- Automate feature computation via transformations
- Share and reuse features across pipelines for various teams
- Provide feature versioning, lineage, regulatory compliance, and metadata
- Attain consistency between training and inferencing
- Monitor the condition of feature pipelines in production environments
Use Case Examples
Some of the largest tech companies that deal extensively with machine learning models have built their own feature stores – Google, Uber, Twitter, Netflix, Facebook, Airbnb, etc. This is a good indication to the rest of the industry of how important it is to use a feature store as an element of an effective machine learning pipeline. Let’s take a look at a couple of use case examples of how feature stores are successfully being used in large enterprise environments.
AT&T
AT&T carries more than 465 petabytes of data traffic across its global network on an average day. When you add in the data generated internally from different applications, in the company’s retail stores, among their field technicians, and across other parts of their business, turning data into actionable intelligence as quickly as possible is vital to competitive advantage. AT&T’s use of a feature store has been instrumental in helping turn this massive trove of data into actionable intelligence.
AT&T co-developed feature store technology with H2O.ai. Now the companies are offering the production-tested feature store as a software platform for other companies and organizations to use with their own data. From financial services to health organizations and pharmaceutical makers, retail, software developers, and more, the demand for reliable, easy-to-use, and secure feature stores is booming. Any organization currently using machine learning or planning to use machine learning will want to consider the value of a feature store. AT&T is using the feature store for network optimization, fraud prevention, tax calculations, and predictive maintenance.
The feature store includes capabilities including integration with multiple data and machine learning pipelines, which can be applied to an on-premise data lake or by leveraging cloud and SaaS providers. The feature store also includes automatic feature recommendations which let data scientists select the features they want to update and improve and receive recommendations to do so.
“Feature stores are one of the hottest areas of AI development right now, because being able to reuse and repurpose data engineering tools is critical as those tools become increasingly complex and expensive to build,” commented Andy Markus, Chief Data Officer, AT&T. “These storehouses are vital not only to our own work, but to other businesses, as well.”
iFood
iFood is the largest food-tech company in Latin America, serving more than 26 million orders each month from more than 150,000 restaurants. The company’s operation generates large amounts of data each second: what dishes were requested, by whom, each driver location update, and much more. To provide the best possible customer experience and maximize the number of orders, iFood built several machine learning models to provide accurate answers for questions such as: how long it will take for an order to be completed; what are the best restaurants and dishes to recommend to a consumer; whether the payment being used is fraudulent or not; among others.
In order to generate the training data sets for those models, and in order to serve features in real-time so the models’ predictions can be made correctly, it is necessary to create efficient, distributed data processing pipelines. To address these technical requirements, iFood built a real-time feature store, using Databricks and Spark Structured Streaming in order to process events streams, store them to a historical Delta Lake Table storage and a Redis low-latency access cluster. The company structured its development processes in order to do it with production-grade, reliable and validated code.
Vendors
There are a number of companies that have become players in this new space. Many are only a couple of years old. Here is a short-list to consider in alphabetical order:
Amazon SageMaker Feature Store | Databricks | Feast | Featureform | H2O.ai | Hopsworks | Iguazio | Kaskada | Molecula | Rasgo | Scribble Data | Splice Machine | Tecton
Conclusion
Feature stores are a very powerful tool available for enterprises that aim to build many machine learning models based on a well-defined set of data entities. The overarching benefit of a feature store is that it encapsulates the logic of feature transformations to automatically prepare new data and serve up examples for training and inference. If your data science team finds itself repeatedly coding up feature transformations or copying/pasting feature-engineering code from one project data pipeline to project, a feature store could significantly simplify your process.
For more information, visit the Feature Store Blog.