It’s standard industry practice to prototype Machine Learning pipelines in Jupyter notebooks, refactor them into Python modules and then deploy using production tools such as Airflow or Kubernetes. However, this process slows down development as it requires significant changes to the code.
Ploomber enables a leaner approach where data scientists can use Jupyter but still adhere to software development best practices such as code reviews or continuous integration. To prove that this approach is a better alternative to the current prototype in a notebook, then refactor, this presentation develops and deploys a Machine Learning pipeline in 45 minutes.
The rest of this post describes how Ploomber achieves such a lean workflow.
Break down logic in multiple files
One of the main issues with notebook-based pipelines is that they often live in a single notebook. Debugging large notebooks is a nightmare, making pipelines hard to maintain. In contrast, Ploomber allows us to break down the logic in multiple, smaller steps that we declare in a pipeline.yaml file. For example, assume we’re working on a model to predict user activity using demographics and past activity. Our training pipeline would look like this:
Figure 1. Example pipeline
To create such a pipeline, we create a pipeline.yaml file and list our tasks (source) with their corresponding outputs (product):
# pipeline.yaml tasks: # get user demographics - source: get-demographics.py product: nb: output/demographics.ipynb data: output/demographics.csv # get user activity - source: get-activity.py product: nb: output/activity.ipynb data: output/activity.csv # features from user demographics - source: fts-demographics.py product: nb: output/fts-demographics.ipynb data: output/fts-demographics.csv # features from user activity - source: fts-activity.py product: nb: output/fts-activity.ipynb data: output/fts-activity.csv # train model - source: train.py product: nb: output/train.ipynb data: output/model.pickle
Since each .py has a clearly defined objective, they are easier to maintain and test than a single notebook.
Write code in .py and interact with it using Jupyter
Jupyter is a fantastic tool to develop data pipelines. It allows us to get quick feedback such as metrics or visualizations, essential for understanding our data. However, traditional .ipynb files have a lot of problems. For example, they make code reviews difficult because comparing versions yields illegible results. The following image shows the diff view of a notebook whose only change is a new cell with a comment:
Figure 2. Illegible notebook diff on GitHub
To fix those problems, Ploomber allows users to open .py files as notebooks, which enables code reviews while still providing the power of interactive development with Jupyter. The following image shows the same .py file rendered as a notebook in Jupyter and as a script in VS Code:
Figure 3. Same .py file rendered as a notebook in Jupyter and script in VS Code
However, Ploomber leverages the .ipynb format as an output. Each .py executes as a notebook, generating a .ipynb file that we can use during a code review to check visual results such as tables or charts. Note that in the pipeline.yaml file, each task has a .ipynb file in the product section. See the fragment below:
# pipeline.yaml (fragment) tasks: # the source script... - source: get-demographics.py product: # ...generates a notebook as output nb: output/demographics.ipynb data: output/demographics.csv # pipeline.yaml continues...
Retrieve results from previous tasks
Another essential feature is how we establish execution order. For example, to generate features from activity data, we need the raw data:
Figure 4. Declaring upstream dependencies
To establish this dependency, we edit fts-activity.py and add a special upstream variable at the top of the file:
upstream = ['activity']
We are stating that activity.py must execute before fts-activity.py. Once we provide such information, Ploomber adds a new cell to give us the location of our input files; we will see something like this:
# what we write upstream = ['activity'] # what Ploomber adds in a new cell upstream = { 'activity': { # extracted from pipeline.yaml 'nb': 'output/activity.ipynb' 'data': 'output/activity.csv' } }
No need to hardcode paths to files!
Pipeline composition
A training pipeline and its serving counterpart have a lot of overlap. The only difference is that the training pipeline gets historical records, processes them, and trains a model, while the serving version gets new observations, processes them, and makes predictions.
Figure 5. The training and serving pipelines are mostly the same
All the data processing steps must be the same to prevent discrepancies at serving time. Once we have the training pipeline, we can easily create the serving version. The first step is to create a new file with our processing tasks:
# features.yaml - extracted from the original pipeline.yaml # features from user demographics - source: fts-demographics.py product: nb: output/fts-demographics.ipynb data: output/fts-demographics.csv # features from user activity - source: fts-activity.py product: nb: output/fts-activity.ipynb data: output/fts-activity.csv
Then we compose the training and serving pipeline by importing such tasks and adding the remaining ones:
We can now deploy our serving pipeline!
Deployment using Ploomber
Once we have our serving pipeline, we can deploy it to any available production backend: Kubernetes (via Argo Workflows), Airflow, or AWS Batch with our second command-line tool: Soopervisor. Such a tool requires a few additional configuration settings to create a Docker image and push our pipeline to production.
That’s it! Ploomber allows us to move back and forth between Jupyter and a production environment without any compromise on software engineering best practices.
If you are looking forward to our presentation, show your support with a star on GitHub, or join our community. See you in November during my session at ODSC West 2021, “Develop and Deploy a Machine Learning Pipeline in 45 Minutes with Ploomber.”
About the author/ODSC West 2021 speaker on Ploomber:
Eduardo Blancas is interested in developing tools to deliver reliable Machine Learning products. Towards that end, he developed Ploomber, an open-source Python library for reproducible Data Science, first introduced at JupyterCon 2020. He holds an M.S in Data Science from Columbia University, where he took part in Computational Neuroscience research. He started his Data Science career in 2015 at the Center for Data Science and Public Policy at The University of Chicago.