Machine Learning for Continuous Integration

23 August 2024

0

Editor’s Note: Andrea Frittoli and Kyra Wulffert are presenting their talk“Machine Learning for Continuous Integration” at ODSC 2019 Europe.

Continuous Integration and Data

As more applications move to a DevOps model with CI/CD pipelines, the testing required for this development model to work inevitably generates lots of data. This is also true for large open-source projects, that may see millions of tests executed on a daily basis.

The data produced by such CI systems contains information about several aspects of the continuous testing system; engineers with specific domain experience usually parse such data on a daily basis in an effort to maintain the system running smoothly.

After years of experience in the field, we wanted to investigate if machine learning could help us extract valuable insights from CI data with minimal human intervention.

Open Source and Open Data

The first requirement for any machine learning project is the data, and we have an open dataset available to use. We used data generated by the OpenStack CI, which runs on Zuul, a CI system. Both Zuul and OpenStack are open source projects.

The code being open source, however, is not a guarantee that the data produced by that platform is open too. Luckily the OpenStack community maintains the data in the open too!

Not all the data produced by Zuul may be suitable for our machine learning work. Zuul tests the code before it is merged into git and it does so following a two pipeline approach. The check pipeline tests changes to code that may be broken and that still requires human review. Once a change is approved by humans and it passes the check pipeline, it is queued into the gate pipeline, where tests are mostly expected to pass. We consider the check pipeline to be too noisy, while gate represents a clean source of data. Failures in gate may be related to temporary instability in the testing infrastructure, flakiness of tests, race conditions in the code or other changes to the code that were merged since the check pipeline was last ran.

The data produced by the gate pipeline is what we use to create our datasets.

Creating a stable dataset

The OpenSource community only stores the CI data for a limited amount of time, since new data is produced daily, and the available storage is limited. To have reproducible experiments we needed a stable dataset; we decided to pull and filter data on a daily basis.

Machine Learning Experiments

We structured our work into several separate stages. The first one is storing the data for our experiments in the cloud. The second stage is data preparation and visualization, which is often an iterative process. The third stage is establishing our metrics, so we have a clear definition of what we aim to optimize. The final stage is running multiple experiments against the datasets we created, fine-tuning the model and analyzing the results.

We wrote tooling in python to help us keep track of datasets, experiments and results.

If you are curious about the data we have used, the experiments we ran and our results, come and join us at our talk “Machine Learning for Continuous Integration” at ODSC 2019 Europe.

More about the authors:

Andrea Frittoli is an Open Source Developer Advocate at IBM and Machine Learning enthusiast. He’s a strong advocate for transparency in open source. He likes working on IaaS projects as well as machine learning, trying to combines the two worlds. Andrea has previously been a speaker at FOSSASIA, FOSS Backstage, OpenStack summits, Open Source Summits, and various meetups.
https://www.linkedin.com/in/andreafrittoli/

Kyra Wulffert is a solution architect and IT expert with over a decade of experience in the telecommunications industry working in international environments with local and remote teams. She’s a Machine Learning and Open Source enthusiast.
https://www.linkedin.com/in/kyrawulffert

Machine Learning for Continuous Integration

Continuous Integration and Data

Open Source and Open Data

Creating a stable dataset

Machine Learning Experiments

Run Local AWS Cloud Stack using LocalStack on Linux

Learn Terraform Automation in 3 days using Video Courses

How To Expose Ansible AWX Service using Nginx Ingress

LEAVE A REPLY Cancel reply

Most Popular

5 Best Antiviruses With Keylogger Protection in 2025 by Tyler Cross

Best VPNs for School in 2025 That Work With Firewalls by Toma Novakovic

How to Watch the Super Bowl From Anywhere in 2025 by Raven Wu

Best Malware Removal + Protection Software in 2025 by Raven Wu

Recent Comments

EDITOR PICKS

5 Best Antiviruses With Keylogger Protection in 2025 by Tyler Cross

Best VPNs for School in 2025 That Work With Firewalls by Toma Novakovic

How to Watch the Super Bowl From Anywhere in 2025 by Raven Wu

POPULAR POSTS

5 Best Antiviruses With Keylogger Protection in 2025 by Tyler Cross

Best VPNs for School in 2025 That Work With Firewalls by Toma Novakovic

How to Watch the Super Bowl From Anywhere in 2025 by Raven Wu

POPULAR CATEGORY

ABOUT US

FOLLOW US

Machine Learning for Continuous Integration

Continuous Integration and Data

Open Source and Open Data

Creating a stable dataset

<img decoding="async" alt="Machine Learning for Continuous Integration" class="aligncenter wp-image-30423 size-full" height="322" loading="lazy" src="https://geeksforgeeks.org/wp-content/uploads/2023/09/openstack_ci_ciml.png" width="736" />Machine Learning Experiments

LEAVE A REPLY Cancel reply

Most Popular

Recent Comments

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY

ABOUT US

FOLLOW US

Machine Learning Experiments