Editor’s note: There is an updated version of this article for 2021. Please read it here for the most up-to-date listing on machine learning datasets!
Your machine learning program is only as good as your training sets. Data sets are an integral part of the quality of your machine learning, but you may not always have access to data behind closed walls or the budget to purchase (or rent) the key.
Don’t despair. There are plenty of data sets out there where you can train your machine learning for free. Here are our top 25 picks for open source machine learning datasets. Each one offers clean data with neat columns and rows so that your training sets run more smoothly. Let’s take a look.
25 Machine Learning Open Datasets To Get You Started
Each of these datasets can answer an interesting question based on your primary field. They’re already scrubbed and simple enough to run cleanly without leaving out too much info to be useful.
Natural Language Processing
– Amazon Reviews: A collection of over 35 million reviews from the last 18 years. It includes things like ratings, reviews in plain text, and user information. It also contains complete product information for reference.
– Wikipedia Links Data: The full power of Wikipedia including four million articles containing 1.9 billion words. Your search options are varied and include both word and phrase searches as well as pieces of paragraphs.
Sentiment Analysis
– Standford Sentiment Treebank: Dataset containing sentiment notations for over 10,000 pieces of data from Rotten Tomatoes reviews rendered in HTML
– Twitter US Airline Sentiment: Tweets collected about US Airlines with clear markers for positive, negative, and neutral tones, dated from 2015.
[Related article: Google Dataset Search Launched to Help Analysts Scour Repositories]
Public Government Data
– Data USA: A comprehensive overview of various sets of US public data in fun visualizations. It includes things like population, health, and jobs.
– EU Open Data Portal: Much like Data USA except with a concentration on countries belonging to the EU. It includes fields such as population, culture, energy, and health, among others.
Finance and Economics
– World Bank Open Data: Data concerning population demographics and key indicators for development.
– IMF Data: International Monetary Fund’s collection of open data for things like debt rates, commodity pricing, international markets, and foreign exchange reserves.
Facial Recognition
– Labeled Faces In The Wild: Common dataset for facial recognition training. It includes 13,000 cropped faces plus a subset of people with two different pictures within the dataset.
– UMDFaces Dataset: Includes both still and video images. The dataset is annotated and features around 367,000 faces of over 8,000 subjects.
Image Datasets
– Imagenet: Dataset containing over 14 million images available for download in different formats. It also includes API integration and is organized according to the WordNet hierarchy.
– Google’s Open Images: 9 million URLs to categorized public images in over 6,000 categories. Each image is licensed under creative commons.
Health:
– Healthdata.gov: a resource from the US federal government providing data to improve health outcomes for the US population.
– MIMIC Critical Care Database: Datasets for Computational Physiology with unidentified health data from 40,000 critical care patients (demographics, vital signs, medications, etc.)
Media
– FiveThirtyEight Journalism: The numbers behind some of this journalism hub’s stories. Useful for visualizations and data stories.
– BuzzFeed Media: Open source data hub for everything in the realm of Buzzfeed. Everything their journalists used to produce the stories (the organization recommends reading the articles to get a better idea of how the data was used.)
Transportation
– US National Travel and Tourism Office: provides trustworthy datasets with big pictures of the tourism industry, including things like inbound and outbound travel and international visitor data.
– Department of Transportation: datasets on each field that falls under the DOT including National Parks, driver registers, bridges and rail information, and port systems.
Speech
– Flickr Audio Caption Corpus: 40,000 spoken captions from 8,000 images in a manageable size. It was initially designed for unsupervised speech pattern discovery.
– Speech Commands Dataset: A continuously evolving collection of one second long utterances from thousands of different people. It’s still receiving contributions and is useful for building basic voice interfaces.
Sound
– FSD (Freesound): A collection of every day sounds collected by contribution under an open source license.
– Environmental Audio Datasets: It does contain some proprietary information, but a large portion is open source. It contains sound events tables and acoustic scenes tables.
Dataset Aggregators
– OpenDataSoft: 2600 data portals arranged in an interactive map formation or by country list. If you’re looking for it, chances are, it’s here.
– Kaggle: an online community of data scientists where users can work with and upload datasets. It’s a community and a resource in one.
– UCI Machine Learning Repository: User contributed datasets in various levels of cleanliness. It’s one of the originals, and you can download datasets without having to register anything.
[Related download: 20 Free ODSC Resources to Learn Machine Learning]
Getting Started With Machine Learning Open Datasets
This is by far not an exhaustive list of datasets. When you’re beginning your next data project, having a place to start based on the subject matter could help you cut down on your initial start time. These offer excellent information sets and are freely available for you to play with. So whether you have a project for your organization, or you’re experimenting with something on your own, there’s a dataset to get you started.
Interesting in learning more about machine learning and Machine Learning Open Datasets? Check out these Ai+ training sessions:
Machine Learning Foundations: Linear Algebra
This first installment in the Machine Learning Foundations series the topic at the heart of most machine learning approaches. Through the combination of theory and interactive examples, you’ll develop an understanding of how linear algebra is used to solve for unknown values in high-dimensional spaces, thereby enabling machines to recognize patterns and make predictions.
Supervised Machine Learning Series
Data Annotation at Scale: Active and Semi-Supervised Learning in Python
Explaining and Interpreting Gradient Boosting Models in Machine Learning
ODSC West 2020: Intelligibility Throughout the Machine Learning Lifecycle