Generate Test Datasets for Machine learning

28 July 2024

2

Whenever we think of Machine Learning, the first thing that comes to our mind is a dataset. While there are many datasets that you can find on websites such as Kaggle, sometimes it is useful to extract data on your own and generate your own dataset. Generating your own dataset gives you more control over the data and allows you to train your machine-learning model. In this article, we will generate random datasets using sklearn.datasets library in Python.

Generate test datasets for Classification:

Binary Classification

Example 1: The 2d binary classification data generated by make_circles() have a spherical decision boundary.

Python3

# Import necessary libraries
from sklearn.datasets import make_circles
import matplotlib.pyplot as plt
 
# Generate 2d classification dataset
X, y = make_circles(n_samples=200, shuffle=True,
                    noise=0.1, random_state=42)
# Plot the generated datasets
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.show()

Output:

make_circles()

Example 2: Two interlocking half circles represent the 2d binary classification data produced by the make_moons() function.

Python3

#import the necessary libraries
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt
# generate 2d classification dataset
X, y = make_moons(n_samples=500, shuffle=True,
                  noise=0.15, random_state=42)
# Plot the generated datasets
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.show()

Output:

make_moons()

Multi-Class Classification

Example 1: Data generated by the function make_blobs() are blobs that can be utilized for clustering.

Python3

#import the necessary libraries
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
 
# Generate 2d classification dataset
X, y = make_blobs(n_samples=500, centers=3, n_features=2, random_state=23)
 
# Plot the generated datasets
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.show()

Output:

make_blobs()

Example 2: To generate data by the function make_classification() need to balance between n_informative, n_redundant and n_classes attributes X[:, :n_informative + n_redundant + n_repeated]

Python3

#import the necessary libraries
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
 
# generate 2d classification dataset
X, y = make_classification(n_samples = 100,
                           n_features=2,
                           n_redundant=0,
                           n_informative=2,
                           n_repeated=0,
                           n_classes =3,
                           n_clusters_per_class=1)
 
# Plot the generated datasets
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.show()

Output:

make_classification()

Example 3:A random multi-label classification data is created by the function make make_multilabel_classification()

Python3

# Import necessary libraries
from sklearn.datasets import make_multilabel_classification
import pandas as pd
import matplotlib.pyplot as plt
 
# Generate 2d classification dataset
X, y = make_multilabel_classification(n_samples=500, n_features=2,
                                      n_classes=2, n_labels=2,
                                      allow_unlabeled=True,
                                      random_state=23)
# create pandas dataframe from generated dataset
df = pd.concat([pd.DataFrame(X, columns=['X1', 'X2']),
                pd.DataFrame(y, columns=['Label1', 'Label2'])],
               axis=1)
display(df.head())
 
# Plot the generated datasets
plt.scatter(df['X1'], df['X2'], c=df['Label1'])
plt.show()

Output:

    X1    X2    Label1    Label2
0    14.0    34.0    0    1
1    30.0    22.0    1    1
2    29.0    19.0    1    1
3    21.0    19.0    1    1
4    16.0    32.0    0    1

make_multilabel_classification()

Generate test datasets for Regression:

Example 1: Generate a 1-dimensional feature and target for linear regression using make_regression

Python3

# Import necessary libraries
from sklearn.datasets import make_regression
import matplotlib.pyplot as plt
# Generate 1d Regression dataset
X, y = make_regression(n_samples = 50, n_features=1,noise=20, random_state=23)
# Plot the generated datasets
plt.scatter(X, y)
plt.show()

Output:

make_regression()

Example 2: Multilabel feature using make_sparse_uncorrelated()

Python3

# Import necessary libraries
from sklearn.datasets import make_sparse_uncorrelated
import matplotlib.pyplot as plt
# Generate 1d Regression dataset
X, y = make_sparse_uncorrelated(n_samples = 100, n_features=4, random_state=23)
# Plot the generated datasets
plt.figure(figsize=(12,10))
for i in range(4):
    plt.subplot(2,2, i+1)
    plt.scatter(X[:,i], y)
    plt.xlabel('X'+str(i+1))
    plt.ylabel('Y')
plt.show()

Output:

make_sparse_uncorrelated()

Example: 3 Multilabel feature using make_friedman2()

Python3

# Import necessary libraries
from sklearn.datasets import make_friedman2
import matplotlib.pyplot as plt
# Generate 1d Regression dataset
X, y = make_friedman2(n_samples = 100, random_state=23)
# Plot the generated datasets
plt.figure(figsize=(12,10))
for i in range(4):
    plt.subplot(2,2, i+1)
    plt.scatter(X[:,i], y)
    plt.xlabel('X'+str(i+1))
    plt.ylabel('Y')
plt.show()

Output:

make_friedman2()

Generate Test Datasets for Machine learning

Generate test datasets for Classification:

Binary Classification

Python3

Python3

Multi-Class Classification

Python3

Python3

Python3

Generate test datasets for Regression:

Python3

Example 2: Multilabel feature using make_sparse_uncorrelated()

Python3

Example: 3 Multilabel feature using make_friedman2()

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

Sticky Password vs. LastPass 2024: Which Is Better? by Katarina Glamoslija

Galaxy S25 on-device AI capability expands, reducing reliance on the cloud

OnePlus 13R launches with a huge battery upgrade, starting in China

This is my surprise phone of the year [Video]

Recent Comments

EDITOR PICKS

Sticky Password vs. LastPass 2024: Which Is Better? by Katarina Glamoslija

Galaxy S25 on-device AI capability expands, reducing reliance on the cloud

OnePlus 13R launches with a huge battery upgrade, starting in China

POPULAR POSTS

Sticky Password vs. LastPass 2024: Which Is Better? by Katarina Glamoslija

Galaxy S25 on-device AI capability expands, reducing reliance on the cloud

OnePlus 13R launches with a huge battery upgrade, starting in China

POPULAR CATEGORY

ABOUT US

FOLLOW US