Skip to content

Data and Bias Generators

generateBias(data, labels, num_biasedClusters, prob=0.05, seed=None)

Generates an artificial bias.

A dataset sampled from a multivariate Gaussian is biased by rotating a hyper- plane around its center by a random angle. Most data points (the user controls how many) above the hyperplane are removed. This bias generation strategy has been described in our paper [1].

Parameters:

Name Type Description Default
data np.ndarray (2D)

The dataset to be biased artificially.

required
labels np.array (1D)

The corresponding set of labels indicating classes / clusters.

required
num_biasedClusters int (

The number of clusters in the dataset that should be biased.

required
prob float, default

The probability for each point above the random hyperplane to remain in the dataset.

0.05
seed int, optional

The random seed for reproducible generation of the bias.

None

Returns:

Type Description
np.ndarray (2D)

The biased dataset.

np.array (1D)

The corresponding labels.

np.array (1D)

The list of indices of points in the original dataset that have been removed.

References

.. [1] Katharina Dost, Katerina Taskova, Patricia Riddle, and Jörg Wicker. "Your Best Guess When You Know Nothing: Identification and Mitigation of Selection Bias." In: 2020 IEEE International Conference on Data Mining (ICDM), pp. 996-1001, IEEE, 2020, ISSN: 2374-8486.

Examples:

>>> from imitatebias.generators import *
>>> import matplotlib.pyplot as plt

Generate a dataset.

>>> X, y = generateData(1000, 2, 2, seed=2210)

Generate a biased dataset.

>>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

Plot the biased dataset.

>>> plt.scatter(X_b[:,0], X_b[:,1], c=y_b)
Plot the removed points.
>>> plt.scatter(X[idcs_deleted,0], X[idcs_deleted,1], c='red', label='deleted points')
>>> plt.legend()
>>> plt.show()
Source code in imitatebias\generators.py
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
def generateBias(data, labels, num_biasedClusters, prob=0.05, seed=None):
    """Generates an artificial bias.

    A dataset sampled from a multivariate Gaussian is biased by rotating a hyper-
    plane around its center by a random angle. Most data points (the user controls how
    many) above the hyperplane are removed. This bias generation strategy has been 
    described in our paper [1].

    Parameters
    ----------
    data : np.ndarray (2D)
        The dataset to be biased artificially. 
    labels : np.array (1D)
        The corresponding set of labels indicating classes / clusters.
    num_biasedClusters : int (> 0)
        The number of clusters in the dataset that should be biased.
    prob : float, default=0.05
        The probability for each point above the random hyperplane to remain in the 
        dataset.
    seed : int, optional
        The random seed for reproducible generation of the bias.

    Returns
    -------
    np.ndarray (2D)
        The biased dataset.
    np.array (1D)
        The corresponding labels.
    np.array (1D)
        The list of indices of points in the original dataset that have been removed.

    References
    ----------
    .. [1] Katharina Dost, Katerina Taskova, Patricia Riddle, and Jörg Wicker. 
       "Your Best Guess When You Know Nothing: Identification and Mitigation of 
       Selection Bias." In: 2020 IEEE International Conference on Data Mining (ICDM), 
       pp. 996-1001, IEEE, 2020, ISSN: 2374-8486.

    Examples
    --------
    >>> from imitatebias.generators import *
    >>> import matplotlib.pyplot as plt

    Generate a dataset.
    >>> X, y = generateData(1000, 2, 2, seed=2210)

    Generate a biased dataset.
    >>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

    Plot the biased dataset.
    >>> plt.scatter(X_b[:,0], X_b[:,1], c=y_b)
    Plot the removed points.
    >>> plt.scatter(X[idcs_deleted,0], X[idcs_deleted,1], c='red', label='deleted points')
    >>> plt.legend()
    >>> plt.show()
    """
    rng = np.random.default_rng(seed)
    delete_this = []
    clusters = np.unique(labels)
    if num_biasedClusters > len(clusters): num_biasedClusters = len(clusters)

    # select blobs that will be biased
    bias_these = rng.choice(clusters, num_biasedClusters, replace=False)
    alphas = rng.random(num_biasedClusters) * 2*np.pi # angle for plane
    for blob, alpha in zip(bias_these, alphas):
        dims = rng.choice(range(len(data[0])), 2, replace=False)
        mean = data[labels == blob].mean(0)[dims]
        d = np.sqrt(np.sum((data[:,dims] - mean)**2, axis=1))
        angles = np.arcsin((data[:, dims[1]] - mean[1]) / d)
        angles = np.array([(np.pi - angles[i] if data[i, dims[0]] < mean[0] else angles[i]) for i in range(len(angles))])
        angles[angles < 0] += 2*np.pi
        if alpha >= np.pi:
            b = np.logical_or(angles > alpha, angles < alpha - np.pi)
        else:
            b = np.logical_and(angles > alpha, angles < (alpha + np.pi) % (2 * np.pi))
        b = np.logical_and(b, labels == blob)
        b = np.where(b)[0]
        b = np.delete(b, rng.choice(range(len(b)), int(prob*len(b)), replace=False))
        delete_this = np.append(delete_this, b)

    delete_this = delete_this.astype(int)
    d, l = np.delete(data, delete_this, axis=0), np.delete(labels, delete_this)
    return d, l, delete_this

generateData(num_instances, num_clusters, num_dims, return_params=False, seed=None, mean_low=1, mean_high=100)

Generates random data drawn from multivariate Gaussian(s).

The covariance matrices of the multivariate Gaussians are generated randomly via their Cholesky decomposition (i.e., for every real-valued symmetric positive- definite (SPD) matrix M, there is a unique lower-diagonal matrix L with positive diagonal entries and LL^T = M). That is, we generate lower-diagonal matrices m with positive diagonal and obtain the covariance matrices as Cov = mm^T.

Parameters:

Name Type Description Default
num_instances int (

The size of the generated dataset.

required
num_clusters int (

The number of clusters / classes in the generated dataset.

required
num_dims int (

The dimensionality of the generated dataset.

required
return_params bool, default

Returns (data, labels, parameters) of the generated Gaussians alongside the data and labels that are returned either way.

False
seed int, optional

The random seed for reproducible generation of the dataset.

None
mean_low float, default

Controls the range in which the means of the Gaussians are generated (lower boundary).

1
mean_high float, default

Controls the range in which the means of the Gaussians are generated (upper boundary).

100

Returns:

Type Description
np.ndarray (2D)

Generated data points.

np.array (1D)

Corresponding class / cluster labels.

Examples:

>>> from imitatebias.generators import *
>>> import matplotlib.pyplot as plt

Generate a dataset.

>>> X, y = generateData(1000, 2, 2, seed=2210)

Plot the dataset.

>>> plt.scatter(X[:,0], X[:,1], c=y)
>>> plt.show()
Source code in imitatebias\generators.py
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
def generateData(num_instances, num_clusters, num_dims, return_params=False, seed=None, mean_low=1, mean_high=100):
    """Generates random data drawn from multivariate Gaussian(s).

    The covariance matrices of the multivariate Gaussians are generated randomly
    via their Cholesky decomposition (i.e., for every real-valued symmetric positive-
    definite (SPD) matrix M, there is a unique lower-diagonal matrix L with positive
    diagonal entries and LL^T = M). That is, we generate lower-diagonal matrices m
    with positive diagonal and obtain the covariance matrices as Cov = mm^T.

    Parameters
    ----------
    num_instances : int (> 0)
        The size of the generated dataset.
    num_clusters : int (> 0)
        The number of clusters / classes in the generated dataset.
    num_dims : int (> 0)
        The dimensionality of the generated dataset.
    return_params : bool, default=False
        Returns (data, labels, parameters) of the generated Gaussians alongside the
        data and labels that are returned either way.
    seed : int, optional
        The random seed for reproducible generation of the dataset.
    mean_low : float, default=1
        Controls the range in which the means of the Gaussians are generated (lower
        boundary).
    mean_high : float, default=100
        Controls the range in which the means of the Gaussians are generated (upper
        boundary).

    Returns
    -------
    np.ndarray (2D)
        Generated data points.
    np.array (1D)
        Corresponding class / cluster labels. 

    Examples
    --------
    >>> from imitatebias.generators import *
    >>> import matplotlib.pyplot as plt

    Generate a dataset.
    >>> X, y = generateData(1000, 2, 2, seed=2210)

    Plot the dataset.
    >>> plt.scatter(X[:,0], X[:,1], c=y)
    >>> plt.show()
    """
    rng = np.random.default_rng(seed)
    points = np.empty((0,num_dims))
    labels = []
    params = []
    num_cl = ([num_instances // num_clusters + (1 if x < num_instances % num_clusters else 0)  for x in range (num_clusters)])
    for i in range(len(num_cl)):

        # generate Cov using Cholesky decomposition
        m = rng.integers(1,50)*(2*rng.random((num_dims, num_dims))-1)
        for j in range(len(m)):
            m[j,j] = np.abs(m[j,j])
        m = np.tril(m)
        cov = m.dot(m.transpose())

        # generate mean
        mean = rng.integers(mean_low,mean_high)*(2*rng.random(num_dims)-1)

        # sample points
        pts = rng.multivariate_normal(mean, cov, size=num_cl[i])

        points = np.concatenate((points, pts), axis=0)
        labels = np.append(labels, [i]*num_cl[i])
        params.append([mean, cov])
    if return_params: return points, np.array(labels), params
    return points, np.array(labels)