Data and Bias Generators
generateBias(data, labels, num_biasedClusters, prob=0.05, seed=None)
Generates an artificial bias.
A dataset sampled from a multivariate Gaussian is biased by rotating a hyper- plane around its center by a random angle. Most data points (the user controls how many) above the hyperplane are removed. This bias generation strategy has been described in our paper [1].
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
np.ndarray (2D)
|
The dataset to be biased artificially. |
required |
labels |
np.array (1D)
|
The corresponding set of labels indicating classes / clusters. |
required |
num_biasedClusters |
int (
|
The number of clusters in the dataset that should be biased. |
required |
prob |
float, default
|
The probability for each point above the random hyperplane to remain in the dataset. |
0.05
|
seed |
int, optional
|
The random seed for reproducible generation of the bias. |
None
|
Returns:
Type | Description |
---|---|
np.ndarray (2D)
|
The biased dataset. |
np.array (1D)
|
The corresponding labels. |
np.array (1D)
|
The list of indices of points in the original dataset that have been removed. |
References
.. [1] Katharina Dost, Katerina Taskova, Patricia Riddle, and Jörg Wicker. "Your Best Guess When You Know Nothing: Identification and Mitigation of Selection Bias." In: 2020 IEEE International Conference on Data Mining (ICDM), pp. 996-1001, IEEE, 2020, ISSN: 2374-8486.
Examples:
>>> from imitatebias.generators import *
>>> import matplotlib.pyplot as plt
Generate a dataset.
>>> X, y = generateData(1000, 2, 2, seed=2210)
Generate a biased dataset.
>>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)
Plot the biased dataset.
>>> plt.scatter(X_b[:,0], X_b[:,1], c=y_b)
Plot the removed points.
>>> plt.scatter(X[idcs_deleted,0], X[idcs_deleted,1], c='red', label='deleted points')
>>> plt.legend()
>>> plt.show()
Source code in imitatebias\generators.py
83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 |
|
generateData(num_instances, num_clusters, num_dims, return_params=False, seed=None, mean_low=1, mean_high=100)
Generates random data drawn from multivariate Gaussian(s).
The covariance matrices of the multivariate Gaussians are generated randomly via their Cholesky decomposition (i.e., for every real-valued symmetric positive- definite (SPD) matrix M, there is a unique lower-diagonal matrix L with positive diagonal entries and LL^T = M). That is, we generate lower-diagonal matrices m with positive diagonal and obtain the covariance matrices as Cov = mm^T.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
num_instances |
int (
|
The size of the generated dataset. |
required |
num_clusters |
int (
|
The number of clusters / classes in the generated dataset. |
required |
num_dims |
int (
|
The dimensionality of the generated dataset. |
required |
return_params |
bool, default
|
Returns (data, labels, parameters) of the generated Gaussians alongside the data and labels that are returned either way. |
False
|
seed |
int, optional
|
The random seed for reproducible generation of the dataset. |
None
|
mean_low |
float, default
|
Controls the range in which the means of the Gaussians are generated (lower boundary). |
1
|
mean_high |
float, default
|
Controls the range in which the means of the Gaussians are generated (upper boundary). |
100
|
Returns:
Type | Description |
---|---|
np.ndarray (2D)
|
Generated data points. |
np.array (1D)
|
Corresponding class / cluster labels. |
Examples:
>>> from imitatebias.generators import *
>>> import matplotlib.pyplot as plt
Generate a dataset.
>>> X, y = generateData(1000, 2, 2, seed=2210)
Plot the dataset.
>>> plt.scatter(X[:,0], X[:,1], c=y)
>>> plt.show()
Source code in imitatebias\generators.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 |
|