Skip to content

Imitate

Imitate

Imitate generates points to mitigate a dataset's bias.

Imitate investigates the dataset's probability density, then adds generated points in order to smooth out the density and have it resemble a Gaussian, the most common density occurring in real-world applications. If the artificial points focus on certain areas and are not widespread, this could indicate a Selection Bias where these areas are underrepresented in the sample.

See our paper [1]_ for details.

Attributes:

Name Type Description
icas list(sklearn.decomposition.FastICA)

A list of FastICA objects trained per label in the training set.

grids dict(string or int or float: numpy.ndarray (2D))

A dictionary mapping a class label to its corresponding grids per dimension over which KDE was evaluated.

vals dict(string or int or float: numpy.ndarray (2D))

A KDE density representation of the dataset evaluated over grids.

fitted dict(string or int or float: numpy.ndarray (2D))

Fitted Gaussian PDF evaluated over grids.

fill_up dict(string or int or float: numpy.ndarray (2D))

vals - fitted, evaluated over grids.

num_fill_up dict(string or int or float: numpy.array (1D))

The necessary number of points to add to mitigate the bias; per label and dimension.

Methods

fit(data, labels=[], bounds={}, strength=1000) Fits the Imitate Gaussians to a dataset. score(data) Scores new data based on Imitate'd fitted Gaussians. augment() Augments the fitted dataset to mitigate its bias.

References

.. [1] Katharina Dost, Katerina Taskova, Patricia Riddle, and Jörg Wicker. "Your Best Guess When You Know Nothing: Identification and Mitigation of Selection Bias." In: 2020 IEEE International Conference on Data Mining (ICDM), pp. 996-1001, IEEE, 2020, ISSN: 2374-8486.

Examples:

>>> from imitatebias.generators import *
>>> from imitatebias.imitate import *
>>> import matplotlib.pyplot as plt

Generate a dataset.

>>> X, y = generateData(1000, 2, 2, seed=2210)

Generate a biased dataset.

>>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

initialize Imitate

>>> imi = Imitate()

fit Imitate to the biased dataset

>>> imi.fit(X_b, labels=y_b)

visualize data per cluster in ICA space

>>> for l in np.unique(y_b):
>>>     data_transformed = imi.icas[l].transform(X_b[y_b == l])
>>>     plt.scatter(data_transformed[:,0], data_transformed[:,1])
>>>     plt.title('Class '+str(l))
>>>     plt.show()

create some random points to score

>>> rnd_points = np.column_stack((np.random.uniform(min(X[:,0]), max(X[:,0]), size=1000),     >>>                               np.random.uniform(min(X[:,1]), max(X[:,1]), size=1000)))

score the random points

>>> scores_fill = imi.score(rnd_points, score_type='fill')
>>> scores_balanced = imi.score(rnd_points, score_type='balanced')

visualize data per cluster in ICA space

>>> for l in np.unique(y_b):
>>>     plt.scatter(rnd_points[:,0], rnd_points[:,1], c=scores_fill[:,int(l)])
>>>     plt.title('Class '+str(l)+'; Score type = fill')
>>>     plt.colorbar()
>>>     plt.show()
>>>     plt.scatter(rnd_points[:,0], rnd_points[:,1], c=scores_balanced[:,int(l)])
>>>     plt.title('Class '+str(l)+'; Score type = balanced')
>>>     plt.colorbar()
>>>     plt.show()

augment the dataset

>>> X_gen, y_gen = imi.augment()

visualize data per cluster in ICA space

>>> plt.scatter(X_b[:,0], X_b[:,1], c=y_b)
>>> plt.scatter(X_gen[:,0], X_gen[:,1], c=y_gen, edgecolors='red')
>>> plt.title('Dataset with generated points (red)')
>>> plt.show()
Source code in imitatebias\imitate.py
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
class Imitate:
    """Imitate generates points to mitigate a dataset's bias.

    Imitate investigates the dataset's probability density, then adds generated points 
    in order to smooth out the density and have it resemble a Gaussian, the most common 
    density occurring in real-world applications. If the artificial points focus on 
    certain areas and are not widespread, this could indicate a Selection Bias where 
    these areas are underrepresented in the sample.

    See our paper [1]_ for details.

    Attributes
    ----------
    icas : list(sklearn.decomposition.FastICA) 
        A list of `FastICA` objects trained per label in the training set.
    grids : dict(string or int or float: numpy.ndarray (2D))
        A dictionary mapping a class label to its corresponding grids per dimension 
        over which KDE was evaluated.
    vals : dict(string or int or float: numpy.ndarray (2D))
        A KDE density representation of the dataset evaluated over `grids`.
    fitted : dict(string or int or float: numpy.ndarray (2D))
        Fitted Gaussian PDF evaluated over `grids`.
    fill_up : dict(string or int or float: numpy.ndarray (2D))
        `vals - fitted`, evaluated over `grids`.
    num_fill_up : dict(string or int or float: numpy.array (1D))
        The necessary number of points to add to mitigate the bias; per label and 
        dimension.

    Methods
    -------
    fit(data, labels=[], bounds={}, strength=1000)
        Fits the Imitate Gaussians to a dataset.
    score(data)
        Scores new data based on Imitate'd fitted Gaussians.
    augment()
        Augments the fitted dataset to mitigate its bias.

    References
    ----------
    .. [1] Katharina Dost, Katerina Taskova, Patricia Riddle, and Jörg Wicker. 
       "Your Best Guess When You Know Nothing: Identification and Mitigation of 
       Selection Bias." In: 2020 IEEE International Conference on Data Mining (ICDM), 
       pp. 996-1001, IEEE, 2020, ISSN: 2374-8486.

    Examples
    --------
    >>> from imitatebias.generators import *
    >>> from imitatebias.imitate import *
    >>> import matplotlib.pyplot as plt

    Generate a dataset.
    >>> X, y = generateData(1000, 2, 2, seed=2210)

    Generate a biased dataset.
    >>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

    initialize Imitate
    >>> imi = Imitate()

    fit Imitate to the biased dataset
    >>> imi.fit(X_b, labels=y_b)

    visualize data per cluster in ICA space
    >>> for l in np.unique(y_b):
    >>>     data_transformed = imi.icas[l].transform(X_b[y_b == l])
    >>>     plt.scatter(data_transformed[:,0], data_transformed[:,1])
    >>>     plt.title('Class '+str(l))
    >>>     plt.show()

    create some random points to score
    >>> rnd_points = np.column_stack((np.random.uniform(min(X[:,0]), max(X[:,0]), size=1000), \
    >>>                               np.random.uniform(min(X[:,1]), max(X[:,1]), size=1000)))

    score the random points
    >>> scores_fill = imi.score(rnd_points, score_type='fill')
    >>> scores_balanced = imi.score(rnd_points, score_type='balanced')

    visualize data per cluster in ICA space
    >>> for l in np.unique(y_b):
    >>>     plt.scatter(rnd_points[:,0], rnd_points[:,1], c=scores_fill[:,int(l)])
    >>>     plt.title('Class '+str(l)+'; Score type = fill')
    >>>     plt.colorbar()
    >>>     plt.show()

    >>>     plt.scatter(rnd_points[:,0], rnd_points[:,1], c=scores_balanced[:,int(l)])
    >>>     plt.title('Class '+str(l)+'; Score type = balanced')
    >>>     plt.colorbar()
    >>>     plt.show()

    augment the dataset
    >>> X_gen, y_gen = imi.augment()

    visualize data per cluster in ICA space
    >>> plt.scatter(X_b[:,0], X_b[:,1], c=y_b)
    >>> plt.scatter(X_gen[:,0], X_gen[:,1], c=y_gen, edgecolors='red')
    >>> plt.title('Dataset with generated points (red)')
    >>> plt.show()
    """

    def __init__(self):
        """Imitate Constructor."""
        self.icas = {}
        self.grids = {}
        self.vals = {}
        self.fitted = {}
        self.fill_up = {}
        self.num_fill_up = {}

    def fit(self, data, labels=[], bounds={}, bounds_set=None, strength=1000):
        """Fits a bias-aware multivariate Gaussian per label to the data.

        Given a dataset and a potential label array, Imitate splits the data per 
        class and operates on each subset individually. For each of those labels, 
        fit fits a multivariate Gaussian to the subset that accounts for potential 
        biases. See our paper [1]_ for details.
        Custom borders can be defined that constrain the fitting process. The strength
        parameter controls how strongly these borders are enforced (the non-bounded
        version uses `strength=1`).

        Parameters
        ----------
        data : numpy.ndarray (2D)
            Potentially biased input dataset.
        labels : numpy.array (1D), optional
            Labels corresponding to the dataset if available.
        bounds : dict(string or int or float: numpy.ndarray (2D)), optional
            Bounds Imitate if provided, for each label, in the shape 
            ``[[min_0, max_0], ..., [min_d, max_d]]``
            for d dimensions. Use a `dictionary` to map each label to its correct
            bounds.
        bounds_set : numpy.ndarray (2D)
            If Imitate should be bounded to the ranges of a certain dataset, this set
            can be passed to it directly. Will be overwritten by `bounds` if specified.
        strength : int, default=1000
            Controls how strongly the bounds are enforced. Will be ignored if no
            bounds are specified.

	References
	----------
	.. [1] Katharina Dost, Katerina Taskova, Patricia Riddle, and Jörg Wicker. 
           "Your Best Guess When You Know Nothing: Identification and Mitigation of 
           Selection Bias." In: 2020 IEEE International Conference on Data Mining (ICDM), 
           pp. 996-1001, IEEE, 2020, ISSN: 2374-8486.

        Examples
        --------
        >>> from imitatebias.generators import *
        >>> from imitatebias.imitate import *
        >>> import matplotlib.pyplot as plt

        Generate a dataset.
        >>> X, y = generateData(1000, 2, 2, seed=2210)

        Generate a biased dataset.
        >>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

        initialize Imitate
        >>> imi = Imitate()

        fit Imitate to the biased dataset
        >>> imi.fit(X_b, labels=y_b)

        visualize data per cluster in ICA space
        >>> for l in np.unique(y_b):
        >>>     data_transformed = imi.icas[l].transform(X_b[y_b == l])
        >>>     plt.scatter(data_transformed[:,0], data_transformed[:,1])
        >>>     plt.title('Class '+str(l))
        >>>     plt.show()
        """
        self.data = data
        self.labels = np.zeros(len(data)).astype(int) if len(labels)==0 else labels
        for l in np.unique(labels):
            d = data[labels == l]
            self.icas[l] = FastICA(n_components=len(d[0]), whiten='arbitrary-variance')
            self.icas[l].fit(d)
            d_trf = self.icas[l].transform(d)

            if len(bounds) > 0:
                b = bounds.get(l) if len(bounds)>0 else None            
                p_gen = np.column_stack([np.random.uniform(*b[i], 1000) for i in range(len(d[0]))])            
                p_trf = self.ica.transform(p_gen)
                bounds_trf = np.vstack((p_trf.min(axis=0), p_trf.max(axis=0))).transpose()
                range_trf = bounds_trf[:,1] - bounds_trf[:,0]
                bounds_relaxed = np.vstack((bounds_trf[:,0]-0.1*range_trf, bounds_trf[:,1]+0.1*range_trf)).transpose()

                self.grids[l], self.vals[l], self.fitted[l], self.fill_up[l], self.num_fill_up[l], _, _, _ = run_imitate(
                    d_trf, bounds=bounds_relaxed, strength=strength)
            elif bounds_set is not None:
                p_trf = self.ica.transform(bounds_set)
                bounds_trf = np.vstack((p_trf.min(axis=0), p_trf.max(axis=0))).transpose()
                range_trf = bounds_trf[:,1] - bounds_trf[:,0]
                bounds_relaxed = np.vstack((bounds_trf[:,0]-0.1*range_trf, bounds_trf[:,1]+0.1*range_trf)).transpose()

                self.grids[l], self.vals[l], self.fitted[l], self.fill_up[l], self.num_fill_up[l], _, _, _ = run_imitate(
                    d_trf, bounds=bounds_relaxed, strength=strength)
            else:
                self.grids[l], self.vals[l], self.fitted[l], self.fill_up[l], self.num_fill_up[l], _, _, _ = run_imitate(
                    d_trf, bounds=None, strength=1)

    def score(self, data, score_type='fill'):
        """Scores new data based on Imitate's fitted Gaussian.

        Imitate fits one multivariate Gaussian per label in a dataset. Scores are 
        obtained via the difference of those Gaussians' PDFs and the input data
        (represented via a KDE estimate). See our paper [1]_ for details.

        Parameters
        ----------
        data : numpy.ndarray (2D)
            Data that shall be scored. This dataset does not need to match the input data,
            but it is required to have the same dimensionality.
        score_type : {'fill', 'balanced'}, default='fill'
            Selects the type of score. `'fill'` measures how well a data points fills in
            the identified bias, i.e., it quantifies the difference between the fitted
            and the observed dataset distribution. The score is set to 0 if the 3-std-
            truncated fitted Gaussian's PDF at this point evaluates to 0. `'balanced'` 
            additionally takes into account how likely a point is to be observed in this 
            dataset. See our paper [2]_ for details.

        Returns
        -------
        np.ndarray (2D)
            Score (i,j) corresponds to data point D_i and input data label j. 

	References
	----------
        .. [1] Katharina Dost, Katerina Taskova, Patricia Riddle, and Jörg Wicker. 
           "Your Best Guess When You Know Nothing: Identification and Mitigation of 
           Selection Bias." In: 2020 IEEE International Conference on Data Mining (ICDM), 
           pp. 996-1001, IEEE, 2020, ISSN: 2374-8486.

        .. [2] Katharina Dost, Hamish Duncanson, Ioannis Ziogas, Patricia Riddle, and Jörg 
           Wicker. "Divide and Imitate: Multi-Cluster Identification and Mitigation of 
           Selection Bias." In: Advances in Knowledge Discovery and Data Mining - 26th 
           Pacific-Asia Conference, PAKDD 2022. Lecture Notes in Computer Science, vol. 
           13281, pp. 149-160. Springer, Cham (2022).

        Examples
        --------
        >>> from imitatebias.generators import *
        >>> from imitatebias.imitate import *
        >>> import matplotlib.pyplot as plt

        Generate a dataset.
        >>> X, y = generateData(1000, 2, 2, seed=2210)

        Generate a biased dataset.
        >>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

        initialize Imitate
        >>> imi = Imitate()

        fit Imitate to the biased dataset
        >>> imi.fit(X_b, labels=y_b)

        create some random points to score
        >>> rnd_points = np.column_stack((np.random.uniform(min(X[:,0]), max(X[:,0]), size=1000), \
        >>>                               np.random.uniform(min(X[:,1]), max(X[:,1]), size=1000)))

        score the random points
        >>> scores_fill = imi.score(rnd_points, score_type='fill')
        >>> scores_balanced = imi.score(rnd_points, score_type='balanced')

        visualize data per cluster in ICA space
        >>> for l in np.unique(y_b):
        >>>     plt.scatter(rnd_points[:,0], rnd_points[:,1], c=scores_fill[:,int(l)])
        >>>     plt.title('Class '+str(l)+'; Score type = fill')
        >>>     plt.colorbar()
        >>>     plt.show()

        >>>     plt.scatter(rnd_points[:,0], rnd_points[:,1], c=scores_balanced[:,int(l)])
        >>>     plt.title('Class '+str(l)+'; Score type = balanced')
        >>>     plt.colorbar()
        >>>     plt.show()
        """
        scores = np.zeros((len(data), len(self.icas)))
        for i,l in enumerate(np.unique(self.labels)): # fill scores[:,i]
            data_trf = self.icas[l].transform(data)
            grids, fitted, fill_up = self.grids[l], self.fitted[l], self.fill_up[l]

            fitted_grid = np.zeros((len(data_trf), len(data_trf[0]))) # points x dims
            fill_grid = np.zeros((len(data_trf), len(data_trf[0]))) # points x dims
            for d in range(len(data_trf[0])):
                # organize in grid cells: 0 = smaller; len(grids[0]) = larger
                grid_dim = np.digitize(data_trf[:,d], grids[d]) # points x dims
                map_to_fitted = np.vectorize(lambda idx: 0 if idx<=0 or idx>=len(grids[d]) else fitted[d][idx-1])
                map_to_fill = np.vectorize(lambda idx: 0 if idx<=0 or idx>=len(grids[d]) else fill_up[d][idx-1])
                fitted_grid[:, d] = map_to_fitted(grid_dim)
                fill_grid[:, d] = map_to_fill(grid_dim)
            if score_type == 'fill':
                scores[:, i] = np.sum(fill_grid, axis=1)
                scores[np.prod(fitted_grid, axis=1) == 0, i] = 0 # 0 score for unprobable entries
            elif score_type == 'balanced':           
                s1 = np.sum(np.log(fitted_grid + 1), axis=1)  # fitted distribution
                s2 = np.sum(np.log(fill_grid + 1), axis=1)   # fill_up
                scores[:, i] = s1 + len(data_trf[0])*s2   # score as the sum of both (weighted?)
                scores[np.sum(fill_grid, axis=1) == 0, i] = 0   # 0 score where we don't fill anything up
                scores[np.prod(fitted_grid, axis=1) == 0, i] = 0   # 0 score for unprobable entries
        return scores

    def augment(self):
        """Augments the fitted dataset to mitigate its bias.

        Generates points to mitigate the bias in the input dataset provided to the `fit` method.
        The number of generated points per label is determined by `Imitate.num_fill_up`.

        Returns
        -------
        numpy.ndarray (2D)
            Generated points.
        numpy.array (1D)
            Corresponding labels.

        Examples
        --------
        >>> from imitatebias.generators import *
        >>> from imitatebias.imitate import *
        >>> import matplotlib.pyplot as plt

        Generate a dataset.
        >>> X, y = generateData(1000, 2, 2, seed=2210)

        Generate a biased dataset.
        >>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

        initialize Imitate
        >>> imi = Imitate()

        fit Imitate to the biased dataset
        >>> imi.fit(X_b, labels=y_b)

        augment the dataset
        >>> X_gen, y_gen = imi.augment()

        visualize data per cluster in ICA space
        >>> plt.scatter(X_b[:,0], X_b[:,1], c=y_b)
        >>> plt.scatter(X_gen[:,0], X_gen[:,1], c=y_gen, edgecolors='red')
        >>> plt.title('Dataset with generated points (red)')
        >>> plt.show()
        """
        gen_points = np.empty((0, len(self.data[0])))
        gen_labels = []

        for l in np.unique(self.labels):
            num_fill_up = self.num_fill_up[l]
            num_gen = np.max(num_fill_up).astype(int)
            if num_gen == 0: continue
            grids, vals, fitted, fill_up = self.grids[l], self.vals[l], self.fitted[l], self.fill_up[l]

            points = np.empty((num_gen, 0))
            for d in range(len(self.data[0])):
                fill = fitted[d] / np.sum(fitted[d]) * (num_gen - num_fill_up[d])  +  fill_up[d] #mixed distr
                fill_cdf = np.cumsum(fill) / num_gen  #normalize

                #generate points according to the cdf
                vals = np.random.rand(num_gen)
                val_bins = np.searchsorted(fill_cdf, vals)
                coords = np.array([np.random.uniform(grids[d][val_bins[i]], grids[d][val_bins[i]+1]) 
                                   for i in range(num_gen)]).reshape(num_gen, 1)
                points = np.concatenate((points, coords), axis=1)

            gen_points = np.concatenate((gen_points, self.icas[l].inverse_transform(points)))
            gen_labels = np.append(gen_labels, [l]*num_gen)
        return gen_points, gen_labels

__init__()

Imitate Constructor.

Source code in imitatebias\imitate.py
286
287
288
289
290
291
292
293
def __init__(self):
    """Imitate Constructor."""
    self.icas = {}
    self.grids = {}
    self.vals = {}
    self.fitted = {}
    self.fill_up = {}
    self.num_fill_up = {}

augment()

Augments the fitted dataset to mitigate its bias.

Generates points to mitigate the bias in the input dataset provided to the fit method. The number of generated points per label is determined by Imitate.num_fill_up.

Returns:

Type Description
numpy.ndarray (2D)

Generated points.

numpy.array (1D)

Corresponding labels.

Examples:

>>> from imitatebias.generators import *
>>> from imitatebias.imitate import *
>>> import matplotlib.pyplot as plt

Generate a dataset.

>>> X, y = generateData(1000, 2, 2, seed=2210)

Generate a biased dataset.

>>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

initialize Imitate

>>> imi = Imitate()

fit Imitate to the biased dataset

>>> imi.fit(X_b, labels=y_b)

augment the dataset

>>> X_gen, y_gen = imi.augment()

visualize data per cluster in ICA space

>>> plt.scatter(X_b[:,0], X_b[:,1], c=y_b)
>>> plt.scatter(X_gen[:,0], X_gen[:,1], c=y_gen, edgecolors='red')
>>> plt.title('Dataset with generated points (red)')
>>> plt.show()
Source code in imitatebias\imitate.py
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
def augment(self):
    """Augments the fitted dataset to mitigate its bias.

    Generates points to mitigate the bias in the input dataset provided to the `fit` method.
    The number of generated points per label is determined by `Imitate.num_fill_up`.

    Returns
    -------
    numpy.ndarray (2D)
        Generated points.
    numpy.array (1D)
        Corresponding labels.

    Examples
    --------
    >>> from imitatebias.generators import *
    >>> from imitatebias.imitate import *
    >>> import matplotlib.pyplot as plt

    Generate a dataset.
    >>> X, y = generateData(1000, 2, 2, seed=2210)

    Generate a biased dataset.
    >>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

    initialize Imitate
    >>> imi = Imitate()

    fit Imitate to the biased dataset
    >>> imi.fit(X_b, labels=y_b)

    augment the dataset
    >>> X_gen, y_gen = imi.augment()

    visualize data per cluster in ICA space
    >>> plt.scatter(X_b[:,0], X_b[:,1], c=y_b)
    >>> plt.scatter(X_gen[:,0], X_gen[:,1], c=y_gen, edgecolors='red')
    >>> plt.title('Dataset with generated points (red)')
    >>> plt.show()
    """
    gen_points = np.empty((0, len(self.data[0])))
    gen_labels = []

    for l in np.unique(self.labels):
        num_fill_up = self.num_fill_up[l]
        num_gen = np.max(num_fill_up).astype(int)
        if num_gen == 0: continue
        grids, vals, fitted, fill_up = self.grids[l], self.vals[l], self.fitted[l], self.fill_up[l]

        points = np.empty((num_gen, 0))
        for d in range(len(self.data[0])):
            fill = fitted[d] / np.sum(fitted[d]) * (num_gen - num_fill_up[d])  +  fill_up[d] #mixed distr
            fill_cdf = np.cumsum(fill) / num_gen  #normalize

            #generate points according to the cdf
            vals = np.random.rand(num_gen)
            val_bins = np.searchsorted(fill_cdf, vals)
            coords = np.array([np.random.uniform(grids[d][val_bins[i]], grids[d][val_bins[i]+1]) 
                               for i in range(num_gen)]).reshape(num_gen, 1)
            points = np.concatenate((points, coords), axis=1)

        gen_points = np.concatenate((gen_points, self.icas[l].inverse_transform(points)))
        gen_labels = np.append(gen_labels, [l]*num_gen)
    return gen_points, gen_labels

fit(data, labels=[], bounds={}, bounds_set=None, strength=1000)

Fits a bias-aware multivariate Gaussian per label to the data.

Given a dataset and a potential label array, Imitate splits the data per class and operates on each subset individually. For each of those labels, fit fits a multivariate Gaussian to the subset that accounts for potential biases. See our paper [1]_ for details. Custom borders can be defined that constrain the fitting process. The strength parameter controls how strongly these borders are enforced (the non-bounded version uses strength=1).

Parameters:

Name Type Description Default
data numpy.ndarray (2D)

Potentially biased input dataset.

required
labels numpy.array (1D), optional

Labels corresponding to the dataset if available.

[]
bounds dict(string or int or float

Bounds Imitate if provided, for each label, in the shape [[min_0, max_0], ..., [min_d, max_d]] for d dimensions. Use a dictionary to map each label to its correct bounds.

{}
bounds_set numpy.ndarray (2D)

If Imitate should be bounded to the ranges of a certain dataset, this set can be passed to it directly. Will be overwritten by bounds if specified.

None
strength int, default

Controls how strongly the bounds are enforced. Will be ignored if no bounds are specified.

1000
References

.. [1] Katharina Dost, Katerina Taskova, Patricia Riddle, and Jörg Wicker. "Your Best Guess When You Know Nothing: Identification and Mitigation of Selection Bias." In: 2020 IEEE International Conference on Data Mining (ICDM), pp. 996-1001, IEEE, 2020, ISSN: 2374-8486.

Examples:

>>> from imitatebias.generators import *
>>> from imitatebias.imitate import *
>>> import matplotlib.pyplot as plt

Generate a dataset.

>>> X, y = generateData(1000, 2, 2, seed=2210)

Generate a biased dataset.

>>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

initialize Imitate

>>> imi = Imitate()

fit Imitate to the biased dataset

>>> imi.fit(X_b, labels=y_b)

visualize data per cluster in ICA space

>>> for l in np.unique(y_b):
>>>     data_transformed = imi.icas[l].transform(X_b[y_b == l])
>>>     plt.scatter(data_transformed[:,0], data_transformed[:,1])
>>>     plt.title('Class '+str(l))
>>>     plt.show()
Source code in imitatebias\imitate.py
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
    def fit(self, data, labels=[], bounds={}, bounds_set=None, strength=1000):
        """Fits a bias-aware multivariate Gaussian per label to the data.

        Given a dataset and a potential label array, Imitate splits the data per 
        class and operates on each subset individually. For each of those labels, 
        fit fits a multivariate Gaussian to the subset that accounts for potential 
        biases. See our paper [1]_ for details.
        Custom borders can be defined that constrain the fitting process. The strength
        parameter controls how strongly these borders are enforced (the non-bounded
        version uses `strength=1`).

        Parameters
        ----------
        data : numpy.ndarray (2D)
            Potentially biased input dataset.
        labels : numpy.array (1D), optional
            Labels corresponding to the dataset if available.
        bounds : dict(string or int or float: numpy.ndarray (2D)), optional
            Bounds Imitate if provided, for each label, in the shape 
            ``[[min_0, max_0], ..., [min_d, max_d]]``
            for d dimensions. Use a `dictionary` to map each label to its correct
            bounds.
        bounds_set : numpy.ndarray (2D)
            If Imitate should be bounded to the ranges of a certain dataset, this set
            can be passed to it directly. Will be overwritten by `bounds` if specified.
        strength : int, default=1000
            Controls how strongly the bounds are enforced. Will be ignored if no
            bounds are specified.

	References
	----------
	.. [1] Katharina Dost, Katerina Taskova, Patricia Riddle, and Jörg Wicker. 
           "Your Best Guess When You Know Nothing: Identification and Mitigation of 
           Selection Bias." In: 2020 IEEE International Conference on Data Mining (ICDM), 
           pp. 996-1001, IEEE, 2020, ISSN: 2374-8486.

        Examples
        --------
        >>> from imitatebias.generators import *
        >>> from imitatebias.imitate import *
        >>> import matplotlib.pyplot as plt

        Generate a dataset.
        >>> X, y = generateData(1000, 2, 2, seed=2210)

        Generate a biased dataset.
        >>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

        initialize Imitate
        >>> imi = Imitate()

        fit Imitate to the biased dataset
        >>> imi.fit(X_b, labels=y_b)

        visualize data per cluster in ICA space
        >>> for l in np.unique(y_b):
        >>>     data_transformed = imi.icas[l].transform(X_b[y_b == l])
        >>>     plt.scatter(data_transformed[:,0], data_transformed[:,1])
        >>>     plt.title('Class '+str(l))
        >>>     plt.show()
        """
        self.data = data
        self.labels = np.zeros(len(data)).astype(int) if len(labels)==0 else labels
        for l in np.unique(labels):
            d = data[labels == l]
            self.icas[l] = FastICA(n_components=len(d[0]), whiten='arbitrary-variance')
            self.icas[l].fit(d)
            d_trf = self.icas[l].transform(d)

            if len(bounds) > 0:
                b = bounds.get(l) if len(bounds)>0 else None            
                p_gen = np.column_stack([np.random.uniform(*b[i], 1000) for i in range(len(d[0]))])            
                p_trf = self.ica.transform(p_gen)
                bounds_trf = np.vstack((p_trf.min(axis=0), p_trf.max(axis=0))).transpose()
                range_trf = bounds_trf[:,1] - bounds_trf[:,0]
                bounds_relaxed = np.vstack((bounds_trf[:,0]-0.1*range_trf, bounds_trf[:,1]+0.1*range_trf)).transpose()

                self.grids[l], self.vals[l], self.fitted[l], self.fill_up[l], self.num_fill_up[l], _, _, _ = run_imitate(
                    d_trf, bounds=bounds_relaxed, strength=strength)
            elif bounds_set is not None:
                p_trf = self.ica.transform(bounds_set)
                bounds_trf = np.vstack((p_trf.min(axis=0), p_trf.max(axis=0))).transpose()
                range_trf = bounds_trf[:,1] - bounds_trf[:,0]
                bounds_relaxed = np.vstack((bounds_trf[:,0]-0.1*range_trf, bounds_trf[:,1]+0.1*range_trf)).transpose()

                self.grids[l], self.vals[l], self.fitted[l], self.fill_up[l], self.num_fill_up[l], _, _, _ = run_imitate(
                    d_trf, bounds=bounds_relaxed, strength=strength)
            else:
                self.grids[l], self.vals[l], self.fitted[l], self.fill_up[l], self.num_fill_up[l], _, _, _ = run_imitate(
                    d_trf, bounds=None, strength=1)

score(data, score_type='fill')

Scores new data based on Imitate's fitted Gaussian.

Imitate fits one multivariate Gaussian per label in a dataset. Scores are obtained via the difference of those Gaussians' PDFs and the input data (represented via a KDE estimate). See our paper [1]_ for details.

Parameters:

Name Type Description Default
data numpy.ndarray (2D)

Data that shall be scored. This dataset does not need to match the input data, but it is required to have the same dimensionality.

required
score_type

Selects the type of score. 'fill' measures how well a data points fills in the identified bias, i.e., it quantifies the difference between the fitted and the observed dataset distribution. The score is set to 0 if the 3-std- truncated fitted Gaussian's PDF at this point evaluates to 0. 'balanced' additionally takes into account how likely a point is to be observed in this dataset. See our paper [2]_ for details.

'fill'

Returns:

Type Description
np.ndarray (2D)

Score (i,j) corresponds to data point D_i and input data label j.

References

.. [1] Katharina Dost, Katerina Taskova, Patricia Riddle, and Jörg Wicker. "Your Best Guess When You Know Nothing: Identification and Mitigation of Selection Bias." In: 2020 IEEE International Conference on Data Mining (ICDM), pp. 996-1001, IEEE, 2020, ISSN: 2374-8486.

.. [2] Katharina Dost, Hamish Duncanson, Ioannis Ziogas, Patricia Riddle, and Jörg Wicker. "Divide and Imitate: Multi-Cluster Identification and Mitigation of Selection Bias." In: Advances in Knowledge Discovery and Data Mining - 26th Pacific-Asia Conference, PAKDD 2022. Lecture Notes in Computer Science, vol. 13281, pp. 149-160. Springer, Cham (2022).

Examples:

>>> from imitatebias.generators import *
>>> from imitatebias.imitate import *
>>> import matplotlib.pyplot as plt

Generate a dataset.

>>> X, y = generateData(1000, 2, 2, seed=2210)

Generate a biased dataset.

>>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

initialize Imitate

>>> imi = Imitate()

fit Imitate to the biased dataset

>>> imi.fit(X_b, labels=y_b)

create some random points to score

>>> rnd_points = np.column_stack((np.random.uniform(min(X[:,0]), max(X[:,0]), size=1000),         >>>                               np.random.uniform(min(X[:,1]), max(X[:,1]), size=1000)))

score the random points

>>> scores_fill = imi.score(rnd_points, score_type='fill')
>>> scores_balanced = imi.score(rnd_points, score_type='balanced')

visualize data per cluster in ICA space

>>> for l in np.unique(y_b):
>>>     plt.scatter(rnd_points[:,0], rnd_points[:,1], c=scores_fill[:,int(l)])
>>>     plt.title('Class '+str(l)+'; Score type = fill')
>>>     plt.colorbar()
>>>     plt.show()
>>>     plt.scatter(rnd_points[:,0], rnd_points[:,1], c=scores_balanced[:,int(l)])
>>>     plt.title('Class '+str(l)+'; Score type = balanced')
>>>     plt.colorbar()
>>>     plt.show()
Source code in imitatebias\imitate.py
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
    def score(self, data, score_type='fill'):
        """Scores new data based on Imitate's fitted Gaussian.

        Imitate fits one multivariate Gaussian per label in a dataset. Scores are 
        obtained via the difference of those Gaussians' PDFs and the input data
        (represented via a KDE estimate). See our paper [1]_ for details.

        Parameters
        ----------
        data : numpy.ndarray (2D)
            Data that shall be scored. This dataset does not need to match the input data,
            but it is required to have the same dimensionality.
        score_type : {'fill', 'balanced'}, default='fill'
            Selects the type of score. `'fill'` measures how well a data points fills in
            the identified bias, i.e., it quantifies the difference between the fitted
            and the observed dataset distribution. The score is set to 0 if the 3-std-
            truncated fitted Gaussian's PDF at this point evaluates to 0. `'balanced'` 
            additionally takes into account how likely a point is to be observed in this 
            dataset. See our paper [2]_ for details.

        Returns
        -------
        np.ndarray (2D)
            Score (i,j) corresponds to data point D_i and input data label j. 

	References
	----------
        .. [1] Katharina Dost, Katerina Taskova, Patricia Riddle, and Jörg Wicker. 
           "Your Best Guess When You Know Nothing: Identification and Mitigation of 
           Selection Bias." In: 2020 IEEE International Conference on Data Mining (ICDM), 
           pp. 996-1001, IEEE, 2020, ISSN: 2374-8486.

        .. [2] Katharina Dost, Hamish Duncanson, Ioannis Ziogas, Patricia Riddle, and Jörg 
           Wicker. "Divide and Imitate: Multi-Cluster Identification and Mitigation of 
           Selection Bias." In: Advances in Knowledge Discovery and Data Mining - 26th 
           Pacific-Asia Conference, PAKDD 2022. Lecture Notes in Computer Science, vol. 
           13281, pp. 149-160. Springer, Cham (2022).

        Examples
        --------
        >>> from imitatebias.generators import *
        >>> from imitatebias.imitate import *
        >>> import matplotlib.pyplot as plt

        Generate a dataset.
        >>> X, y = generateData(1000, 2, 2, seed=2210)

        Generate a biased dataset.
        >>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

        initialize Imitate
        >>> imi = Imitate()

        fit Imitate to the biased dataset
        >>> imi.fit(X_b, labels=y_b)

        create some random points to score
        >>> rnd_points = np.column_stack((np.random.uniform(min(X[:,0]), max(X[:,0]), size=1000), \
        >>>                               np.random.uniform(min(X[:,1]), max(X[:,1]), size=1000)))

        score the random points
        >>> scores_fill = imi.score(rnd_points, score_type='fill')
        >>> scores_balanced = imi.score(rnd_points, score_type='balanced')

        visualize data per cluster in ICA space
        >>> for l in np.unique(y_b):
        >>>     plt.scatter(rnd_points[:,0], rnd_points[:,1], c=scores_fill[:,int(l)])
        >>>     plt.title('Class '+str(l)+'; Score type = fill')
        >>>     plt.colorbar()
        >>>     plt.show()

        >>>     plt.scatter(rnd_points[:,0], rnd_points[:,1], c=scores_balanced[:,int(l)])
        >>>     plt.title('Class '+str(l)+'; Score type = balanced')
        >>>     plt.colorbar()
        >>>     plt.show()
        """
        scores = np.zeros((len(data), len(self.icas)))
        for i,l in enumerate(np.unique(self.labels)): # fill scores[:,i]
            data_trf = self.icas[l].transform(data)
            grids, fitted, fill_up = self.grids[l], self.fitted[l], self.fill_up[l]

            fitted_grid = np.zeros((len(data_trf), len(data_trf[0]))) # points x dims
            fill_grid = np.zeros((len(data_trf), len(data_trf[0]))) # points x dims
            for d in range(len(data_trf[0])):
                # organize in grid cells: 0 = smaller; len(grids[0]) = larger
                grid_dim = np.digitize(data_trf[:,d], grids[d]) # points x dims
                map_to_fitted = np.vectorize(lambda idx: 0 if idx<=0 or idx>=len(grids[d]) else fitted[d][idx-1])
                map_to_fill = np.vectorize(lambda idx: 0 if idx<=0 or idx>=len(grids[d]) else fill_up[d][idx-1])
                fitted_grid[:, d] = map_to_fitted(grid_dim)
                fill_grid[:, d] = map_to_fill(grid_dim)
            if score_type == 'fill':
                scores[:, i] = np.sum(fill_grid, axis=1)
                scores[np.prod(fitted_grid, axis=1) == 0, i] = 0 # 0 score for unprobable entries
            elif score_type == 'balanced':           
                s1 = np.sum(np.log(fitted_grid + 1), axis=1)  # fitted distribution
                s2 = np.sum(np.log(fill_grid + 1), axis=1)   # fill_up
                scores[:, i] = s1 + len(data_trf[0])*s2   # score as the sum of both (weighted?)
                scores[np.sum(fill_grid, axis=1) == 0, i] = 0   # 0 score where we don't fill anything up
                scores[np.prod(fitted_grid, axis=1) == 0, i] = 0   # 0 score for unprobable entries
        return scores