Imitate

`Imitate`

Imitate generates points to mitigate a dataset's bias.

Imitate investigates the dataset's probability density, then adds generated points in order to smooth out the density and have it resemble a Gaussian, the most common density occurring in real-world applications. If the artificial points focus on certain areas and are not widespread, this could indicate a Selection Bias where these areas are underrepresented in the sample.

See our paper [1]_ for details.

Attributes:

Name	Type	Description
`icas`	`list(sklearn.decomposition.FastICA)`	A list of `FastICA` objects trained per label in the training set.
`grids`	`dict(string or int or float: numpy.ndarray (2D))`	A dictionary mapping a class label to its corresponding grids per dimension over which KDE was evaluated.
`vals`	`dict(string or int or float: numpy.ndarray (2D))`	A KDE density representation of the dataset evaluated over `grids`.
`fitted`	`dict(string or int or float: numpy.ndarray (2D))`	Fitted Gaussian PDF evaluated over `grids`.
`fill_up`	`dict(string or int or float: numpy.ndarray (2D))`	`vals - fitted`, evaluated over `grids`.
`num_fill_up`	`dict(string or int or float: numpy.array (1D))`	The necessary number of points to add to mitigate the bias; per label and dimension.

Methods

fit(data, labels=[], bounds={}, strength=1000) Fits the Imitate Gaussians to a dataset. score(data) Scores new data based on Imitate'd fitted Gaussians. augment() Augments the fitted dataset to mitigate its bias.

References

.. [1] Katharina Dost, Katerina Taskova, Patricia Riddle, and Jörg Wicker. "Your Best Guess When You Know Nothing: Identification and Mitigation of Selection Bias." In: 2020 IEEE International Conference on Data Mining (ICDM), pp. 996-1001, IEEE, 2020, ISSN: 2374-8486.

Examples:

>>> from imitatebias.generators import *
>>> from imitatebias.imitate import *
>>> import matplotlib.pyplot as plt

Generate a dataset.

>>> X, y = generateData(1000, 2, 2, seed=2210)

Generate a biased dataset.

>>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

initialize Imitate

>>> imi = Imitate()

fit Imitate to the biased dataset

>>> imi.fit(X_b, labels=y_b)

visualize data per cluster in ICA space

>>> for l in np.unique(y_b):
>>>     data_transformed = imi.icas[l].transform(X_b[y_b == l])
>>>     plt.scatter(data_transformed[:,0], data_transformed[:,1])
>>>     plt.title('Class '+str(l))
>>>     plt.show()

create some random points to score

>>> rnd_points = np.column_stack((np.random.uniform(min(X[:,0]), max(X[:,0]), size=1000),     >>>                               np.random.uniform(min(X[:,1]), max(X[:,1]), size=1000)))

score the random points

>>> scores_fill = imi.score(rnd_points, score_type='fill')
>>> scores_balanced = imi.score(rnd_points, score_type='balanced')

visualize data per cluster in ICA space

>>> for l in np.unique(y_b):
>>>     plt.scatter(rnd_points[:,0], rnd_points[:,1], c=scores_fill[:,int(l)])
>>>     plt.title('Class '+str(l)+'; Score type = fill')
>>>     plt.colorbar()
>>>     plt.show()

>>>     plt.scatter(rnd_points[:,0], rnd_points[:,1], c=scores_balanced[:,int(l)])
>>>     plt.title('Class '+str(l)+'; Score type = balanced')
>>>     plt.colorbar()
>>>     plt.show()

augment the dataset

>>> X_gen, y_gen = imi.augment()

visualize data per cluster in ICA space

>>> plt.scatter(X_b[:,0], X_b[:,1], c=y_b)
>>> plt.scatter(X_gen[:,0], X_gen[:,1], c=y_gen, edgecolors='red')
>>> plt.title('Dataset with generated points (red)')
>>> plt.show()

Source code in imitatebias\imitate.py

class Imitate:
    """Imitate generates points to mitigate a dataset's bias.

    Imitate investigates the dataset's probability density, then adds generated points 
    in order to smooth out the density and have it resemble a Gaussian, the most common 
    density occurring in real-world applications. If the artificial points focus on 
    certain areas and are not widespread, this could indicate a Selection Bias where 
    these areas are underrepresented in the sample.

    See our paper [1]_ for details.

    Attributes
    ----------
    icas : list(sklearn.decomposition.FastICA) 
        A list of `FastICA` objects trained per label in the training set.
    grids : dict(string or int or float: numpy.ndarray (2D))
        A dictionary mapping a class label to its corresponding grids per dimension 
        over which KDE was evaluated.
    vals : dict(string or int or float: numpy.ndarray (2D))
        A KDE density representation of the dataset evaluated over `grids`.
    fitted : dict(string or int or float: numpy.ndarray (2D))
        Fitted Gaussian PDF evaluated over `grids`.
    fill_up : dict(string or int or float: numpy.ndarray (2D))
        `vals - fitted`, evaluated over `grids`.
    num_fill_up : dict(string or int or float: numpy.array (1D))
        The necessary number of points to add to mitigate the bias; per label and 
        dimension.

    Methods
    -------
    fit(data, labels=[], bounds={}, strength=1000)
        Fits the Imitate Gaussians to a dataset.
    score(data)
        Scores new data based on Imitate'd fitted Gaussians.
    augment()
        Augments the fitted dataset to mitigate its bias.

    References
    ----------
    .. [1] Katharina Dost, Katerina Taskova, Patricia Riddle, and Jörg Wicker. 
       "Your Best Guess When You Know Nothing: Identification and Mitigation of 
       Selection Bias." In: 2020 IEEE International Conference on Data Mining (ICDM), 
       pp. 996-1001, IEEE, 2020, ISSN: 2374-8486.

    Examples
    --------
    >>> from imitatebias.generators import *
    >>> from imitatebias.imitate import *
    >>> import matplotlib.pyplot as plt

    Generate a dataset.
    >>> X, y = generateData(1000, 2, 2, seed=2210)

    Generate a biased dataset.
    >>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

    initialize Imitate
    >>> imi = Imitate()

    fit Imitate to the biased dataset
    >>> imi.fit(X_b, labels=y_b)

    visualize data per cluster in ICA space
    >>> for l in np.unique(y_b):
    >>>     data_transformed = imi.icas[l].transform(X_b[y_b == l])
    >>>     plt.scatter(data_transformed[:,0], data_transformed[:,1])
    >>>     plt.title('Class '+str(l))
    >>>     plt.show()

    create some random points to score
    >>> rnd_points = np.column_stack((np.random.uniform(min(X[:,0]), max(X[:,0]), size=1000), \
    >>>                               np.random.uniform(min(X[:,1]), max(X[:,1]), size=1000)))

    score the random points
    >>> scores_fill = imi.score(rnd_points, score_type='fill')
    >>> scores_balanced = imi.score(rnd_points, score_type='balanced')

    visualize data per cluster in ICA space
    >>> for l in np.unique(y_b):
    >>>     plt.scatter(rnd_points[:,0], rnd_points[:,1], c=scores_fill[:,int(l)])
    >>>     plt.title('Class '+str(l)+'; Score type = fill')
    >>>     plt.colorbar()
    >>>     plt.show()

    >>>     plt.scatter(rnd_points[:,0], rnd_points[:,1], c=scores_balanced[:,int(l)])
    >>>     plt.title('Class '+str(l)+'; Score type = balanced')
    >>>     plt.colorbar()
    >>>     plt.show()

    augment the dataset
    >>> X_gen, y_gen = imi.augment()

    visualize data per cluster in ICA space
    >>> plt.scatter(X_b[:,0], X_b[:,1], c=y_b)
    >>> plt.scatter(X_gen[:,0], X_gen[:,1], c=y_gen, edgecolors='red')
    >>> plt.title('Dataset with generated points (red)')
    >>> plt.show()
    """

    def __init__(self):
        """Imitate Constructor."""
        self.icas = {}
        self.grids = {}
        self.vals = {}
        self.fitted = {}
        self.fill_up = {}
        self.num_fill_up = {}

    def fit(self, data, labels=[], bounds={}, bounds_set=None, strength=1000):
        """Fits a bias-aware multivariate Gaussian per label to the data.

        Given a dataset and a potential label array, Imitate splits the data per 
        class and operates on each subset individually. For each of those labels, 
        fit fits a multivariate Gaussian to the subset that accounts for potential 
        biases. See our paper [1]_ for details.
        Custom borders can be defined that constrain the fitting process. The strength
        parameter controls how strongly these borders are enforced (the non-bounded
        version uses `strength=1`).

        Parameters
        ----------
        data : numpy.ndarray (2D)
            Potentially biased input dataset.
        labels : numpy.array (1D), optional
            Labels corresponding to the dataset if available.
        bounds : dict(string or int or float: numpy.ndarray (2D)), optional
            Bounds Imitate if provided, for each label, in the shape 
            ``[[min_0, max_0], ..., [min_d, max_d]]``
            for d dimensions. Use a `dictionary` to map each label to its correct
            bounds.
        bounds_set : numpy.ndarray (2D)
            If Imitate should be bounded to the ranges of a certain dataset, this set
            can be passed to it directly. Will be overwritten by `bounds` if specified.
        strength : int, default=1000
            Controls how strongly the bounds are enforced. Will be ignored if no
            bounds are specified.

	References
	----------
	.. [1] Katharina Dost, Katerina Taskova, Patricia Riddle, and Jörg Wicker. 
           "Your Best Guess When You Know Nothing: Identification and Mitigation of 
           Selection Bias." In: 2020 IEEE International Conference on Data Mining (ICDM), 
           pp. 996-1001, IEEE, 2020, ISSN: 2374-8486.

        Examples
        --------
        >>> from imitatebias.generators import *
        >>> from imitatebias.imitate import *
        >>> import matplotlib.pyplot as plt

        Generate a dataset.
        >>> X, y = generateData(1000, 2, 2, seed=2210)

        Generate a biased dataset.
        >>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

        initialize Imitate
        >>> imi = Imitate()

        fit Imitate to the biased dataset
        >>> imi.fit(X_b, labels=y_b)

        visualize data per cluster in ICA space
        >>> for l in np.unique(y_b):
        >>>     data_transformed = imi.icas[l].transform(X_b[y_b == l])
        >>>     plt.scatter(data_transformed[:,0], data_transformed[:,1])
        >>>     plt.title('Class '+str(l))
        >>>     plt.show()
        """
        self.data = data
        self.labels = np.zeros(len(data)).astype(int) if len(labels)==0 else labels
        for l in np.unique(labels):
            d = data[labels == l]
            self.icas[l] = FastICA(n_components=len(d[0]), whiten='arbitrary-variance')
            self.icas[l].fit(d)
            d_trf = self.icas[l].transform(d)

            if len(bounds) > 0:
                b = bounds.get(l) if len(bounds)>0 else None            
                p_gen = np.column_stack([np.random.uniform(*b[i], 1000) for i in range(len(d[0]))])            
                p_trf = self.ica.transform(p_gen)
                bounds_trf = np.vstack((p_trf.min(axis=0), p_trf.max(axis=0))).transpose()
                range_trf = bounds_trf[:,1] - bounds_trf[:,0]
                bounds_relaxed = np.vstack((bounds_trf[:,0]-0.1*range_trf, bounds_trf[:,1]+0.1*range_trf)).transpose()

                self.grids[l], self.vals[l], self.fitted[l], self.fill_up[l], self.num_fill_up[l], _, _, _ = run_imitate(
                    d_trf, bounds=bounds_relaxed, strength=strength)
            elif bounds_set is not None:
                p_trf = self.ica.transform(bounds_set)
                bounds_trf = np.vstack((p_trf.min(axis=0), p_trf.max(axis=0))).transpose()
                range_trf = bounds_trf[:,1] - bounds_trf[:,0]
                bounds_relaxed = np.vstack((bounds_trf[:,0]-0.1*range_trf, bounds_trf[:,1]+0.1*range_trf)).transpose()

                self.grids[l], self.vals[l], self.fitted[l], self.fill_up[l], self.num_fill_up[l], _, _, _ = run_imitate(
                    d_trf, bounds=bounds_relaxed, strength=strength)
            else:
                self.grids[l], self.vals[l], self.fitted[l], self.fill_up[l], self.num_fill_up[l], _, _, _ = run_imitate(
                    d_trf, bounds=None, strength=1)

    def score(self, data, score_type='fill'):
        """Scores new data based on Imitate's fitted Gaussian.

        Imitate fits one multivariate Gaussian per label in a dataset. Scores are 
        obtained via the difference of those Gaussians' PDFs and the input data
        (represented via a KDE estimate). See our paper [1]_ for details.

        Parameters
        ----------
        data : numpy.ndarray (2D)
            Data that shall be scored. This dataset does not need to match the input data,
            but it is required to have the same dimensionality.
        score_type : {'fill', 'balanced'}, default='fill'
            Selects the type of score. `'fill'` measures how well a data points fills in
            the identified bias, i.e., it quantifies the difference between the fitted
            and the observed dataset distribution. The score is set to 0 if the 3-std-
            truncated fitted Gaussian's PDF at this point evaluates to 0. `'balanced'` 
            additionally takes into account how likely a point is to be observed in this 
            dataset. See our paper [2]_ for details.

        Returns
        -------
        np.ndarray (2D)
            Score (i,j) corresponds to data point D_i and input data label j. 

	References
	----------
        .. [1] Katharina Dost, Katerina Taskova, Patricia Riddle, and Jörg Wicker. 
           "Your Best Guess When You Know Nothing: Identification and Mitigation of 
           Selection Bias." In: 2020 IEEE International Conference on Data Mining (ICDM), 
           pp. 996-1001, IEEE, 2020, ISSN: 2374-8486.

        .. [2] Katharina Dost, Hamish Duncanson, Ioannis Ziogas, Patricia Riddle, and Jörg 
           Wicker. "Divide and Imitate: Multi-Cluster Identification and Mitigation of 
           Selection Bias." In: Advances in Knowledge Discovery and Data Mining - 26th 
           Pacific-Asia Conference, PAKDD 2022. Lecture Notes in Computer Science, vol. 
           13281, pp. 149-160. Springer, Cham (2022).

        Examples
        --------
        >>> from imitatebias.generators import *
        >>> from imitatebias.imitate import *
        >>> import matplotlib.pyplot as plt

        Generate a dataset.
        >>> X, y = generateData(1000, 2, 2, seed=2210)

        Generate a biased dataset.
        >>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

        initialize Imitate
        >>> imi = Imitate()

        fit Imitate to the biased dataset
        >>> imi.fit(X_b, labels=y_b)

        create some random points to score
        >>> rnd_points = np.column_stack((np.random.uniform(min(X[:,0]), max(X[:,0]), size=1000), \
        >>>                               np.random.uniform(min(X[:,1]), max(X[:,1]), size=1000)))

        score the random points
        >>> scores_fill = imi.score(rnd_points, score_type='fill')
        >>> scores_balanced = imi.score(rnd_points, score_type='balanced')

        visualize data per cluster in ICA space
        >>> for l in np.unique(y_b):
        >>>     plt.scatter(rnd_points[:,0], rnd_points[:,1], c=scores_fill[:,int(l)])
        >>>     plt.title('Class '+str(l)+'; Score type = fill')
        >>>     plt.colorbar()
        >>>     plt.show()

        >>>     plt.scatter(rnd_points[:,0], rnd_points[:,1], c=scores_balanced[:,int(l)])
        >>>     plt.title('Class '+str(l)+'; Score type = balanced')
        >>>     plt.colorbar()
        >>>     plt.show()
        """
        scores = np.zeros((len(data), len(self.icas)))
        for i,l in enumerate(np.unique(self.labels)): # fill scores[:,i]
            data_trf = self.icas[l].transform(data)
            grids, fitted, fill_up = self.grids[l], self.fitted[l], self.fill_up[l]

            fitted_grid = np.zeros((len(data_trf), len(data_trf[0]))) # points x dims
            fill_grid = np.zeros((len(data_trf), len(data_trf[0]))) # points x dims
            for d in range(len(data_trf[0])):
                # organize in grid cells: 0 = smaller; len(grids[0]) = larger
                grid_dim = np.digitize(data_trf[:,d], grids[d]) # points x dims
                map_to_fitted = np.vectorize(lambda idx: 0 if idx<=0 or idx>=len(grids[d]) else fitted[d][idx-1])
                map_to_fill = np.vectorize(lambda idx: 0 if idx<=0 or idx>=len(grids[d]) else fill_up[d][idx-1])
                fitted_grid[:, d] = map_to_fitted(grid_dim)
                fill_grid[:, d] = map_to_fill(grid_dim)
            if score_type == 'fill':
                scores[:, i] = np.sum(fill_grid, axis=1)
                scores[np.prod(fitted_grid, axis=1) == 0, i] = 0 # 0 score for unprobable entries
            elif score_type == 'balanced':           
                s1 = np.sum(np.log(fitted_grid + 1), axis=1)  # fitted distribution
                s2 = np.sum(np.log(fill_grid + 1), axis=1)   # fill_up
                scores[:, i] = s1 + len(data_trf[0])*s2   # score as the sum of both (weighted?)
                scores[np.sum(fill_grid, axis=1) == 0, i] = 0   # 0 score where we don't fill anything up
                scores[np.prod(fitted_grid, axis=1) == 0, i] = 0   # 0 score for unprobable entries
        return scores

    def augment(self):
        """Augments the fitted dataset to mitigate its bias.

        Generates points to mitigate the bias in the input dataset provided to the `fit` method.
        The number of generated points per label is determined by `Imitate.num_fill_up`.

        Returns
        -------
        numpy.ndarray (2D)
            Generated points.
        numpy.array (1D)
            Corresponding labels.

        Examples
        --------
        >>> from imitatebias.generators import *
        >>> from imitatebias.imitate import *
        >>> import matplotlib.pyplot as plt

        Generate a dataset.
        >>> X, y = generateData(1000, 2, 2, seed=2210)

        Generate a biased dataset.
        >>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

        initialize Imitate
        >>> imi = Imitate()

        fit Imitate to the biased dataset
        >>> imi.fit(X_b, labels=y_b)

        augment the dataset
        >>> X_gen, y_gen = imi.augment()

        visualize data per cluster in ICA space
        >>> plt.scatter(X_b[:,0], X_b[:,1], c=y_b)
        >>> plt.scatter(X_gen[:,0], X_gen[:,1], c=y_gen, edgecolors='red')
        >>> plt.title('Dataset with generated points (red)')
        >>> plt.show()
        """
        gen_points = np.empty((0, len(self.data[0])))
        gen_labels = []

        for l in np.unique(self.labels):
            num_fill_up = self.num_fill_up[l]
            num_gen = np.max(num_fill_up).astype(int)
            if num_gen == 0: continue
            grids, vals, fitted, fill_up = self.grids[l], self.vals[l], self.fitted[l], self.fill_up[l]

            points = np.empty((num_gen, 0))
            for d in range(len(self.data[0])):
                fill = fitted[d] / np.sum(fitted[d]) * (num_gen - num_fill_up[d])  +  fill_up[d] #mixed distr
                fill_cdf = np.cumsum(fill) / num_gen  #normalize

                #generate points according to the cdf
                vals = np.random.rand(num_gen)
                val_bins = np.searchsorted(fill_cdf, vals)
                coords = np.array([np.random.uniform(grids[d][val_bins[i]], grids[d][val_bins[i]+1]) 
                                   for i in range(num_gen)]).reshape(num_gen, 1)
                points = np.concatenate((points, coords), axis=1)

            gen_points = np.concatenate((gen_points, self.icas[l].inverse_transform(points)))
            gen_labels = np.append(gen_labels, [l]*num_gen)
        return gen_points, gen_labels

`init()`

Imitate Constructor.

Source code in imitatebias\imitate.py

def __init__(self):
    """Imitate Constructor."""
    self.icas = {}
    self.grids = {}
    self.vals = {}
    self.fitted = {}
    self.fill_up = {}
    self.num_fill_up = {}

`augment()`

Augments the fitted dataset to mitigate its bias.

Generates points to mitigate the bias in the input dataset provided to the fit method. The number of generated points per label is determined by Imitate.num_fill_up.

Returns:

Type	Description
`numpy.ndarray (2D)`	Generated points.
`numpy.array (1D)`	Corresponding labels.

Examples:

>>> from imitatebias.generators import *
>>> from imitatebias.imitate import *
>>> import matplotlib.pyplot as plt

Generate a dataset.

>>> X, y = generateData(1000, 2, 2, seed=2210)

Generate a biased dataset.

>>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

initialize Imitate

>>> imi = Imitate()

fit Imitate to the biased dataset

>>> imi.fit(X_b, labels=y_b)

augment the dataset

>>> X_gen, y_gen = imi.augment()

visualize data per cluster in ICA space

>>> plt.scatter(X_b[:,0], X_b[:,1], c=y_b)
>>> plt.scatter(X_gen[:,0], X_gen[:,1], c=y_gen, edgecolors='red')
>>> plt.title('Dataset with generated points (red)')
>>> plt.show()

Source code in imitatebias\imitate.py

def augment(self):
    """Augments the fitted dataset to mitigate its bias.

    Generates points to mitigate the bias in the input dataset provided to the `fit` method.
    The number of generated points per label is determined by `Imitate.num_fill_up`.

    Returns
    -------
    numpy.ndarray (2D)
        Generated points.
    numpy.array (1D)
        Corresponding labels.

    Examples
    --------
    >>> from imitatebias.generators import *
    >>> from imitatebias.imitate import *
    >>> import matplotlib.pyplot as plt

    Generate a dataset.
    >>> X, y = generateData(1000, 2, 2, seed=2210)

    Generate a biased dataset.
    >>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

    initialize Imitate
    >>> imi = Imitate()

    fit Imitate to the biased dataset
    >>> imi.fit(X_b, labels=y_b)

    augment the dataset
    >>> X_gen, y_gen = imi.augment()

    visualize data per cluster in ICA space
    >>> plt.scatter(X_b[:,0], X_b[:,1], c=y_b)
    >>> plt.scatter(X_gen[:,0], X_gen[:,1], c=y_gen, edgecolors='red')
    >>> plt.title('Dataset with generated points (red)')
    >>> plt.show()
    """
    gen_points = np.empty((0, len(self.data[0])))
    gen_labels = []

    for l in np.unique(self.labels):
        num_fill_up = self.num_fill_up[l]
        num_gen = np.max(num_fill_up).astype(int)
        if num_gen == 0: continue
        grids, vals, fitted, fill_up = self.grids[l], self.vals[l], self.fitted[l], self.fill_up[l]

        points = np.empty((num_gen, 0))
        for d in range(len(self.data[0])):
            fill = fitted[d] / np.sum(fitted[d]) * (num_gen - num_fill_up[d])  +  fill_up[d] #mixed distr
            fill_cdf = np.cumsum(fill) / num_gen  #normalize

            #generate points according to the cdf
            vals = np.random.rand(num_gen)
            val_bins = np.searchsorted(fill_cdf, vals)
            coords = np.array([np.random.uniform(grids[d][val_bins[i]], grids[d][val_bins[i]+1]) 
                               for i in range(num_gen)]).reshape(num_gen, 1)
            points = np.concatenate((points, coords), axis=1)

        gen_points = np.concatenate((gen_points, self.icas[l].inverse_transform(points)))
        gen_labels = np.append(gen_labels, [l]*num_gen)
    return gen_points, gen_labels

`fit(data, labels=[], bounds={}, bounds_set=None, strength=1000)`

Fits a bias-aware multivariate Gaussian per label to the data.

Given a dataset and a potential label array, Imitate splits the data per class and operates on each subset individually. For each of those labels, fit fits a multivariate Gaussian to the subset that accounts for potential biases. See our paper [1]_ for details. Custom borders can be defined that constrain the fitting process. The strength parameter controls how strongly these borders are enforced (the non-bounded version uses strength=1).

Parameters:

Name	Type	Description	Default
`data`	`numpy.ndarray (2D)`	Potentially biased input dataset.	required
`labels`	`numpy.array (1D), optional`	Labels corresponding to the dataset if available.	`[]`
`bounds`	`dict(string or int or float`	Bounds Imitate if provided, for each label, in the shape `[[min_0, max_0], ..., [min_d, max_d]]` for d dimensions. Use a `dictionary` to map each label to its correct bounds.	`{}`
`bounds_set`	`numpy.ndarray (2D)`	If Imitate should be bounded to the ranges of a certain dataset, this set can be passed to it directly. Will be overwritten by `bounds` if specified.	`None`
`strength`	`int, default`	Controls how strongly the bounds are enforced. Will be ignored if no bounds are specified.	`1000`

References

.. [1] Katharina Dost, Katerina Taskova, Patricia Riddle, and Jörg Wicker. "Your Best Guess When You Know Nothing: Identification and Mitigation of Selection Bias." In: 2020 IEEE International Conference on Data Mining (ICDM), pp. 996-1001, IEEE, 2020, ISSN: 2374-8486.

Examples:

>>> from imitatebias.generators import *
>>> from imitatebias.imitate import *
>>> import matplotlib.pyplot as plt

Generate a dataset.

>>> X, y = generateData(1000, 2, 2, seed=2210)

Generate a biased dataset.

>>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

initialize Imitate

>>> imi = Imitate()

fit Imitate to the biased dataset

>>> imi.fit(X_b, labels=y_b)

visualize data per cluster in ICA space

>>> for l in np.unique(y_b):
>>>     data_transformed = imi.icas[l].transform(X_b[y_b == l])
>>>     plt.scatter(data_transformed[:,0], data_transformed[:,1])
>>>     plt.title('Class '+str(l))
>>>     plt.show()

Source code in imitatebias\imitate.py

    def fit(self, data, labels=[], bounds={}, bounds_set=None, strength=1000):
        """Fits a bias-aware multivariate Gaussian per label to the data.

        Given a dataset and a potential label array, Imitate splits the data per 
        class and operates on each subset individually. For each of those labels, 
        fit fits a multivariate Gaussian to the subset that accounts for potential 
        biases. See our paper [1]_ for details.
        Custom borders can be defined that constrain the fitting process. The strength
        parameter controls how strongly these borders are enforced (the non-bounded
        version uses `strength=1`).

        Parameters
        ----------
        data : numpy.ndarray (2D)
            Potentially biased input dataset.
        labels : numpy.array (1D), optional
            Labels corresponding to the dataset if available.
        bounds : dict(string or int or float: numpy.ndarray (2D)), optional
            Bounds Imitate if provided, for each label, in the shape 
            ``[[min_0, max_0], ..., [min_d, max_d]]``
            for d dimensions. Use a `dictionary` to map each label to its correct
            bounds.
        bounds_set : numpy.ndarray (2D)
            If Imitate should be bounded to the ranges of a certain dataset, this set
            can be passed to it directly. Will be overwritten by `bounds` if specified.
        strength : int, default=1000
            Controls how strongly the bounds are enforced. Will be ignored if no
            bounds are specified.

	References
	----------
	.. [1] Katharina Dost, Katerina Taskova, Patricia Riddle, and Jörg Wicker. 
           "Your Best Guess When You Know Nothing: Identification and Mitigation of 
           Selection Bias." In: 2020 IEEE International Conference on Data Mining (ICDM), 
           pp. 996-1001, IEEE, 2020, ISSN: 2374-8486.

        Examples
        --------
        >>> from imitatebias.generators import *
        >>> from imitatebias.imitate import *
        >>> import matplotlib.pyplot as plt

        Generate a dataset.
        >>> X, y = generateData(1000, 2, 2, seed=2210)

        Generate a biased dataset.
        >>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

        initialize Imitate
        >>> imi = Imitate()

        fit Imitate to the biased dataset
        >>> imi.fit(X_b, labels=y_b)

        visualize data per cluster in ICA space
        >>> for l in np.unique(y_b):
        >>>     data_transformed = imi.icas[l].transform(X_b[y_b == l])
        >>>     plt.scatter(data_transformed[:,0], data_transformed[:,1])
        >>>     plt.title('Class '+str(l))
        >>>     plt.show()
        """
        self.data = data
        self.labels = np.zeros(len(data)).astype(int) if len(labels)==0 else labels
        for l in np.unique(labels):
            d = data[labels == l]
            self.icas[l] = FastICA(n_components=len(d[0]), whiten='arbitrary-variance')
            self.icas[l].fit(d)
            d_trf = self.icas[l].transform(d)

            if len(bounds) > 0:
                b = bounds.get(l) if len(bounds)>0 else None            
                p_gen = np.column_stack([np.random.uniform(*b[i], 1000) for i in range(len(d[0]))])            
                p_trf = self.ica.transform(p_gen)
                bounds_trf = np.vstack((p_trf.min(axis=0), p_trf.max(axis=0))).transpose()
                range_trf = bounds_trf[:,1] - bounds_trf[:,0]
                bounds_relaxed = np.vstack((bounds_trf[:,0]-0.1*range_trf, bounds_trf[:,1]+0.1*range_trf)).transpose()

                self.grids[l], self.vals[l], self.fitted[l], self.fill_up[l], self.num_fill_up[l], _, _, _ = run_imitate(
                    d_trf, bounds=bounds_relaxed, strength=strength)
            elif bounds_set is not None:
                p_trf = self.ica.transform(bounds_set)
                bounds_trf = np.vstack((p_trf.min(axis=0), p_trf.max(axis=0))).transpose()
                range_trf = bounds_trf[:,1] - bounds_trf[:,0]
                bounds_relaxed = np.vstack((bounds_trf[:,0]-0.1*range_trf, bounds_trf[:,1]+0.1*range_trf)).transpose()

                self.grids[l], self.vals[l], self.fitted[l], self.fill_up[l], self.num_fill_up[l], _, _, _ = run_imitate(
                    d_trf, bounds=bounds_relaxed, strength=strength)
            else:
                self.grids[l], self.vals[l], self.fitted[l], self.fill_up[l], self.num_fill_up[l], _, _, _ = run_imitate(
                    d_trf, bounds=None, strength=1)

`score(data, score_type='fill')`

Scores new data based on Imitate's fitted Gaussian.

Imitate fits one multivariate Gaussian per label in a dataset. Scores are obtained via the difference of those Gaussians' PDFs and the input data (represented via a KDE estimate). See our paper [1]_ for details.

Parameters:

Name	Type	Description	Default
`data`	`numpy.ndarray (2D)`	Data that shall be scored. This dataset does not need to match the input data, but it is required to have the same dimensionality.	required
`score_type`		Selects the type of score. `'fill'` measures how well a data points fills in the identified bias, i.e., it quantifies the difference between the fitted and the observed dataset distribution. The score is set to 0 if the 3-std- truncated fitted Gaussian's PDF at this point evaluates to 0. `'balanced'` additionally takes into account how likely a point is to be observed in this dataset. See our paper [2]_ for details.	`'fill'`

Returns:

Type	Description
`np.ndarray (2D)`	Score (i,j) corresponds to data point D_i and input data label j.

References

.. [1] Katharina Dost, Katerina Taskova, Patricia Riddle, and Jörg Wicker. "Your Best Guess When You Know Nothing: Identification and Mitigation of Selection Bias." In: 2020 IEEE International Conference on Data Mining (ICDM), pp. 996-1001, IEEE, 2020, ISSN: 2374-8486.

.. [2] Katharina Dost, Hamish Duncanson, Ioannis Ziogas, Patricia Riddle, and Jörg Wicker. "Divide and Imitate: Multi-Cluster Identification and Mitigation of Selection Bias." In: Advances in Knowledge Discovery and Data Mining - 26th Pacific-Asia Conference, PAKDD 2022. Lecture Notes in Computer Science, vol. 13281, pp. 149-160. Springer, Cham (2022).

Examples:

>>> from imitatebias.generators import *
>>> from imitatebias.imitate import *
>>> import matplotlib.pyplot as plt

Generate a dataset.

>>> X, y = generateData(1000, 2, 2, seed=2210)

Generate a biased dataset.

>>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

initialize Imitate

>>> imi = Imitate()

fit Imitate to the biased dataset

>>> imi.fit(X_b, labels=y_b)

create some random points to score

>>> rnd_points = np.column_stack((np.random.uniform(min(X[:,0]), max(X[:,0]), size=1000),         >>>                               np.random.uniform(min(X[:,1]), max(X[:,1]), size=1000)))

score the random points

>>> scores_fill = imi.score(rnd_points, score_type='fill')
>>> scores_balanced = imi.score(rnd_points, score_type='balanced')

visualize data per cluster in ICA space

>>> for l in np.unique(y_b):
>>>     plt.scatter(rnd_points[:,0], rnd_points[:,1], c=scores_fill[:,int(l)])
>>>     plt.title('Class '+str(l)+'; Score type = fill')
>>>     plt.colorbar()
>>>     plt.show()

>>>     plt.scatter(rnd_points[:,0], rnd_points[:,1], c=scores_balanced[:,int(l)])
>>>     plt.title('Class '+str(l)+'; Score type = balanced')
>>>     plt.colorbar()
>>>     plt.show()

Source code in imitatebias\imitate.py

    def score(self, data, score_type='fill'):
        """Scores new data based on Imitate's fitted Gaussian.

        Imitate fits one multivariate Gaussian per label in a dataset. Scores are 
        obtained via the difference of those Gaussians' PDFs and the input data
        (represented via a KDE estimate). See our paper [1]_ for details.

        Parameters
        ----------
        data : numpy.ndarray (2D)
            Data that shall be scored. This dataset does not need to match the input data,
            but it is required to have the same dimensionality.
        score_type : {'fill', 'balanced'}, default='fill'
            Selects the type of score. `'fill'` measures how well a data points fills in
            the identified bias, i.e., it quantifies the difference between the fitted
            and the observed dataset distribution. The score is set to 0 if the 3-std-
            truncated fitted Gaussian's PDF at this point evaluates to 0. `'balanced'` 
            additionally takes into account how likely a point is to be observed in this 
            dataset. See our paper [2]_ for details.

        Returns
        -------
        np.ndarray (2D)
            Score (i,j) corresponds to data point D_i and input data label j. 

	References
	----------
        .. [1] Katharina Dost, Katerina Taskova, Patricia Riddle, and Jörg Wicker. 
           "Your Best Guess When You Know Nothing: Identification and Mitigation of 
           Selection Bias." In: 2020 IEEE International Conference on Data Mining (ICDM), 
           pp. 996-1001, IEEE, 2020, ISSN: 2374-8486.

        .. [2] Katharina Dost, Hamish Duncanson, Ioannis Ziogas, Patricia Riddle, and Jörg 
           Wicker. "Divide and Imitate: Multi-Cluster Identification and Mitigation of 
           Selection Bias." In: Advances in Knowledge Discovery and Data Mining - 26th 
           Pacific-Asia Conference, PAKDD 2022. Lecture Notes in Computer Science, vol. 
           13281, pp. 149-160. Springer, Cham (2022).

        Examples
        --------
        >>> from imitatebias.generators import *
        >>> from imitatebias.imitate import *
        >>> import matplotlib.pyplot as plt

        Generate a dataset.
        >>> X, y = generateData(1000, 2, 2, seed=2210)

        Generate a biased dataset.
        >>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

        initialize Imitate
        >>> imi = Imitate()

        fit Imitate to the biased dataset
        >>> imi.fit(X_b, labels=y_b)

        create some random points to score
        >>> rnd_points = np.column_stack((np.random.uniform(min(X[:,0]), max(X[:,0]), size=1000), \
        >>>                               np.random.uniform(min(X[:,1]), max(X[:,1]), size=1000)))

        score the random points
        >>> scores_fill = imi.score(rnd_points, score_type='fill')
        >>> scores_balanced = imi.score(rnd_points, score_type='balanced')

        visualize data per cluster in ICA space
        >>> for l in np.unique(y_b):
        >>>     plt.scatter(rnd_points[:,0], rnd_points[:,1], c=scores_fill[:,int(l)])
        >>>     plt.title('Class '+str(l)+'; Score type = fill')
        >>>     plt.colorbar()
        >>>     plt.show()

        >>>     plt.scatter(rnd_points[:,0], rnd_points[:,1], c=scores_balanced[:,int(l)])
        >>>     plt.title('Class '+str(l)+'; Score type = balanced')
        >>>     plt.colorbar()
        >>>     plt.show()
        """
        scores = np.zeros((len(data), len(self.icas)))
        for i,l in enumerate(np.unique(self.labels)): # fill scores[:,i]
            data_trf = self.icas[l].transform(data)
            grids, fitted, fill_up = self.grids[l], self.fitted[l], self.fill_up[l]

            fitted_grid = np.zeros((len(data_trf), len(data_trf[0]))) # points x dims
            fill_grid = np.zeros((len(data_trf), len(data_trf[0]))) # points x dims
            for d in range(len(data_trf[0])):
                # organize in grid cells: 0 = smaller; len(grids[0]) = larger
                grid_dim = np.digitize(data_trf[:,d], grids[d]) # points x dims
                map_to_fitted = np.vectorize(lambda idx: 0 if idx<=0 or idx>=len(grids[d]) else fitted[d][idx-1])
                map_to_fill = np.vectorize(lambda idx: 0 if idx<=0 or idx>=len(grids[d]) else fill_up[d][idx-1])
                fitted_grid[:, d] = map_to_fitted(grid_dim)
                fill_grid[:, d] = map_to_fill(grid_dim)
            if score_type == 'fill':
                scores[:, i] = np.sum(fill_grid, axis=1)
                scores[np.prod(fitted_grid, axis=1) == 0, i] = 0 # 0 score for unprobable entries
            elif score_type == 'balanced':           
                s1 = np.sum(np.log(fitted_grid + 1), axis=1)  # fitted distribution
                s2 = np.sum(np.log(fill_grid + 1), axis=1)   # fill_up
                scores[:, i] = s1 + len(data_trf[0])*s2   # score as the sum of both (weighted?)
                scores[np.sum(fill_grid, axis=1) == 0, i] = 0   # 0 score where we don't fill anything up
                scores[np.prod(fitted_grid, axis=1) == 0, i] = 0   # 0 score for unprobable entries
        return scores

Imitate

Imitate

Methods

References

__init__()

augment()

fit(data, labels=[], bounds={}, bounds_set=None, strength=1000)

References

score(data, score_type='fill')

References

`Imitate`

`init()`

`augment()`

`fit(data, labels=[], bounds={}, bounds_set=None, strength=1000)`

`score(data, score_type='fill')`