Mimic

`Mimic`

Mimic generates points to mitigate a multi-cluster dataset's bias.

Machine Learning can help overcome human biases in decision making by focussing on purely logical conclusions based on the training data. If the training data is biased, however, that bias will be transferred to the model and remains undetected as the performance is validated on a test set drawn from the same biased distribution. Existing strategies for selection bias identification and mitigation generally rely on some sort of knowledge of the bias or the ground-truth. An exception is the Imitate [1] algorithm that assumes no knowledge but comes with a strong limitation: It can only model datasets with one normally distributed cluster per class. MIMIC uses Imitate as a building block but relaxes this limitation. By allowing mixtures of multivariate Gaussians, our technique is able to model multi-cluster datasets and provide solutions for a substantially wider set of problems.
See our paper [2] for details.

Attributes:

Name	Type	Description
`params`	`dict(int: numpy.ndarray (2D))`	A label-indexed dictionary containing (mean, cov) tuples for each identified cluster belonging to this label.
`data`	`numpy.ndarray (2D)`	The dataset Mimic is fitted to.
`labels`	`numpy.array (1D)`	The corresponding labels. Labels need to be integer values.

Methods

fit(data, labels=[], centers=None) Fits the Mimic Gaussians to a dataset. predict_cluster(which_class) Predicts clusters for the input dataset. augment() Augments the fitted dataset to mitigate its bias.

References

.. [1] Katharina Dost, Katerina Taskova, Patricia Riddle, and Jörg Wicker. "Your Best Guess When You Know Nothing: Identification and Mitigation of Selection Bias." In: 2020 IEEE International Conference on Data Mining (ICDM), pp. 996-1001, IEEE, 2020, ISSN: 2374-8486.

.. [2] Katharina Dost, Hamish Duncanson, Ioannis Ziogas, Patricia Riddle, and Jörg Wicker. "Divide and Imitate: Multi-Cluster Identification and Mitigation of Selection Bias." In: Advances in Knowledge Discovery and Data Mining - 26th Pacific-Asia Conference, PAKDD 2022. Lecture Notes in Computer Science, vol. 13281, pp. 149-160. Springer, Cham (2022).

Examples:

>>> from imitatebias.generators import *
>>> from imitatebias.mimic import *

Generate a dataset.

>>> X, y = generateData(1000, 2, 2, seed=2210)

Generate a biased dataset.

>>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

initialize Mimic

>>> mim = Mimic()

fit to the biased dataset

>>> mim.fit(X_b, labels=y_b)

predict cluster assignment for class 0

>>> predicted_clusters = mim.predict_cluster(0)

plot the resulting clusters for class 0

>>> plt.scatter(X_b[y_b == 0, 0], X_b[y_b == 0, 1], c=predicted_clusters)
>>> plt.show()

augment the data

>>> gen_p, gen_l = mim.augment()

plot the result

>>> plt.scatter(X_b[:,0], X_b[:,1], label='dataset')
>>> plt.scatter(gen_p[:,0], gen_p[:,1], label='generated points')
>>> plt.legend()
>>> plt.show()

Source code in imitatebias\mimic.py

class Mimic:
    """Mimic generates points to mitigate a multi-cluster dataset's bias.

    Machine Learning can help overcome human biases in decision making by focussing 
    on purely logical conclusions based on the training data. If the training data 
    is biased, however, that bias will be transferred to the model and remains 
    undetected as the performance is validated on a test set drawn from the same 
    biased distribution.
    Existing strategies for selection bias identification and mitigation generally 
    rely on some sort of knowledge of the bias or the ground-truth. An exception 
    is the Imitate [1]_ algorithm that assumes no knowledge but comes with a strong 
    limitation: It can only model datasets with one normally distributed cluster 
    per class.
    MIMIC uses Imitate as a building block but relaxes this limitation. By allowing 
    mixtures of multivariate Gaussians, our technique is able to model multi-cluster 
    datasets and provide solutions for a substantially wider set of problems.   
    See our paper [2]_ for details.

    Attributes
    ----------
    params : dict(int: numpy.ndarray (2D))
        A label-indexed dictionary containing (mean, cov) tuples for each identified
        cluster belonging to this label.
    data : numpy.ndarray (2D)
        The dataset Mimic is fitted to.
    labels : numpy.array (1D)
        The corresponding labels. Labels need to be integer values.

    Methods
    -------
    fit(data, labels=[], centers=None)
        Fits the Mimic Gaussians to a dataset.
    predict_cluster(which_class)
        Predicts clusters for the input dataset.
    augment()
        Augments the fitted dataset to mitigate its bias.

    References
    ----------
    .. [1] Katharina Dost, Katerina Taskova, Patricia Riddle, and Jörg Wicker. 
       "Your Best Guess When You Know Nothing: Identification and Mitigation of 
       Selection Bias." In: 2020 IEEE International Conference on Data Mining (ICDM), 
       pp. 996-1001, IEEE, 2020, ISSN: 2374-8486.

    .. [2] Katharina Dost, Hamish Duncanson, Ioannis Ziogas, Patricia Riddle, and Jörg
       Wicker. "Divide and Imitate: Multi-Cluster Identification and Mitigation of 
       Selection Bias." In: Advances in Knowledge Discovery and Data Mining - 26th 
       Pacific-Asia Conference, PAKDD 2022. Lecture Notes in Computer Science, vol. 
       13281, pp. 149-160. Springer, Cham (2022).

    Examples
    --------
    >>> from imitatebias.generators import *
    >>> from imitatebias.mimic import *

    Generate a dataset.
    >>> X, y = generateData(1000, 2, 2, seed=2210)

    Generate a biased dataset.
    >>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

    initialize Mimic
    >>> mim = Mimic()

    fit to the biased dataset
    >>> mim.fit(X_b, labels=y_b)

    predict cluster assignment for class 0
    >>> predicted_clusters = mim.predict_cluster(0)

    plot the resulting clusters for class 0
    >>> plt.scatter(X_b[y_b == 0, 0], X_b[y_b == 0, 1], c=predicted_clusters)
    >>> plt.show()

    augment the data
    >>> gen_p, gen_l = mim.augment()

    plot the result
    >>> plt.scatter(X_b[:,0], X_b[:,1], label='dataset')
    >>> plt.scatter(gen_p[:,0], gen_p[:,1], label='generated points')
    >>> plt.legend()
    >>> plt.show()

    """

    def __init__(self):
        """Mimic Constructor."""
        self.params = {}

    def fit(self, data, labels=[], centers=None):
        """Fits a bias-aware multivariate Gaussian Mixture Model per label to the data.

        See our paper [1]_ for details. This process is slow and substantially less
        powerful than the Imitate algorithm since it additionally needs to cluster the
        dataset into potentially biased overlapping clusters. We only recommend Mimic
        if the user is certain that the dataset contains multiple clusters. 

        Parameters
        ----------
        data : numpy.ndarray (2D)
            The input dataset.
        labels : numpy.array (1D), optional
            The corresponding labels if the dataset contains multiple classes.
        centers : numpy.ndarray (2D), optional
            A list [C1, ..., Cn] of n initial d-dimensional cluster centers 
            Ci = [Ci_0, ..., Ci_d]. If those centers are not provided, the clustering will
            be initialized with KMeans for the K that optimizes the Silhouette score.

	References
	----------
	.. [1] Katharina Dost, Hamish Duncanson, Ioannis Ziogas, Patricia Riddle, and Jörg
           Wicker. "Divide and Imitate: Multi-Cluster Identification and Mitigation of 
           Selection Bias." In: Advances in Knowledge Discovery and Data Mining - 26th 
           Pacific-Asia Conference, PAKDD 2022. Lecture Notes in Computer Science, vol. 
           13281, pp. 149-160. Springer, Cham (2022).

        Examples
        --------
        >>> from imitatebias.generators import *
        >>> from imitatebias.mimic import *

        Generate a dataset.
        >>> X, y = generateData(1000, 2, 2, seed=2210)

        Generate a biased dataset.
        >>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

        initialize Mimic
        >>> mim = Mimic()

        fit to the biased dataset
        >>> mim.fit(X_b, labels=y_b)



        """
        self.data = data
        self.labels = np.zeros(len(data)).astype(int) if len(labels) == 0 else labels

        for l in np.unique(self.labels):
            d = data[self.labels == l]
            k_init = findK(d)
            # params = mean/cov for each cluster
            probs_imi, params = run_mimic(d, k_init=k_init)

            # merge the resulting clusters
            probs_merge, params_merge = merge(probs_imi, params, d)

            # store parameters
            self.params[l] = params_merge

    def predict_cluster(self, which_class):
        """Predicts clusters for the input data.

        Assigns clusters to the input data belonging to a specified class. Those clusters
        are selected based on the maximum probability that a point belongs to each of the 
        clusters.

        Parameters
        ----------
        which_class : int
            Filters the data based on the initial labels.

        Returns
        -------
        numpy.array (1D)
            The array containing the assigned clusters.

        Examples
        --------
        >>> from imitatebias.generators import *
        >>> from imitatebias.mimic import *
        >>> import matplotlib.pyplot as plt

        Generate a dataset.
        >>> X, y = generateData(1000, 2, 2, seed=2210)

        Generate a biased dataset.
        >>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

        initialize Mimic
        >>> mim = Mimic()

        fit to the biased dataset
        >>> mim.fit(X_b, labels=y_b)

        predict cluster assignment for class 0
        >>> predicted_clusters = mim.predict_cluster(0)

        plot the resulting clusters for class 0
        >>> plt.scatter(X_b[y_b == 0, 0], X_b[y_b == 0, 1], c=predicted_clusters)
        >>> plt.show()



        """
        l = which_class
        probs = np.column_stack([multivariate_normal(self.params[l][i][0], self.params[l][i][1]).pdf(
            self.data[self.labels==l]) for i in range(len(self.params[l]))])
        return prob_cluster_assignment(probs)

    def augment(self):
        """Augments the fitted dataset to mitigate its bias.

        Generates points to fill in the gap between fitted and observed distributions
        in the input dataset.

        Returns
        -------
        numpy.ndarray (2D)
            Generated points.
        numpy.array (1D)
            Corresponding class labels.

        Examples
        --------
        >>> from imitatebias.generators import *
        >>> from imitatebias.mimic import *
        >>> import matplotlib.pyplot as plt

        Generate a dataset.
        >>> X, y = generateData(1000, 2, 2, seed=2210)

        Generate a biased dataset.
        >>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

        initialize Mimic
        >>> mim = Mimic()

        fit to the biased dataset
        >>> mim.fit(X_b, labels=y_b)

        augment the data
        >>> gen_p, gen_l = mim.augment()

        plot the result
        >>> plt.scatter(X_b[:,0], X_b[:,1], label='dataset')
        >>> plt.scatter(gen_p[:,0], gen_p[:,1], label='generated points')
        >>> plt.legend()
        >>> plt.show()



        """
        gen_points = np.empty((0, len(self.data[0])))
        gen_labels = []

        for l in np.unique(self.labels):
            cl_labels = self.predict_cluster(l)
            data_clean = self.data[self.labels==l][cl_labels >= 0]
            cl_labels_clean = cl_labels[cl_labels >= 0]

            points, point_cl_labels = Mimic_augment(data_clean, cl_labels_clean)
            gen_points = np.concatenate((gen_points, points))
            gen_labels = np.append(gen_labels, [l]*len(points))
        return gen_points, gen_labels

`init()`

Mimic Constructor.

Source code in imitatebias\mimic.py

def __init__(self):
    """Mimic Constructor."""
    self.params = {}

`augment()`

Augments the fitted dataset to mitigate its bias.

Generates points to fill in the gap between fitted and observed distributions in the input dataset.

Returns:

Type	Description
`numpy.ndarray (2D)`	Generated points.
`numpy.array (1D)`	Corresponding class labels.

Examples:

>>> from imitatebias.generators import *
>>> from imitatebias.mimic import *
>>> import matplotlib.pyplot as plt

Generate a dataset.

>>> X, y = generateData(1000, 2, 2, seed=2210)

Generate a biased dataset.

>>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

initialize Mimic

>>> mim = Mimic()

fit to the biased dataset

>>> mim.fit(X_b, labels=y_b)

augment the data

>>> gen_p, gen_l = mim.augment()

plot the result

>>> plt.scatter(X_b[:,0], X_b[:,1], label='dataset')
>>> plt.scatter(gen_p[:,0], gen_p[:,1], label='generated points')
>>> plt.legend()
>>> plt.show()

Source code in imitatebias\mimic.py

def augment(self):
    """Augments the fitted dataset to mitigate its bias.

    Generates points to fill in the gap between fitted and observed distributions
    in the input dataset.

    Returns
    -------
    numpy.ndarray (2D)
        Generated points.
    numpy.array (1D)
        Corresponding class labels.

    Examples
    --------
    >>> from imitatebias.generators import *
    >>> from imitatebias.mimic import *
    >>> import matplotlib.pyplot as plt

    Generate a dataset.
    >>> X, y = generateData(1000, 2, 2, seed=2210)

    Generate a biased dataset.
    >>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

    initialize Mimic
    >>> mim = Mimic()

    fit to the biased dataset
    >>> mim.fit(X_b, labels=y_b)

    augment the data
    >>> gen_p, gen_l = mim.augment()

    plot the result
    >>> plt.scatter(X_b[:,0], X_b[:,1], label='dataset')
    >>> plt.scatter(gen_p[:,0], gen_p[:,1], label='generated points')
    >>> plt.legend()
    >>> plt.show()



    """
    gen_points = np.empty((0, len(self.data[0])))
    gen_labels = []

    for l in np.unique(self.labels):
        cl_labels = self.predict_cluster(l)
        data_clean = self.data[self.labels==l][cl_labels >= 0]
        cl_labels_clean = cl_labels[cl_labels >= 0]

        points, point_cl_labels = Mimic_augment(data_clean, cl_labels_clean)
        gen_points = np.concatenate((gen_points, points))
        gen_labels = np.append(gen_labels, [l]*len(points))
    return gen_points, gen_labels

`fit(data, labels=[], centers=None)`

Fits a bias-aware multivariate Gaussian Mixture Model per label to the data.

See our paper [1]_ for details. This process is slow and substantially less powerful than the Imitate algorithm since it additionally needs to cluster the dataset into potentially biased overlapping clusters. We only recommend Mimic if the user is certain that the dataset contains multiple clusters.

Parameters:

Name	Type	Description	Default
`data`	`numpy.ndarray (2D)`	The input dataset.	required
`labels`	`numpy.array (1D), optional`	The corresponding labels if the dataset contains multiple classes.	`[]`
`centers`	`numpy.ndarray (2D), optional`	A list [C1, ..., Cn] of n initial d-dimensional cluster centers Ci = [Ci_0, ..., Ci_d]. If those centers are not provided, the clustering will be initialized with KMeans for the K that optimizes the Silhouette score.	`None`

References

.. [1] Katharina Dost, Hamish Duncanson, Ioannis Ziogas, Patricia Riddle, and Jörg Wicker. "Divide and Imitate: Multi-Cluster Identification and Mitigation of Selection Bias." In: Advances in Knowledge Discovery and Data Mining - 26th Pacific-Asia Conference, PAKDD 2022. Lecture Notes in Computer Science, vol. 13281, pp. 149-160. Springer, Cham (2022).

Examples:

>>> from imitatebias.generators import *
>>> from imitatebias.mimic import *

Generate a dataset.

>>> X, y = generateData(1000, 2, 2, seed=2210)

Generate a biased dataset.

>>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

initialize Mimic

>>> mim = Mimic()

fit to the biased dataset

>>> mim.fit(X_b, labels=y_b)

Source code in imitatebias\mimic.py

    def fit(self, data, labels=[], centers=None):
        """Fits a bias-aware multivariate Gaussian Mixture Model per label to the data.

        See our paper [1]_ for details. This process is slow and substantially less
        powerful than the Imitate algorithm since it additionally needs to cluster the
        dataset into potentially biased overlapping clusters. We only recommend Mimic
        if the user is certain that the dataset contains multiple clusters. 

        Parameters
        ----------
        data : numpy.ndarray (2D)
            The input dataset.
        labels : numpy.array (1D), optional
            The corresponding labels if the dataset contains multiple classes.
        centers : numpy.ndarray (2D), optional
            A list [C1, ..., Cn] of n initial d-dimensional cluster centers 
            Ci = [Ci_0, ..., Ci_d]. If those centers are not provided, the clustering will
            be initialized with KMeans for the K that optimizes the Silhouette score.

	References
	----------
	.. [1] Katharina Dost, Hamish Duncanson, Ioannis Ziogas, Patricia Riddle, and Jörg
           Wicker. "Divide and Imitate: Multi-Cluster Identification and Mitigation of 
           Selection Bias." In: Advances in Knowledge Discovery and Data Mining - 26th 
           Pacific-Asia Conference, PAKDD 2022. Lecture Notes in Computer Science, vol. 
           13281, pp. 149-160. Springer, Cham (2022).

        Examples
        --------
        >>> from imitatebias.generators import *
        >>> from imitatebias.mimic import *

        Generate a dataset.
        >>> X, y = generateData(1000, 2, 2, seed=2210)

        Generate a biased dataset.
        >>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

        initialize Mimic
        >>> mim = Mimic()

        fit to the biased dataset
        >>> mim.fit(X_b, labels=y_b)



        """
        self.data = data
        self.labels = np.zeros(len(data)).astype(int) if len(labels) == 0 else labels

        for l in np.unique(self.labels):
            d = data[self.labels == l]
            k_init = findK(d)
            # params = mean/cov for each cluster
            probs_imi, params = run_mimic(d, k_init=k_init)

            # merge the resulting clusters
            probs_merge, params_merge = merge(probs_imi, params, d)

            # store parameters
            self.params[l] = params_merge

`predict_cluster(which_class)`

Predicts clusters for the input data.

Assigns clusters to the input data belonging to a specified class. Those clusters are selected based on the maximum probability that a point belongs to each of the clusters.

Parameters:

Name	Type	Description	Default
`which_class`	`int`	Filters the data based on the initial labels.	required

Returns:

Type	Description
`numpy.array (1D)`	The array containing the assigned clusters.

Examples:

>>> from imitatebias.generators import *
>>> from imitatebias.mimic import *
>>> import matplotlib.pyplot as plt

Generate a dataset.

>>> X, y = generateData(1000, 2, 2, seed=2210)

Generate a biased dataset.

>>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

initialize Mimic

>>> mim = Mimic()

fit to the biased dataset

>>> mim.fit(X_b, labels=y_b)

predict cluster assignment for class 0

>>> predicted_clusters = mim.predict_cluster(0)

plot the resulting clusters for class 0

>>> plt.scatter(X_b[y_b == 0, 0], X_b[y_b == 0, 1], c=predicted_clusters)
>>> plt.show()

Source code in imitatebias\mimic.py

def predict_cluster(self, which_class):
    """Predicts clusters for the input data.

    Assigns clusters to the input data belonging to a specified class. Those clusters
    are selected based on the maximum probability that a point belongs to each of the 
    clusters.

    Parameters
    ----------
    which_class : int
        Filters the data based on the initial labels.

    Returns
    -------
    numpy.array (1D)
        The array containing the assigned clusters.

    Examples
    --------
    >>> from imitatebias.generators import *
    >>> from imitatebias.mimic import *
    >>> import matplotlib.pyplot as plt

    Generate a dataset.
    >>> X, y = generateData(1000, 2, 2, seed=2210)

    Generate a biased dataset.
    >>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

    initialize Mimic
    >>> mim = Mimic()

    fit to the biased dataset
    >>> mim.fit(X_b, labels=y_b)

    predict cluster assignment for class 0
    >>> predicted_clusters = mim.predict_cluster(0)

    plot the resulting clusters for class 0
    >>> plt.scatter(X_b[y_b == 0, 0], X_b[y_b == 0, 1], c=predicted_clusters)
    >>> plt.show()



    """
    l = which_class
    probs = np.column_stack([multivariate_normal(self.params[l][i][0], self.params[l][i][1]).pdf(
        self.data[self.labels==l]) for i in range(len(self.params[l]))])
    return prob_cluster_assignment(probs)

Mimic

Mimic

Methods

References

__init__()

augment()

fit(data, labels=[], centers=None)

References

predict_cluster(which_class)

`Mimic`

`init()`

`augment()`

`fit(data, labels=[], centers=None)`

`predict_cluster(which_class)`