Skip to content

Mimic

Mimic

Mimic generates points to mitigate a multi-cluster dataset's bias.

Machine Learning can help overcome human biases in decision making by focussing on purely logical conclusions based on the training data. If the training data is biased, however, that bias will be transferred to the model and remains undetected as the performance is validated on a test set drawn from the same biased distribution. Existing strategies for selection bias identification and mitigation generally rely on some sort of knowledge of the bias or the ground-truth. An exception is the Imitate [1] algorithm that assumes no knowledge but comes with a strong limitation: It can only model datasets with one normally distributed cluster per class. MIMIC uses Imitate as a building block but relaxes this limitation. By allowing mixtures of multivariate Gaussians, our technique is able to model multi-cluster datasets and provide solutions for a substantially wider set of problems.
See our paper [2]
for details.

Attributes:

Name Type Description
params dict(int: numpy.ndarray (2D))

A label-indexed dictionary containing (mean, cov) tuples for each identified cluster belonging to this label.

data numpy.ndarray (2D)

The dataset Mimic is fitted to.

labels numpy.array (1D)

The corresponding labels. Labels need to be integer values.

Methods

fit(data, labels=[], centers=None) Fits the Mimic Gaussians to a dataset. predict_cluster(which_class) Predicts clusters for the input dataset. augment() Augments the fitted dataset to mitigate its bias.

References

.. [1] Katharina Dost, Katerina Taskova, Patricia Riddle, and Jörg Wicker. "Your Best Guess When You Know Nothing: Identification and Mitigation of Selection Bias." In: 2020 IEEE International Conference on Data Mining (ICDM), pp. 996-1001, IEEE, 2020, ISSN: 2374-8486.

.. [2] Katharina Dost, Hamish Duncanson, Ioannis Ziogas, Patricia Riddle, and Jörg Wicker. "Divide and Imitate: Multi-Cluster Identification and Mitigation of Selection Bias." In: Advances in Knowledge Discovery and Data Mining - 26th Pacific-Asia Conference, PAKDD 2022. Lecture Notes in Computer Science, vol. 13281, pp. 149-160. Springer, Cham (2022).

Examples:

>>> from imitatebias.generators import *
>>> from imitatebias.mimic import *

Generate a dataset.

>>> X, y = generateData(1000, 2, 2, seed=2210)

Generate a biased dataset.

>>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

initialize Mimic

>>> mim = Mimic()

fit to the biased dataset

>>> mim.fit(X_b, labels=y_b)

predict cluster assignment for class 0

>>> predicted_clusters = mim.predict_cluster(0)

plot the resulting clusters for class 0

>>> plt.scatter(X_b[y_b == 0, 0], X_b[y_b == 0, 1], c=predicted_clusters)
>>> plt.show()

augment the data

>>> gen_p, gen_l = mim.augment()

plot the result

>>> plt.scatter(X_b[:,0], X_b[:,1], label='dataset')
>>> plt.scatter(gen_p[:,0], gen_p[:,1], label='generated points')
>>> plt.legend()
>>> plt.show()
Source code in imitatebias\mimic.py
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
class Mimic:
    """Mimic generates points to mitigate a multi-cluster dataset's bias.

    Machine Learning can help overcome human biases in decision making by focussing 
    on purely logical conclusions based on the training data. If the training data 
    is biased, however, that bias will be transferred to the model and remains 
    undetected as the performance is validated on a test set drawn from the same 
    biased distribution.
    Existing strategies for selection bias identification and mitigation generally 
    rely on some sort of knowledge of the bias or the ground-truth. An exception 
    is the Imitate [1]_ algorithm that assumes no knowledge but comes with a strong 
    limitation: It can only model datasets with one normally distributed cluster 
    per class.
    MIMIC uses Imitate as a building block but relaxes this limitation. By allowing 
    mixtures of multivariate Gaussians, our technique is able to model multi-cluster 
    datasets and provide solutions for a substantially wider set of problems.   
    See our paper [2]_ for details.

    Attributes
    ----------
    params : dict(int: numpy.ndarray (2D))
        A label-indexed dictionary containing (mean, cov) tuples for each identified
        cluster belonging to this label.
    data : numpy.ndarray (2D)
        The dataset Mimic is fitted to.
    labels : numpy.array (1D)
        The corresponding labels. Labels need to be integer values.

    Methods
    -------
    fit(data, labels=[], centers=None)
        Fits the Mimic Gaussians to a dataset.
    predict_cluster(which_class)
        Predicts clusters for the input dataset.
    augment()
        Augments the fitted dataset to mitigate its bias.

    References
    ----------
    .. [1] Katharina Dost, Katerina Taskova, Patricia Riddle, and Jörg Wicker. 
       "Your Best Guess When You Know Nothing: Identification and Mitigation of 
       Selection Bias." In: 2020 IEEE International Conference on Data Mining (ICDM), 
       pp. 996-1001, IEEE, 2020, ISSN: 2374-8486.

    .. [2] Katharina Dost, Hamish Duncanson, Ioannis Ziogas, Patricia Riddle, and Jörg
       Wicker. "Divide and Imitate: Multi-Cluster Identification and Mitigation of 
       Selection Bias." In: Advances in Knowledge Discovery and Data Mining - 26th 
       Pacific-Asia Conference, PAKDD 2022. Lecture Notes in Computer Science, vol. 
       13281, pp. 149-160. Springer, Cham (2022).

    Examples
    --------
    >>> from imitatebias.generators import *
    >>> from imitatebias.mimic import *

    Generate a dataset.
    >>> X, y = generateData(1000, 2, 2, seed=2210)

    Generate a biased dataset.
    >>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

    initialize Mimic
    >>> mim = Mimic()

    fit to the biased dataset
    >>> mim.fit(X_b, labels=y_b)

    predict cluster assignment for class 0
    >>> predicted_clusters = mim.predict_cluster(0)

    plot the resulting clusters for class 0
    >>> plt.scatter(X_b[y_b == 0, 0], X_b[y_b == 0, 1], c=predicted_clusters)
    >>> plt.show()

    augment the data
    >>> gen_p, gen_l = mim.augment()

    plot the result
    >>> plt.scatter(X_b[:,0], X_b[:,1], label='dataset')
    >>> plt.scatter(gen_p[:,0], gen_p[:,1], label='generated points')
    >>> plt.legend()
    >>> plt.show()

    """

    def __init__(self):
        """Mimic Constructor."""
        self.params = {}

    def fit(self, data, labels=[], centers=None):
        """Fits a bias-aware multivariate Gaussian Mixture Model per label to the data.

        See our paper [1]_ for details. This process is slow and substantially less
        powerful than the Imitate algorithm since it additionally needs to cluster the
        dataset into potentially biased overlapping clusters. We only recommend Mimic
        if the user is certain that the dataset contains multiple clusters. 

        Parameters
        ----------
        data : numpy.ndarray (2D)
            The input dataset.
        labels : numpy.array (1D), optional
            The corresponding labels if the dataset contains multiple classes.
        centers : numpy.ndarray (2D), optional
            A list [C1, ..., Cn] of n initial d-dimensional cluster centers 
            Ci = [Ci_0, ..., Ci_d]. If those centers are not provided, the clustering will
            be initialized with KMeans for the K that optimizes the Silhouette score.

	References
	----------
	.. [1] Katharina Dost, Hamish Duncanson, Ioannis Ziogas, Patricia Riddle, and Jörg
           Wicker. "Divide and Imitate: Multi-Cluster Identification and Mitigation of 
           Selection Bias." In: Advances in Knowledge Discovery and Data Mining - 26th 
           Pacific-Asia Conference, PAKDD 2022. Lecture Notes in Computer Science, vol. 
           13281, pp. 149-160. Springer, Cham (2022).

        Examples
        --------
        >>> from imitatebias.generators import *
        >>> from imitatebias.mimic import *

        Generate a dataset.
        >>> X, y = generateData(1000, 2, 2, seed=2210)

        Generate a biased dataset.
        >>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

        initialize Mimic
        >>> mim = Mimic()

        fit to the biased dataset
        >>> mim.fit(X_b, labels=y_b)



        """
        self.data = data
        self.labels = np.zeros(len(data)).astype(int) if len(labels) == 0 else labels

        for l in np.unique(self.labels):
            d = data[self.labels == l]
            k_init = findK(d)
            # params = mean/cov for each cluster
            probs_imi, params = run_mimic(d, k_init=k_init)

            # merge the resulting clusters
            probs_merge, params_merge = merge(probs_imi, params, d)

            # store parameters
            self.params[l] = params_merge

    def predict_cluster(self, which_class):
        """Predicts clusters for the input data.

        Assigns clusters to the input data belonging to a specified class. Those clusters
        are selected based on the maximum probability that a point belongs to each of the 
        clusters.

        Parameters
        ----------
        which_class : int
            Filters the data based on the initial labels.

        Returns
        -------
        numpy.array (1D)
            The array containing the assigned clusters.

        Examples
        --------
        >>> from imitatebias.generators import *
        >>> from imitatebias.mimic import *
        >>> import matplotlib.pyplot as plt

        Generate a dataset.
        >>> X, y = generateData(1000, 2, 2, seed=2210)

        Generate a biased dataset.
        >>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

        initialize Mimic
        >>> mim = Mimic()

        fit to the biased dataset
        >>> mim.fit(X_b, labels=y_b)

        predict cluster assignment for class 0
        >>> predicted_clusters = mim.predict_cluster(0)

        plot the resulting clusters for class 0
        >>> plt.scatter(X_b[y_b == 0, 0], X_b[y_b == 0, 1], c=predicted_clusters)
        >>> plt.show()



        """
        l = which_class
        probs = np.column_stack([multivariate_normal(self.params[l][i][0], self.params[l][i][1]).pdf(
            self.data[self.labels==l]) for i in range(len(self.params[l]))])
        return prob_cluster_assignment(probs)

    def augment(self):
        """Augments the fitted dataset to mitigate its bias.

        Generates points to fill in the gap between fitted and observed distributions
        in the input dataset.

        Returns
        -------
        numpy.ndarray (2D)
            Generated points.
        numpy.array (1D)
            Corresponding class labels.

        Examples
        --------
        >>> from imitatebias.generators import *
        >>> from imitatebias.mimic import *
        >>> import matplotlib.pyplot as plt

        Generate a dataset.
        >>> X, y = generateData(1000, 2, 2, seed=2210)

        Generate a biased dataset.
        >>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

        initialize Mimic
        >>> mim = Mimic()

        fit to the biased dataset
        >>> mim.fit(X_b, labels=y_b)

        augment the data
        >>> gen_p, gen_l = mim.augment()

        plot the result
        >>> plt.scatter(X_b[:,0], X_b[:,1], label='dataset')
        >>> plt.scatter(gen_p[:,0], gen_p[:,1], label='generated points')
        >>> plt.legend()
        >>> plt.show()



        """
        gen_points = np.empty((0, len(self.data[0])))
        gen_labels = []

        for l in np.unique(self.labels):
            cl_labels = self.predict_cluster(l)
            data_clean = self.data[self.labels==l][cl_labels >= 0]
            cl_labels_clean = cl_labels[cl_labels >= 0]

            points, point_cl_labels = Mimic_augment(data_clean, cl_labels_clean)
            gen_points = np.concatenate((gen_points, points))
            gen_labels = np.append(gen_labels, [l]*len(points))
        return gen_points, gen_labels

__init__()

Mimic Constructor.

Source code in imitatebias\mimic.py
614
615
616
def __init__(self):
    """Mimic Constructor."""
    self.params = {}

augment()

Augments the fitted dataset to mitigate its bias.

Generates points to fill in the gap between fitted and observed distributions in the input dataset.

Returns:

Type Description
numpy.ndarray (2D)

Generated points.

numpy.array (1D)

Corresponding class labels.

Examples:

>>> from imitatebias.generators import *
>>> from imitatebias.mimic import *
>>> import matplotlib.pyplot as plt

Generate a dataset.

>>> X, y = generateData(1000, 2, 2, seed=2210)

Generate a biased dataset.

>>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

initialize Mimic

>>> mim = Mimic()

fit to the biased dataset

>>> mim.fit(X_b, labels=y_b)

augment the data

>>> gen_p, gen_l = mim.augment()

plot the result

>>> plt.scatter(X_b[:,0], X_b[:,1], label='dataset')
>>> plt.scatter(gen_p[:,0], gen_p[:,1], label='generated points')
>>> plt.legend()
>>> plt.show()
Source code in imitatebias\mimic.py
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
def augment(self):
    """Augments the fitted dataset to mitigate its bias.

    Generates points to fill in the gap between fitted and observed distributions
    in the input dataset.

    Returns
    -------
    numpy.ndarray (2D)
        Generated points.
    numpy.array (1D)
        Corresponding class labels.

    Examples
    --------
    >>> from imitatebias.generators import *
    >>> from imitatebias.mimic import *
    >>> import matplotlib.pyplot as plt

    Generate a dataset.
    >>> X, y = generateData(1000, 2, 2, seed=2210)

    Generate a biased dataset.
    >>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

    initialize Mimic
    >>> mim = Mimic()

    fit to the biased dataset
    >>> mim.fit(X_b, labels=y_b)

    augment the data
    >>> gen_p, gen_l = mim.augment()

    plot the result
    >>> plt.scatter(X_b[:,0], X_b[:,1], label='dataset')
    >>> plt.scatter(gen_p[:,0], gen_p[:,1], label='generated points')
    >>> plt.legend()
    >>> plt.show()



    """
    gen_points = np.empty((0, len(self.data[0])))
    gen_labels = []

    for l in np.unique(self.labels):
        cl_labels = self.predict_cluster(l)
        data_clean = self.data[self.labels==l][cl_labels >= 0]
        cl_labels_clean = cl_labels[cl_labels >= 0]

        points, point_cl_labels = Mimic_augment(data_clean, cl_labels_clean)
        gen_points = np.concatenate((gen_points, points))
        gen_labels = np.append(gen_labels, [l]*len(points))
    return gen_points, gen_labels

fit(data, labels=[], centers=None)

Fits a bias-aware multivariate Gaussian Mixture Model per label to the data.

See our paper [1]_ for details. This process is slow and substantially less powerful than the Imitate algorithm since it additionally needs to cluster the dataset into potentially biased overlapping clusters. We only recommend Mimic if the user is certain that the dataset contains multiple clusters.

Parameters:

Name Type Description Default
data numpy.ndarray (2D)

The input dataset.

required
labels numpy.array (1D), optional

The corresponding labels if the dataset contains multiple classes.

[]
centers numpy.ndarray (2D), optional

A list [C1, ..., Cn] of n initial d-dimensional cluster centers Ci = [Ci_0, ..., Ci_d]. If those centers are not provided, the clustering will be initialized with KMeans for the K that optimizes the Silhouette score.

None
References

.. [1] Katharina Dost, Hamish Duncanson, Ioannis Ziogas, Patricia Riddle, and Jörg Wicker. "Divide and Imitate: Multi-Cluster Identification and Mitigation of Selection Bias." In: Advances in Knowledge Discovery and Data Mining - 26th Pacific-Asia Conference, PAKDD 2022. Lecture Notes in Computer Science, vol. 13281, pp. 149-160. Springer, Cham (2022).

Examples:

>>> from imitatebias.generators import *
>>> from imitatebias.mimic import *

Generate a dataset.

>>> X, y = generateData(1000, 2, 2, seed=2210)

Generate a biased dataset.

>>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

initialize Mimic

>>> mim = Mimic()

fit to the biased dataset

>>> mim.fit(X_b, labels=y_b)
Source code in imitatebias\mimic.py
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
    def fit(self, data, labels=[], centers=None):
        """Fits a bias-aware multivariate Gaussian Mixture Model per label to the data.

        See our paper [1]_ for details. This process is slow and substantially less
        powerful than the Imitate algorithm since it additionally needs to cluster the
        dataset into potentially biased overlapping clusters. We only recommend Mimic
        if the user is certain that the dataset contains multiple clusters. 

        Parameters
        ----------
        data : numpy.ndarray (2D)
            The input dataset.
        labels : numpy.array (1D), optional
            The corresponding labels if the dataset contains multiple classes.
        centers : numpy.ndarray (2D), optional
            A list [C1, ..., Cn] of n initial d-dimensional cluster centers 
            Ci = [Ci_0, ..., Ci_d]. If those centers are not provided, the clustering will
            be initialized with KMeans for the K that optimizes the Silhouette score.

	References
	----------
	.. [1] Katharina Dost, Hamish Duncanson, Ioannis Ziogas, Patricia Riddle, and Jörg
           Wicker. "Divide and Imitate: Multi-Cluster Identification and Mitigation of 
           Selection Bias." In: Advances in Knowledge Discovery and Data Mining - 26th 
           Pacific-Asia Conference, PAKDD 2022. Lecture Notes in Computer Science, vol. 
           13281, pp. 149-160. Springer, Cham (2022).

        Examples
        --------
        >>> from imitatebias.generators import *
        >>> from imitatebias.mimic import *

        Generate a dataset.
        >>> X, y = generateData(1000, 2, 2, seed=2210)

        Generate a biased dataset.
        >>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

        initialize Mimic
        >>> mim = Mimic()

        fit to the biased dataset
        >>> mim.fit(X_b, labels=y_b)



        """
        self.data = data
        self.labels = np.zeros(len(data)).astype(int) if len(labels) == 0 else labels

        for l in np.unique(self.labels):
            d = data[self.labels == l]
            k_init = findK(d)
            # params = mean/cov for each cluster
            probs_imi, params = run_mimic(d, k_init=k_init)

            # merge the resulting clusters
            probs_merge, params_merge = merge(probs_imi, params, d)

            # store parameters
            self.params[l] = params_merge

predict_cluster(which_class)

Predicts clusters for the input data.

Assigns clusters to the input data belonging to a specified class. Those clusters are selected based on the maximum probability that a point belongs to each of the clusters.

Parameters:

Name Type Description Default
which_class int

Filters the data based on the initial labels.

required

Returns:

Type Description
numpy.array (1D)

The array containing the assigned clusters.

Examples:

>>> from imitatebias.generators import *
>>> from imitatebias.mimic import *
>>> import matplotlib.pyplot as plt

Generate a dataset.

>>> X, y = generateData(1000, 2, 2, seed=2210)

Generate a biased dataset.

>>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

initialize Mimic

>>> mim = Mimic()

fit to the biased dataset

>>> mim.fit(X_b, labels=y_b)

predict cluster assignment for class 0

>>> predicted_clusters = mim.predict_cluster(0)

plot the resulting clusters for class 0

>>> plt.scatter(X_b[y_b == 0, 0], X_b[y_b == 0, 1], c=predicted_clusters)
>>> plt.show()
Source code in imitatebias\mimic.py
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
def predict_cluster(self, which_class):
    """Predicts clusters for the input data.

    Assigns clusters to the input data belonging to a specified class. Those clusters
    are selected based on the maximum probability that a point belongs to each of the 
    clusters.

    Parameters
    ----------
    which_class : int
        Filters the data based on the initial labels.

    Returns
    -------
    numpy.array (1D)
        The array containing the assigned clusters.

    Examples
    --------
    >>> from imitatebias.generators import *
    >>> from imitatebias.mimic import *
    >>> import matplotlib.pyplot as plt

    Generate a dataset.
    >>> X, y = generateData(1000, 2, 2, seed=2210)

    Generate a biased dataset.
    >>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)

    initialize Mimic
    >>> mim = Mimic()

    fit to the biased dataset
    >>> mim.fit(X_b, labels=y_b)

    predict cluster assignment for class 0
    >>> predicted_clusters = mim.predict_cluster(0)

    plot the resulting clusters for class 0
    >>> plt.scatter(X_b[y_b == 0, 0], X_b[y_b == 0, 1], c=predicted_clusters)
    >>> plt.show()



    """
    l = which_class
    probs = np.column_stack([multivariate_normal(self.params[l][i][0], self.params[l][i][1]).pdf(
        self.data[self.labels==l]) for i in range(len(self.params[l]))])
    return prob_cluster_assignment(probs)