Mimic
Mimic
Mimic generates points to mitigate a multi-cluster dataset's bias.
Machine Learning can help overcome human biases in decision making by focussing
on purely logical conclusions based on the training data. If the training data
is biased, however, that bias will be transferred to the model and remains
undetected as the performance is validated on a test set drawn from the same
biased distribution.
Existing strategies for selection bias identification and mitigation generally
rely on some sort of knowledge of the bias or the ground-truth. An exception
is the Imitate [1] algorithm that assumes no knowledge but comes with a strong
limitation: It can only model datasets with one normally distributed cluster
per class.
MIMIC uses Imitate as a building block but relaxes this limitation. By allowing
mixtures of multivariate Gaussians, our technique is able to model multi-cluster
datasets and provide solutions for a substantially wider set of problems.
See our paper [2] for details.
Attributes:
Name | Type | Description |
---|---|---|
params |
dict(int: numpy.ndarray (2D))
|
A label-indexed dictionary containing (mean, cov) tuples for each identified cluster belonging to this label. |
data |
numpy.ndarray (2D)
|
The dataset Mimic is fitted to. |
labels |
numpy.array (1D)
|
The corresponding labels. Labels need to be integer values. |
Methods
fit(data, labels=[], centers=None) Fits the Mimic Gaussians to a dataset. predict_cluster(which_class) Predicts clusters for the input dataset. augment() Augments the fitted dataset to mitigate its bias.
References
.. [1] Katharina Dost, Katerina Taskova, Patricia Riddle, and Jörg Wicker. "Your Best Guess When You Know Nothing: Identification and Mitigation of Selection Bias." In: 2020 IEEE International Conference on Data Mining (ICDM), pp. 996-1001, IEEE, 2020, ISSN: 2374-8486.
.. [2] Katharina Dost, Hamish Duncanson, Ioannis Ziogas, Patricia Riddle, and Jörg Wicker. "Divide and Imitate: Multi-Cluster Identification and Mitigation of Selection Bias." In: Advances in Knowledge Discovery and Data Mining - 26th Pacific-Asia Conference, PAKDD 2022. Lecture Notes in Computer Science, vol. 13281, pp. 149-160. Springer, Cham (2022).
Examples:
>>> from imitatebias.generators import *
>>> from imitatebias.mimic import *
Generate a dataset.
>>> X, y = generateData(1000, 2, 2, seed=2210)
Generate a biased dataset.
>>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)
initialize Mimic
>>> mim = Mimic()
fit to the biased dataset
>>> mim.fit(X_b, labels=y_b)
predict cluster assignment for class 0
>>> predicted_clusters = mim.predict_cluster(0)
plot the resulting clusters for class 0
>>> plt.scatter(X_b[y_b == 0, 0], X_b[y_b == 0, 1], c=predicted_clusters)
>>> plt.show()
augment the data
>>> gen_p, gen_l = mim.augment()
plot the result
>>> plt.scatter(X_b[:,0], X_b[:,1], label='dataset')
>>> plt.scatter(gen_p[:,0], gen_p[:,1], label='generated points')
>>> plt.legend()
>>> plt.show()
Source code in imitatebias\mimic.py
529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 |
|
__init__()
Mimic Constructor.
Source code in imitatebias\mimic.py
614 615 616 |
|
augment()
Augments the fitted dataset to mitigate its bias.
Generates points to fill in the gap between fitted and observed distributions in the input dataset.
Returns:
Type | Description |
---|---|
numpy.ndarray (2D)
|
Generated points. |
numpy.array (1D)
|
Corresponding class labels. |
Examples:
>>> from imitatebias.generators import *
>>> from imitatebias.mimic import *
>>> import matplotlib.pyplot as plt
Generate a dataset.
>>> X, y = generateData(1000, 2, 2, seed=2210)
Generate a biased dataset.
>>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)
initialize Mimic
>>> mim = Mimic()
fit to the biased dataset
>>> mim.fit(X_b, labels=y_b)
augment the data
>>> gen_p, gen_l = mim.augment()
plot the result
>>> plt.scatter(X_b[:,0], X_b[:,1], label='dataset')
>>> plt.scatter(gen_p[:,0], gen_p[:,1], label='generated points')
>>> plt.legend()
>>> plt.show()
Source code in imitatebias\mimic.py
730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 |
|
fit(data, labels=[], centers=None)
Fits a bias-aware multivariate Gaussian Mixture Model per label to the data.
See our paper [1]_ for details. This process is slow and substantially less powerful than the Imitate algorithm since it additionally needs to cluster the dataset into potentially biased overlapping clusters. We only recommend Mimic if the user is certain that the dataset contains multiple clusters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
numpy.ndarray (2D)
|
The input dataset. |
required |
labels |
numpy.array (1D), optional
|
The corresponding labels if the dataset contains multiple classes. |
[]
|
centers |
numpy.ndarray (2D), optional
|
A list [C1, ..., Cn] of n initial d-dimensional cluster centers Ci = [Ci_0, ..., Ci_d]. If those centers are not provided, the clustering will be initialized with KMeans for the K that optimizes the Silhouette score. |
None
|
References
.. [1] Katharina Dost, Hamish Duncanson, Ioannis Ziogas, Patricia Riddle, and Jörg Wicker. "Divide and Imitate: Multi-Cluster Identification and Mitigation of Selection Bias." In: Advances in Knowledge Discovery and Data Mining - 26th Pacific-Asia Conference, PAKDD 2022. Lecture Notes in Computer Science, vol. 13281, pp. 149-160. Springer, Cham (2022).
Examples:
>>> from imitatebias.generators import *
>>> from imitatebias.mimic import *
Generate a dataset.
>>> X, y = generateData(1000, 2, 2, seed=2210)
Generate a biased dataset.
>>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)
initialize Mimic
>>> mim = Mimic()
fit to the biased dataset
>>> mim.fit(X_b, labels=y_b)
Source code in imitatebias\mimic.py
618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 |
|
predict_cluster(which_class)
Predicts clusters for the input data.
Assigns clusters to the input data belonging to a specified class. Those clusters are selected based on the maximum probability that a point belongs to each of the clusters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
which_class |
int
|
Filters the data based on the initial labels. |
required |
Returns:
Type | Description |
---|---|
numpy.array (1D)
|
The array containing the assigned clusters. |
Examples:
>>> from imitatebias.generators import *
>>> from imitatebias.mimic import *
>>> import matplotlib.pyplot as plt
Generate a dataset.
>>> X, y = generateData(1000, 2, 2, seed=2210)
Generate a biased dataset.
>>> X_b, y_b, idcs_deleted = generateBias(X, y, 1, seed=2210)
initialize Mimic
>>> mim = Mimic()
fit to the biased dataset
>>> mim.fit(X_b, labels=y_b)
predict cluster assignment for class 0
>>> predicted_clusters = mim.predict_cluster(0)
plot the resulting clusters for class 0
>>> plt.scatter(X_b[y_b == 0, 0], X_b[y_b == 0, 1], c=predicted_clusters)
>>> plt.show()
Source code in imitatebias\mimic.py
680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 |
|