Skip to content

Cancels

Cancels

Cancels selects additional points/compounds to mitigate a bias.

Given a pool of potential candidates to be added to a dataset, Cancels investigates the dataset's distribution and selects those points or compounds that mitigate the dataset's bias without losing its specialization to its domain. See our paper [1]_ for details.

Attributes:

Name Type Description
n_pc int (> 0)

Controls the number of Principal Components used in PCA to decrease the dataset dimensionality.

imi `Imitate`

Imitate object containing all information about the fitted multi- variate Gaussian indicating a potential bias.

pca sklearn.decomposition.PCA

Stores the trained PCA transformation.

d_trf numpy.ndarray (2D)

The PCA-transformed input dataset.

Methods

fit(data, bounding_pool=None, bounding_range=None, strength=1000) Fits the Cancels method to a dataset. score(pool) Scores all points / compounds in a pool. augment(pool) Selects compounds from the pool that mitigate the bias.

References

.. [1] Katharina Dost, Zac Pullar-Strecker, Liam Brydon, Kunyang Zhang, Jasmin Hafner, Patricia Riddle, and Jörg Wicker. "Combatting Over-Specialization Bias in Growing Chemical Databases." 05 October 2022, PREPRINT (Version 1) available at Research Square [https://doi.org/10.21203/rs.3.rs-2133331/v1]

Examples:

>>> from imitatebias.generators import *
>>> from imitatebias.cancels import *
>>> import matplotlib.pyplot as plt
>>> import seaborn as sns

generate data and pool

>>> X, y = generateData(500, 1, 10, seed=2210)
>>> X_b, _, _ = generateBias(X, y, 1, seed=2210)

fit Cancels

>>> can = Cancels(n_pc=2)
>>> can.fit(X_b)

generate data points to fill in the bias (for the sake of visualization)

>>> gen_p, _ = can.imi.augment()

plot Cancels' indicated biases in PCA space

>>> plt.scatter(can.pca.transform(X_b)[:,0], can.pca.transform(X_b)[:,1])
>>> sns.kdeplot(x=gen_p[:,0], y=gen_p[:,1], cut=10, thresh=0, cmap='Greens')
>>> plt.show()

score the pool

>>> scores = can.score(pool)

plot the pool's scores in PCA space

>>> plt.scatter(can.pca.transform(pool)[:,0], can.pca.transform(pool)[:,1], c=scores)
>>> plt.colorbar()
>>> plt.show()

create a random pool

>>> pool = np.column_stack([np.random.uniform(min(X[:,0]), max(X[:,0]), size=1000) for i in range(len(X[0]))])

select additional data points from the pool

>>> pool_idcs = can.augment(pool)

plot Cancels' indicated biases in PCA space

>>> plt.scatter(can.pca.transform(X_b)[:,0], can.pca.transform(X_b)[:,1], label='Dataset')
>>> plt.scatter(can.pca.transform(pool)[pool_idcs,0], can.pca.transform(pool)[pool_idcs,1], label='Added')
>>> plt.legend()
>>> plt.show()
Source code in imitatebias\cancels.py
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
class Cancels:
    """Cancels selects additional points/compounds to mitigate a bias.

    Given a pool of potential candidates to be added to a dataset,
    Cancels investigates the dataset's distribution and selects those points
    or compounds that mitigate the dataset's bias without losing its 
    specialization to its domain. See our paper [1]_ for details.

    Attributes
    ----------
    n_pc : int (> 0)
        Controls the number of Principal Components used in PCA to decrease
        the dataset dimensionality.
    imi : `Imitate`
        Imitate object containing all information about the fitted multi-
        variate Gaussian indicating a potential bias.
    pca : sklearn.decomposition.PCA
        Stores the trained PCA transformation.
    d_trf : numpy.ndarray (2D)
        The PCA-transformed input dataset.

    Methods
    -------
    fit(data, bounding_pool=None, bounding_range=None, strength=1000)
        Fits the Cancels method to a dataset.
    score(pool)
        Scores all points / compounds in a pool.
    augment(pool)
        Selects compounds from the pool that mitigate the bias.

    References
    ----------
    .. [1] Katharina Dost, Zac Pullar-Strecker, Liam Brydon, Kunyang Zhang, Jasmin Hafner, 
       Patricia Riddle, and Jörg Wicker. "Combatting Over-Specialization Bias in Growing 
       Chemical Databases." 05 October 2022, PREPRINT (Version 1) available at Research 
       Square [https://doi.org/10.21203/rs.3.rs-2133331/v1]

    Examples
    --------
    >>> from imitatebias.generators import *
    >>> from imitatebias.cancels import *
    >>> import matplotlib.pyplot as plt
    >>> import seaborn as sns

    generate data and pool
    >>> X, y = generateData(500, 1, 10, seed=2210)
    >>> X_b, _, _ = generateBias(X, y, 1, seed=2210)

    fit Cancels
    >>> can = Cancels(n_pc=2)
    >>> can.fit(X_b)

    generate data points to fill in the bias (for the sake of visualization)
    >>> gen_p, _ = can.imi.augment()

    plot Cancels' indicated biases in PCA space
    >>> plt.scatter(can.pca.transform(X_b)[:,0], can.pca.transform(X_b)[:,1])
    >>> sns.kdeplot(x=gen_p[:,0], y=gen_p[:,1], cut=10, thresh=0, cmap='Greens')
    >>> plt.show()

    score the pool
    >>> scores = can.score(pool)

    plot the pool's scores in PCA space
    >>> plt.scatter(can.pca.transform(pool)[:,0], can.pca.transform(pool)[:,1], c=scores)
    >>> plt.colorbar()
    >>> plt.show()

    create a random pool
    >>> pool = np.column_stack([np.random.uniform(min(X[:,0]), max(X[:,0]), size=1000) for i in range(len(X[0]))])

    select additional data points from the pool
    >>> pool_idcs = can.augment(pool)

    plot Cancels' indicated biases in PCA space
    >>> plt.scatter(can.pca.transform(X_b)[:,0], can.pca.transform(X_b)[:,1], label='Dataset')
    >>> plt.scatter(can.pca.transform(pool)[pool_idcs,0], can.pca.transform(pool)[pool_idcs,1], label='Added')
    >>> plt.legend()
    >>> plt.show()
    """

    def __init__(self, n_pc=5):
        """Cancels Constructor.

        Parameters
        ----------
        n_pc : int
            The number of Principal Components to be used for dimensionality
            reduction.
        """
        self.n_pc = n_pc
        self.imi = Imitate()

    def fit(self, data, bounding_pool=None, bounding_range=None, strength=1000):
        """Fits a bias-aware multivariate Gaussian to the dataset.

        After reducing the dimensionality of the dataset using PCA, a
        bias-aware multivariate Gaussian is fitted to the data using the
        Imitate algorithm. See [1]_ for details on Imitate and [2]_ for
        details on Cancels.

        Parameters
        ----------
        data : numpy.ndarray (2D)
            The input data.
        bounding_pool : numpy.ndarray (2D), optional
            If the fitting of the Gaussian is supposed to be constrained to
            an existing pool, the pool can be provided to ensure that Cancels
            selects the best-possible points / compounds given this pool.
        bounding_range : np.ndarray (2D)
            Alternative to `bounding_pool`. Will be overwritten if a bounding
            pool is provided. Provide the range [R1, ..., Rd] for each of d
            dimensions where Ri = [min_i, max_i] is the range for the i-th
            dimension.
        strength : int or float
            Controls the strength of the boundary enforcement. See [1]_.

	References
	----------
	.. [1] Katharina Dost, Katerina Taskova, Patricia Riddle, and Jörg Wicker. 
           "Your Best Guess When You Know Nothing: Identification and Mitigation 
           of Selection Bias." In: 2020 IEEE International Conference on Data 
           Mining (ICDM), pp. 996-1001, IEEE, 2020, ISSN: 2374-8486.

        .. [2] Katharina Dost, Zac Pullar-Strecker, Liam Brydon, Kunyang Zhang, Jasmin Hafner, 
       	   Patricia Riddle, and Jörg Wicker. "Combatting Over-Specialization Bias in Growing 
           Chemical Databases." 05 October 2022, PREPRINT (Version 1) available at Research 
           Square [https://doi.org/10.21203/rs.3.rs-2133331/v1]

        Examples
        --------
        >>> from imitatebias.generators import *
        >>> from imitatebias.cancels import *
        >>> import matplotlib.pyplot as plt
        >>> import seaborn as sns

        generate data and pool
        >>> X, y = generateData(500, 1, 10, seed=2210)
        >>> X_b, _, _ = generateBias(X, y, 1, seed=2210)

        fit Cancels
        >>> can = Cancels(n_pc=2)
        >>> can.fit(X_b)

        generate data points to fill in the bias (for the sake of visualization)
        >>> gen_p, _ = can.imi.augment()

        plot Cancels' indicated biases in PCA space
        >>> plt.scatter(can.pca.transform(X_b)[:,0], can.pca.transform(X_b)[:,1])
        >>> sns.kdeplot(x=gen_p[:,0], y=gen_p[:,1], cut=10, thresh=0, cmap='Greens')
        >>> plt.show()
        """
        self.pca = PCA(n_components = self.n_pc)
        self.pca.fit(data)
        self.d_trf = self.pca.transform(data)

        if bounding_pool is not None:
            self.imi.fit(self.d_trf, bounds_set=bounding_pool, strength=strength)
        elif bounding_range is not None:
            self.imi.fit(self.d_trf, bounds={0: bounding_range}, strength=strength)
        else:
            self.imi.fit(self.d_trf, labels=np.zeros(len(data)).astype(int))

    def score(self, pool):
        """Scores all elements in a pool on their bias-mitigating ability.

        See [1]_ for details on the score.

        Parameters
        ----------
        pool : numpy.ndarray (2D)
            The pool that shall be scored.

        Returns:

        numpy.array (1D)
            The non-normalized scores for each element of the pool.

	References
	----------
	.. [1] Katharina Dost, Zac Pullar-Strecker, Liam Brydon, Kunyang Zhang, Jasmin Hafner, 
       	   Patricia Riddle, and Jörg Wicker. "Combatting Over-Specialization Bias in Growing 
           Chemical Databases." 05 October 2022, PREPRINT (Version 1) available at Research 
           Square [https://doi.org/10.21203/rs.3.rs-2133331/v1]

        Examples
        --------
        >>> from imitatebias.generators import *
        >>> from imitatebias.cancels import *
        >>> import matplotlib.pyplot as plt

        generate data and pool
        >>> X, y = generateData(500, 1, 10, seed=2210)
        >>> X_b, _, _ = generateBias(X, y, 1, seed=2210)
        >>> pool = np.column_stack([np.random.uniform(min(X[:,0]), max(X[:,0]), size=1000) for i in range(len(X[0]))])

        fit Cancels
        >>> can = Cancels(n_pc=2)
        >>> can.fit(X_b)

        score the pool
        >>> scores = can.score(pool)

        plot the pool's scores in PCA space
        >>> plt.scatter(can.pca.transform(pool)[:,0], can.pca.transform(pool)[:,1], c=scores)
        >>> plt.colorbar()
        >>> plt.show()
        """
        pool_trf = self.pca.transform(pool)
        return self.imi.score(pool_trf, score_type='balanced')[:,0]

    def augment(self, pool):
        """Augments the input dataset using the pool.

        Randomly selects points / compounds from the pool to mitigate the
        identified selection bias of the input dataset.

        Parameters
        ----------
        pool : numpy.ndarray (2D)
            The pool that shall be scored.

        Returns
        -------
        numpy.array (1D)
            A set of indices of those element from the pool that have been selected.

	References
	----------
	.. [1] Katharina Dost, Zac Pullar-Strecker, Liam Brydon, Kunyang Zhang, Jasmin Hafner, 
       	   Patricia Riddle, and Jörg Wicker. "Combatting Over-Specialization Bias in Growing 
           Chemical Databases." 05 October 2022, PREPRINT (Version 1) available at Research 
           Square [https://doi.org/10.21203/rs.3.rs-2133331/v1]

        Examples
        --------
        >>> from imitatebias.generators import *
        >>> from imitatebias.cancels import *
        >>> import matplotlib.pyplot as plt

        generate data and pool
        >>> X, y = generateData(500, 1, 10, seed=2210)
        >>> X_b, _, _ = generateBias(X, y, 1, seed=2210)
        >>> pool = np.column_stack([np.random.uniform(min(X[:,0]), max(X[:,0]), size=1000) for i in range(len(X[0]))])

        fit Cancels
        >>> can = Cancels(n_pc=2)
        >>> can.fit(X_b)

        select additional data points from the pool
        >>> pool_idcs = can.augment(pool)

        plot Cancels' indicated biases in PCA space
        >>> plt.scatter(can.pca.transform(X_b)[:,0], can.pca.transform(X_b)[:,1], label='Dataset')
        >>> plt.scatter(can.pca.transform(pool)[pool_idcs,0], can.pca.transform(pool)[pool_idcs,1], label='Added')
        >>> plt.legend()
        >>> plt.show()
        """
        score = self.score(pool)
        score = score / np.sum(score) # convert to probability distribution
        pool_trf = self.imi.icas[0].transform(self.pca.transform(pool))
        data = self.imi.icas[0].transform(copy.deepcopy(self.d_trf))

        add_idcs = np.array([], dtype=np.int32)
        p_model_given_data = P_data_given_model(data, self.imi.grids[0], self.imi.fitted[0])
        p_model_given_data -= P_data(data, self.imi.grids[0], self.imi.fitted[0], s=3)
        num_fill = int(min(max(self.imi.num_fill_up[0]), len(score>0)))
        batches = np.append([10] * (num_fill // 10), [num_fill % 10])
        tries = 0

        for i in range(len(batches)):
            candidates = np.random.choice(range(len(score)), int(batches[i]), p=score).astype(int)
            d_new = np.vstack((data, pool_trf[candidates]))
            #P_new = P_model_given_data(d_new, self.imi.grids[0], self.imi.fitted[0])
            P_new = P_data_given_model(d_new, self.imi.grids[0], self.imi.fitted[0]) 
            P_new -= P_data(d_new, self.imi.grids[0], self.imi.fitted[0], s=3)
            if P_new <= p_model_given_data: # stopping if likelihood gets worse
                if tries < 3:
                    i += -1 # try again!
                    tries += 1
                else:
                    tries = 0
                continue
            p_model_given_data = P_new
            add_idcs = np.append(add_idcs, candidates)
            score[add_idcs] = 0
            data = np.vstack((data, pool_trf[add_idcs]))
            if np.sum(score) == 0: break
            score = score / np.sum(score)

        return add_idcs

__init__(n_pc=5)

Cancels Constructor.

Parameters:

Name Type Description Default
n_pc int

The number of Principal Components to be used for dimensionality reduction.

5
Source code in imitatebias\cancels.py
85
86
87
88
89
90
91
92
93
94
95
def __init__(self, n_pc=5):
    """Cancels Constructor.

    Parameters
    ----------
    n_pc : int
        The number of Principal Components to be used for dimensionality
        reduction.
    """
    self.n_pc = n_pc
    self.imi = Imitate()

augment(pool)

Augments the input dataset using the pool.

Randomly selects points / compounds from the pool to mitigate the identified selection bias of the input dataset.

Parameters:

Name Type Description Default
pool numpy.ndarray (2D)

The pool that shall be scored.

required

Returns:

Type Description
numpy.array (1D)

A set of indices of those element from the pool that have been selected.

References

.. [1] Katharina Dost, Zac Pullar-Strecker, Liam Brydon, Kunyang Zhang, Jasmin Hafner, Patricia Riddle, and Jörg Wicker. "Combatting Over-Specialization Bias in Growing Chemical Databases." 05 October 2022, PREPRINT (Version 1) available at Research Square [https://doi.org/10.21203/rs.3.rs-2133331/v1]

Examples:

>>> from imitatebias.generators import *
>>> from imitatebias.cancels import *
>>> import matplotlib.pyplot as plt

generate data and pool

>>> X, y = generateData(500, 1, 10, seed=2210)
>>> X_b, _, _ = generateBias(X, y, 1, seed=2210)
>>> pool = np.column_stack([np.random.uniform(min(X[:,0]), max(X[:,0]), size=1000) for i in range(len(X[0]))])

fit Cancels

>>> can = Cancels(n_pc=2)
>>> can.fit(X_b)

select additional data points from the pool

>>> pool_idcs = can.augment(pool)

plot Cancels' indicated biases in PCA space

>>> plt.scatter(can.pca.transform(X_b)[:,0], can.pca.transform(X_b)[:,1], label='Dataset')
>>> plt.scatter(can.pca.transform(pool)[pool_idcs,0], can.pca.transform(pool)[pool_idcs,1], label='Added')
>>> plt.legend()
>>> plt.show()
Source code in imitatebias\cancels.py
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
    def augment(self, pool):
        """Augments the input dataset using the pool.

        Randomly selects points / compounds from the pool to mitigate the
        identified selection bias of the input dataset.

        Parameters
        ----------
        pool : numpy.ndarray (2D)
            The pool that shall be scored.

        Returns
        -------
        numpy.array (1D)
            A set of indices of those element from the pool that have been selected.

	References
	----------
	.. [1] Katharina Dost, Zac Pullar-Strecker, Liam Brydon, Kunyang Zhang, Jasmin Hafner, 
       	   Patricia Riddle, and Jörg Wicker. "Combatting Over-Specialization Bias in Growing 
           Chemical Databases." 05 October 2022, PREPRINT (Version 1) available at Research 
           Square [https://doi.org/10.21203/rs.3.rs-2133331/v1]

        Examples
        --------
        >>> from imitatebias.generators import *
        >>> from imitatebias.cancels import *
        >>> import matplotlib.pyplot as plt

        generate data and pool
        >>> X, y = generateData(500, 1, 10, seed=2210)
        >>> X_b, _, _ = generateBias(X, y, 1, seed=2210)
        >>> pool = np.column_stack([np.random.uniform(min(X[:,0]), max(X[:,0]), size=1000) for i in range(len(X[0]))])

        fit Cancels
        >>> can = Cancels(n_pc=2)
        >>> can.fit(X_b)

        select additional data points from the pool
        >>> pool_idcs = can.augment(pool)

        plot Cancels' indicated biases in PCA space
        >>> plt.scatter(can.pca.transform(X_b)[:,0], can.pca.transform(X_b)[:,1], label='Dataset')
        >>> plt.scatter(can.pca.transform(pool)[pool_idcs,0], can.pca.transform(pool)[pool_idcs,1], label='Added')
        >>> plt.legend()
        >>> plt.show()
        """
        score = self.score(pool)
        score = score / np.sum(score) # convert to probability distribution
        pool_trf = self.imi.icas[0].transform(self.pca.transform(pool))
        data = self.imi.icas[0].transform(copy.deepcopy(self.d_trf))

        add_idcs = np.array([], dtype=np.int32)
        p_model_given_data = P_data_given_model(data, self.imi.grids[0], self.imi.fitted[0])
        p_model_given_data -= P_data(data, self.imi.grids[0], self.imi.fitted[0], s=3)
        num_fill = int(min(max(self.imi.num_fill_up[0]), len(score>0)))
        batches = np.append([10] * (num_fill // 10), [num_fill % 10])
        tries = 0

        for i in range(len(batches)):
            candidates = np.random.choice(range(len(score)), int(batches[i]), p=score).astype(int)
            d_new = np.vstack((data, pool_trf[candidates]))
            #P_new = P_model_given_data(d_new, self.imi.grids[0], self.imi.fitted[0])
            P_new = P_data_given_model(d_new, self.imi.grids[0], self.imi.fitted[0]) 
            P_new -= P_data(d_new, self.imi.grids[0], self.imi.fitted[0], s=3)
            if P_new <= p_model_given_data: # stopping if likelihood gets worse
                if tries < 3:
                    i += -1 # try again!
                    tries += 1
                else:
                    tries = 0
                continue
            p_model_given_data = P_new
            add_idcs = np.append(add_idcs, candidates)
            score[add_idcs] = 0
            data = np.vstack((data, pool_trf[add_idcs]))
            if np.sum(score) == 0: break
            score = score / np.sum(score)

        return add_idcs

fit(data, bounding_pool=None, bounding_range=None, strength=1000)

Fits a bias-aware multivariate Gaussian to the dataset.

After reducing the dimensionality of the dataset using PCA, a bias-aware multivariate Gaussian is fitted to the data using the Imitate algorithm. See [1] for details on Imitate and [2] for details on Cancels.

Parameters:

Name Type Description Default
data numpy.ndarray (2D)

The input data.

required
bounding_pool numpy.ndarray (2D), optional

If the fitting of the Gaussian is supposed to be constrained to an existing pool, the pool can be provided to ensure that Cancels selects the best-possible points / compounds given this pool.

None
bounding_range np.ndarray (2D)

Alternative to bounding_pool. Will be overwritten if a bounding pool is provided. Provide the range [R1, ..., Rd] for each of d dimensions where Ri = [min_i, max_i] is the range for the i-th dimension.

None
strength int or float

Controls the strength of the boundary enforcement. See [1]_.

1000
References

.. [1] Katharina Dost, Katerina Taskova, Patricia Riddle, and Jörg Wicker. "Your Best Guess When You Know Nothing: Identification and Mitigation of Selection Bias." In: 2020 IEEE International Conference on Data Mining (ICDM), pp. 996-1001, IEEE, 2020, ISSN: 2374-8486.

.. [2] Katharina Dost, Zac Pullar-Strecker, Liam Brydon, Kunyang Zhang, Jasmin Hafner, Patricia Riddle, and Jörg Wicker. "Combatting Over-Specialization Bias in Growing Chemical Databases." 05 October 2022, PREPRINT (Version 1) available at Research Square [https://doi.org/10.21203/rs.3.rs-2133331/v1]

Examples:

>>> from imitatebias.generators import *
>>> from imitatebias.cancels import *
>>> import matplotlib.pyplot as plt
>>> import seaborn as sns

generate data and pool

>>> X, y = generateData(500, 1, 10, seed=2210)
>>> X_b, _, _ = generateBias(X, y, 1, seed=2210)

fit Cancels

>>> can = Cancels(n_pc=2)
>>> can.fit(X_b)

generate data points to fill in the bias (for the sake of visualization)

>>> gen_p, _ = can.imi.augment()

plot Cancels' indicated biases in PCA space

>>> plt.scatter(can.pca.transform(X_b)[:,0], can.pca.transform(X_b)[:,1])
>>> sns.kdeplot(x=gen_p[:,0], y=gen_p[:,1], cut=10, thresh=0, cmap='Greens')
>>> plt.show()
Source code in imitatebias\cancels.py
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
    def fit(self, data, bounding_pool=None, bounding_range=None, strength=1000):
        """Fits a bias-aware multivariate Gaussian to the dataset.

        After reducing the dimensionality of the dataset using PCA, a
        bias-aware multivariate Gaussian is fitted to the data using the
        Imitate algorithm. See [1]_ for details on Imitate and [2]_ for
        details on Cancels.

        Parameters
        ----------
        data : numpy.ndarray (2D)
            The input data.
        bounding_pool : numpy.ndarray (2D), optional
            If the fitting of the Gaussian is supposed to be constrained to
            an existing pool, the pool can be provided to ensure that Cancels
            selects the best-possible points / compounds given this pool.
        bounding_range : np.ndarray (2D)
            Alternative to `bounding_pool`. Will be overwritten if a bounding
            pool is provided. Provide the range [R1, ..., Rd] for each of d
            dimensions where Ri = [min_i, max_i] is the range for the i-th
            dimension.
        strength : int or float
            Controls the strength of the boundary enforcement. See [1]_.

	References
	----------
	.. [1] Katharina Dost, Katerina Taskova, Patricia Riddle, and Jörg Wicker. 
           "Your Best Guess When You Know Nothing: Identification and Mitigation 
           of Selection Bias." In: 2020 IEEE International Conference on Data 
           Mining (ICDM), pp. 996-1001, IEEE, 2020, ISSN: 2374-8486.

        .. [2] Katharina Dost, Zac Pullar-Strecker, Liam Brydon, Kunyang Zhang, Jasmin Hafner, 
       	   Patricia Riddle, and Jörg Wicker. "Combatting Over-Specialization Bias in Growing 
           Chemical Databases." 05 October 2022, PREPRINT (Version 1) available at Research 
           Square [https://doi.org/10.21203/rs.3.rs-2133331/v1]

        Examples
        --------
        >>> from imitatebias.generators import *
        >>> from imitatebias.cancels import *
        >>> import matplotlib.pyplot as plt
        >>> import seaborn as sns

        generate data and pool
        >>> X, y = generateData(500, 1, 10, seed=2210)
        >>> X_b, _, _ = generateBias(X, y, 1, seed=2210)

        fit Cancels
        >>> can = Cancels(n_pc=2)
        >>> can.fit(X_b)

        generate data points to fill in the bias (for the sake of visualization)
        >>> gen_p, _ = can.imi.augment()

        plot Cancels' indicated biases in PCA space
        >>> plt.scatter(can.pca.transform(X_b)[:,0], can.pca.transform(X_b)[:,1])
        >>> sns.kdeplot(x=gen_p[:,0], y=gen_p[:,1], cut=10, thresh=0, cmap='Greens')
        >>> plt.show()
        """
        self.pca = PCA(n_components = self.n_pc)
        self.pca.fit(data)
        self.d_trf = self.pca.transform(data)

        if bounding_pool is not None:
            self.imi.fit(self.d_trf, bounds_set=bounding_pool, strength=strength)
        elif bounding_range is not None:
            self.imi.fit(self.d_trf, bounds={0: bounding_range}, strength=strength)
        else:
            self.imi.fit(self.d_trf, labels=np.zeros(len(data)).astype(int))

score(pool)

Scores all elements in a pool on their bias-mitigating ability.

See [1]_ for details on the score.

Parameters:

Name Type Description Default
pool numpy.ndarray (2D)

The pool that shall be scored.

required

Returns:

numpy.array (1D) The non-normalized scores for each element of the pool.

References

.. [1] Katharina Dost, Zac Pullar-Strecker, Liam Brydon, Kunyang Zhang, Jasmin Hafner, Patricia Riddle, and Jörg Wicker. "Combatting Over-Specialization Bias in Growing Chemical Databases." 05 October 2022, PREPRINT (Version 1) available at Research Square [https://doi.org/10.21203/rs.3.rs-2133331/v1]

Examples:

>>> from imitatebias.generators import *
>>> from imitatebias.cancels import *
>>> import matplotlib.pyplot as plt

generate data and pool

>>> X, y = generateData(500, 1, 10, seed=2210)
>>> X_b, _, _ = generateBias(X, y, 1, seed=2210)
>>> pool = np.column_stack([np.random.uniform(min(X[:,0]), max(X[:,0]), size=1000) for i in range(len(X[0]))])

fit Cancels

>>> can = Cancels(n_pc=2)
>>> can.fit(X_b)

score the pool

>>> scores = can.score(pool)

plot the pool's scores in PCA space

>>> plt.scatter(can.pca.transform(pool)[:,0], can.pca.transform(pool)[:,1], c=scores)
>>> plt.colorbar()
>>> plt.show()
Source code in imitatebias\cancels.py
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
    def score(self, pool):
        """Scores all elements in a pool on their bias-mitigating ability.

        See [1]_ for details on the score.

        Parameters
        ----------
        pool : numpy.ndarray (2D)
            The pool that shall be scored.

        Returns:

        numpy.array (1D)
            The non-normalized scores for each element of the pool.

	References
	----------
	.. [1] Katharina Dost, Zac Pullar-Strecker, Liam Brydon, Kunyang Zhang, Jasmin Hafner, 
       	   Patricia Riddle, and Jörg Wicker. "Combatting Over-Specialization Bias in Growing 
           Chemical Databases." 05 October 2022, PREPRINT (Version 1) available at Research 
           Square [https://doi.org/10.21203/rs.3.rs-2133331/v1]

        Examples
        --------
        >>> from imitatebias.generators import *
        >>> from imitatebias.cancels import *
        >>> import matplotlib.pyplot as plt

        generate data and pool
        >>> X, y = generateData(500, 1, 10, seed=2210)
        >>> X_b, _, _ = generateBias(X, y, 1, seed=2210)
        >>> pool = np.column_stack([np.random.uniform(min(X[:,0]), max(X[:,0]), size=1000) for i in range(len(X[0]))])

        fit Cancels
        >>> can = Cancels(n_pc=2)
        >>> can.fit(X_b)

        score the pool
        >>> scores = can.score(pool)

        plot the pool's scores in PCA space
        >>> plt.scatter(can.pca.transform(pool)[:,0], can.pca.transform(pool)[:,1], c=scores)
        >>> plt.colorbar()
        >>> plt.show()
        """
        pool_trf = self.pca.transform(pool)
        return self.imi.score(pool_trf, score_type='balanced')[:,0]