DiverseSelector.distance

Similarity Module.

DiverseSelector.distance.modified_tanimoto(a: array, b: array) float[source]

Compute the modified tanimoto coefficient from bitstring vectors of data points A and B.

Adjusts calculation of the Tanimoto coefficient to counter its natural bias towards shorter vectors using a Bernoulli probability model.

..math:: MT = frac{2-p}{3}T_1 + frac{1+p}{3}T_0

where \(p\) is success probability of independent trials, \(T_1\) is the number of common ‘1’ bits between data points (\(T_1 = | A \cap B |\)), and \(T_0\) is the number of common ‘0’ bits between data points (\(T_0 = |(1-A) \cap (1-B)|\)).

Parameters:
  • a (ndarray of shape (n_features,)) – The 1D bitstring feature array of sample \(A\) in an n_features dimensional space.

  • b (ndarray of shape (n_features,)) – The 1D bitstring feature array of sample \(B\) in an n_features dimensional space.

Returns:

mt – Modified tanimoto coefficient between bitstring feature arrays \(A\) and \(B\).

Return type:

float

Notes

The equation above has been derived from

..math:: MT_alpha= {alpha}T_1 + (1-alpha)T_0

where \(\alpha = \frac{2-p}{3}\). This is done so that the expected value of the modified tanimoto, \(E(MT)\), remains constant even as the number of trials \(p\) grows larger.

Fligner, M. A., Verducci, J. S., and Blower, P. E.. (2002) A Modification of the Jaccard-Tanimoto Similarity Index for Diverse Selection of Chemical Compounds Using Binary Strings. Technometrics 44, 110-119.

DiverseSelector.distance.nearest_average_tanimoto(X: ndarray) float[source]

Compute the average tanimoto for nearest data points measured by Minkowski 2-norm.

For each sample, the closest neighbor is identified by computing its Minkowski 2-norm (i.e., Euclidean) distance with all other samples, and identifying neighboring sample with the shortest distance.

Parameters:

X (ndarray of shape (n_samples, n_features)) – Feature matrix of n_samples samples in n_features dimensional space.

Returns:

  • float – Average of the Tanimoto coefficients for each sample and its closest neighbor.

  • Papp, Á., Gulyás-Forró, A., Gulyás, Z., Dormán, G., Ürge, L.,

  • and Darvas, F.. (2006) Explicit Diversity Index (EDI)

  • A Novel Measure for Assessing the Diversity of Compound Databases.

  • Journal of Chemical Information and Modeling 46, 1898-1904.

DiverseSelector.distance.pairwise_similarity_bit(X: array, metric: str) ndarray[source]

Compute pairwise similarity coefficient matrix.

Parameters:
  • X (ndarray of shape (n_samples, n_features)) – Feature matrix of n_samples samples in n_features dimensional space.

  • metric (str) – The metric used when calculating similarity coefficients between samples in a feature array. Method for calculating similarity coefficient. Options: “tanimoto”, “modified_tanimoto”.

Returns:

s – A symmetric similarity matrix between each pair of samples in the feature matrix. The diagonal elements are directly computed instead of assuming that they are 1.

Return type:

ndarray of shape (n_samples, n_samples)

DiverseSelector.distance.tanimoto(a: array, b: array) float[source]

Compute Tanimoto coefficient or index (a.k.a. Jaccard similarity coefficient).

For two binary or non-binary arrays \(A\) and \(B\), Tanimoto coefficient is defined as the size of their intersection divided by the size of their union:

..math::

T(A, B) = frac{| A cap B|}{| A cup B |} = \frac{| A cap B|}{|A| + |B| - | A cap B|} = frac{A cdot B}{|A|^2 + |B|^2 - A cdot B}

where \(A \cdot B = \sum_i{A_i B_i}\) and \(\|A\|^2 = \sum_i{A_i^2}\).

Parameters:
  • a (ndarray of shape (n_features,)) – The 1D feature array of sample \(A\) in an n_features dimensional space.

  • b (ndarray of shape (n_features,)) – The 1D feature array of sample \(B\) in an n_features dimensional space.

Returns:

  • coeff (float) – Tanimoto coefficient between feature arrays \(A\) and \(B\).

  • Bajusz, D., Rácz, A., and Héberger, K.. (2015)

  • Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations?.

  • Journal of Cheminformatics 7.