DiverseSelector.diversity

Molecule dataset diversity calculation module.

DiverseSelector.diversity.compute_diversity(features: array, div_type: str = 'hypersphere_overlap_of_subset') float[source]

Compute diversity metrics.

Parameters:
  • features (np.ndarray) – Feature matrix.

  • div_type (str, optional) – Method of calculation diversity for a given molecule set, which includes “entropy”, “logdet”, “shannon_entropy”, “wdud”, gini_coefficient” and “hypersphere_overlap_of_subset”. Default is “hypersphere_overlap_of_subset”.

  • mols (List[rdkit.Chem.rdchem.Mol], optional) – List of RDKit molecule objects. This is only needed when using the “explicit_diversity_index” method. Default=None.

Return type:

float, computed diversity.

DiverseSelector.diversity.entropy(x: ndarray) float[source]

Compute entropy of matrix.

The equation for entropy is .. math:

E = $-\frac{\sum{\frac{y_i}{N}\ln{\frac{y_i}{N}}}}{L\frac{\ln{2}}{2}}$

where N is the number of molecules in the set, L is the length of the fingerprint, and :math:y_i is a vector of the bitcounts of each feature in the fingerprints.

Parameters:

x (ndarray) – Feature matrix.

Returns:

e – Entropy of matrix.

Return type:

float

Notes

Feature matrices are converted to bits, so we lose any information associated with num in matrix. Weidlich, I. E., and Filippov, I. V. (2016) Using the Gini coefficient to measure the chemical diversity of small-molecule libraries. Journal of Computational Chemistry 37, 2091-2097.

DiverseSelector.diversity.gini_coefficient(a: ndarray)[source]

Gini coefficient of bit-wise fingerprints of a database of molecules.

Measures the chemical diversity of a database of molecules defined by the following formula:

\[G = \frac{2 \sum_{i=1}^L i ||y_i||_1 }{N \sum_{i=1}^L ||y_i||_1} - \frac{L+1}{L},\]

where \(y_i \in \{0, 1\}^N\) is a vector of zero and ones of length the number of molecules \(N\) of the i`th feature, and :math:`L is the feature length.

Parameters:

a (ndarray(N, L)) – Molecule features in L bits with N molecules.

Returns:

Gini coefficient between zero and one, where closer to zero indicates more diversity.

Return type:

float

References

[1]

Weidlich, Iwona E., and Igor V. Filippov. “Using the gini coefficient to measure the chemical diversity of small‐molecule libraries.” (2016): 2091-2097.

DiverseSelector.diversity.hypersphere_overlap_of_subset(lib: ndarray, x: array) float[source]

Computes the overlap of subset with hyper-spheres around each point

The edge penalty is also included, which disregards areas outside of the boundary of the full feature space/library. This is calculated as:

\[g(S) = \sum_{i < j}^k O(i, j) + \sum^k_m E(m),\]

where \(i, j\) is over the subset of molecules, \(O(i, j)\) is the approximate overlap between hyper-spheres, \(k\) is the number of features and \(E\) is the edge penalty of a molecule.

Parameters:
  • lib (ndarray) – Feature matrix of all molecules.

  • x (ndarray) – Feature matrix of selected subset of molecules.

Returns:

g_s – The total diversity volume of the matrix.

Return type:

float

Notes

Agrafiotis, D. K.. (1997) Stochastic Algorithms for Maximizing Molecular Diversity. Journal of Chemical Information and Computer Sciences 37, 841-851.

DiverseSelector.diversity.logdet(x: ndarray) float[source]

Computes the log determinant function.

Input is an :math:Stimes :math:n feature matrix with :math:S molecules and :math:n features.

Parameters:

x (ndarray(S, n)) – Subset feature matrix.

Returns:

f_logdet – The volume of parallelotope spand by the matrix.

Return type:

float

Notes

Nakamura, T., Sakaue, S., Fujii, K., Harabuchi, Y., Maeda, S., and Iwata, S.. (2022) Selecting molecules with diverse structures and properties by maximizing submodular functions of descriptors learned with graph neural networks. Scientific Reports 12.

DiverseSelector.diversity.shannon_entropy(x: ndarray) float[source]

Computes the shannon entropy of a matrix.

The equation for Shannon entropy is

\[H(X) = \sum_{i=1}^{n}-P_i(X)\log{P_i(X)}\]

where X is the feature matrix, n is the number of features, and \(P_i(X)\) is the proportion of molecules that have feature :math:i in :math:X.

Parameters:

x (ndarray) – Bit-string matrix.

Returns:

h_x – The shannon entropy of the matrix.

Return type:

float

Notes

Leguy, J., Glavatskikh, M., Cauchy, T., and Benoit. (2021) Scalable estimator of the diversity for de novo molecular generation resulting in a more robust QM dataset (OD9) and a more efficient molecular optimization. Journal of Cheminformatics 13.

DiverseSelector.diversity.wdud(x: ndarray) float[source]

Computes the Wasserstein Distance to Uniform Distribution(WDUD).

The equation for the Wasserstein Distance for a single feature to uniform distribution is .. math:

WDUD(x) = \int_{0}^{1} |U(x) - V(x)|dx

where the feature is normalized to [0, 1], \(U(x)=x\) is the cumulative distribution of the uniform distribution on [0, 1], and \(V(x) = \sum_{y <= x}1 / N\) is the discrete distribution of the values of the feature in \(x\), where \(y\) is the ith feature. This integral is calculated iteratively between :math:y_i and :math:y_{i+1}, using trapezoidal method.

Parameters:

x (ndarray(N, K)) – Feature array of N molecules and K features.

Returns:

The mean of the WDUD of each feature over all molecules.

Return type:

float

Notes

Nakamura, T., Sakaue, S., Fujii, K., Harabuchi, Y., Maeda, S., and Iwata, S.. (2022) Selecting molecules with diverse structures and properties by maximizing submodular functions of descriptors learned with graph neural networks. Scientific Reports 12.