Measuring Chemical Similarity in Metabolomics

One of the primary uses of chemical similarity in medicine is drug target prediction. Like much research in bioinformatics, the goal is to predict interaction or effect automatically, without testing each possibility in the lab. Chemical similarity between target substrates or products and existing well-studied compounds can assist in predicting outcome of pharmaceutical treatments.

But a problem arises when discussing chemical similarity: there isn’t a comprehensive definition of what it means. For its uses in the medical field, we are primarily interested in potential for interaction between compounds. This means that the shape of the molecule and its functional groups matter, but this is not necessarily all. Assumptions must be made by the methods used for determining the similarity of two compounds.

Below, I will go into a few key methods for measuring chemical similarity along with the assumptions made by these methods. For a comprehensive analysis and comparison of these methods, see Dr. Edmond Duesbury’s PhD thesis. Note: his thesis includes a discussion of hyperstructures, which I don’t discuss here, and it excludes ontology-based methods. It is also not specific to the metabolomics field.

Fingerprint-Based Methods

The most popular method for chemical similarity when applied to metabolites (my current area of focus) is representing compounds as fingerprints. These fingerprints are fixed-length bit vectors that encode information about the compound, and the encoding is specific to what researchers in the field believe is relevant. One may then use the Tanimoto/Jaccard coefficient between the two bit strings to obtain the similarity.

In metabolomics, the most common type of encoding is the PubChem ID, an 880 bit vector denoting substructures present in the compound. Computing the Tanimoto coefficient on these vectors is quick and easy, but there is something questionable about weighing all of these bits equally. Consider this: not all substructures are equally common, and some may be related to one another. Do we really want to place equal importance on each of these substructures? In spite of this potential problem, this method has been shown to perform well in some scenarios (although it hasn’t been tested in metabolomics with ground truth) and is often used because of its convenience.

Ontology-Based Methods

Several databases of chemical ontologies exist. These include ClassyFire, ChEBI, and LION\Web (for lipids only). In these ontologies, compounds are organized hierarchically into groups chosen by those designing the ontology. For instance, ClassyFire divides compounds into organic and inorganic, then into lipids, analogues, metal compounds, non-metal compounds, etc (see the paper for the full ontology).

For ontological databases, it is possible to compare two compounds by the height of their split in the ontology. This can be easily done once one queries the compound of interest and obtains the ontology. This is simple in ClassyFire: the classyfireR package in R returns a compound’s ontology up to four levels of the hierarchy using its InChIKey. For better resolution, one can also use the batch tool to manually query a list of InChIKeys. Notably, this chemical similarity is based on the organization of the hierarchy rather than on properties that can be observed from chemical structure alone.

Graph-Based Methods

Graph-based methods use graph mining to determine the similarity of two compounds. Atoms are often treated as nodes, with bonds as edges (although, in some cases, line graphs are used). As a computer scientist, I’m partial to these. The problem is that they are less scalable than the methods described above.

One simple graph-based method is atom similarity, developed in the 1980’s. In this method, the path (in non-hydrogen bonds) is computed between every pair of atoms in the molecule. The similarity is then based on the number of shared paths (i.e. shared bond structure between each pair of atoms). As far as I’m aware, this method is still being used in the ChemMineR R package.

One that provides a more intuitive graphical measure of chemical similarity, but is also more computationally intensive, is Maximum Common Substructure (MCS). In this method, a modular product graph is computed between two molecules. This graph includes all possible mappings between atoms and bonds from the chemical graphs of the compounds. Nodes denote pairs of atoms with the same identity in both compounds, and edges denote pairs of bonds shared between two pairs of atoms across compounds. Intuitively, any atom-bond substructure found in both compounds would manifest as a clique in the modular product graph. So, the problem of determining similarity is simply to find the maximum clique and compute the ratio of atoms and bonds in the clique to the union of atoms and bonds between the two structures.

…Except for one problem. The maximum clique problem is NP-hard. In comparison to fingerprint or ontology -based methods, MCS is painfully slow. Heuristics exist and are described in Duesbury’s thesis, but the available R package does not appear to include them.


Each method seems to provide its own potential benefit. Methods based on ontology or fingerprints rely on more assumptions than those based on chemical graphs. However, they are much faster and more scalable than methods based on chemical graphs. Fingerprint-based methods, combined with Tanimoto similarity, are currently the most common approach in the Metabolomics field.


Duesbury, Edmund (2015). Applications and Variations of the Maximum Common Subgraph for the Determination of Chemical Similarity. PhD thesis, University of Sheffield. [online] Available at: [Accessed 24 Mar. 2019].

Temple University. (2009) PubChem Substructure Fingerprint. [online] Available at: [Accessed 24 Mar. 2019].

Molenaar MJeucken AWassenaar T, et al. (2018). LION/web: a web-based ontology enrichment tool for lipidomic data analysis. bioRxiv. [online] Available at: [Accessed 24 Mar. 2019].

Degtyarenko K, de Matos P, Ennis M, et al. (2008). ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Research, 36 (Database issue), D344–D350. [online] Available at: [Accessed 24 Mar. 2019].

Feunang Y, Eisner R, Knox C, et al. (2016). ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. Journal of Cheminformatics, 8(61). [online] Available at: %5BAccessed 24 Mar. 2019].

Fiehn O. ClassyFire Batch by Fiehn Lab [Software]. Available at: [Accessed 24 Mar. 2019].

Carhart R, Smith D, and Venkataraghavan R. (1985). Atom pairs as molecular features in structure-activity studies: definition and applications. Journal of Chemical Information and Computer Sciences, 25(2), 64-73. [online] Available at: %5BAccessed 24 Mar. 2019].

Raymond J, Gardiner E, and Willet P. (2002). RASCAL: Calculation of Graph Similarity using Maximum Common Edge Subgraphs. The Computer Journal, 45(6), 631-644. [online] Available at:  [Accessed 24 Mar. 2019].

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s