In ecology, the concepts of alpha diversity and beta diversity are frequently used to characterize habitats. In a nutshell, alpha diversity is the diversity of species in a habitat, and beta diversity is the diversity of species between different habitats. An illustration is given below.
The reason why I (a non-ecologist) am writing about this is because these metrics are also used in microbiome analyses, and an R package for microbiome data is now available from Bioconductor which includes these metrics.
The authors of the package have given a nice introduction to their package here. In the section on alpha diversity, you can see that there is a plethora of diversity indices reported by the package. They also have some functions available for beta diversity analysis. I’m going to discuss some of the math behind these indices to motivate why a particular metric might be useful for a particular study.
Species richness in a sample is a simple concept: it is the number of species at the sample site. However, as some species may have abundances below the detection limit or may not be detected due to gaps in the database, true richness must estimate the number of undetected species. The microbiome package uses Chao richness  as a lower bound of richness, which is defined as:
Sobs is the number of observed species, f1 is the number of species observed in the first individual, and f2 is the number of species observed in the second individual sampled. The assumption being made here is that the number of undiscovered species is equal to the ratio of the square of discovered species in the first individual to twice the number of discovered species in the second individual. This may be a fair assumption for many applications, but it’s probably a good idea to think about whether this assumption makes sense for the data you are analyzing.
For the rest of this post, I will denote richness (in the general sense as number of species) using R.
Richness gives a count of the number of species, but alpha diversity takes that a step further and examines the proportions of species. Here, in addition to species count, species abundance also comes into play.
Inverse Simpson Diversity
The Inverse Simpson Diversity metric  was first published by Edward Simpson in 1949 and is fairly straightforward:
Here, ni is the abundance of species i, and N is the total abundance of all species. Summed over all species, the metric tells us the sum of squares of all abundance ratios. Note that, when the species are evenly distributed, the denominator is expected to be low, especially in systems with high numbers of species. Conversely, a system dominated by one or two species will have a high value in the denominator. Diversity will be maximized when there are N species, each with an abundance of 1, and minimized when there is a single species with an abundance of N.
Gini Simpson Diversity
This metric  is very similar to the Inverse Simpson metric. The only difference is that, instead of the inverse, the sum of squares is subtracted from 1, i.e:
The effect here is the same, but this metric scales the diversity metric between 0 and 1. The differences here between high and low sum of squares values are less extreme than with the Inverse Simpson metric.
Claude Shannon developed his metric of entropy to use in information theory . Today, it is used in many fields, such as measuring diversity in the microbiome.
The best way to understand how Shannon entropy/diversity works is to look at a plot of how it works with only two species and log base 2. The entropy is highest when both are in equal proportion, and lowest when one species is completely dominant. A word of caution: things change when more species are added. With 4 species, the entropy maxes out at 2 instead of 1.
Fisher’s diversity  is not so straightforward, and it assumes the number of species expected to be present in j individuals (i.e. fj) follows a log distribution, i.e.
The diversity metric is then the closest fitting alpha to satisfy each equation given the actual breakdown of species by individual. Empirically, alpha approximates the number of species per individual. This metric may be useful if you wish to examine the diversity within a group of individuals, rather than pooling together all samples in the group. It won’t be particularly useful for evaluating the dominance or evenness of species.
Coverage diversity tells you the number of species needed to cover at least half of the total abundance.
This also does not tell you whether there is evenness in the species abundance. To understand that, you would need to compare to the total number of species and possibly look at other cutoffs besides 0.5. But what this does tell you that the Simpson and Shannon metrics do not is the number of species needed to make up a portion of the abundance.
The alpha diversity metrics may focus on different aspects of diversity, such as count vs. dominance of a species or evenness in the abundance of species. Evenness metrics are specifically designed to compute evenness of abundance, without being affected by species count.
Camargo evenness  is based on a sum of pairwise differences in abundance between species (the term in the numerator). This sum is normalized by an unattainable “worst case scenario” for evenness in which the difference in abundance between each pair of species is itself the abundance of one member of the pair. When there is little difference in abundance between species, this value will be low, and it will become high after being subtracted from 1.
The Pielou Evenness  is related to Shannon entropy. In fact, the term in the numerator is exactly the Shannon entropy, which is fitting because high entropy indicates evenness of the distribution of species. It is normalized by the total log of the sum of abundances. This helps to mitigate the problem of differing species counts’ effects on the entropy range.
Simpson evenness is closely related to Simpson diversity. It is actually just the Inverse Simpson Diversity with the richness added to the denominator. As with the normalization in the Shannon metric, this is also to mitigate the effect of species count on the range of the metric.
This evenness metric is more complex, but is described in . What is being computed is a logged variance in abundances, which is then rescaled to be between 0 and 1 using an arctan function. Variance is a sensible metric here, because a system with evenness in species abundance is expected to have low variance.
Bulla Evenness  focuses on abundance ratios lower than expected under an even distribution of abundance of all species. It computes this using the minimization function in the summation. In an ideal scenario of evenness, the abundance will always be equivalent to the inverse of the number of species (i.e. the richness), and the resulting value will be 1. The more abundances dip below this expected value, the smaller the value in the numerator will be.
Absolute dominance is very simple: it is the maximum abundance value. This can help to identify the maximum abundance in a dominant species, but it does not identify the dominance of the additional species, nor does it take into account the total abundance.
Relative dominance does take into account the total abundance. There are variations of relative dominance, but all can be represented using the following formula:
Here, the k most abundant species will be used to compute the dominance. If k = 1, this is called the DBP metric , and it will be based only on the most abundant species. If k = 2, this is called the DMN metric , and it will be based on the top two most abundant species.
Simpson dominance is simply the sum of squared abundance ratios, i.e:
Note that this is the opposite of the inverse Simpson metric used to measure diversity. If one considers dominance to be the opposite of diversity, then this may be a good choice.
The core abundance measures the sum of all abundance ratios above a threshold. This is similar to the relative dominance, except that the cutoff is based on a threshold and not a ranking.
Finally, the Gini Index  was developed in 1912 by Corrado Gini to measure wealth distribution. The idea is somewhat similar to Camargo evenness, in that it makes use of absolute differences in abundance and normalizes them.
A good way to understand the Gini Index is by looking at what it was initially trying to measure. The Gini Index was originally developed to measure the deviation of an economic system from one in which wealth was distributed equally, called the Lorenz curve. The Gini Index is double the area of this curve.
Since we’re talking about species abundance rather than income, the Gini Index in this context measures the relationship between the cumulative share of species and the cumulative share of abundance. Intuitively, if only a few species were highly abundant, the area between the Lorenz curve and the line of equality will be higher than if many species were equally abundant, leading to a higher Gini Index.
Rarity indices measure the amount or proportion of rare species in a system, where a rare species is one that is present, but not abundant.
Log Modulo Skewness
This metric  is based on the skewness of the statistical distribution of the abundances. Intuitively, the more skewed a distribution is, the higher the number of rare species represented. The expression inside of the sign and log functions is the standard skewness expression.
For those not familiar with skewness, it is best understood using an illustration. The figure below illustrates positive skew, no skew, and negative skew.
Low abundance measures the proportion of abundance ratios that are comprised of rare species. Here, rarity is defined using abundance below a given threshold.
Non-core abundance is the exact opposite of core abundance (used as a dominance metric), i.e:
This tells us how much of the abundance is not accounted for by the core abundance.
Rare abundance is very similar to low abundance, with the difference being that low abundance thresholds the raw abundance value, whereas rare abundance thresholds the ratio of abundance in the species to total abundance. Specifically, for the microbiome package, a cutoff of 0.2 is used.
Only one type of beta diversity between samples is computed using the microbiome package, and that is the one used in , i.e. a statistical analysis of the Bray-Curtis beta diversity matrix between samples. The Bray-Curtis metric  is defined as follows:
Here, j and l are two samples in the data set. So, the second term is the sum of shared abundances for each species present in either sample, divided by the sum of total abundances for each either sample. If a small percentage of abundance is shared, this will result in high diversity.
There are other metrics of beta diversity as well, and these can be found in the vegan R package. You can find a description of those beta diversity metrics here.
 Chao,A. and Chiu,C.-H. (2016) Species Richness: Estimation and Comparison. In, Wiley StatsRef: Statistics Reference Online., pp. 1–26.
 Simpson,E.H. (1949) Measurement of diversity . Nature, 163, 688.
 Jost,L. (2006) Entropy and diversity. Oikos, 113, 363–375
 ] Shannon,C.E. (1948) A Mathematical Theory of Communication. Bell Syst. Tech. J., 27, 379–423.
 Fisher,R.A. et al. (1943) The Relation Between the Number of Species and the Number of Individuals in a Random Sample of an Animal Population. J. Anim. Ecol., 12, 42–58.
 Camargo,J.A. (1995) On Measuring Species Evenness and Other Associated Parameters of Community Structure. Oikos, 74, 538.
 Pielou,E.C. (1966) The measurement of diversity in different types of biological collections. J. Theor. Biol., 13, 131–144.
 Smith,B. and Wilson,J.B. (1996) A Consumer’s Guide to Evenness Indices. Oikos, 76, 70.
 Bulla,L. (1994) An Index of Evenness and Its Associated Diversity Measure. Oikos, 70, 167.
 Berger,W.H. and Parker,F.L. (1970) Diversity of planktonic foraminifera in deep-sea sediments. Science, 168, 1345–7.
 Mcnaughton,S.J. (1967) Relationships among functional properties of Californian Grassland. Nature, 216, 168–169.
 Gini, C. (1912) Variabilità e mutabilità. In: Pizetti, E. and Salvemini, T., Eds., Rome: Libreria Eredi Virgilio Veschi, Memorie di metodologica statistica.
 Locey,K.J. and Lennon,J.T. (2016) Scaling laws predict global microbial diversity. Proc. Natl. Acad. Sci. U. S. A., 113, 5970–5975.
 Salonen,A. et al. (2014) Impact of diet and individual variation on intestinal microbiota composition and fermentation products in obese men. ISME J., 8, 2218–2230.
 Bray,J.R. and Curtis,J.T. (1957) An Ordination of the Upland Forest Communities of Southern Wisconsin. Ecol. Monogr., 27, 325–349.
The course introduced me to aspects of metabolomics data sets that I, and I think it’s safe to say most bioinformaticians, hadn’t commonly thought about. Data sources and variation factors are significant in metabolomics but are often overlooked by data scientists and statisticians, who generally assume data is “clean” before starting analysis. This post offers a highly condensed version of the course material for bioinformaticians from backgrounds similar to mine. My hope is that it will help others both understand the origins of their data and look at metabolomics papers and data sets with a critical eye.
Metabolomics Experiment Overview
Metabolomics is the study of all small molecules in a biological system. “small” is defined differently by different sources – one definition includes only metabolites < 1 kDa. Metabolomics studies fall into two groups:
Targeted metabolomics (designed to test a hypothesis, focuses on specific metabolites)
Untargeted metabolomics (designed to generate a hypothesis, focuses on surveying all metabolites).
In either case, both the sample processing and experimental workflow are chosen with the scientific question in mind, e.g.
How does the metabolomic profile of a tomato grown with fertilizer X differ from the metabolomic profile of a tomato grown with fertilizer Y?
How does the plasma metabolome of a patient given drug X differ from that of a patient given drug Y?
Samples from relevant populations are then collected, extracted, and run through an instrument (either a mass spectrometer or NMR instrument). These instruments have a detector to detect (for MS) mass-to-charge (m/z) ratios and retention times of the compounds and (for NMR) chemical shifts of the compounds in magentic environments. (The retention time of the compound is the time when it exits the column – more on that later). The end result is a three-dimensional plot, where the dimensions are m/z, intensity (frequency of detector hits), and retention time.
MS plots are often projected onto two dimensions for interpretability. There are a few types of two-dimensional plots used. One is the total ion chromatogram, which plots only retention time by intensity, summed over all m/z. Here, multiple metabolites can potentially be represented by a single peak. Another plot is the base peak chromatogram. Here, only the maximum intensity peak for a given retention time is shown in the chromatogram. Finally, there is the mass spectra, which includes only m/z by intensity for a given retention time.
Even 2D plots can be difficult to interpret: compounds don’t always separate neatly from one another, and they fragment during the ionization process used by the instrument. However, software (sometimes open source, but often proprietary) exists to separate out peaks in the chromatogram. This is called “peak picking” and is usually followed by detecting features (peak sets) believed to correspond to metabolites. Once features are detected, the researcher can look up the corresponding metabolites using mass-to-charge ratios and adducts or fragments detected by the software. Although NMR spectra are different, NMR also induces fragmentation and requires peak-picking software and spectrum matching for identification.
In any metabolomics study, most features will not be identified. This happens for multiple reasons. Sometimes, the configuration of the experiment or sample processing procedure prevents some metabolites from being detected. Sometimes, spectra are clear but cannot be matched to anything in a database or a standard (two ways of confirming metabolite identity – more on that later). And sometimes, metabolites are overly fragmented, so their spectra are not recognizable or misindentified. Even for those features identified, they often cannot be quantified exactly; all that can be determined is the abundance of a feature in one group of samples compared to another.
You as an analyst should think about how the experimental design, extraction method, components and settings of the instrument, and peak-picking procedures affect the metabolites detected. You should be especially aware of this when comparing results from different studies. The two labs were likely trying to answer slightly different questions, and they designed their experiments to answer these questions. People conducting primary research generally don’t think about secondary analysis when they design their experiments because they need to design them in the most efficient and cost-effective way for their lab.
Experimental Design Effects
There are many factors when it comes to experimental design, and one of the most important precautions to take is to ensure that external factors are not contributing to apparent metabolomic differences between samples. Here are a few ways that researchers minimize these external factors; however, they aren’t all feasible for every study. You should check whether these aspects of their design are documented in their study when interpreting results.
Storage time and procedure: All samples should be stored in the same way, in the same type of container, for the same amount of time. Freezing and thawing a sample can affect its metabolite profile, and so can its contact with the container. Keeping these factors consistent is an ideal scenario and doesn’t always happen in reality.
Processing: All samples should be extracted in the same way (and ideally by the same lab tech) and run through the same experiment all in one batch. But again, this doesn’t always happen, especially if there is a large number of samples. One practice is to flush the instrument with solvent after each batch to prevent contamination from previous samples and to randomize run order.
Replicates: Are the replicates technical (coming from the same organism) or biological (coming from different organisms in the same group)? This can have a significant effect on the results.
Quality Control: Quality control samples (QC) are often used for finding outliers after peaks are picked as well as for correcting for differences between runs. The QC sample is usually a pooled combination of all samples and is run along with other samples. Feature abundances can then be corrected when normalizing by QC.
Blanks: Blanks can be run to correct for effects due to sample processing and storage. Creating blanks may involve putting water or another neutral substance through the same process as the sample, storing water in the same container for the same amount of time as the sample, or adding pH-balancing buffers to both the sample and the blank. Hypothetically, it is also possible to have a blank created by each tech, have a separate blank for each batch, etc, to correct for other effects. In reality, this is often limited by resources and time.
Internal standards: These are pure compounds that are run through the instrument along with the samples. Because the quantity of the compound is known, they can be used to normalize and even give an absolute quantification of metabolite abundance. However, standards aren’t always available for a given experiment.
The following image shows an example of samples from an experiment using QC, blanks, and internal standards.
Before they can be run through the experiment, samples must be extracted. Extraction protocols differ widely depending on the metabolites of focus, and they also differ between instrument types. Depending on the goal of the experiment, samples may be frozen, dried, centrifuged, extracted using a solvent, and/or diluted. Some frequently-analyzed substances, such as human plasma, have established protocols. Others do not, in which case the extraction protocol must be defined ad hoc by the lab.
Aside from the type of sample, the extraction protocol is also going to depend on the instrument being used. In liquid chromatography and NMR, extraction procedures can vary and are usually dependent on the metabolites of focus. For GC-MS, the sample must be in the gaseous state, with the compounds distributed throughout the gas, so extraction options are limited. For substances that are not volatile, a chemical process called derivitization is used to make them volatile, and then the extraction procedure is performed.
One way to extract metabolites for GC-MS is by collecting the gas above the sample (called the headspace), which depends on the solubility of the compounds and the vapor pressure. It is also possible to flush nitrogen through the sample and then collect it. The efficacy of this method depends on the rate of transfer of the compounds through the nitrogen. Finally, a method called Dynamic Headspace Extraction (illustrated below) was devised to minimize these dependencies.
Another approach is solid phase microextraction (SPME), which uses a fiber to collect a heated sample. The fiber becomes saturated quickly, so the amount of sample that can be extracted is very small. This small sample becomes problematic when trying to separate true signal from noise. In addition, fibers are not always made consistently, making reproducibility difficult. In spite of these drawbacks, it is a simple approach that is commonly used.
Here, we will discuss two ways in which a sample can be analyzed after the preparation and extraction is complete.
It can be passed through a column (known as chromatography), ionized, and then detected using mass spectrometry.
It can be analyzed using an NMR instrument.
In general, MS methods have higher sensitivity and can therefore detect more compounds. But NMR methods have much higher reproducibility and selectivity (i.e the mass resolution is very accurate). For this reason, one might use MS if compound coverage is most important for a study and NMR if precise identification is most important. We will discuss each procedure further here.
After sample preparation and extraction, the sample is passed through a column, which is used to separate out the sample’s components. This may be done using an LC column or a GC column.
Liquid Chromatography (LC)
LC is generally used for samples that are not easy to derivitize (i.e. change to a gaseous state). It uses two key principles to separate out compounds for detection: a solvent (mobile phase) and a column (stationary phase).
Depending on the polarity of the compounds (i.e. how attracted they are to water), different columns and solvents will be used. This is very important for analyzing spectra. The goal is to have all molecules pass through the column and reach the detector so that their spectra can be quantified, but not all at once. Therefore, the ideal column is one to which all compounds of focus will be attracted to varying degrees, but to which no compound will be so attracted as to stick to it without passing through the column.
Different columns are optimized for different types of compounds, and compounds also react while in the column, so it is impossible to capture all types of compounds in a given column. A few key types of columns are used in LC: Normal Phase, Reverse Phase, and HILIC. Each of these columns is also used with a specific type of solvent and hydrophobicity. Normal phase columns are usually used for very hydrophobic compounds, whereas HILIC is used for very hydrophilic compounds. Reverse phase is used for compounds that fall into an intermediate category. Bearing this in mind, think about what will happen if a researcher has a solution with hydrophobic compounds but runs HILIC. Those compounds will immediately exit the column and reach the detector. They will show up as a jumbled mess at the beginning of the chromatogram, rather than being nicely separated for easy identification. So, the types of compounds identified in an experiment have a lot to do with the column used to run the experiment.
The reactivity and variable attraction of compounds while in the column is a key reason why MS methods are said to be comparative, not quantitative. Because the chemical environment is complex, it is difficult to determine how much of a compound is getting lost because of its attraction to the column or reaction with other compounds.
Liquid Chromatography (LC) Ionization
After compounds pass through the column, they are ionized and launched into a mass spectrometer for detection (called mass spectrometry). When this is done after liquid chromatography, it is called LC-MS. When it is done after gas chromatography, it is called GC-MS. Ionization is important because all MS are designed for detecting ions only, and ionization procedures differ according to chromatography.
For LC-MS, a common type of ionization is electrospray ionization (ESI). This is considered a type of soft ionization, which means that little fragmentation of the compound occurs during the process. ESI sprays droplets of the ionized sample (combined with solvent) into the MS, creating a vapor. The solvent evaporates, leaving behind the compounds to travel through the MS. Ionization for gas chromatography, as we will discuss, is different.
Gas Chromatography (GC-MS)
Gas chromatography is different from liquid chromatography in that it requires the sample to be derivitized (i.e. converted to a gaseous state) before it can pass through the column. GC columns are much larger than LC columns and are coiled. However, like LC columns, they are coated with a stationary phase and used for separating the compounds. The selectivity and effectiveness of a GC column is determined by its diameter, length, and film thickness.
Unlike LC, GC does not use soft ionization to launch compounds into the MS. It uses a type of hard ionization called Electron Impact (EI). This means that the spectra from a GC will be much more fragmented than the spectra for the same sample using LC because of the high amount of fragmentation caused by the ionization process. While this can be a factor in metabolite identification, it is noteworthy that the identification libraries for GC are more robust than those for LC, helping to mitigate this effect.
Mass Spectrometry (MS)
Once ions in the gaseous state are produced using either of the ionization techniques discussed above, the ionized compounds travel through a mass specrometer to ultimately meet the detector. The differences in trajectory through the mass spectrometer are what makes the compounds identifiable. Mass spectrometers can be of a few types, but they often contain a quadrupole component. Quadrupole components use voltage changes to filter the compounds by mass, yielding m/z ratios. They are very sensitive, but not very selective. For this reason, they are not typically used in conjunction with other components; three common types are Time Of Flight (TOF), Orbitrap, and Fourier Transform Ion Cyclotron Resonance (FT-ICR).
TOF instruments allow the molecule to drift freely through a tube, and they measure m/z ratios using the time taken for a molecule to travel through the tube. Using a TOF in conjunction with a quadrupole increases selectivity, yielding better separation of molecules.
Orbitrap instruments work by using electrodes to trap ions so that the ions must move around a spindle. The m/z ratio is computed using the trajectory of each ion.
FT-ICR uses a magnet to induce fluctuations in the compound. The frequency of the fluctuations are measured using a Fourier transform. Compared to TOF and Orbitrap instruments, these have the highest selectivity. However, they are also less affordable and slower than TOF or Orbitrap.
One important point about MS methods is that they are not quantitative. A compound may be measured across two samples to determine its relative abundance within the samples, but this does not tell us the compounds’s total abundance. MS/MS can be used to focus on a small portion of the MS spectra and quantify it. A second sample is needed for this process, and it requires a QTOF (a quadrupole attached to a TOF) with mass selection. The process is depicted below. Note that this can be done only when the researcher knows which mass range to consider. MS/MS is usually done as a follow-up to MS.
Nuclear Magnetic Resonance (NMR)
The mechanism used in NMR is quite different from that used in MS techniques. NMR instruments use large magnets to induce the oscillation of compounds. They do not separate compounds using a column, and they do not induce hard or soft ionization, and they do not destroy the sample. The compounds are identified using spectra generated from their net magnetization. This is dependent on the molecule’s spin, a property of magnetism.
The oscillation behavior of compounds in NMR is determined primarily by key atoms in the compound, called the nuclei. In any given NMR experiment, a specific type of nucleus will be targeted by the instrument. Common nuclei are hydrogen, the carbon-13 isotope, the phosphorus-31 isotope, nitrogen-15, silicon-29, and fluorine-19. The researcher’s choice of nucleus will depend on which of these they expect to be most abundant in the sample. The sample is enclosed in a holder (probe) optimized for different nuclei of interest, and each atom of this type in a compound will have a separate signal.
The chemical shift of a nucleus (i.e. the shift in signal) is determined by the atoms surrounding it, but it may also be affected by factors like the pH of the sample, the temperature of the sample, and the solvent used. Impurities and solvents may also affect the spectrum, but there are methods for handling these. To lock the optimal magnetic field for a given sample, a process called shimming needs to be conducted with the supervision of the researcher.
J-coupling, the interaction between a nucleus and other nuclei surrounding it in the same compound, also affects the final spectra. NMR peaks often do not have a single summit, but multiple summits (called the multiplicity of the peak). This multiplicity is affected by J-coupling. An example of J-coupling is shown below.
Not everything that needs to be known about each compound can be inferred directly from the NMR spectra described above. For instance, a compound’s fragments may include two or more that are identical; these will overlap completely and won’t be distinguishable. In addition, inference using J-coupling can reveal how many surrounding atoms of a certain type exist, but not exactly where they are in the molecule. 2-D approaches can help researchers to answer these questions.
Homonuclear COrrelation SpectroscopY (COSY) is a technique that is used to determine which interactions lead to J-coupling. It displays not only the spectra for each molecule, but a cross-peak showing where the coupling occurs.
TOtal Correlated SpectroscopY (TOCSY) is another technique that, in addition to showing interactions from J-coupling, also shows indirect interactions between nuclei of the same type. It is sometimes compared to running two steps of COSY. This can be used to resolve scenarios in which multiple fragments from the same compound are producting the same peak.
Nuclear Overhauser Effect Spectroscopy (NOESY) is used to determine the distances between nuclei in bonds. Like COSY and TOCSY, both dimensions use the same type of nucleus.
Heteronuclear Single Quantum Coherence Spectroscopy (HSQC) is used to find correlations between nuclei of different types. For instance, if used with hydrogen and carbon-13, it can find the fragments in which carbon and hydrogen nuclei are attached.
Metabolite Identification and Modeling
Metabolite identification consists of a few steps:
Detecting peaks in the signal intensity across m/z and retention time.
Grouping peaks to create features hypothesized to correspond to metabolites. This is often done using correlation of peaks across samples.
Matching the spectra of features to known metabolites. This can be done using a pure standard or by querying the spectrum or maximum intensity peak of the spectrum against a database.
A peak is a region of signal with intensity high enough that it is assumed not to be an anomaly. Peak picking often uses statistical methods, and they may be in the form of proprietary software or open source software such as XCMS.
The primary peak picking algorithm in XCMS is called centWave. The user must specify several parameters:
Signal to noise threshold
Maximum distance between peaks
Maximum change in m/z across scans for the peak
Minimum and maximum counts of sub-peaks per peak
Intensity noise threshold
Function to use in approximation of peak shape.
These parameters may not be intuitive to the average researcher. In addition, many researchers perform visual inspection on the list of final peaks to remove false peaks, and this is highly subjective. Both parameters and a list of additional removed peaks should be included for the experiment to be truly replicable.
For NMR, adjustments need to be made to the spectrum before peaks are picked. A Fourier transform is used to convert the signal into the frequency domain, which sometimes includes zero-filling to increase the resolution of the signal. The signal’s phase and baseline are adjusted, usually manually. Finally, an internal standard is typically used to normalize the signal. All of this, along with the peak picking, is often done using proprietary software called TopSpin on Bruker NMR instruments. These manual steps, in addition to the proprietary nature of the software, can present challenges in replication.
Grouping Peaks and Features
Once peaks are detected for each sample, they must be grouped across samples. In XCMS, this is done using the retcor and group functions. Again, there are multiple parameters that the user must specify:
Method for calculating signal drift
Method for grouping signals
Deviation tolerance, and
Minimum number of samples containing the feature.
A manual data cleanup is usually performed after this step. Researchers may remove peaks found in a specified number of blanks, peaks not found in a specified number of QC, or peaks with more than a specified coefficient of variance across QC.
An algorithm called CAMERA can be used as an add-on to the output from XCMS. CAMERA uses known chemical rules to group features found across samples, focusing especially isotopes and adducts in the spectra.
To determine which metabolites to report as significant, researchers have many methods they may use. In these models, metabolites are used as features or independent variables, and sample class labels are used as the outcome or dependent variable.
For class discovery, a PCA or hierarchical clustering method is often used, with the samples labels overlaid. This indicates whether the sample labels are in any way related to the metabolomic driving factors differentiating the samples. For class comparison, a t-test, Wilcoxon rank-sum test, or ANOVA may be used for each metabolite across sample groups, perhaps with p-value or FDR correction. For class prediction, researchers may use a random forest or support vector machine, but it is common to see PLS-DA being used. As with previous steps, it is important to report the parameters given to these algorithms.
The results of a class discovery, class comparison, or class prediction algorithm will include a list of important metabolites, and these are generally the metabolites reported as signficant. But they are still unknown! The next step is to identify them.
One option for identifying significant metabolites is to infer mass measurements from the m/z ratios and query them against metabolite databases. However, the most abundant mass within a feature doesn’t always represent the true mass of the compound. The mass detected by the instrument is called the accurate mass; it is subject to the sensitivity of the instrument and may not include the mass of adducts or isotopes. The discrepancy between this and the exact mass needs to be accounted for when querying the mass in a database.
In addition, there are different types of exact masses, and the researcher needs to be aware which one is contained in a given database. The nominal mass is computed using the integer masses of each element in a compound, the monoisotopic mass is computed using the masses of the most abundant isotope of each compound, and the average mass is computed using the masses of each compound for each isotope, averaged over the abundance of each isotope.
In spectral matching, the spectra found for metabolites of interest are matched to spectra of known compounds in a database. METLIN is a common database for this type of analysis. However, be aware that these spectra are curated from different types of studies, and some have not been verified using a standard; they may have merely been predicted.
A researcher who wants to be absolutely certain about a metabolite’s identity will compare the metabolite’s spectra to the spectrum of a known standard for that compound on the same instrument and subject to the same experimental conditions. This not only requires a lot of extra effort, but it is not possible for many compounds.
Critical Thinking in Computational Metabolomics
In conclusion, there are many factors involved in a metabolomics study. The following are, in my opinion, the most important questions to consider when reading a metabolomics paper, and especially when comparing the results of multiple studies:
Was this a targeted or untargeted study? In other words, was the focus on specific metabolites?
How were the samples processed and extracted?
Did the authors correct for variation adequately by using blanks and QC?
If MS was used, which column was used, and which type of metabolite is most likely to be detected using this column?
If MS was used, what is the mass accuracy of the MS instrument?
If NMR was used, which nuclei were used? Was any 2-D analysis performed?
How was peak picking performed? Which peak picking parameters were used?
Which steps were used in data cleanup?
Were isotopes and adducts resolved using CAMERA?
How were significant metabolites found? If a parametric algorithm or statistical test was used, what were the parameters?
At which level were the metabolites identified?
If you do not know the answers to these questions, it is best to ask the authors. Or, if you are not able to do this, at least err on the side of caution. Do not assume that the answers to these questions are likely to be the same across studies.
Open Computational Problems
Several open computational problems exist in metabolomics. You can read more about the current state of computational metabolomics here. In particular, researchers are seeking to address the following problems:
Identification of metabolites. This is an important one, since the vast majority of features detected using peak picking software are never matched to known spectra. Good solutions will address differences between experiments. Improved integration of known mass spectra is also important.
Biological inference for metabolite flux between phenotypes. This is often done using known biological and chemical pathways, which are assumed to be discrete (this is not always the case). Sometimes, chemical similarity or reaction networks are used instead of pathways.
Integration of metabolomics with other -omics data. Many researchers are interested in multi-omics approaches, but combining data types is not straightforward.
Integration of multi-instrument data. As we have seen, results returned by different types of instruments (NMR, GC-MS, and LC-MS) are expected to differ. The question of how to integrate these data types, like the question of how to integrate multi-omics data, has yet to be answered.
Integration of lipids and polar metabolites. The typical protocols for identifying lipids and for identifying polar metabolites differ at multiple steps of their protocols. However, it would be useful to perform an integrative computational analysis of these two data types. In a recent paper, this was addressed using a novel instrumentation workflow that lends itself well to integrative analysis.
Ferreira do Nascimento (2017) Advances in Chromatographic Analysis Avid Science.
Frainay,C. et al. (2018) Mind the Gap: Mapping Mass Spectral Databases in Genome-Scale Metabolic Networks Reveals Poorly Covered Areas. Metabolites, 8, 51.
Kuhl,C. et al. (2012) CAMERA: An Integrated Strategy for Compound Spectra Extraction and Annotation of Liquid Chromatography/Mass Spectrometry Data Sets. Anal. Chem., 84, 283–289.
Peisl,B.Y.L. et al. (2018) Dark matter in host-microbiome metabolomics: Tackling the unknowns–A review. Anal. Chim. Acta, 1037, 13–27.
Riekeberg,E. and Powers,R. (2017) New frontiers in metabolomics: from measurement to insight. F1000Research, 6, 1148.
Samuelsson,L.M. and Larsson,D.G.J. (2008) Contributions from metabolomics to fish research. Mol. Biosyst., 4, 974.
Schmidt,K. and Podmore,I. (2015) Current Challenges in Volatile Organic Compounds Analysis as Potential Biomarkers of Cancer. J. Biomarkers, 2015, 1–16.
Schwaiger,M. et al. (2019) Merging metabolomics and lipidomics into one analytical run. Analyst, 144, 220–229.
One of the primary uses of chemical similarity in medicine is drug target prediction. Like much research in bioinformatics, the goal is to predict interaction or effect automatically, without testing each possibility in the lab. Chemical similarity between target substrates or products and existing well-studied compounds can assist in predicting outcome of pharmaceutical treatments.
But a problem arises when discussing chemical similarity: there isn’t a comprehensive definition of what it means. For its uses in the medical field, we are primarily interested in potential for interaction between compounds. This means that the shape of the molecule and its functional groups matter, but this is not necessarily all. Assumptions must be made by the methods used for determining the similarity of two compounds.
Below, I will go into a few key methods for measuring chemical similarity along with the assumptions made by these methods. For a comprehensive analysis and comparison of these methods, see Dr. Edmond Duesbury’s PhD thesis. Note: his thesis includes a discussion of hyperstructures, which I don’t discuss here, and it excludes ontology-based methods. It is also not specific to the metabolomics field.
The most popular method for chemical similarity when applied to metabolites (my current area of focus) is representing compounds as fingerprints. These fingerprints are fixed-length bit vectors that encode information about the compound, and the encoding is specific to what researchers in the field believe is relevant. One may then use the Tanimoto/Jaccard coefficient between the two bit strings to obtain the similarity.
In metabolomics, the most common type of encoding is the PubChem ID, an 880 bit vector denoting substructures present in the compound. Computing the Tanimoto coefficient on these vectors is quick and easy, but there is something questionable about weighing all of these bits equally. Consider this: not all substructures are equally common, and some may be related to one another. Do we really want to place equal importance on each of these substructures? In spite of this potential problem, this method has been shown to perform well in some scenarios (although it hasn’t been tested in metabolomics with ground truth) and is often used because of its convenience.
Several databases of chemical ontologies exist. These include ClassyFire, ChEBI, and LION\Web (for lipids only). In these ontologies, compounds are organized hierarchically into groups chosen by those designing the ontology. For instance, ClassyFire divides compounds into organic and inorganic, then into lipids, analogues, metal compounds, non-metal compounds, etc (see the paper for the full ontology).
For ontological databases, it is possible to compare two compounds by the height of their split in the ontology. This can be easily done once one queries the compound of interest and obtains the ontology. This is simple in ClassyFire: the classyfireR package in R returns a compound’s ontology up to four levels of the hierarchy using its InChIKey. For better resolution, one can also use the batch tool to manually query a list of InChIKeys. Notably, this chemical similarity is based on the organization of the hierarchy rather than on properties that can be observed from chemical structure alone.
Graph-based methods use graph mining to determine the similarity of two compounds. Atoms are often treated as nodes, with bonds as edges (although, in some cases, line graphs are used). As a computer scientist, I’m partial to these. The problem is that they are less scalable than the methods described above.
One simple graph-based method is atom similarity, developed in the 1980’s. In this method, the path (in non-hydrogen bonds) is computed between every pair of atoms in the molecule. The similarity is then based on the number of shared paths (i.e. shared bond structure between each pair of atoms). As far as I’m aware, this method is still being used in the ChemMineR R package.
One that provides a more intuitive graphical measure of chemical similarity, but is also more computationally intensive, is Maximum Common Substructure (MCS). In this method, a modular product graph is computed between two molecules. This graph includes all possible mappings between atoms and bonds from the chemical graphs of the compounds. Nodes denote pairs of atoms with the same identity in both compounds, and edges denote pairs of bonds shared between two pairs of atoms across compounds. Intuitively, any atom-bond substructure found in both compounds would manifest as a clique in the modular product graph. So, the problem of determining similarity is simply to find the maximum clique and compute the ratio of atoms and bonds in the clique to the union of atoms and bonds between the two structures.
…Except for one problem. The maximum clique problem is NP-hard. In comparison to fingerprint or ontology -based methods, MCS is painfully slow. Heuristics exist and are described in Duesbury’s thesis, but the available R package does not appear to include them.
Each method seems to provide its own potential benefit. Methods based on ontology or fingerprints rely on more assumptions than those based on chemical graphs. However, they are much faster and more scalable than methods based on chemical graphs. Fingerprint-based methods, combined with Tanimoto similarity, are currently the most common approach in the Metabolomics field.
Duesbury, Edmund (2015). Applications and Variations of the Maximum Common Subgraph for the Determination of Chemical Similarity. PhD thesis, University of Sheffield. [online] Available at: http://etheses.whiterose.ac.uk/13063/ [Accessed 24 Mar. 2019].
Degtyarenko K, de Matos P, Ennis M, et al. (2008). ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Research, 36 (Database issue), D344–D350. [online] Available at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2238832/ [Accessed 24 Mar. 2019].
Carhart R, Smith D, and Venkataraghavan R. (1985). Atom pairs as molecular features in structure-activity studies: definition and applications. Journal of Chemical Information and Computer Sciences, 25(2), 64-73. [online] Available at: https://pubs.acs.org/doi/10.1021/ci00046a002 %5BAccessed 24 Mar. 2019].