Diversity Metrics in the Microbiome

In ecology, the concepts of alpha diversity and beta diversity are frequently used to characterize habitats. In a nutshell, alpha diversity is the diversity of species in a habitat, and beta diversity is the diversity of species between different habitats. An illustration is given below.

diversity-perspectives.jpeg
Source: https://eco-intelligent.com/2016/10/14/alpha-beta-gamma-diversity/

The reason why I (a non-ecologist) am writing about this is because these metrics are also used in microbiome analyses, and an R package for microbiome data is now available from Bioconductor which includes these metrics.

The authors of the package have given a nice introduction to their package here. In the section on alpha diversity, you can see that there is a plethora of diversity indices reported by the package. They also have some functions available for beta diversity analysis. I’m going to discuss some of the math behind these indices to motivate why a particular metric might be useful for a particular study.

Richness

Species richness in a sample is a simple concept: it is the number of species at the sample site. However, as some species may have abundances below the detection limit or may not be detected due to gaps in the database, true richness must estimate the number of undetected species. The microbiome package uses Chao richness [1] as a lower bound of richness, which is defined as:

chao_richness_eq

Sobs is the number of observed species, f1 is the number of species observed in the first individual, and f2 is the number of species observed in the second individual sampled. The assumption being made here is that the number of undiscovered species is equal to the ratio of the square of discovered species in the first individual to twice the number of discovered species in the second individual. This may be a fair assumption for many applications, but it’s probably a good idea to think about whether this assumption makes sense for the data you are analyzing.

For the rest of this post, I will denote richness (in the general sense as number of species) using R.

Alpha Diversity

Richness gives a count of the number of species, but alpha diversity takes that a step further and examines the proportions of species. Here, in addition to species count, species abundance also comes into play.

Inverse Simpson Diversity

The Inverse Simpson Diversity metric [2] was first published by Edward Simpson in 1949 and is fairly straightforward:

inverse_simpson_diversity

Here, ni is the abundance of species i, and N is the total abundance of all species. Summed over all species, the metric tells us the sum of squares of all abundance ratios. Note that, when the species are evenly distributed, the denominator is expected to be low, especially in systems with high numbers of species. Conversely, a system dominated by one or two species will have a high value in the denominator. Diversity will be maximized when there are N species, each with an abundance of 1, and minimized when there is a single species with an abundance of N.

Gini Simpson Diversity

This metric [3] is very similar to the Inverse Simpson metric. The only difference is that, instead of the inverse, the sum of squares is subtracted from 1, i.e:

gini_simpson_diversity.PNG

The effect here is the same, but this metric scales the diversity metric between 0 and 1. The differences here between high and low sum of squares values are less extreme than with the Inverse Simpson metric.

Shannon Diversity

Claude Shannon developed his metric of entropy to use in information theory [4]. Today, it is used in many fields, such as measuring diversity in the microbiome.

shannon_eq.PNG

The best way to understand how Shannon entropy/diversity works is to look at a plot of how it works with only two species and log base 2. The entropy is highest when both are in equal proportion, and lowest when one species is completely dominant. A word of caution: things change when more species are added. With 4 species, the entropy maxes out at 2 instead of 1.

EntropyVersusProbability
Source: http://matlabdatamining.blogspot.com/2006/11/introduction-to-entropy.html

Fisher Diversity

Fisher’s diversity [5] is not so straightforward, and it assumes the number of species expected to be present in j individuals (i.e. fj) follows a log distribution, i.e.

fisher.PNG

The diversity metric is then the closest fitting alpha to satisfy each equation given the actual breakdown of species by individual. Empirically, alpha approximates the number of species per individual. This metric may be useful if you wish to examine the diversity within a group of individuals, rather than pooling together all samples in the group. It won’t be particularly useful for evaluating the dominance or evenness of species.

Coverage Diversity

Coverage diversity tells you the number of species needed to cover at least half of the total abundance.

coverage.PNG

This also does not tell you whether there is evenness in the species abundance. To understand that, you would need to compare to the total number of species and possibly look at other cutoffs besides 0.5. But what this does tell you that the Simpson and Shannon metrics do not is the number of species needed to make up a portion of the abundance.

Evenness

The alpha diversity metrics may focus on different aspects of diversity, such as count vs. dominance of a species or evenness in the abundance of species. Evenness metrics are specifically designed to compute evenness of abundance, without being affected by species count.

Camargo Evenness

Camargo evenness [6] is based on a sum of pairwise differences in abundance between species (the term in the numerator). This sum is normalized by an unattainable “worst case scenario” for evenness in which the difference in abundance between each pair of species is itself the abundance of one member of the pair. When there is little difference in abundance between species, this value will be low, and it will become high after being subtracted from 1.

camargo

Pielou Evenness

The Pielou Evenness [7] is related to Shannon entropy. In fact, the term in the numerator is exactly the Shannon entropy, which is fitting because high entropy indicates evenness of the distribution of species. It is normalized by the total log of the sum of abundances. This helps to mitigate the problem of differing species counts’ effects on the entropy range.

pielou.PNG

Simpson Evenness

Simpson evenness is closely related to Simpson diversity. It is actually just the Inverse Simpson Diversity with the richness added to the denominator. As with the normalization in the Shannon metric, this is also to mitigate the effect of species count on the range of the metric.

simpson_evenness.PNG

Evar Evenness

This evenness metric is more complex, but is described in [8]. What is being computed is a logged variance in abundances, which is then rescaled to be between 0 and 1 using an arctan function. Variance is a sensible metric here, because a system with evenness in species abundance is expected to have low variance.

evar.PNG

Bulla Evenness

Bulla Evenness [9] focuses on abundance ratios lower than expected under an even distribution of abundance of all species. It computes this using the minimization function in the summation. In an ideal scenario of evenness, the abundance will always be equivalent to the inverse of the number of species (i.e. the richness), and the resulting value will be 1. The more abundances dip below this expected value, the smaller the value in the numerator will be.

bulla.PNG

Dominance

Absolute Dominance

Absolute dominance is very simple: it is the maximum abundance value. This can help to identify the maximum abundance in a dominant species, but it does not identify the dominance of the additional species, nor does it take into account the total abundance.

Relative Dominance

Relative dominance does take into account the total abundance. There are variations of relative dominance, but all can be represented using the following formula:

relative.PNG

Here, the k most abundant species will be used to compute the dominance. If k = 1, this is called the DBP metric [10], and it will be based only on the most abundant species. If k = 2, this is called the DMN metric [11], and it will be based on the top two most abundant species.

Simpson Dominance

Simpson dominance is simply the sum of squared abundance ratios, i.e:

simpson_dominance

Note that this is the opposite of the inverse Simpson metric used to measure diversity. If one considers dominance to be the opposite of diversity, then this may be a good choice.

Core Abundance

The core abundance measures the sum of all abundance ratios above a threshold. This is similar to the relative dominance, except that the cutoff is based on a threshold and not a ranking.

core.PNG

Gini Dominance

Finally, the Gini Index [12] was developed in 1912 by Corrado Gini to measure wealth distribution. The idea is somewhat similar to Camargo evenness, in that it makes use of absolute differences in abundance and normalizes them.

gini.PNG

A good way to understand the Gini Index is by looking at what it was initially trying to measure. The Gini Index was originally developed to measure the deviation of an economic system from one in which wealth was distributed equally, called the Lorenz curve. The Gini Index is double the area of this curve.

gini_lorenz.png
Source: https://towardsdatascience.com/gini-coefficient-and-lorenz-curve-f19bb8f46d66

Since we’re talking about species abundance rather than income, the Gini Index in this context measures the relationship between the cumulative share of species and the cumulative share of abundance. Intuitively, if only a few species were highly abundant, the area between the Lorenz curve and the line of equality will be higher than if many species were equally abundant, leading to a higher Gini Index.

Rarity

Rarity indices measure the amount or proportion of rare species in a system, where a rare species is one that is present, but not abundant.

Log Modulo Skewness

This metric [13] is based on the skewness of the statistical distribution of the abundances. Intuitively, the more skewed a distribution is, the higher the number of rare species represented. The expression inside of the sign and log functions is the standard skewness expression.

log_modulo_skewness.PNG

For those not familiar with skewness, it is best understood using an illustration. The figure below illustrates positive skew, no skew, and negative skew.

skew.jpeg
Source: https://codeburst.io/2-important-statistics-terms-you-need-to-know-in-data-science-skewness-and-kurtosis-388fef94eeaa

Low Abundance

Low abundance measures the proportion of abundance ratios that are comprised of rare species. Here, rarity is defined using abundance below a given threshold.

low_abundance.PNG

Non-Core Abundance

Non-core abundance is the exact opposite of core abundance (used as a dominance metric), i.e:

noncore.PNG

This tells us how much of the abundance is not accounted for by the core abundance.

Rare Abundance

Rare abundance is very similar to low abundance, with the difference being that low abundance thresholds the raw abundance value, whereas rare abundance thresholds the ratio of abundance in the species to total abundance. Specifically, for the microbiome package, a cutoff of 0.2 is used.

rare.PNG

Beta Diversity

Only one type of beta diversity between samples is computed using the microbiome package, and that is the one used in [14], i.e. a statistical analysis of the Bray-Curtis beta diversity matrix between samples. The Bray-Curtis metric [15] is defined as follows:

bray_curtis.PNG

Here, j and l are two samples in the data set. So, the second term is the sum of shared abundances for each species present in either sample, divided by the sum of total abundances for each either sample. If a small percentage of abundance is shared, this will result in high diversity.

There are other metrics of beta diversity as well, and these can be found in the vegan R package. You can find a description of those beta diversity metrics here.

References

[1] Chao,A. and Chiu,C.-H. (2016) Species Richness: Estimation and Comparison. In, Wiley StatsRef: Statistics Reference Online., pp. 1–26.

[2] Simpson,E.H. (1949) Measurement of diversity [16]. Nature, 163, 688.

[3] Jost,L. (2006) Entropy and diversity. Oikos, 113, 363–375

[4] ] Shannon,C.E. (1948) A Mathematical Theory of Communication. Bell Syst. Tech. J., 27, 379–423.

[5] Fisher,R.A. et al. (1943) The Relation Between the Number of Species and the Number of Individuals in a Random Sample of an Animal Population. J. Anim. Ecol., 12, 42–58.

[6] Camargo,J.A. (1995) On Measuring Species Evenness and Other Associated Parameters of Community Structure. Oikos, 74, 538.

[7] Pielou,E.C. (1966) The measurement of diversity in different types of biological collections. J. Theor. Biol., 13, 131–144.

[8] Smith,B. and Wilson,J.B. (1996) A Consumer’s Guide to Evenness Indices. Oikos, 76, 70.

[9] Bulla,L. (1994) An Index of Evenness and Its Associated Diversity Measure. Oikos, 70, 167.

[10] Berger,W.H. and Parker,F.L. (1970) Diversity of planktonic foraminifera in deep-sea sediments. Science, 168, 1345–7.

[11] Mcnaughton,S.J. (1967) Relationships among functional properties of Californian Grassland. Nature, 216, 168–169.

[12] Gini, C. (1912) Variabilità e mutabilità. In: Pizetti, E. and Salvemini, T., Eds., Rome: Libreria Eredi Virgilio Veschi, Memorie di metodologica statistica.

[13] Locey,K.J. and Lennon,J.T. (2016) Scaling laws predict global microbial diversity. Proc. Natl. Acad. Sci. U. S. A., 113, 5970–5975.

[14] Salonen,A. et al. (2014) Impact of diet and individual variation on intestinal microbiota composition and fermentation products in obese men. ISME J., 8, 2218–2230.

[15] Bray,J.R. and Curtis,J.T. (1957) An Ordination of the Upland Forest Communities of Southern Wisconsin. Ecol. Monogr., 27, 325–349.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s