Ensemble Learning

We are all familiar with machine learning, right? In simplest terms, the goal of machine learning is to learn the underlying model or function that can reliably produce a desired or known outcome from given data. This can mean many things depending on one’s field of research, including:

  • Learning the function that maps a given set of gene expression levels, metabolite abundance, or clinical images to a phenotype
  • Learning the function that maps an audio signal to an English sentence
  • Learning the function that maps an image to an image subject
  • Learning a mapping from a graph’s nodes to a set of modules optimized for a modularity index
  • Learning a mapping from an image to a set of regions of interest based on color or saturation gradients

machine_learning.png
A machine learning model, abstractly, is a function mapping data to outcome. This model is generally assumed to take a form chosen by the researcher.

Assumptions in Machine Learning

While the true underlying model is unknown, we generally make some assumptions about the form it takes. If we don’t, then the set of possible solutions effectively becomes uncountably infinite. That is, if we were to take all parameters to the model and sort them in order of value, we could always add an infinite number of other parameters between any two sorted parameters. We need to bound the set of possible solutions to make the problem tractable.

Here are some of the common assumptions made:

  • The outcome is produced by a linear function of features in the data (linear regression).
  • The outcome is produced by multiplying together a set of conditional probability distributions, where outcome is conditioned on the individual features (Naive Bayes).
  • The outcome is produced by a series of chained decision boundaries (decision tree).
  • The outcome is produced by a series of stacked perceptrons on the data features, with a fixed number of perceptrons at each layer and a fixed number of layers (neural network).
  • The outcome is produced using a series of mean clustering iterations on the input space (k-means clustering).

When we make these assumptions, there is always the possibility that they could be wrong. Yes, this is even true of neural networks, the so-called universal function approximators. While a single neural network can approximate a wide range of functions after parameter tuning, it may still miss the mark; for instance, neural networks trained to classify gender by eye images have been found to model the presence / absence of eye makeup [1], and it is also possible to change the outcome of a neural network for image identification by changing a single pixel [2].

Motivation for Ensembles

A way to mitigate this is to learn a combination of these models, i.e. an ensemble. This ensemble may include multiple types of models (for instance, linear regression and k-means clustering) or it may include combinations of the same type of model but with different parameterizations (an example of this is a random forest, which is a combination of many decision trees). In the image below, each musician’s part represents a single model, with the true model only being realized when all parts are present.

ensemble_machine_learning
In this variation, an instrument represents a class of model (e.g. the violin represents k-nearest neighbor). A class may be used more than once with different parameterizations. The ensemble image was taken from the home page of the Salzburg Orchestra.

Concretely, though, you may wonder why an ensemble is an improvement over a single model. As long as the model is complex enough, shouldn’t that be sufficient? There are two references I would like to recommend that nicely describe the utility of ensemble models, by Dietterich [3] and Polikar [4], respectively. I will summarize a few reasons why ensemble methods are useful below:

  1. Robustness to sample size. With small sample sizes, it is difficult to learn all parameters of a complex model with high accuracy. Rather, given different initializations, different models of the same class will learn different parameterizations, all equally accurate. Combining these models may yield a more accurate prediction.
  2. Robustness to local minima. Some models have the tendency to become caught in local minima. Ensemble methods can mitigate this by finding a space between the local minima.
  3. Ability to expand the space of possible solutions. Again, a model is limited to a solution within a given parameter space. If the true solution is not within this parameter space, combining models can expand the set of possible solutions to include a parameterization closer to the optimal model.
  4. Ability to combine multiple data sources. Ensembles can be used as a way of fusing multiple data sources. In this case, a separate model is often learned for each source, and these models are then combined to obtain a final model.

Types of Ensemble Methods

Here, I describe the methods for combining models in an ensemble and delve into some basic ensemble frameworks.

Methods for Combining Models

Classification models are often combined using majority voting. In this framework, the class predicted by the majority of models in the ensemble is reported as the true class. This is reasonable if all models are equally accurate predictors, but that is not always the case. A variant of this method is weighted majority voting, in which the weight of each model’s output toward the final output is determined by its accuracy on the training data. Other common methods include algebraic combination, selection of a single “best” model for each input datum, or combination using fuzzy logic [5].

weighted_majority_voting.png
A weighted majority voting ensemble. Here, each model predicts an outcome, its outcome is weighted by the training accuracy, and all outcomes are combined to produce a final outcome.

Basic Ensemble Frameworks / Techniques

Bagging [6]

In bagging, the aim is to have the models in the ensemble learn diverse parameters by training each model on a random subset of the data. This is useful because, if each model learns different parameters, the chances of the error between the models being correlated is reduced. This allows each model to provide a useful contribution to the outcome.

bagging.png
An illustration of a bagging model, where the set of samples selected for each model is represented using color.

Addition of Gaussian Noise [7]

Another approach toward the same goal is the addition of Gaussian noise to each model. Like bagging, the addition of noise helps to diversify the set of parameters learned for each model. It can be used in conjunction with bagging, as shown below, but that need not be the case.

gaussian_noise.png
An illustration of an ensemble with added Gaussian noise. This ensemble combines Gaussian noise with bagging.

Boosting [8]

Boosting can be thought of as a “smarter” variant of bagging. Rather than sampling randomly, each model’s samples are determined using the predictions of the previous model. Those samples for which outcomes disagree on the previous models are used to train the current model. In the AdaBoost variant, those samples which are mispredicted in all previous models are used to train the current model. Thus, the last models to be trained handle the most complex classification cases.

boosting.png
An illustration of boosting. The samples used for training each model are color-coded, as in the boosting model.

Feature Manipulation [9][10]

In feature manipulation, it is not the samples which are divided when training the models, but the features. Each model is trained using a subset of features from the input, after which the models are combined. Again, the goal is to diversify the set of models so that error is uncorrelated. Random forests often use this technique. Other than simply subsetting the features, some techniques involve projecting the features onto novel vector spaces, such as the eigenspace.

feature_manipulation.png
An illustration of feature manipulation.

Error-Correcting Output Codes [11]

This can be useful for data with many classes, and it relies on the ability to decompose the set of output classes into uncorrelated bits (This can be a difficult task). Here, the same data is input into each model, but the models are all binary classifiers focused on classification of just one output bit. Assuming that the bits are uncorrelated, model error should also be. Once each bit is predicted, a nearest neighbor technique is used to determine the class associated with that set of bits.

error_correcting_output_codes.png
An illustration of Error-Correcting Output Codes.

Random Weight Initialization [12]

For models that are not robust to weight initialization, the use of random weights to initialize each model can diversify the parameter space. This is often used in combination with one of the other techniques described above.

random_weight_initialization.png
An illustration of an ensemble model with randomized weights.

Stacking [13]

This looks like a neural network but, unless each model is a perceptron, it isn’t. Here, rather than using sampling techniques to diversify the parameter space, decision models at the second layer determine directly how each model should be trained. Note that, like a neural network, this technique increases the space of parameters to learn and may not be a good choice for small sample sizes.

stacking.png
An illustration of a stacked ensemble model.

Mixture of Experts [14]

A mixture of experts model is similar to a stacked ensemble, but instead of learning a layer of additional models on top of the first layer, it learns gated network of the model outputs. This is called a mixture of experts because, for a given input, the set of “experts” that can best classify that input varies. In this way, it is similar in principle to boosting, but the mixture of experts uses only the output of a subset of relevant models. The illustration below shows different classes of models being used but, notably, this need not be the case. In addition, illustrations above that do not show different classes of models may still use different classes of models.

mixture_of_experts.png
An illustration of a Mixture of Experts model. Different shapes are used to illustrate different potential classes of models.

Composite Classifier [15]

In a composite classifier, a space of data that cannot be reliably classified by the first model is fed into the second for classification. This is similar in theory to boosting, but the unclassifiable space is actually estimated directly during training of the first model. This was a very early type of ensemble method first described in 1979 and consisting of a linear classifier (a type of proto-SVM) fed into a KNN model. In theory, other types of models could be used.

composite_classifier.png
An illustration of the initial composite classifier described in 1979.

Dynamic Classifier Selection [16]

In this framework, a single model is used to predict each datum based on its similarity to training data. The way this work is that each training datum is classified using each of the models, and the best model for that datum is chosen and tracked. Then, each datum in the testing set is mapped to its nearest neighbor in the training set, and the best predictor for that neighbor is used to predict the datum. Like the mixture of experts or the stacked framework, this technique assumes that not all models are useful for all data. However, it is more stringent in that it allows only a single model to be used for each datum.

dynamic_classifier_selection.png

Conclusion

Ensemble models provide multiple advantages over single models for predicting outcome, but a necessary condition for high performance is that the models in the ensemble not make the same mistakes at the same time. There are multiple techniques described above which work to achieve this goal in different ways. Note that the techniques described above represent basic classes of ensemble frameworks (as per the dates cited below) and are not necessarily an exhaustive list.

References

  1. Kuehlkamp,A. et al. (2017) Gender-From-Iris or Gender-From-Mascara? In, IEEE Winter Conference on Applications of Computer Vision.
  2. Su,J. et al. (2019) One Pixel Attack for Fooling Deep Neural Networks. In, IEEE Transactions on Evolutionary Computation.
  3. Dietterich,T.G. (2000) Ensemble Methods in Machine Learning. Int. Work. Mult. Classif. Syst., 1857, 1–15.
  4. Polikar,R. (2009) Ensemble learning. Scholarpedia, 4, 2776.
  5. Scherer,R. (2010) Designing boosting ensemble of relational fuzzy systems. Int. J. Neural Syst., 20, 381–8.
  6. Breiman,L. (1994) Bagging Predictors. Berkeley, California.
  7. Raviv,Y. and Intrator,N. (1996) Bootstrapping with Noise: An Effective Regularization Technique. Conn. Sci., 8, 355–372.
  8. Freund,Y. and Schapire,R.E. (1999) A Short Introduction to Boosting.
  9. Cherkauer,K.J. (1996) Human Expert-Level Performance on a Scientific Image Analysis Task by a System Using Combined Artiicial Neural Networks. AAAI Press.
  10. Tumer,K. and Ghosh,J. (1996) Error Correlation and Error Reduction in Ensemble Classifiers. Conn. Sci., 8, 385–404.
  11. Dietterich,T.G. and Bakiri,G. (1995) Solving Multiclass Learning Problems via Error-Correcting Output Codes. Journal of Artiicial Intelligence Research, 2, 263-286.
  12. Kolen,J.F. and Pollack,J.B. (1990) Back Propagation is Sensitive to Initial Conditions. Complex Syst., 4.
  13. Wolpert,D.H. (1992) STACKED GENERALIZATION. Los Alamos, NM.
  14. Jacobs,R.A. et al. (1991) Adaptive Mixtures of Local Experts. Neural Comput., 3, 79–87.
  15. Dasarathy,B. V. and Sheela,B. V. (1979) A composite classifier system design: Concepts and methodology. Proc. IEEE, 67, 708–713.
  16. Giacinto,G. and Roli,F. (1999) Methods for dynamic classifier selection. In, Proceedings – International Conference on Image Analysis and Processing, ICIAP 1999., pp. 659

 

Diversity Metrics in the Microbiome

In ecology, the concepts of alpha diversity and beta diversity are frequently used to characterize habitats. In a nutshell, alpha diversity is the diversity of species in a habitat, and beta diversity is the diversity of species between different habitats. An illustration is given below.

diversity-perspectives.jpeg
Source: https://eco-intelligent.com/2016/10/14/alpha-beta-gamma-diversity/

The reason why I (a non-ecologist) am writing about this is because these metrics are also used in microbiome analyses, and an R package for microbiome data is now available from Bioconductor which includes these metrics.

The authors of the package have given a nice introduction to their package here. In the section on alpha diversity, you can see that there is a plethora of diversity indices reported by the package. They also have some functions available for beta diversity analysis. I’m going to discuss some of the math behind these indices to motivate why a particular metric might be useful for a particular study.

Richness

Species richness in a sample is a simple concept: it is the number of species at the sample site. However, as some species may have abundances below the detection limit or may not be detected due to gaps in the database, true richness must estimate the number of undetected species. The microbiome package uses Chao richness [1] as a lower bound of richness, which is defined as:

chao_richness_eq

Sobs is the number of observed species, f1 is the number of species observed in the first individual, and f2 is the number of species observed in the second individual sampled. The assumption being made here is that the number of undiscovered species is equal to the ratio of the square of discovered species in the first individual to twice the number of discovered species in the second individual. This may be a fair assumption for many applications, but it’s probably a good idea to think about whether this assumption makes sense for the data you are analyzing.

For the rest of this post, I will denote richness (in the general sense as number of species) using R.

Alpha Diversity

Richness gives a count of the number of species, but alpha diversity takes that a step further and examines the proportions of species. Here, in addition to species count, species abundance also comes into play.

Inverse Simpson Diversity

The Inverse Simpson Diversity metric [2] was first published by Edward Simpson in 1949 and is fairly straightforward:

inverse_simpson_diversity

Here, ni is the abundance of species i, and N is the total abundance of all species. Summed over all species, the metric tells us the sum of squares of all abundance ratios. Note that, when the species are evenly distributed, the denominator is expected to be low, especially in systems with high numbers of species. Conversely, a system dominated by one or two species will have a high value in the denominator. Diversity will be maximized when there are N species, each with an abundance of 1, and minimized when there is a single species with an abundance of N.

Gini Simpson Diversity

This metric [3] is very similar to the Inverse Simpson metric. The only difference is that, instead of the inverse, the sum of squares is subtracted from 1, i.e:

gini_simpson_diversity.PNG

The effect here is the same, but this metric scales the diversity metric between 0 and 1. The differences here between high and low sum of squares values are less extreme than with the Inverse Simpson metric.

Shannon Diversity

Claude Shannon developed his metric of entropy to use in information theory [4]. Today, it is used in many fields, such as measuring diversity in the microbiome.

shannon_eq.PNG

The best way to understand how Shannon entropy/diversity works is to look at a plot of how it works with only two species and log base 2. The entropy is highest when both are in equal proportion, and lowest when one species is completely dominant. A word of caution: things change when more species are added. With 4 species, the entropy maxes out at 2 instead of 1.

EntropyVersusProbability
Source: http://matlabdatamining.blogspot.com/2006/11/introduction-to-entropy.html

Fisher Diversity

Fisher’s diversity [5] is not so straightforward, and it assumes the number of species expected to be present in j individuals (i.e. fj) follows a log distribution, i.e.

fisher.PNG

The diversity metric is then the closest fitting alpha to satisfy each equation given the actual breakdown of species by individual. Empirically, alpha approximates the number of species per individual. This metric may be useful if you wish to examine the diversity within a group of individuals, rather than pooling together all samples in the group. It won’t be particularly useful for evaluating the dominance or evenness of species.

Coverage Diversity

Coverage diversity tells you the number of species needed to cover at least half of the total abundance.

coverage.PNG

This also does not tell you whether there is evenness in the species abundance. To understand that, you would need to compare to the total number of species and possibly look at other cutoffs besides 0.5. But what this does tell you that the Simpson and Shannon metrics do not is the number of species needed to make up a portion of the abundance.

Evenness

The alpha diversity metrics may focus on different aspects of diversity, such as count vs. dominance of a species or evenness in the abundance of species. Evenness metrics are specifically designed to compute evenness of abundance, without being affected by species count.

Camargo Evenness

Camargo evenness [6] is based on a sum of pairwise differences in abundance between species (the term in the numerator). This sum is normalized by an unattainable “worst case scenario” for evenness in which the difference in abundance between each pair of species is itself the abundance of one member of the pair. When there is little difference in abundance between species, this value will be low, and it will become high after being subtracted from 1.

camargo

Pielou Evenness

The Pielou Evenness [7] is related to Shannon entropy. In fact, the term in the numerator is exactly the Shannon entropy, which is fitting because high entropy indicates evenness of the distribution of species. It is normalized by the total log of the sum of abundances. This helps to mitigate the problem of differing species counts’ effects on the entropy range.

pielou.PNG

Simpson Evenness

Simpson evenness is closely related to Simpson diversity. It is actually just the Inverse Simpson Diversity with the richness added to the denominator. As with the normalization in the Shannon metric, this is also to mitigate the effect of species count on the range of the metric.

simpson_evenness.PNG

Evar Evenness

This evenness metric is more complex, but is described in [8]. What is being computed is a logged variance in abundances, which is then rescaled to be between 0 and 1 using an arctan function. Variance is a sensible metric here, because a system with evenness in species abundance is expected to have low variance.

evar.PNG

Bulla Evenness

Bulla Evenness [9] focuses on abundance ratios lower than expected under an even distribution of abundance of all species. It computes this using the minimization function in the summation. In an ideal scenario of evenness, the abundance will always be equivalent to the inverse of the number of species (i.e. the richness), and the resulting value will be 1. The more abundances dip below this expected value, the smaller the value in the numerator will be.

bulla.PNG

Dominance

Absolute Dominance

Absolute dominance is very simple: it is the maximum abundance value. This can help to identify the maximum abundance in a dominant species, but it does not identify the dominance of the additional species, nor does it take into account the total abundance.

Relative Dominance

Relative dominance does take into account the total abundance. There are variations of relative dominance, but all can be represented using the following formula:

relative.PNG

Here, the k most abundant species will be used to compute the dominance. If k = 1, this is called the DBP metric [10], and it will be based only on the most abundant species. If k = 2, this is called the DMN metric [11], and it will be based on the top two most abundant species.

Simpson Dominance

Simpson dominance is simply the sum of squared abundance ratios, i.e:

simpson_dominance

Note that this is the opposite of the inverse Simpson metric used to measure diversity. If one considers dominance to be the opposite of diversity, then this may be a good choice.

Core Abundance

The core abundance measures the sum of all abundance ratios above a threshold. This is similar to the relative dominance, except that the cutoff is based on a threshold and not a ranking.

core.PNG

Gini Dominance

Finally, the Gini Index [12] was developed in 1912 by Corrado Gini to measure wealth distribution. The idea is somewhat similar to Camargo evenness, in that it makes use of absolute differences in abundance and normalizes them.

gini.PNG

A good way to understand the Gini Index is by looking at what it was initially trying to measure. The Gini Index was originally developed to measure the deviation of an economic system from one in which wealth was distributed equally, called the Lorenz curve. The Gini Index is double the area of this curve.

gini_lorenz.png
Source: https://towardsdatascience.com/gini-coefficient-and-lorenz-curve-f19bb8f46d66

Since we’re talking about species abundance rather than income, the Gini Index in this context measures the relationship between the cumulative share of species and the cumulative share of abundance. Intuitively, if only a few species were highly abundant, the area between the Lorenz curve and the line of equality will be higher than if many species were equally abundant, leading to a higher Gini Index.

Rarity

Rarity indices measure the amount or proportion of rare species in a system, where a rare species is one that is present, but not abundant.

Log Modulo Skewness

This metric [13] is based on the skewness of the statistical distribution of the abundances. Intuitively, the more skewed a distribution is, the higher the number of rare species represented. The expression inside of the sign and log functions is the standard skewness expression.

log_modulo_skewness.PNG

For those not familiar with skewness, it is best understood using an illustration. The figure below illustrates positive skew, no skew, and negative skew.

skew.jpeg
Source: https://codeburst.io/2-important-statistics-terms-you-need-to-know-in-data-science-skewness-and-kurtosis-388fef94eeaa

Low Abundance

Low abundance measures the proportion of abundance ratios that are comprised of rare species. Here, rarity is defined using abundance below a given threshold.

low_abundance.PNG

Non-Core Abundance

Non-core abundance is the exact opposite of core abundance (used as a dominance metric), i.e:

noncore.PNG

This tells us how much of the abundance is not accounted for by the core abundance.

Rare Abundance

Rare abundance is very similar to low abundance, with the difference being that low abundance thresholds the raw abundance value, whereas rare abundance thresholds the ratio of abundance in the species to total abundance. Specifically, for the microbiome package, a cutoff of 0.2 is used.

rare.PNG

Beta Diversity

Only one type of beta diversity between samples is computed using the microbiome package, and that is the one used in [14], i.e. a statistical analysis of the Bray-Curtis beta diversity matrix between samples. The Bray-Curtis metric [15] is defined as follows:

bray_curtis.PNG

Here, j and l are two samples in the data set. So, the second term is the sum of shared abundances for each species present in either sample, divided by the sum of total abundances for each either sample. If a small percentage of abundance is shared, this will result in high diversity.

There are other metrics of beta diversity as well, and these can be found in the vegan R package. You can find a description of those beta diversity metrics here.

References

[1] Chao,A. and Chiu,C.-H. (2016) Species Richness: Estimation and Comparison. In, Wiley StatsRef: Statistics Reference Online., pp. 1–26.

[2] Simpson,E.H. (1949) Measurement of diversity [16]. Nature, 163, 688.

[3] Jost,L. (2006) Entropy and diversity. Oikos, 113, 363–375

[4] ] Shannon,C.E. (1948) A Mathematical Theory of Communication. Bell Syst. Tech. J., 27, 379–423.

[5] Fisher,R.A. et al. (1943) The Relation Between the Number of Species and the Number of Individuals in a Random Sample of an Animal Population. J. Anim. Ecol., 12, 42–58.

[6] Camargo,J.A. (1995) On Measuring Species Evenness and Other Associated Parameters of Community Structure. Oikos, 74, 538.

[7] Pielou,E.C. (1966) The measurement of diversity in different types of biological collections. J. Theor. Biol., 13, 131–144.

[8] Smith,B. and Wilson,J.B. (1996) A Consumer’s Guide to Evenness Indices. Oikos, 76, 70.

[9] Bulla,L. (1994) An Index of Evenness and Its Associated Diversity Measure. Oikos, 70, 167.

[10] Berger,W.H. and Parker,F.L. (1970) Diversity of planktonic foraminifera in deep-sea sediments. Science, 168, 1345–7.

[11] Mcnaughton,S.J. (1967) Relationships among functional properties of Californian Grassland. Nature, 216, 168–169.

[12] Gini, C. (1912) Variabilità e mutabilità. In: Pizetti, E. and Salvemini, T., Eds., Rome: Libreria Eredi Virgilio Veschi, Memorie di metodologica statistica.

[13] Locey,K.J. and Lennon,J.T. (2016) Scaling laws predict global microbial diversity. Proc. Natl. Acad. Sci. U. S. A., 113, 5970–5975.

[14] Salonen,A. et al. (2014) Impact of diet and individual variation on intestinal microbiota composition and fermentation products in obese men. ISME J., 8, 2218–2230.

[15] Bray,J.R. and Curtis,J.T. (1957) An Ordination of the Upland Forest Communities of Southern Wisconsin. Ecol. Monogr., 27, 325–349.

Understanding Metabolomics Data: An Overview for Bioinformaticians

Last semester, I took a course called Metabolomics Practicum, taught by several professors who were all experts in different areas of metabolomics at The Ohio State University:

The course introduced me to aspects of metabolomics data sets that I, and I think it’s safe to say most bioinformaticians, hadn’t commonly thought about. Data sources and variation factors are significant in metabolomics but are often overlooked by data scientists and statisticians, who generally assume data is “clean” before starting analysis. This post offers a highly condensed version of the course material for bioinformaticians from backgrounds similar to mine. My hope is that it will help others both understand the origins of their data and look at metabolomics papers and data sets with a critical eye.

Metabolomics Experiment Overview

Metabolomics is the study of all small molecules in a biological system. “small” is defined differently by different sources – one definition includes only metabolites < 1 kDa. Metabolomics studies fall into two groups:

  1. Targeted metabolomics (designed to test a hypothesis, focuses on specific metabolites)
  2. Untargeted metabolomics (designed to generate a hypothesis, focuses on surveying all metabolites).

In either case, both the sample processing and experimental workflow are chosen with the scientific question in mind, e.g.

  • How does the metabolomic profile of a tomato grown with fertilizer X differ from the metabolomic profile of a tomato grown with fertilizer Y?
  • How does the plasma metabolome of a patient given drug X differ from that of a patient given drug Y?

Samples from relevant populations are then collected, extracted, and run through an instrument (either a mass spectrometer or NMR instrument). These instruments have a detector to detect (for MS) mass-to-charge (m/z) ratios and retention times of the compounds and (for NMR) chemical shifts of the compounds in magentic environments. (The retention time of the compound is the time when it exits the column – more on that later). The end result is a three-dimensional plot, where the dimensions are m/z, intensity (frequency of detector hits), and retention time.

Picture1
A full 3D mass spectrometry output, including intensity, retention time, and m/z. 2D simplifications, namely a chromatogram and mass spectra, are also shown. Image source: Wikipedia

MS plots are often projected onto two dimensions for interpretability. There are a few types of two-dimensional plots used. One is the total ion chromatogram, which plots only retention time by intensity, summed over all m/z. Here, multiple metabolites can potentially be represented by a single peak. Another plot is the base peak chromatogram. Here, only the maximum intensity peak for a given retention time is shown in the chromatogram. Finally, there is the mass spectra, which includes only m/z by intensity for a given retention time.

Even 2D plots can be difficult to interpret: compounds don’t always separate neatly from one another, and they fragment during the ionization process used by the instrument. However, software (sometimes open source, but often proprietary) exists to separate out peaks in the chromatogram. This is called “peak picking” and is usually followed by detecting features (peak sets) believed to correspond to metabolites. Once features are detected, the researcher can look up the corresponding metabolites using mass-to-charge ratios and adducts or fragments detected by the software. Although NMR spectra are different, NMR also induces fragmentation and requires peak-picking software and spectrum matching for identification.

In any metabolomics study, most features will not be identified. This happens for multiple reasons. Sometimes, the configuration of the experiment or sample processing procedure prevents some metabolites from being detected. Sometimes, spectra are clear but cannot be matched to anything in a database or a standard (two ways of confirming metabolite identity – more on that later). And sometimes, metabolites are overly fragmented, so their spectra are not recognizable or misindentified. Even for those features identified, they often cannot be quantified exactly; all that can be determined is the abundance of a feature in one group of samples compared to another.

known_known
The above categories of metabolites will all likely be present in any metabolomics study.

You as an analyst should think about how the experimental design, extraction method, components and settings of the instrument, and peak-picking procedures affect the metabolites detected. You should be especially aware of this when comparing results from different studies. The two labs were likely trying to answer slightly different questions, and they designed their experiments to answer these questions. People conducting primary research generally don’t think about secondary analysis when they design their experiments because they need to design them in the most efficient and cost-effective way for their lab.

Experimental Design Effects

There are many factors when it comes to experimental design, and one of the most important precautions to take is to ensure that external factors are not contributing to apparent metabolomic differences between samples. Here are a few ways that researchers minimize these external factors; however, they aren’t all feasible for every study. You should check whether these aspects of their design are documented in their study when interpreting results.

  • Storage time and procedure: All samples should be stored in the same way, in the same type of container, for the same amount of time. Freezing and thawing a sample can affect its metabolite profile, and so can its contact with the container. Keeping these factors consistent is an ideal scenario and doesn’t always happen in reality.
  • Processing: All samples should be extracted in the same way (and ideally by the same lab tech) and run through the same experiment all in one batch. But again, this doesn’t always happen, especially if there is a large number of samples. One practice is to flush the instrument with solvent after each batch to prevent contamination from previous samples and to randomize run order.
  • Replicates: Are the replicates technical (coming from the same organism) or biological (coming from different organisms in the same group)? This can have a significant effect on the results.
  • Quality Control: Quality control samples (QC) are often used for finding outliers after peaks are picked as well as for correcting for differences between runs. The QC sample is usually a pooled combination of all samples and is run along with other samples. Feature abundances can then be corrected when normalizing by QC.
  • Blanks: Blanks can be run to correct for effects due to sample processing and storage. Creating blanks may involve putting water or another neutral substance through the same process as the sample, storing water in the same container for the same amount of time as the sample, or adding pH-balancing buffers to both the sample and the blank. Hypothetically, it is also possible to have a blank created by each tech, have a separate blank for each batch, etc, to correct for other effects. In reality, this is often limited by resources and time.
  • Internal standards: These are pure compounds that are run through the instrument along with the samples. Because the quantity of the compound is known, they can be used to normalize and even give an absolute quantification of metabolite abundance. However, standards aren’t always available for a given experiment.

The following image shows an example of samples from an experiment using QC, blanks, and internal standards.

sample_layout
This experiment includes multiple QC samples, a plasma standard, and a few types of blanks in a tray along with the study samples. Source: John-Williams et al., Sci Data 2017

Extraction Methods

Before they can be run through the experiment, samples must be extracted. Extraction protocols differ widely depending on the metabolites of focus, and they also differ between instrument types. Depending on the goal of the experiment, samples may be frozen, dried, centrifuged, extracted using a solvent, and/or diluted. Some frequently-analyzed substances, such as human plasma, have established protocols. Others do not, in which case the extraction protocol must be defined ad hoc by the lab.

Aside from the type of sample, the extraction protocol is also going to depend on the instrument being used. In liquid chromatography and NMR, extraction procedures can vary and are usually dependent on the metabolites of focus. For GC-MS, the sample must be in the gaseous state, with the compounds distributed throughout the gas, so extraction options are limited. For substances that are not volatile, a chemical process called derivitization is used to make them volatile, and then the extraction procedure is performed.

One way to extract metabolites for GC-MS is by collecting the gas above the sample (called the headspace), which depends on the solubility of the compounds and the vapor pressure. It is also possible to flush nitrogen through the sample and then collect it. The efficacy of this method depends on the rate of transfer of the compounds through the nitrogen. Finally, a method called Dynamic Headspace Extraction (illustrated below) was devised to minimize these dependencies.

Picture2
Dynamic Headspace Extraction. Image source: http://www.gerstel.com/en/dhs-scheme.htm

Another approach is solid phase microextraction (SPME), which uses a fiber to collect a heated sample. The fiber becomes saturated quickly, so the amount of sample that can be extracted is very small. This small sample becomes problematic when trying to separate true signal from noise. In addition, fibers are not always made consistently, making reproducibility difficult. In spite of these drawbacks, it is a simple approach that is commonly used.

spme
Solid Phase Microextraction (SPME). Image source: K Schmidt, 2015.

Instruments

Here, we will discuss two ways in which a sample can be analyzed after the preparation and extraction is complete.

  1. It can be passed through a column (known as chromatography), ionized, and then detected using mass spectrometry.
  2. It can be analyzed using an NMR instrument.

In general, MS methods have higher sensitivity and can therefore detect more compounds. But NMR methods have much higher reproducibility and selectivity (i.e the mass resolution is very accurate). For this reason, one might use MS if compound coverage is most important for a study and NMR if precise identification is most important. We will discuss each procedure further here.

Chromatography

After sample preparation and extraction, the sample is passed through a column, which is used to separate out the sample’s components. This may be done using an LC column or a GC column.

Liquid Chromatography (LC)

LC is generally used for samples that are not easy to derivitize (i.e. change to a gaseous state). It uses two key principles to separate out compounds for detection: a solvent (mobile phase) and a column (stationary phase).

lcms
An overview of the LCMS process using an HPLC column. The solvent is pumped into the column along with the sample, where metabolites separate before being run through a mass spectrometer and detected. Image Source: Waters: The Science of What’s Possible.

Depending on the polarity of the compounds (i.e. how attracted they are to water), different columns and solvents will be used. This is very important for analyzing spectra. The goal is to have all molecules pass through the column and reach the detector so that their spectra can be quantified, but not all at once. Therefore, the ideal column is one to which all compounds of focus will be attracted to varying degrees, but to which no compound will be so attracted as to stick to it without passing through the column.

Different columns are optimized for different types of compounds, and compounds also react while in the column, so it is impossible to capture all types of compounds in a given column. A few key types of columns are used in LC: Normal Phase, Reverse Phase, and HILIC. Each of these columns is also used with a specific type of solvent and hydrophobicity. Normal phase columns are usually used for very hydrophobic compounds, whereas HILIC is used for very hydrophilic compounds. Reverse phase is used for compounds that fall into an intermediate category. Bearing this in mind, think about what will happen if a researcher has a solution with hydrophobic compounds but runs HILIC. Those compounds will immediately exit the column and reach the detector. They will show up as a jumbled mess at the beginning of the chromatogram, rather than being nicely separated for easy identification. So, the types of compounds identified in an experiment have a lot to do with the column used to run the experiment.

The reactivity and variable attraction of compounds while in the column is a key reason why MS methods are said to be comparative, not quantitative. Because the chemical environment is complex, it is difficult to determine how much of a compound is getting lost because of its attraction to the column or reaction with other compounds.

Liquid Chromatography (LC) Ionization

After compounds pass through the column, they are ionized and launched into a mass spectrometer for detection (called mass spectrometry). When this is done after liquid chromatography, it is called LC-MS. When it is done after gas chromatography, it is called GC-MS. Ionization is important because all MS are designed for detecting ions only, and ionization procedures differ according to chromatography.

For LC-MS, a common type of ionization is electrospray ionization (ESI). This is considered a type of soft ionization, which means that little fragmentation of the compound occurs during the process. ESI sprays droplets of the ionized sample (combined with solvent) into the MS, creating a vapor. The solvent evaporates, leaving behind the compounds to travel through the MS. Ionization for gas chromatography, as we will discuss, is different.

 

Gas Chromatography (GC-MS)

Gas chromatography is different from liquid chromatography in that it requires the sample to be derivitized (i.e. converted to a gaseous state) before it can pass through the column. GC columns are much larger than LC columns and are coiled. However, like LC columns, they are coated with a stationary phase and used for separating the compounds. The selectivity and effectiveness of a GC column is determined by its diameter, length, and film thickness.

gc_col
The left and center images display the coiled shape of two types of GC columns. The image on the right depicts the layers found on the inside of the column.

Unlike LC, GC does not use soft ionization to launch compounds into the MS. It uses a type of hard ionization called Electron Impact (EI). This means that the spectra from a GC will be much more fragmented than the spectra for the same sample using LC because of the high amount of fragmentation caused by the ionization process. While this can be a factor in metabolite identification, it is noteworthy that the identification libraries for GC are more robust than those for LC, helping to mitigate this effect.

electron_impact
Electron Impact (EI), considered a type of hard ionization. Image source: http://science.widener.edu/svb/massspec/intro_to_ms/ei.html

Mass Spectrometry (MS)

Once ions in the gaseous state are produced using either of the ionization techniques discussed above, the ionized compounds travel through a mass specrometer to ultimately meet the detector. The differences in trajectory through the mass spectrometer are what makes the compounds identifiable. Mass spectrometers can be of a few types, but they often contain a quadrupole component. Quadrupole components use voltage changes to filter the compounds by mass, yielding m/z ratios. They are very sensitive, but not very selective. For this reason, they are not typically used in conjunction with other components; three common types are Time Of Flight (TOF), Orbitrap, and Fourier Transform Ion Cyclotron Resonance (FT-ICR).

quadrupole
A quadrupole contains rods with positive and negative charges, which are used to filter compounds using voltage. Image Source: http://www.chm.bris.ac.uk/ms/quadrupole.xhtml

TOF instruments allow the molecule to drift freely through a tube, and they measure m/z ratios using the time taken for a molecule to travel through the tube. Using a TOF in conjunction with a quadrupole increases selectivity, yielding better separation of molecules.

tof-sche
A diagram of a Time-Of-Flight (TOF) instrument. Image source: https://www.tissuegroup.chem.vt.edu/chem-ed/ms/tof.html

Orbitrap instruments work by using electrodes to trap ions so that the ions must move around a spindle. The m/z ratio is computed using the trajectory of each ion.

orbitrap
An Orbitrap instrument with electrodes and a spindle. Image Source: R. Do Ferreira Nascimento, 2017.

FT-ICR uses a magnet to induce fluctuations in the compound. The frequency of the fluctuations are measured using a Fourier transform. Compared to TOF and Orbitrap instruments, these have the highest selectivity. However, they are also less affordable and slower than TOF or Orbitrap.

ft_icr
An FT-ICR instrument with a magnetic field. Image source: http://www.mslab.ulg.ac.be/mslab/equipment/fticr-ms-bruker-apex-qe-9-4t/

MS/MS

One important point about MS methods is that they are not quantitative. A compound may be measured across two samples to determine its relative abundance within the samples, but this does not tell us the compounds’s total abundance. MS/MS can be used to focus on a small portion of the MS spectra and quantify it. A second sample is needed for this process, and it requires a QTOF (a quadrupole attached to a TOF) with mass selection. The process is depicted below. Note that this can be done only when the researcher knows which mass range to consider. MS/MS is usually done as a follow-up to MS.

msms
A QTOF instrument being used for MS/MS. Selection of a precursor mass is an important component of MS/MS. Image source: J. Cooperstone, course lecture slides.

Nuclear Magnetic Resonance (NMR)

The mechanism used in NMR is quite different from that used in MS techniques. NMR instruments use large magnets to induce the oscillation of compounds. They do not separate compounds using a column, and they do not induce hard or soft ionization, and they do not destroy the sample. The compounds are identified using spectra generated from their net magnetization. This is dependent on the molecule’s spin, a property of magnetism.

nmr
An NMR instrument, with a magnet (center), workstation (left), and console (right). Source: E. Hatzakis, lecture slides.

The oscillation behavior of compounds in NMR is determined primarily by key atoms in the compound, called the nuclei. In any given NMR experiment, a specific type of nucleus will be targeted by the instrument. Common nuclei are hydrogen, the carbon-13 isotope, the phosphorus-31 isotope, nitrogen-15, silicon-29, and fluorine-19. The researcher’s choice of nucleus will depend on which of these they expect to be most abundant in the sample. The sample is enclosed in a holder (probe) optimized for different nuclei of interest, and each atom of this type in a compound will have a separate signal.

The chemical shift of a nucleus (i.e. the shift in signal) is determined by the atoms surrounding it, but it may also be affected by factors like the pH of the sample, the temperature of the sample, and the solvent used. Impurities and solvents may also affect the spectrum, but there are methods for handling these. To lock the optimal magnetic field for a given sample, a process called shimming needs to be conducted with the supervision of the researcher.

chem_shift
These general guidelines can be used to determine a molecule’s identity using chemical shift. In reality, identification of a compound using chemical shift can be complex and requires experience. Source: E. Hatzakis, lecture slides

J-coupling, the interaction between a nucleus and other nuclei surrounding it in the same compound, also affects the final spectra. NMR peaks often do not have a single summit, but multiple summits (called the multiplicity of the peak). This multiplicity is affected by J-coupling. An example of J-coupling is shown below.

jcoupling
A sample NMR spectrum. For the region from 9.2 to 7.6 ppm, there are clear examples of peaks with two summits (duplets), three summits (triplets) or more. Source: E. Hatzakis, lecture slides

2-D Approaches

Not everything that needs to be known about each compound can be inferred directly from the NMR spectra described above. For instance, a compound’s fragments may include two or more that are identical; these will overlap completely and won’t be distinguishable. In addition, inference using J-coupling can reveal how many surrounding atoms of a certain type exist, but not exactly where they are in the molecule. 2-D approaches can help researchers to answer these questions.

  • Homonuclear COrrelation SpectroscopY (COSY) is a technique that is used to determine which interactions lead to J-coupling. It displays not only the spectra for each molecule, but a cross-peak showing where the coupling occurs.

cosy
A 2D COSY Spectrum with cross-peaks. Source: E. Hatzakis, lecture slides

  • TOtal Correlated SpectroscopY (TOCSY) is another technique that, in addition to showing interactions from J-coupling, also shows indirect interactions between nuclei of the same type. It is sometimes compared to running two steps of COSY. This can be used to resolve scenarios in which multiple fragments from the same compound are producting the same peak.
  • Nuclear Overhauser Effect Spectroscopy (NOESY) is used to determine the distances between nuclei in bonds. Like COSY and TOCSY, both dimensions use the same type of nucleus.

noesy
A 2D NOESY Spectrum. Source: E. Hatzakis, lecture slides

  • Heteronuclear Single Quantum Coherence Spectroscopy (HSQC) is used to find correlations between nuclei of different types. For instance, if used with hydrogen and carbon-13, it can find the fragments in which carbon and hydrogen nuclei are attached.

hsqc
A 2D HSQC spectrum using hydrogen and carbon-13. Source: E. Hatzakis, lecture slides

Metabolite Identification and Modeling

Metabolite identification consists of a few steps:

  1. Detecting peaks in the signal intensity across m/z and retention time.
  2. Grouping peaks to create features hypothesized to correspond to metabolites. This is often done using correlation of peaks across samples.
  3. Matching the spectra of features to known metabolites. This can be done using a pure standard or by querying the spectrum or maximum intensity peak of the spectrum against a database.

Peak Picking

peak is a region of signal with intensity high enough that it is assumed not to be an anomaly. Peak picking often uses statistical methods, and they may be in the form of proprietary software or open source software such as XCMS.

The primary peak picking algorithm in XCMS is called centWave. The user must specify several parameters:

  • Peak width
  • Signal to noise threshold
  • Maximum distance between peaks
  • Maximum change in m/z across scans for the peak
  • Minimum and maximum counts of sub-peaks per peak
  • Intensity noise threshold
  • Function to use in approximation of peak shape.

These parameters may not be intuitive to the average researcher. In addition, many researchers perform visual inspection on the list of final peaks to remove false peaks, and this is highly subjective. Both parameters and a list of additional removed peaks should be included for the experiment to be truly replicable.

For NMR, adjustments need to be made to the spectrum before peaks are picked. A Fourier transform is used to convert the signal into the frequency domain, which sometimes includes zero-filling to increase the resolution of the signal. The signal’s phase and baseline are adjusted, usually manually. Finally, an internal standard is typically used to normalize the signal. All of this, along with the peak picking, is often done using proprietary software called TopSpin on Bruker NMR instruments. These manual steps, in addition to the proprietary nature of the software, can present challenges in replication.

Grouping Peaks and Features

Once peaks are detected for each sample, they must be grouped across samples. In XCMS, this is done using the retcor and group functions. Again, there are multiple parameters that the user must specify:

  • Method for calculating signal drift
  • Step size
  • Method for grouping signals
  • Deviation tolerance, and
  • Minimum number of samples containing the feature.

A manual data cleanup is usually performed after this step. Researchers may remove peaks found in a specified number of blanks, peaks not found in a specified number of QC, or peaks with more than a specified coefficient of variance across QC.

An algorithm called CAMERA can be used as an add-on to the output from XCMS. CAMERA uses known chemical rules to group features found across samples, focusing especially isotopes and adducts in the spectra.

Modeling Data

To determine which metabolites to report as significant, researchers have many methods they may use. In these models, metabolites are used as features or independent variables, and sample class labels are used as the outcome or dependent variable.

For class discovery, a PCA or hierarchical clustering method is often used, with the samples labels overlaid. This indicates whether the sample labels are in any way related to the metabolomic driving factors differentiating the samples. For class comparison, a t-test, Wilcoxon rank-sum test, or ANOVA may be used for each metabolite across sample groups, perhaps with p-value or FDR correction. For class prediction, researchers may use a random forest or support vector machine, but it is common to see PLS-DA being used. As with previous steps, it is important to report the parameters given to these algorithms.

Identifying Metabolites

The results of a class discovery, class comparison, or class prediction algorithm will include a list of important metabolites, and these are generally the metabolites reported as signficant. But they are still unknown! The next step is to identify them.

Mass Measurements

One option for identifying significant metabolites is to infer mass measurements from the m/z ratios and query them against metabolite databases. However, the most abundant mass within a feature doesn’t always represent the true mass of the compound. The mass detected by the instrument is called the accurate mass; it is subject to the sensitivity of the instrument and may not include the mass of adducts or isotopes. The discrepancy between this and the exact mass needs to be accounted for when querying the mass in a database.

In addition, there are different types of exact masses, and the researcher needs to be aware which one is contained in a given database. The nominal mass is computed using the integer masses of each element in a compound, the monoisotopic mass is computed using the masses of the most abundant isotope of each compound, and the average mass is computed using the masses of each compound for each isotope, averaged over the abundance of each isotope.

Matching Spectra

In spectral matching, the spectra found for metabolites of interest are matched to spectra of known compounds in a database. METLIN is a common database for this type of analysis. However, be aware that these spectra are curated from different types of studies, and some have not been verified using a standard; they may have merely been predicted.

Using Standards

A researcher who wants to be absolutely certain about a metabolite’s identity will compare the metabolite’s spectra to the spectrum of a known standard for that compound on the same instrument and subject to the same experimental conditions. This not only requires a lot of extra effort, but it is not possible for many compounds.

id_levels
This table is from Sumner et al, 2007. 1 is considered the best level of identification.

Critical Thinking in Computational Metabolomics

In conclusion, there are many factors involved in a metabolomics study. The following are, in my opinion, the most important questions to consider when reading a metabolomics paper, and especially when comparing the results of multiple studies:

  • Was this a targeted or untargeted study? In other words, was the focus on specific metabolites?
  • How were the samples processed and extracted?
  • Did the authors correct for variation adequately by using blanks and QC?
  • If MS was used, which column was used, and which type of metabolite is most likely to be detected using this column?
  • If MS was used, what is the mass accuracy of the MS instrument?
  • If NMR was used, which nuclei were used? Was any 2-D analysis performed?
  • How was peak picking performed? Which peak picking parameters were used?
  • Which steps were used in data cleanup?
  • Were isotopes and adducts resolved using CAMERA?
  • How were significant metabolites found? If a parametric algorithm or statistical test was used, what were the parameters?
  • At which level were the metabolites identified?

If you do not know the answers to these questions, it is best to ask the authors. Or, if you are not able to do this, at least err on the side of caution. Do not assume that the answers to these questions are likely to be the same across studies.

Open Computational Problems

Several open computational problems exist in metabolomics. You can read more about the current state of computational metabolomics here. In particular, researchers are seeking to address the following problems:

  • Identification of metabolites. This is an important one, since the vast majority of features detected using peak picking software are never matched to known spectra. Good solutions will address differences between experiments. Improved integration of known mass spectra is also important.
  • Biological inference for metabolite flux between phenotypes. This is often done using known biological and chemical pathways, which are assumed to be discrete (this is not always the case). Sometimes, chemical similarity or reaction networks are used instead of pathways.
  • Integration of metabolomics with other -omics data. Many researchers are interested in multi-omics approaches, but combining data types is not straightforward.
  • Integration of multi-instrument data. As we have seen, results returned by different types of instruments (NMR, GC-MS, and LC-MS) are expected to differ. The question of how to integrate these data types, like the question of how to integrate multi-omics data, has yet to be answered.
  • Integration of lipids and polar metabolites. The typical protocols for identifying lipids and for identifying polar metabolites differ at multiple steps of their protocols. However, it would be useful to perform an integrative computational analysis of these two data types. In a recent paper, this was addressed using a novel instrumentation workflow that lends itself well to integrative analysis.

References

Ferreira do Nascimento (2017) Advances in Chromatographic Analysis Avid Science.

Frainay,C. et al. (2018) Mind the Gap: Mapping Mass Spectral Databases in Genome-Scale Metabolic Networks Reveals Poorly Covered Areas. Metabolites, 8, 51.

Kuhl,C. et al. (2012) CAMERA: An Integrated Strategy for Compound Spectra Extraction and Annotation of Liquid Chromatography/Mass Spectrometry Data Sets. Anal. Chem., 84, 283–289.

Peisl,B.Y.L. et al. (2018) Dark matter in host-microbiome metabolomics: Tackling the unknowns–A review. Anal. Chim. Acta, 1037, 13–27.

Riekeberg,E. and Powers,R. (2017) New frontiers in metabolomics: from measurement to insight. F1000Research, 6, 1148.

Samuelsson,L.M. and Larsson,D.G.J. (2008) Contributions from metabolomics to fish research. Mol. Biosyst., 4, 974.

Schmidt,K. and Podmore,I. (2015) Current Challenges in Volatile Organic Compounds Analysis as Potential Biomarkers of Cancer. J. Biomarkers, 2015, 1–16.

Schwaiger,M. et al. (2019) Merging metabolomics and lipidomics into one analytical run. Analyst, 144, 220–229.

K. Sellers. (2010). Why Derivatize? Improve GC Separations with Derivatization. Available from: https://www.restek.com/pdfs/CFTS1269.pdf

St John-Williams,L. et al. (2017) Targeted metabolomics and medication classification data from participants in the ADNI1 cohort. Sci. Data, 4, 170140.

Sumner,L.W. et al. (2007) Proposed minimum reporting standards for chemical analysis Chemical Analysis Working Group (CAWG) Metabolomics Standards Initiative (MSI). Metabolomics, 3, 211–221.

Tautenhahn,R. et al. (2008) Highly sensitive feature detection for high resolution LC/MS. BMC Bioinformatics, 9, 504.

Tautenhahn,R. et al. (2012) XCMS Online: a web-based platform to process untargeted metabolomic data. Anal. Chem., 84, 5035–9.

Thermo-Fisher Scientific. (2007). Plasma and Serum Preparation. Available from: https://www.thermofisher.com/us/en/home/references/protocols/cell-and-tissue-analysis/elisa-protocol/elisa-sample-preparation-protocols/plasma-and-serum-preparation.html

Wishart,D.S. et al. (2018) HMDB 4.0: the human metabolome database for 2018. Nucleic Acids Res., 46, D608–D617.

Xu,Y.-F. et al. (2015) Avoiding Misannotation of In-Source Fragmentation Products as Cellular Metabolites in Liquid Chromatography–Mass Spectrometry-Based Metabolomics. Anal. Chem., 87, 2273–2281.

An Introduction to Bash Scripting

blank_computer_screen
Source: http://robertmuth.blogspot.com/2012/08/better-bash-scripting-in-15-minutes.html

What is Bash Scripting?

Bash (an abbreviation for “Bourne Again Shell“) is the shell scripting language and interpreter on most Linux systems. If you’ve used the command line in a Linux environment, you’ve probably already written some commands in Bash. This tutorial explains how to package these shell commands neatly into a script, which can be useful for anyone wanting to streamline commands for a routine task. Bash scripting is also a vital skill for anyone conducting research in a supercomputer environment.

Creating a Bash Script

When you create a bash script in a Unix environment, it is customary to give it a .sh extension. Technically it isn’t necessary, because Unix will still treat it as a bash script regardless, but it’s a good idea. Think about it: have you ever inherited someone’s old project and been faced with the task of figuring out what each of their files is for? Imagine doing that without extensions. So, give your file a .sh extension.

To open a new script in emacs:

C-x C-f my_script.sh

Or, if you’re like me and use vim:

vi my_script.sh

Then, add the following header to the top of your file. All other code will go below this. This header is called a shebang and is used to ensure that Bash (and no other interpreter) will be used to run the script.

#!/bin/bash

Running a Bash Script

Source: https://www.guru99.com/sql-interview-questions-answers.html

You do not need to run a compiler or specify an interpreter to run a Bash script (the shebang already takes care of that for you). To execute them, simply navigate to the directory where the script is saved and run:

./my_script.sh

If your Bash script takes command line arguments (more on how to do this later), execute your script as in the following example, where "my_file.csv" and 0.4 are the parameters to the script.

./my_script.sh "my_file.csv" 0.4

Some Bash scripts accept named parameters (more on this later). To run a Bash script with named parameters, execute it in the following way. In this example, -f references a file parameter and -t a threshold parameter.

./my_script.sh -f "my_file.csv" -t 0.4

Declaring and Using Variables

Source: https://www.researchgate.net/figure/Global-and-local-variables-in-procedural-programming_fig1_283347855

In bash scripting, the type of a variable is automatically inferred, so you don’t need to declare it. You also don’t have a lot of special data types. Basically, there are floating point values, character strings, and lists. Note that, by default, floating point values are treated as characters. They are only treated as numeric during mathemical or arithmetic operations. If you need anything more complicated than that, my advice is to write your code in R, Python, Java, or Perl and call it from your bash script rather than coding everything up in bash.

Single Values

Again, you don’t need to declare the type of your variable. All you need to do is this:

my_character_variable="hello world"
my_float_variable=4.0
my_integer_variable=1

One thing to note about bash is that it’s finicky about whitespace. No matter how much you want to, don’t put spaces between the variable name and the equals sign.

When you want to use these values, put a dollar sign in front of them. It is a good idea to enclose them in curly braces as well, although it isn’t always required. In some scenarios (like concatenating two strings with an underscore), curly braces are important for distinguishing variable names from other text. For example, the code below will store “hello world_1” in the variable.

full_string=${my_character_variable}_${my_integer_variable}

Arrays and Lists

The syntax for lists is also quite simple, and bash arrays can contain both strings and numeric values, e.g.

my_array=(1 2 3 4 5 6 7 8 9 10 4.0 "hello" "world")

Variable Scoping

If you’ve done a lot of coding, you know that variables are local to the block of code in which they’re declared, right? Right. Except in bash. By default, every variable declared in a bash script is global. If you want a local variable, you need to declare it like this:

local my_local_character_variable="hello neighborhood"

Arithmetic

Arithmetic on variables is straightforward. Use the same operators that you would use in typical arithmetic. However, to ensure that your variables are treated as numeric and not string variables, you must use double parentheses around the expression or the let keyword in front of the variable in which you want to store your result. Some examples are shown below. Remember, be careful about whitespace.

x=5
y=2
modulus=$((${x}%${y}))
let addition=${x}+${y}

String Concatenation

String concatenation is also very simple in Bash. Simply type the variables, one after the other, with any additional text filled in exactly where you want it! It doesn’t get easier than that. Here’s an example.

#This code stores "Why hello there, world!" in a new variable.
word1="hello"
word2="world"
combined="Why ${word1} there, ${word2}"

This also words if you’re dealing with numeric values, without any need for parsing.

#This code stores the string "We're number 1" in a new variable.
val=1
combined="We're number ${val}"

Running External Programs

If you’ve ever used the command line on a Unix system, then you already know how to execute external programs in a bash script. Simply call the program as you would do from the command line. Here are a few examples.

R Scripts

Many people like to run R interactively. Unfortunately, you can’t do this within a Bash script, because scripts are not interactive by virtue of being scripts. You will need to save your code in a .r file and run it using Rscript. If you have run Rscript on the command line before, this should look very familiar.

param_1=2
param2="/root/my_file.csv"
param3="/root/my_output.csv"
Rscript my_r_script.r $param_1 $param_2 $param3

Utilities

You can run utilities from within your bash script just as you would run them on the command line. Here is an example using bedtools, a utility common in bioinformatics. Called in this way, it will print the resulting file to the console rather than saving it. To avoid this behavior, see the section on Output.

bedtools sort -i my_bed_file.bed

Unix Tools

There are a few Unix tools that can be handy to use in Bash scripts. Each of these tools is really a topic on its own, but here is a brief introduction to them.

grep

The grep utility is primarily used for searching text. It is often used with pipes (discussed later) and can be called directly within the Bash script. Learn more about the capabilities of grep here. The following command will return every line in my_file.txt containing my_word.

grep "my_word" my_file.txt

cut

cut is used for selecting substrings of each line in a file or modifying the lines in a file. Here is an example of how it can be used to select only the first two columns in a tab-delimited file. You can see more examples of how to use cut here.

cut -d "\t" -f 1,2 my_file.txt

shuf

If you are doing any work that involves permuting data, shuf is a convenient tool. It can be used either to shuffle an entire file or to select a random set of lines from a file. Read more about shuf here. The example below shows how to shuffle an entire file’s lines using shuf.

shuf my_file.txt

awk

The awk utility is convenient for selecting and modifying lines that meet specified criteria. It is more powerful than cut, but it can also be more complicated to use. To really use awk well, you should understand regular expressions. The following example shows how to select columns 1 and 2 from a file (similar to the cat example). You can see more examples of awk here.

awk '{print $1 $2}' my_file.txt

Branching

Source: https://www.classes.cs.uchicago.edu/archive/2019/winter/15200-1/lecs/notes/Lec4ComplexCondNotes.html

Braching in Bash uses the following syntax. Within the double square brackets, you can construct tests using the comparison operators available in Bash, and you can chain tests together using Bash logical operators. Single square brackets are also supported, but double square brackets have some added features. Note that the whitespace between the test and brackets is important!

if[[ $num -eq 42 ]]
   then
      Rscript my_r_script.r "file_42.csv"
   else
      Rscript my_r_script.r "file_not_42.csv"
fi

Tests in an if statement can include more than just arithmetic. The following code checks whether a directory exists, and creates it if it doesn’t.

dir="../my_dir"
if [[ ! -e $dir ]]; then
   mkdir $dir
fi

Looping

Source: http://www.functionx.com/java/Lesson08.htm

You can use for loops or while loops in Bash. For loops are used for looping over lists. The example below shows looping over a list of numbers.

for f in 0 1 2 3 4 5 6 7; 
   do 
      Rscript my_r_script.r $f 
   done

This could also be done using a range.

for f in {0..7};
   do
      Rscript my_r_script.r $f
   done

Finally, you could loop over a pre-defined array.

for f in ${my_array);
   do
      Rscript my_r_script.r $f
   done

While loops test conditions have similar syntax to if statement test conditions. The following while loop does the same as the first two for loops above.

i=0
while [[ $i -lt 7 ]] 
   do
     Rscript my_r_script.r $f
   done

Functions

Source: http://www.desy.de/gna/html/cc/Tutorial/node3.htm

You can define and call functions in Bash scripts, but note that you need to define your function before you call it. This is notable because many programming languages do not have this restriction. Another thing that is different about functions in Bash scripting is the way parameters are passed. When calling the function, you simply pass the parameter directly after the function call like a command-line argument. Inside your function definition, your first parameter will be referred to as $1, your second as $2, and so on. What about returning values from a function? Bash doesn’t allow this. So strictly speaking, Bash functions are not really functions at all but procedures.

my_function() {
   local c=$1
   Rscript my_r_script $c
}
for f in 0 1 2 3 4 5 6 7;
   do 
      my_function $f
   done

Input

Input in Bash scripting can take two forms. You can pass command-line arguments when calling your script, or you can store your input as a file.

Command Line Arguments

Basic command line arguments work similarly to parameters in Bash functions: $1 refers to argument 1, $2 to argument 2, and so on.

However, if you want to make your script more user-friendly and allow for named parameters, that is also possible. The code below allows for three named parameters: -n for a name, -f for a file name, and -t for a threshold. All arguments are optional. The realpath operator returns the full path of the file name given if the file exists.

while getopts n:f:t: option; 
   do
      case "${option}" in
         n) name=$OPTARG;;
         f) filename=$(realpath $OPTARG);;
         t) threshold=$OPTARG;;
      esac
   done

File Input

Of course, you can also simply hard code file names into your Bash script and use them as your input. If you want to input a list of values rather than a single value, storing them in a file is probably the best way to do this. There are several ways to load your data from the file into a list.

The first option prints the file using the cat utility and stores each line in a list. Note that the parentheses here are different from the double parentheses described in the arithmetic section. Double parentheses (()) run arithmetic operations, and single parentheses () allow you to run commands in a subshell (essentially a child process) that can then be returned using the dollar sign $.

my_list=$(cat my_file.txt)

Another option uses the shell redirection operator to read each line of the file in a loop.

my_list=()
while read infile;
do
    my_list+=($infile)
done < my_file

The IFS Variable

Note that the code above will only work as expected if there is no whitespace (spaces, tabs, etc) within each line. If your lines have spaces or tabs, Bash will automatically split on each space or tab. You can change this by setting a special variable called IFS to split on new lines only.

For example, say your input file is formatted like this.

Hi, I'm a file.
You should input me into your Bash script.
But it needs to be done line-by-line.

If you want to read each of these in a single line, you could do

IFS=$'\n'
my_list=()
while read infile;
do
    my_list+=($infile)
done < my_file

You can also use IFS to split on other characters as well. See this page for more information on IFS.

Output

In Bash, you can print output to a file or direct it to stdout or stderr (by default, stdout is usually the main console).

Shell Redirection Operator

The shell redirection operator allows you to redirect output to a file. For instance, the following line redirects the output of the bedtools sort utility to the file my_sorted_bed_file.bed. Normally, this output would print to stdout.

bedtools sort -i my_bed_file.bed > my_sorted_bed_file.bed

It is also possible to append to the file, like so:

bedtools sort -i my_bed_file.bed >> my_sorted_bed_file.bed

Finally, if your line of code prints to stderr, you can redirect both streams as follows:

bedtools sort -i my_bed_file.bed > my_sorted_bed_file.bed 2> my_errors.log

echo Utility

The main way to output to the console (stdout) is to use the Unix echo utility. The following examples show how echo can be used.

#This command prints "Hello World" to stdout.
echo "Hello World"

#This command prints the contents of my_array to stdout.
my_array=(1 2 3 4 5 6 7 8 9 10 4.0 "hello" "world")
echo $my_array

#This command prints the contents of the file my_file.txt to stdout.
echo $(cat my_file.txt)

Piping

Source: https://bash.cyberciti.biz/guide/Pipes

When you pipe a command, you are redirecting its output to another command. This is done using the | operator. Pipes are used in many scenarios, but here are some examples.

#The following code prints only the names of files in a directory containing ".png".
ls -l | grep "\.png$"

#The following command sorts the first 1000 lines of a file.
head -n 1000 | sort -V -k1,1 -k2,2

Quotes

Three types of quotes are used in Bash: double quotes, single quotes, and backtick quotes. They are all used for different purposes.

Double Quotes

Double quotes are used around text. If variables are included in the double quotes, they are expanded. Here is an example. The code below prints “Why hello there, world”

word1="hello"
word2="world"
echo "Why ${word1} there, ${word2}!"

Note that if you want to include quotes within the string, you need to use an escape character. The code below prints “The script name is “my_script.r””

echo "The script name is \"my_script.r\""

Single Quotes

Single quotes are also used around text, but the difference is that they do not expand variables. Looking at a similar example (below), we print “Why ${word1} there, ${word2}”

#This code stores "Why hello there, world!" in a new variable.
word1="hello"
word2="world"
combined='Why ${word1} there, ${word2}'

Backtick Quotes

These are usually just called “backticks”; however, many people consider them a type of quote or mistake them for single quotes. Backticks have an entirely different function from other quotes, which is to return the output of a command. In this way, they function the same as $(). For instance, in the File Input section, we could have also written the command like so.

my_list=`cat my_file.txt`

Background Processes

background_process
Source: https://turbofuture.com/computers/Run-process-in-background-linux-terminal

Sometimes, you may want to run part of your script in the background so that it doesn’t block additional processes from accessing the shell. To do this, you need to attach lines of code to threads. One easy way to do it is to put all code you wish to run in the background into its own method. Then, call that method in a for loop. The ampersand attaches your code to a background process.

Note the pids array and the wait statement. These are important if you want to make sure no other code executes until all threads have completed. The code below tells the script to track all process id’s and wait until they have completed before running the next line of code.

pids=""
for f in $CHROMS;
   do
      my_function $f &
      pids="$pids $!"
   done
wait $pids

The Logical Flaw in Hjernevask’s “Gender Equality Paradox”

For those who have never heard of it, Hjernevask (“Brainwash” in Norwegian) is a documentary by a Norwegian comedian about gender studies research. “Men’s rights activists” tout it as a gospel of sorts for their movement and use it frequently to justify why they think we shouldn’t push for more female representation in STEM fields. I decided to watch it. Why did I bother? Primarily because I think it’s important to understand where people are sourcing their information and why that information leads to erroneous conclusions. After seeing the episode, I realized why it was so frequently referenced: it seemed close enough to being logical that a person watching without a keen eye could be fooled. It even included a few studies, albeit interpreted in a biased way.

The basic argument of Hjernevask was this: Countries that have the highest gender empowerment measure (GEM) also have some of the lowest rates of women going into STEM fields, so it must be the case that female empowerment “empowers” women to not choose STEM careers because there is something fundamentally incompatible between women and STEM.

…Whew. Evel Knievel couldn’t make that leap.

We all know that correlation does not equal causation. So, a negative correlation between GEM and the rate of women entering STEM fields does not indicate that women are failing to choose STEM careers because of female empowerment. Let’s look at possible confounding factors and see what could play into this correlative effect.

Perceived Ability

A recent paper explores the self-perceived mathemetical abilities of boys and girls around the world by country. The findings? While girls in countries with low GEM scores had lower self-perceived ability than boys, their self-perceived ability was still much higher than that of girls in countries with high GEM scores! What’s more, in many high-GEM countries, girls’ perceptions of their ability had little to do with their actual performance on an administered exam. This was less true of many low-GEM countries. It seems quite likely, then, that girls in many high-GEM countries do not pursue careers in STEM because they don’t feel confident in their mathematical skills, not because they don’t want those careers.

Perception aside, what do studies tell us about the difference in mathematical skill between boys and girls? Well, many studies indicate that there is little difference in mathematical skill between the sexes. Other studies show that boys and girls excel at different types of math problems, all of which matter in STEM fields.

One may wonder why we see this effect; we might assume that girls in high-GEM countries would be encouraged more in their mathematical skill. Read on.

Social Norms

The paper discussed above provides the following hypothesized explanation for the outcome:

This research suggests that gender egalitarian contexts can encourage girls and boys to express societally approved gender ideals as part of their gender performance. Education in these contexts thus serves not only an instrumental function, but also an expressive function, so that students’ educational choices are thought to reflect important aspects of who they are. From this perspective, educational aspirations can be seen as not simply reflecting the world of possibilities that girls believe are open to them, but also become an arena in which girls perform the gendered identities that they have internalized. Thus, while girls in gender egalitarian contexts may have more opportunities to pursue STEM fields, they may also decide that it is more feminine to be interested in other fields of study, and choose these other fields that they believe more closely match the image that they are seeking to realize.

This is an intriguing explanation. Let’s make one thing clear: there is no reason to believe that countries with high GEM are devoid of gender-specific social norms. GEM is based on things like maternal mortality rate, percentage of women working outside the home, percentage of parliament members who are women, and the percentage of women with a secondary education. These are useful for gaining an overall picture of women’s status in a country. They do not tell us the assumptions the culture makes about very specific women’s issues (like STEM education).

Furthermore, studies conducted in the United States (which is generally considered to be a high-GEM country) show that boys frequently believe they are better at math than girls. This reflects the gender stereotype discussed above; not only do girls have low mathematical self-esteem, but they are also perceived by others as having poor mathematical skill simply by being girls. I cannot make assertions regarding boys’ perceived ability in comparison to girls worldwide, as I don’t know of any such assessment. But certainly, the United States hasn’t yet reached a point when we can say that no gender-specific social norms exist. And depending on regional differences in social norms, they may be another confounding factor which is related to perceived self-ability as described above.

Marketing

One point Hjernevask makes based on the results of several studies is that girls are often more people-oriented than boys and that this may be at least partially biological. However, Hjernevask errs in concluding that women are just not suited for engineering because of this. Engineering has the potential for significant societal benefit, and viewing engineering from this angle is quite valuable, but it isn’t how engineering is marketed. Notably, marketing is important when students in high-income countries choose fields of study because many options are open to them. To observe the importance of marketing, let’s look at a sub-field of STEM that defies the norm: healthcare. Although many executives in the medical field are still men, the percentage of women in healthcare is much higher than in other STEM fields. And healthcare, unlike other STEM fields, is primarily marketed with societal benefit in mind. A survey conducted by the Journal of Nursing Education found that, unsurprisingly, care for others is a primary reason why many women choose to become nurses.

There is actually an interesting history of women in computer science (my own field) that relates to marketing as well. To summarize,  most programmers were female until the 1980’s: the advent of video games, which were marketed for boys. Tinkering with gaming systems provided boys with an introduction to computer science, and the field came to be associated with gaming. While I think it’s great that gaming can spark an interest in computer science, girls are often left out. The solution? Computer science programs should emphasize the social importance of bioinformatics, NLP, and cybersecurity. And, make video games designed for girls too!

Conclusion

It’s probably a combination of all factors discussed here leading to the effect described in Hjernevask. Part of this effect is due to girls in high-GEM countries shying away from STEM careers because STEM careers are not marketed towards them. Another is due to societies’ harmful assumptions about women’s mathematical and technical skill. Finally, girls internalize these harmful assumptions, leading to poor self-perception in mathematics. The last thing we should do is ignore the STEM crisis as the producers of Hjernevask would have us do. We need to address it both by challenging our ideas about gender and STEM and by showing girls why STEM is a great field for making an impact on the world.

References

Goldman A and Penner A. (2016). Exploring international gender differences in mathematics self-concept. International Journal of Adolescence and Youth, 21(4), 403-418. [online] Available at:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5100974/ %5BAccessed 24 Mar. 2019].

Lindberg S, Hyde M,  Petersen J, et al. (2010). New trends in gender and mathematics performance: A meta-analysis. Psychological Bulletin, 136(6), 1123-1135. [online] Available at:
https://psycnet.apa.org/record/2010-22162-004 [Accessed 24 Mar. 2019].

Susac S, Bubic A, Vrbanc A, et al. (2014) Development of abstract mathematical reasoning: the case of algebra. Frontiers in Human Neuroscience, 8 (679) [online] Available at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4151197/ [Accessed 24 Mar. 2019].

Correll S. (2001) Gender and the Career Choice Process: The Role of Biased Self‐Assessments. American Journal of Sociology, Vol. 106, No. 6, pp. 1691-1730. [online] Available at: https://sociology.stanford.edu/sites/g/files/sbiybj9501/f/publications/gender_and_the_career_choice_process-_the_role_of_biased_self-assessments.pdf [Accessed 24 Mar. 2019].

Advisory.com. (2019). Women make up 80% of health care workers—but just 40% of executives. [online] Available at: https://www.advisory.com/daily-briefing/blog/2014/08/women-in-leadership [Accessed 24 Mar. 2019].

Medical School Headquarters. (2019). 5 Reasons to Go To Medical School, and 5 to Not. [online] Available at: https://medicalschoolhq.net/mshq-045-5-reasons-to-go-to-medical-school-and-5-to-not/ [Accessed 24 Mar. 2019].

Boughn S and Lentini A. (1999). Why do women choose nursing? Journal of Nursing Education, 38(4), 156-61. [online] Available at: https://www.ncbi.nlm.nih.gov/pubmed/10225263 [Accessed 24 Mar. 2019].

Nytimes.com. (2019). The Secret History of Women in Coding. [online] Available at: https://www.nytimes.com/2019/02/13/magazine/women-coding-computer-programming.html [Accessed 24 Mar. 2019].

Marie Claire. (2019). How Female Gamers Are Trying to Reinvent the Industry. [online] Available at: https://www.marieclaire.com/career-advice/a25379999/female-gamers/ [Accessed 24 Mar. 2019].

Ohio Supercomputer Center Tutorial

This tutorial is intended for users who have access to the Ohio Supercomputer Center (OSC) for research purposes but who are not familiar with conducting research in a supercomputing environment. It gives an overview of the resources available from OSC and the structure of the OSC computing environment, methods for connecting to OSC, job submission, software installation, and help desk requests.

An Overview of OSC

The term “supercomputer” is somewhat vague, but what it typically means in today’s context is actually a supercomputing cluster, or a large collection of high-powered servers connected via a local network. Clusters allow users to run computationally intensive, parallelizable tasks in a single environment. To illustrate the concept, I have provided an image of one of OSC’s retired clusters, Oakley, from the OSC website.

2015_1222-cluster-graphic_oak_all
Source: https://www.osc.edu/sites/osc.edu/files/2015_1222-Cluster-Graphic_Oak_all.png

In a supercomputing center like OSC, many users share these resources at once, and resources (nodes, memory, and time) are allocated to users as users request them. You do this by submitting jobs (discussed later). When you submit a job, you request the resources you need. When those resources are available on the cluster you are using, your job will run.

Computing Clusters

OSC has 3 clusters: Owens, Ruby, and Pitzer. To decide which of these clusters best meets your needs, check the specifications of each from the OSC website. Depending on your research, you may want to think about the type of network connections used or how many GPU’s are available. You may also want to browse the software list to determine which clusters contain the software you need.

Note: While not all clusters have the same resources, user files are mirrored across all clusters. This means that, if you are using Owens and Owens is unavailable due to periodic system updates, you can use another cluster until Owens is available again.

Login Nodes and Compute Nodes

Most of the nodes in any OSC cluster are compute nodes, with a few login nodes. The difference between the two is mainly this:

  • Login nodes are used only for logging in and for very basic tasks, like moving or deleting files. There are only a few of them because the tasks performed on them are not intensive, so many users can share only a few nodes.
  • Compute nodes are used for intensive computational tasks. All of your work should be done on compute nodes, not on login nodes.

When you connect to OSC (no matter which method you use), you are connecting to a login node. The only way to access a compute node is by submitting a job. Any work you do on OSC must be done within a job – otherwise, you risk putting a heavy workload on the login nodes. If you do this, any script you run will be killed almost immediately, and you will receive an e-mail from the OSC administrators reminding you to never do that again.

File Systems

All OSC clusters have the same file systems, described below.

  • /fs/home is the file system you are directed to automatically when you log in, and it is organized by user ID. You can store data and output files there, but your collaborators won’t be able to access them.
  • /fs/project is a file system organized by research project ID. To get to your project folder, you will need to navigate to /fs/project/. Storing your files here makes them available to your collaborators.
  • /fs/scratch is a large file system for temporary storage and is available to everyone. Use it if you need to share files with someone who does not have access to your project folder or if your files are too large to fit in your project folder.

Connecting to OSC

Personally, I prefer to use the command line for most of the work I do, but many users like the look and feel of a GUI. If you prefer using a GUI, access OSC through the OnDemand Web Portal. If you prefer working from a command line, use PuTTY (for Windows users) or SSH (for Mac or Unix users).

Using the OnDemand Web Portal

OnDemand is a web-based service run by OSC administrators for accessing OSC using a GUI (No installation required). OSC has a tutorial for using OnDemand. Note that, instead of using CILogon, you will want to choose “Log in with your OSC account”. You will be asked to register your username and password the first time you log in.

Using PuTTY (Windows)

To connect using Windows, use PuTTY. If you haven’t used PuTTY before, just download it and it’s ready to use! Once you have PuTTY installed, configure PuTTY to connect to OSC as shown below (the example below is for the Owens cluster). Other than what is highlighted, no changes need to be made to the settings.

putty1

Note: I have saved this connection under the name “Owens” by entering “Owens” under Saved Sessions and clicking Save. If you do this, it saves all information for future logins. You can select Owens and click Load rather than entering everything again.

Using SSH (Mac / Unix)

From your terminal, simply use one of the following commands, depending on which cluster you wish to use.

ssh "user-id"@pitzer.osc.edu
ssh "user-id"@owens.osc.edu
ssh "user-id"@ruby.osc.edu

Submitting Jobs

Here we are…the most important part of the tutorial! Once you have logged into your cluster, you will need to submit your job so that your code can run on the compute nodes. When you submit a job, you need to think about the following things:

  • How many nodes you will need. This is going to depend on how you’ve parallelized your code. To run on multiple nodes, you will need to use a parallelization package that supports MPI or OpenMP in your code. Examples of this are Rmpi and snow in R, and mpi4py in Python.
  • How much memory (or processors) you will need. If your program uses threading, you will need to request processors accordingly. Note that, in addition, the amount of memory allocated for you is proportional to the number of processors you request. For instance, standard Owens nodes have 28 processors and and 64 GB memory, so requesting 4 processors on Owens will allow you to use 9 GB memory. If you need to use more than 64 GB, you can request a large memory node by requesting as many processors as the large memory node contains (e.g. 48 on Owens). Note: If you are using multiple nodes, you must request all processors for each node.
  • How much time you will need. It is best to overestimate on your first run. Then, you can see how much time the job actually takes interactively or using a batch job by checking the log files.

There are two types of jobs that you can run: interactive jobs and batch jobs. You can create these yourself or (if you are using OnDemand) you can use the Job Composer templates provided by OnDemand as shown in the image below. The next two sections describe the types of jobs and how to create them without using templates.

job_composer

Interactive Jobs

An interactive job allows you to connect to the compute node and interactively run tasks. This approach has the following advantages over running in batch mode:

  • You can catch bugs as soon as they occur.
  • You can view output in real time.
  • If you prefer running code line-by-line (such as with an R or Python interpreter), you can do that.

However, note that an interactive job does not keep running in the background when you close your connection to the cluster. For time-intensive tasks, it is not a good choice. Sometimes, it can also be inconvenient when a job is queued, because you should periodically check whether it has been dequeued so that you can run your tasks.

To run an interactive job, issue the following command.

qsub -I -l nodes=4:ppn=28 -l walltime=3:00:00 -A PAS0001

Here, the request is for 4 nodes with 28 processors each (To run on large memory nodes, we could have requested 48 processors each). The interactive job will run for 3 hours, and resources will be charged to project PAS0001.

Batch Jobs

In a batch job, all of your tasks are run from a shell script, which you must create. Batch jobs have the following advantages:

  • They will continue running even after you close your connection to the cluster, making them a good choice for time-intensive tasks.
  • You do not need to check whether a job has been dequeued before running anything. Since your commands are in a script, they will run as soon as the job is dequeued.
  • Putting your commands in a script can help you to better organize your code.

For a batch job, you should create a shell script, e.g. myscript.sh, with the following format.

#PBS -l nodes=4:ppn=28
#PBS -l walltime=3:00:00
#!/bin/bash
for i in 0 1 2 3 4;
do
   echo $i
   Rscript some_r_script.r $i
done

Here, we are again requesting 4 nodes with 28 processors each and 3 hours of runtime. This is specified in the #PBS directives. The rest of the code is an example: it loops through the numbers 0 to 4, prints them out, and calls an R script. This is meant to illustrate how you can use a shell script for simple looping and branching, input and output, and running code.

Now that you have created your script, you will need to submit it. To do this, run:

qsub -A PAS0001 my_script.sh

After your script completes, it will save your console output to a file called my_script.sh.o."some-number" and any errors to my_script.sh.e."the-same-number".

Checking Job Status

To check the status of a job, run the following command:

qstat -u "user-id"

If you have a job running, your output will look something like this:

                                         Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - ---------
4723639.owens-batch.te serial STDIN 86768 1 28 -- 03:00:00 R 00:00:06

This tells us the job ID, that the job is running from STDIN (i.e. it is an interactive job), that it is running on one node with 28 cores, and that it has been running for 6 seconds. Instead of R, jobs that are queued will show Q as their status. Jobs that have recently completed will show C.

Deleting a Job

To delete a job, you must first obtain its ID, which can be done by checking the job status as shown above. Once you have the ID, run:

qdel "job-id"

In this case, the job ID is 4723639.

Transferring Files

To/From a Local Machine

Using OnDemand

To use OnDemand for file transfers, please see this tutorial from OSC.

Using SFTP or SCP (Mac / Unix)

SCP is generally faster than SFTP, but SFTP allows for file management on the remote server, such as file deletion and creation of directories. Learn more about the differences here.

To use SFTP for file transfers, connect to the OSC SFTP server as shown:

sftp "user-id"@sftp.osc.edu

To use SCP for file transfers, use the OSC SCP server as shown below. The first command is for transferring from the SCP server to your local directory, and the second is for transferring from your local directory to the SCP server.

scp "user-id"@scp.osu.edu:"your-file.txt" "your-local-directory"

scp "your-file.txt" "user-id"@scp.osu.edu:"your-remote-directory"

Using FileZilla or WinSCP (Windows)

To connect via FileZilla or WinSCP, you will need the same connection information as you used for the PuTTY connection, but you will use the SFTP server instead of the cluster name for FileZilla and the SCP server for WinSCP. Here is an example using FileZilla.

filezilla

Downloading from Online

One option for retrieving files from online is to download them to your local system and transfer the files using one of the methods above. But this isn’t very efficient, especially if you are downloading large data files.

My favorite method for downloading data from online is to use wget. For example:

wget https://www.encodeproject.org/files/ENCFF001CUR/@@download/ENCFF001CUR.fastq.gz

This downloads a 1.41 GB file from the ENCODE Consortium. For me, it downloads at a speed of 29.7 MB/s.

If you want to obtain data from GitHub and are familiar with Git commands, you can also clone a repository from the OSC command line using git clone . Git is automatically installed on the OSC clusters, so you don’t need to worry about installing it.

It is a good idea to check whether your download was successful. If the file you are downloading has an md5 checksum, you can use that to verify your dowload. For instructions on using md5 checksums, see this tutorial.

Using Software

Available Software

Each cluster on OSC has software pre-installed that you can use. So before trying to install new software yourself, check whether it is already available using:

module spider "name-of-program"

An example of the output for R would be:

---------------------------------------------------------------------------------------------------------------------------------------------------------
  R:
---------------------------------------------------------------------------------------------------------------------------------------------------------
     Versions:
        R/3.3.1
        R/3.3.2
        R/3.4.0
        R/3.4.2
        R/3.5.0
        R/3.5.2
     Other possible modules matches:
        amber  arm-ddt  arm-map  arm-pr  blender  darshan  espresso  express  freesurfer  gromacs  hdf5-serial  homer  hyperworks  libjpeg-turbo  ...

---------------------------------------------------------------------------------------------------------------------------------------------------------
  To find other possible module matches execute:

      $ module -r spider '.*R.*'

---------------------------------------------------------------------------------------------------------------------------------------------------------
  For detailed information about a specific "R" module (including how to load the modules) use the module's full name.
  For example:

     $ module spider R/3.5.2
---------------------------------------------------------------------------------------------------------------------------------------------------------

Note that this also tells you which versions are available. Sometimes, the version of software you want to use is available on OSC, but there is another version loaded by default. To load the version you want, just use module load. For example, let’s say that you want to use version 3.5.2 of R, but 3.5.2 is not the default.

Then, you can run:

module load R\3.5.2

Configuring Paths

Setting your $PATH environment variables can be useful in a Unix environment. It allows you to simply type the name of the software or package you wish to use, without specifying the full path. In OSC, you do this by modifying two files: .bashrc and .bash_profile. These can be found in your /fs/home directory. Below is an example of .bashrc file content:

PATH=$PATH:$HOME/.bds
PATH=${PATH}:$HOME/gosr/bin
PATH=${PATH}:$HOME/Samtools/bin
PATH=${PATH}:$HOME/FastQC/
PERL5LIB=$HOME/bioperl-1.2.3
PERL5LIB=${PERL5LIB}:$HOME/ensembl/modules
PERL5LIB=${PERL5LIB}:$HOME/ensembl-compara/modules
PYTHONPATH=$HOME/pythonmodules/pysam-0.7.5/lib/python2.7/site-packages
PYTHONPATH=$PYTHONPATH:$HOME/usr/local/anaconda/anaconda2/pkgs/
PYTHONPATH=$PYTHONPATH:$HOME/pythonmodules/gosr/lib/python2.7/site-packages

Here, I have added system paths, Perl-specific paths, and Python-specific paths. Simply add a similar line to both the .bashrc and .bash_profile files to include any additional path. You will need close and reopen your connection to the cluster before these changes can take effect. Note: It is important to modify both files. One is used for the login node, and the other is used for the compute node. Ideally, you want them to be consistent.

Installing Software and Packages

Installing software on OSC is not as simple as on a Unix system on which you are the root user. You do not have sudo access, so you are limited to user installations. For example, I often use the following for python modules:

pip --user "package-name"

If there is software or a package you wish to install that requires root access, you will need to contact the OSC Help Desk.

Support

To request support, contact oschelp@osc.edu with a description of the problem you are facing, and include your user ID. The help desk usually responds quickly.

Measuring Chemical Similarity in Metabolomics

One of the primary uses of chemical similarity in medicine is drug target prediction. Like much research in bioinformatics, the goal is to predict interaction or effect automatically, without testing each possibility in the lab. Chemical similarity between target substrates or products and existing well-studied compounds can assist in predicting outcome of pharmaceutical treatments.

But a problem arises when discussing chemical similarity: there isn’t a comprehensive definition of what it means. For its uses in the medical field, we are primarily interested in potential for interaction between compounds. This means that the shape of the molecule and its functional groups matter, but this is not necessarily all. Assumptions must be made by the methods used for determining the similarity of two compounds.

Below, I will go into a few key methods for measuring chemical similarity along with the assumptions made by these methods. For a comprehensive analysis and comparison of these methods, see Dr. Edmond Duesbury’s PhD thesis. Note: his thesis includes a discussion of hyperstructures, which I don’t discuss here, and it excludes ontology-based methods. It is also not specific to the metabolomics field.

Fingerprint-Based Methods

The most popular method for chemical similarity when applied to metabolites (my current area of focus) is representing compounds as fingerprints. These fingerprints are fixed-length bit vectors that encode information about the compound, and the encoding is specific to what researchers in the field believe is relevant. One may then use the Tanimoto/Jaccard coefficient between the two bit strings to obtain the similarity.

In metabolomics, the most common type of encoding is the PubChem ID, an 880 bit vector denoting substructures present in the compound. Computing the Tanimoto coefficient on these vectors is quick and easy, but there is something questionable about weighing all of these bits equally. Consider this: not all substructures are equally common, and some may be related to one another. Do we really want to place equal importance on each of these substructures? In spite of this potential problem, this method has been shown to perform well in some scenarios (although it hasn’t been tested in metabolomics with ground truth) and is often used because of its convenience.

Ontology-Based Methods

Several databases of chemical ontologies exist. These include ClassyFire, ChEBI, and LION\Web (for lipids only). In these ontologies, compounds are organized hierarchically into groups chosen by those designing the ontology. For instance, ClassyFire divides compounds into organic and inorganic, then into lipids, analogues, metal compounds, non-metal compounds, etc (see the paper for the full ontology).

For ontological databases, it is possible to compare two compounds by the height of their split in the ontology. This can be easily done once one queries the compound of interest and obtains the ontology. This is simple in ClassyFire: the classyfireR package in R returns a compound’s ontology up to four levels of the hierarchy using its InChIKey. For better resolution, one can also use the batch tool to manually query a list of InChIKeys. Notably, this chemical similarity is based on the organization of the hierarchy rather than on properties that can be observed from chemical structure alone.

Graph-Based Methods

Graph-based methods use graph mining to determine the similarity of two compounds. Atoms are often treated as nodes, with bonds as edges (although, in some cases, line graphs are used). As a computer scientist, I’m partial to these. The problem is that they are less scalable than the methods described above.

One simple graph-based method is atom similarity, developed in the 1980’s. In this method, the path (in non-hydrogen bonds) is computed between every pair of atoms in the molecule. The similarity is then based on the number of shared paths (i.e. shared bond structure between each pair of atoms). As far as I’m aware, this method is still being used in the ChemMineR R package.

One that provides a more intuitive graphical measure of chemical similarity, but is also more computationally intensive, is Maximum Common Substructure (MCS). In this method, a modular product graph is computed between two molecules. This graph includes all possible mappings between atoms and bonds from the chemical graphs of the compounds. Nodes denote pairs of atoms with the same identity in both compounds, and edges denote pairs of bonds shared between two pairs of atoms across compounds. Intuitively, any atom-bond substructure found in both compounds would manifest as a clique in the modular product graph. So, the problem of determining similarity is simply to find the maximum clique and compute the ratio of atoms and bonds in the clique to the union of atoms and bonds between the two structures.

…Except for one problem. The maximum clique problem is NP-hard. In comparison to fingerprint or ontology -based methods, MCS is painfully slow. Heuristics exist and are described in Duesbury’s thesis, but the available R package does not appear to include them.

Synopsis

Each method seems to provide its own potential benefit. Methods based on ontology or fingerprints rely on more assumptions than those based on chemical graphs. However, they are much faster and more scalable than methods based on chemical graphs. Fingerprint-based methods, combined with Tanimoto similarity, are currently the most common approach in the Metabolomics field.

References

Duesbury, Edmund (2015). Applications and Variations of the Maximum Common Subgraph for the Determination of Chemical Similarity. PhD thesis, University of Sheffield. [online] Available at: http://etheses.whiterose.ac.uk/13063/ [Accessed 24 Mar. 2019].

Temple University. (2009) PubChem Substructure Fingerprint. [online] Available at: https://astro.temple.edu/~tua87106/list_fingerprints.pdf [Accessed 24 Mar. 2019].

Molenaar MJeucken AWassenaar T, et al. (2018). LION/web: a web-based ontology enrichment tool for lipidomic data analysis. bioRxiv. [online] Available at: https://www.biorxiv.org/content/10.1101/398040v1 [Accessed 24 Mar. 2019].

Degtyarenko K, de Matos P, Ennis M, et al. (2008). ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Research, 36 (Database issue), D344–D350. [online] Available at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2238832/ [Accessed 24 Mar. 2019].

Feunang Y, Eisner R, Knox C, et al. (2016). ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. Journal of Cheminformatics, 8(61). [online] Available at:
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-016-0174-y %5BAccessed 24 Mar. 2019].

Fiehn O. ClassyFire Batch by Fiehn Lab [Software]. Available at: https://cfb.fiehnlab.ucdavis.edu/ [Accessed 24 Mar. 2019].

Carhart R, Smith D, and Venkataraghavan R. (1985). Atom pairs as molecular features in structure-activity studies: definition and applications. Journal of Chemical Information and Computer Sciences, 25(2), 64-73. [online] Available at:
https://pubs.acs.org/doi/10.1021/ci00046a002 %5BAccessed 24 Mar. 2019].

Raymond J, Gardiner E, and Willet P. (2002). RASCAL: Calculation of Graph Similarity using Maximum Common Edge Subgraphs. The Computer Journal, 45(6), 631-644. [online] Available at:
https://academic.oup.com/comjnl/article-abstract/45/6/631/429187?redirectedFrom=fulltext  [Accessed 24 Mar. 2019].

A Little About Me

In a nutshell, I’m a computer scientist who focuses on bioinformatics research, specifically in computationally undersaturated -omics fields. To me, this is an exciting place to be, because it is a place in which I can apply a good deal of critical thought and explore the depth of the field to organically develop new algorithmic solutions. I’m a PhD student and, while it’s a stressful lifestyle, I honestly love what I do. People often see their work as a burden, but to me, my research is my true passion.

The role of science in society is important to me, and I believe there is much to talk about regarding how we interact with science, especially computational science. I enjoy teaching and mentoring, and I am involved with several organizations dedicated to increasing women’s involvement in science. In addition, I am interested in how cultural and philosophical perspectives can impact our interactions with science and technology.

Take a look around, and perhaps you may find something of interest 🙂