An Introduction to Bash Scripting

blank_computer_screen
Source: http://robertmuth.blogspot.com/2012/08/better-bash-scripting-in-15-minutes.html

What is Bash Scripting?

Bash (an abbreviation for “Bourne Again Shell“) is the shell scripting language and interpreter on most Linux systems. If you’ve used the command line in a Linux environment, you’ve probably already written some commands in Bash. This tutorial explains how to package these shell commands neatly into a script, which can be useful for anyone wanting to streamline commands for a routine task. Bash scripting is also a vital skill for anyone conducting research in a supercomputer environment.

Creating a Bash Script

When you create a bash script in a Unix environment, it is customary to give it a .sh extension. Technically it isn’t necessary, because Unix will still treat it as a bash script regardless, but it’s a good idea. Think about it: have you ever inherited someone’s old project and been faced with the task of figuring out what each of their files is for? Imagine doing that without extensions. So, give your file a .sh extension.

To open a new script in emacs:

C-x C-f my_script.sh

Or, if you’re like me and use vim:

vi my_script.sh

Then, add the following header to the top of your file. All other code will go below this. This header is called a shebang and is used to ensure that Bash (and no other interpreter) will be used to run the script.

#!/bin/bash

Running a Bash Script

Source: https://www.guru99.com/sql-interview-questions-answers.html

You do not need to run a compiler or specify an interpreter to run a Bash script (the shebang already takes care of that for you). To execute them, simply navigate to the directory where the script is saved and run:

./my_script.sh

If your Bash script takes command line arguments (more on how to do this later), execute your script as in the following example, where "my_file.csv" and 0.4 are the parameters to the script.

./my_script.sh "my_file.csv" 0.4

Some Bash scripts accept named parameters (more on this later). To run a Bash script with named parameters, execute it in the following way. In this example, -f references a file parameter and -t a threshold parameter.

./my_script.sh -f "my_file.csv" -t 0.4

Declaring and Using Variables

Source: https://www.researchgate.net/figure/Global-and-local-variables-in-procedural-programming_fig1_283347855

In bash scripting, the type of a variable is automatically inferred, so you don’t need to declare it. You also don’t have a lot of special data types. Basically, there are floating point values, character strings, and lists. Note that, by default, floating point values are treated as characters. They are only treated as numeric during mathemical or arithmetic operations. If you need anything more complicated than that, my advice is to write your code in R, Python, Java, or Perl and call it from your bash script rather than coding everything up in bash.

Single Values

Again, you don’t need to declare the type of your variable. All you need to do is this:

my_character_variable="hello world"
my_float_variable=4.0
my_integer_variable=1

One thing to note about bash is that it’s finicky about whitespace. No matter how much you want to, don’t put spaces between the variable name and the equals sign.

When you want to use these values, put a dollar sign in front of them. It is a good idea to enclose them in curly braces as well, although it isn’t always required. In some scenarios (like concatenating two strings with an underscore), curly braces are important for distinguishing variable names from other text. For example, the code below will store “hello world_1” in the variable.

full_string=${my_character_variable}_${my_integer_variable}

Arrays and Lists

The syntax for lists is also quite simple, and bash arrays can contain both strings and numeric values, e.g.

my_array=(1 2 3 4 5 6 7 8 9 10 4.0 "hello" "world")

Variable Scoping

If you’ve done a lot of coding, you know that variables are local to the block of code in which they’re declared, right? Right. Except in bash. By default, every variable declared in a bash script is global. If you want a local variable, you need to declare it like this:

local my_local_character_variable="hello neighborhood"

Arithmetic

Arithmetic on variables is straightforward. Use the same operators that you would use in typical arithmetic. However, to ensure that your variables are treated as numeric and not string variables, you must use double parentheses around the expression or the let keyword in front of the variable in which you want to store your result. Some examples are shown below. Remember, be careful about whitespace.

x=5
y=2
modulus=$((${x}%${y}))
let addition=${x}+${y}

String Concatenation

String concatenation is also very simple in Bash. Simply type the variables, one after the other, with any additional text filled in exactly where you want it! It doesn’t get easier than that. Here’s an example.

#This code stores "Why hello there, world!" in a new variable.
word1="hello"
word2="world"
combined="Why ${word1} there, ${word2}"

This also words if you’re dealing with numeric values, without any need for parsing.

#This code stores the string "We're number 1" in a new variable.
val=1
combined="We're number ${val}"

Running External Programs

If you’ve ever used the command line on a Unix system, then you already know how to execute external programs in a bash script. Simply call the program as you would do from the command line. Here are a few examples.

R Scripts

Many people like to run R interactively. Unfortunately, you can’t do this within a Bash script, because scripts are not interactive by virtue of being scripts. You will need to save your code in a .r file and run it using Rscript. If you have run Rscript on the command line before, this should look very familiar.

param_1=2
param2="/root/my_file.csv"
param3="/root/my_output.csv"
Rscript my_r_script.r $param_1 $param_2 $param3

Utilities

You can run utilities from within your bash script just as you would run them on the command line. Here is an example using bedtools, a utility common in bioinformatics. Called in this way, it will print the resulting file to the console rather than saving it. To avoid this behavior, see the section on Output.

bedtools sort -i my_bed_file.bed

Unix Tools

There are a few Unix tools that can be handy to use in Bash scripts. Each of these tools is really a topic on its own, but here is a brief introduction to them.

grep

The grep utility is primarily used for searching text. It is often used with pipes (discussed later) and can be called directly within the Bash script. Learn more about the capabilities of grep here. The following command will return every line in my_file.txt containing my_word.

grep "my_word" my_file.txt

cut

cut is used for selecting substrings of each line in a file or modifying the lines in a file. Here is an example of how it can be used to select only the first two columns in a tab-delimited file. You can see more examples of how to use cut here.

cut -d "\t" -f 1,2 my_file.txt

shuf

If you are doing any work that involves permuting data, shuf is a convenient tool. It can be used either to shuffle an entire file or to select a random set of lines from a file. Read more about shuf here. The example below shows how to shuffle an entire file’s lines using shuf.

shuf my_file.txt

awk

The awk utility is convenient for selecting and modifying lines that meet specified criteria. It is more powerful than cut, but it can also be more complicated to use. To really use awk well, you should understand regular expressions. The following example shows how to select columns 1 and 2 from a file (similar to the cat example). You can see more examples of awk here.

awk '{print $1 $2}' my_file.txt

Branching

Source: https://www.classes.cs.uchicago.edu/archive/2019/winter/15200-1/lecs/notes/Lec4ComplexCondNotes.html

Braching in Bash uses the following syntax. Within the double square brackets, you can construct tests using the comparison operators available in Bash, and you can chain tests together using Bash logical operators. Single square brackets are also supported, but double square brackets have some added features. Note that the whitespace between the test and brackets is important!

if[[ $num -eq 42 ]]
   then
      Rscript my_r_script.r "file_42.csv"
   else
      Rscript my_r_script.r "file_not_42.csv"
fi

Tests in an if statement can include more than just arithmetic. The following code checks whether a directory exists, and creates it if it doesn’t.

dir="../my_dir"
if [[ ! -e $dir ]]; then
   mkdir $dir
fi

Looping

Source: http://www.functionx.com/java/Lesson08.htm

You can use for loops or while loops in Bash. For loops are used for looping over lists. The example below shows looping over a list of numbers.

for f in 0 1 2 3 4 5 6 7; 
   do 
      Rscript my_r_script.r $f 
   done

This could also be done using a range.

for f in {0..7};
   do
      Rscript my_r_script.r $f
   done

Finally, you could loop over a pre-defined array.

for f in ${my_array);
   do
      Rscript my_r_script.r $f
   done

While loops test conditions have similar syntax to if statement test conditions. The following while loop does the same as the first two for loops above.

i=0
while [[ $i -lt 7 ]] 
   do
     Rscript my_r_script.r $f
   done

Functions

Source: http://www.desy.de/gna/html/cc/Tutorial/node3.htm

You can define and call functions in Bash scripts, but note that you need to define your function before you call it. This is notable because many programming languages do not have this restriction. Another thing that is different about functions in Bash scripting is the way parameters are passed. When calling the function, you simply pass the parameter directly after the function call like a command-line argument. Inside your function definition, your first parameter will be referred to as $1, your second as $2, and so on. What about returning values from a function? Bash doesn’t allow this. So strictly speaking, Bash functions are not really functions at all but procedures.

my_function() {
   local c=$1
   Rscript my_r_script $c
}
for f in 0 1 2 3 4 5 6 7;
   do 
      my_function $f
   done

Input

Input in Bash scripting can take two forms. You can pass command-line arguments when calling your script, or you can store your input as a file.

Command Line Arguments

Basic command line arguments work similarly to parameters in Bash functions: $1 refers to argument 1, $2 to argument 2, and so on.

However, if you want to make your script more user-friendly and allow for named parameters, that is also possible. The code below allows for three named parameters: -n for a name, -f for a file name, and -t for a threshold. All arguments are optional. The realpath operator returns the full path of the file name given if the file exists.

while getopts n:f:t: option; 
   do
      case "${option}" in
         n) name=$OPTARG;;
         f) filename=$(realpath $OPTARG);;
         t) threshold=$OPTARG;;
      esac
   done

File Input

Of course, you can also simply hard code file names into your Bash script and use them as your input. If you want to input a list of values rather than a single value, storing them in a file is probably the best way to do this. There are several ways to load your data from the file into a list.

The first option prints the file using the cat utility and stores each line in a list. Note that the parentheses here are different from the double parentheses described in the arithmetic section. Double parentheses (()) run arithmetic operations, and single parentheses () allow you to run commands in a subshell (essentially a child process) that can then be returned using the dollar sign $.

my_list=$(cat my_file.txt)

Another option uses the shell redirection operator to read each line of the file in a loop.

my_list=()
while read infile;
do
    my_list+=($infile)
done < my_file

The IFS Variable

Note that the code above will only work as expected if there is no whitespace (spaces, tabs, etc) within each line. If your lines have spaces or tabs, Bash will automatically split on each space or tab. You can change this by setting a special variable called IFS to split on new lines only.

For example, say your input file is formatted like this.

Hi, I'm a file.
You should input me into your Bash script.
But it needs to be done line-by-line.

If you want to read each of these in a single line, you could do

IFS=$'\n'
my_list=()
while read infile;
do
    my_list+=($infile)
done < my_file

You can also use IFS to split on other characters as well. See this page for more information on IFS.

Output

In Bash, you can print output to a file or direct it to stdout or stderr (by default, stdout is usually the main console).

Shell Redirection Operator

The shell redirection operator allows you to redirect output to a file. For instance, the following line redirects the output of the bedtools sort utility to the file my_sorted_bed_file.bed. Normally, this output would print to stdout.

bedtools sort -i my_bed_file.bed > my_sorted_bed_file.bed

It is also possible to append to the file, like so:

bedtools sort -i my_bed_file.bed >> my_sorted_bed_file.bed

Finally, if your line of code prints to stderr, you can redirect both streams as follows:

bedtools sort -i my_bed_file.bed > my_sorted_bed_file.bed 2> my_errors.log

echo Utility

The main way to output to the console (stdout) is to use the Unix echo utility. The following examples show how echo can be used.

#This command prints "Hello World" to stdout.
echo "Hello World"

#This command prints the contents of my_array to stdout.
my_array=(1 2 3 4 5 6 7 8 9 10 4.0 "hello" "world")
echo $my_array

#This command prints the contents of the file my_file.txt to stdout.
echo $(cat my_file.txt)

Piping

Source: https://bash.cyberciti.biz/guide/Pipes

When you pipe a command, you are redirecting its output to another command. This is done using the | operator. Pipes are used in many scenarios, but here are some examples.

#The following code prints only the names of files in a directory containing ".png".
ls -l | grep "\.png$"

#The following command sorts the first 1000 lines of a file.
head -n 1000 | sort -V -k1,1 -k2,2

Quotes

Three types of quotes are used in Bash: double quotes, single quotes, and backtick quotes. They are all used for different purposes.

Double Quotes

Double quotes are used around text. If variables are included in the double quotes, they are expanded. Here is an example. The code below prints “Why hello there, world”

word1="hello"
word2="world"
echo "Why ${word1} there, ${word2}!"

Note that if you want to include quotes within the string, you need to use an escape character. The code below prints “The script name is “my_script.r””

echo "The script name is \"my_script.r\""

Single Quotes

Single quotes are also used around text, but the difference is that they do not expand variables. Looking at a similar example (below), we print “Why ${word1} there, ${word2}”

#This code stores "Why hello there, world!" in a new variable.
word1="hello"
word2="world"
combined='Why ${word1} there, ${word2}'

Backtick Quotes

These are usually just called “backticks”; however, many people consider them a type of quote or mistake them for single quotes. Backticks have an entirely different function from other quotes, which is to return the output of a command. In this way, they function the same as $(). For instance, in the File Input section, we could have also written the command like so.

my_list=`cat my_file.txt`

Background Processes

background_process
Source: https://turbofuture.com/computers/Run-process-in-background-linux-terminal

Sometimes, you may want to run part of your script in the background so that it doesn’t block additional processes from accessing the shell. To do this, you need to attach lines of code to threads. One easy way to do it is to put all code you wish to run in the background into its own method. Then, call that method in a for loop. The ampersand attaches your code to a background process.

Note the pids array and the wait statement. These are important if you want to make sure no other code executes until all threads have completed. The code below tells the script to track all process id’s and wait until they have completed before running the next line of code.

pids=""
for f in $CHROMS;
   do
      my_function $f &
      pids="$pids $!"
   done
wait $pids

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s