Regular expressions

Contents

Intro

[back to top]

A regular expression, regex or regexp (sometimes called a rational expression) is, in theoretical computer science and formal language theory, a sequence of characters that define a search pattern. Usually this pattern is then used by string searching algorithms for “find” or “find and replace” operations on strings.

Multiple tools in LINUX can be utilized to find regular expressions. The most common one is grep which you already used during the morning.

Now there are a couple of more powerful tools that work with regular expressions, these are awk and sed.

For example: the most basic awk command looks like this:

$ awk '/pattern/{ print $0 }' file

Let’s move to one of our afternoon directories.

$ cd /scratch/asecas86/afternoon

Basic awk

[back to top]

We will be using basic awk with our fna and gff files. Lets start with fna files.

Let’s move to genomes

$ cd genomes

Now select the following file: GCA_002443195.1_AflaGuard_genomic.fna and run the following command:

$ awk '/NNNNN/{ print $0 }' GCA_002443195.1_AflaGuard_genomic.fna

Pretty nasty output. We must linearize first the fasta files. There is this very helpful script that linearizes any fasta file. Let’s download it first.

$ wget --no-check-certificate https://raw.githubusercontent.com/andrese52/Training_datasets_scripting/master/scripts/linearize.sh

Let’s check how many lines GCA_002443195.1_AflaGuard_genomic.fna has prior linearizing it.

$ wc -l GCA_002443195.1_AflaGuard_genomic.fna
453747 GCA_002443195.1_AflaGuard_genomic.fna

Run the script as follows:

$ bash linearize.sh GCA_002443195.1_AflaGuard_genomic.fna

This script manipulates the fasta file and just reformats it to have the following format:

>seq1
ATCCCCCCC
>seq2
ATCTCTCT

Once the file is linearized, count the lines again like so:

$ wc -l linearized_GCA_002443195.1_AflaGuard_genomic.fna
196 linearized_GCA_002443195.1_AflaGuard_genomic.fna

BOOM!! You have linearized a .fasta file. This is very helpful to manipulate the file and get metrics and regular expressions.

Let’s make sure the file was correctly linearized by counting the number of fasta headers in the original file. The count should give 1962 = 98. Let’s check!!

$ grep -c ">" GCA_002443195.1_AflaGuard_genomic.fna
98

GREAT, you just linearized a fasta file. Now let’s get some metrics.

Identifying patterns in fasta files with awk

[back to top]

An important pattern to find are those nucleotides that have not been identified yet due to low quality sequencing. We might want to eliminate contigs or scaffolds containing these NNNN patterns since they are not useful for downstream analyses (this varies depending on the researcher’s needs). Let’s check:

$ awk '/NNNNN/{ print $0 }' linearized_GCA_002443195.1_AflaGuard_genomic.fna | wc -l
32

So, 32 out of 98 contigs have some sort of NNNNN patterns of at least 5 Ns.

Let’s get the sequence lengths of the current scaffolds

$ cat linearized_GCA_002443195.1_AflaGuard_genomic.fna  | awk 'NR%2==0' | awk '{print length($1)}'
  • What is happening above is that NR%2==0 is requesting that every even line in the file is printed, then these lines are piped into another awk that gives the length of the sequence.

  • We can also redirect the output to a new file.

$ cat linearized_GCA_002443195.1_AflaGuard_genomic.fna  | awk 'NR%2==0' | awk '{print length($1)}' > read_dist_GCA_002443195.1_AflaGuard_genomic.fna

Now let’s check the read distribution file

$ head -n 5 read_dist_GCA_002443195.1_AflaGuard_genomic.fna

It appears like some scaffolds have 4 million nucleotides

4471384
4150194
2714413
2659415
2557007

Task 3

[back to top] Linearize the fasta file GCA_000006275.2_JCVI-afl1-v2.0_genomic.fna and count the number of nucleotides on each scaffold.

$ bash linearize.sh GCA_000006275.2_JCVI-afl1-v2.0_genomic.fna
$ cat linearized_GCA_000006275.2_JCVI-afl1-v2.0_genomic.fna  | awk 'NR%2==0' | awk '{print length($1)}' > read_dist_GCA_000006275.2_JCVI-afl1-v2.0_genomic.fna

Once you have both files let’s get the shortest contig lenghts:

$ sort -n read_dist_GCA_002443195.1_AflaGuard_genomic.fna | head -n 1
$ sort -n read_dist_GCA_000006275.2_JCVI-afl1-v2.0_genomic.fna | head -n 1

Task 4

[back to top] Obtain the largest contig length on both genomes

Annotation files and awk

[back to top]

Annotation files of these genomes contain information about the genes that the nucleotides encode. We will be doing some basic filtering with awk to find genes of interest.

First let’s move to the annotations directory

$ cd ../annotations

We will be working on the annotations of the same two genomes that we used for counting the length of each scaffold.

Important information about gff files can be found online here

What is most important about these files is that they usually have 9 columns.

Position index Position name Description
1 sequence The name of the sequence where the feature is located.
2 source Keyword identifying the source of the feature, like a program (e.g. Augustus or RepeatMasker) or an organization (like TAIR).
3 feature The feature type name, like “gene” or “exon”. In a well structured GFF file, all the children features always follow their parents in a single block (so all exons of a transcript are put after their parent “transcript” feature line and before any other parent transcript line). In GFF3, all features and their relationships should be compatible with the standards released by the Sequence Ontology Project.
4 start Genomic start of the feature, with a 1-base offset. This is in contrast with other 0-offset half-open sequence formats, like BED files.
5 end Genomic end of the feature, with a 1-base offset. This is the same end coordinate as it is in 0-offset half-open sequence formats, like BED files.[citation needed]
6 score Numeric value that generally indicates the confidence of the source on the annotated feature. A value of “.” (a dot) is used to define a null value.
7 strand Single character that indicates the Sense (molecular biology) strand of the feature; it can assume the values of “+” (positive, or 5’->3’), “-”, (negative, or 3’->5’), “.” (undetermined).
8 frame (GTF, GFF2) or phase (GFF3) Frame or phase of CDS features; it can be either one of 0, 1, 2 (for CDS features) or “.” (for everything else). Frame and Phase are not the same, See following subsection.
9 Attributes All the other information pertaining to this feature. The format, structure and content of this field is the one which varies the most between the three competing file formats

It would be good if we first explore the header:

$ head -n 5 GCA_000006275.2_JCVI-afl1-v2.0_genomic.gff

The header contains a lot of information about generalities of the annotation and the genome itself. We must skip these first lines. They are usually not tab-delimited.

The best way to do it is just by targetting lines of interest, for example I am interested in aflatoxin genes. Therefore I will grep the file for the pattern aflatoxin.

$ grep -i "aflatoxin" GCA_000006275.2_JCVI-afl1-v2.0_genomic.gff

Since I am interested only on those genes I will redirect the output.

$ grep -i "aflatoxin" GCA_000006275.2_JCVI-afl1-v2.0_genomic.gff > aflatoxin_genes.gff

Now let’s analyze by column. Usually the word aflatoxin should be found in the Attributes column which is column 9. So I can also get those same lines by doing this:

Print each line where the 9th field is similar to aflatoxin:

$ awk -F "\t" '$9 ~ "aflatoxin"' GCA_000006275.2_JCVI-afl1-v2.0_genomic.gff

Now let’s check the difference between grep and awk

$ grep -i "aflatoxin" GCA_000006275.2_JCVI-afl1-v2.0_genomic.gff | wc -l
$ awk -F "\t" '$9 ~ "aflatoxin"' GCA_000006275.2_JCVI-afl1-v2.0_genomic.gff | wc -l

We must get the same result for both which is 107 entries.

We have already redirected the aflatoxin annotations to a separate file. Now let’s play with that file.

We can get all the positive sense CDS and count them

$ awk -F "\t" '$7 == "+"' aflatoxin_genes.gff | wc -l
45

Task 5

[back to top] Count the number of CDS in the aflatoxin annotation file

Extras with gff files

[back to top]

Print all sequences annotated in a GFF3 file: This command will work only if we use the original gff file.

$ cut -s -f 1,9 GCA_000006275.2_JCVI-afl1-v2.0_genomic.gff | grep $'\t' | cut -f 1 | sort | uniq

Determine all feature types annotated in a GFF3 file.

$ grep -v '^#' GCA_000006275.2_JCVI-afl1-v2.0_genomic.gff | cut -s -f 3 | sort | uniq
CDS
exon
gene
mobile_genetic_element
mRNA
region

Here the -v in grep is saying to target the reverse of the search. If you remember whenever you did head in the annotation file there were many comments. It is trying to get rid of those lines first.

The output is piped to a cut where the -f is saying to select only the column 3 and the -s asks to avoid printing lines that do not contain the field 3 (empty lines or entries).

Determine the number of genes annotated in a GFF3 file.

$ grep -c $'\tgene\t' GCA_000006275.2_JCVI-afl1-v2.0_genomic.gff
13485