NGS Data Analysis for Illumina Platform—Overview and Workflow

Processing of next-generation sequencing (NGS) data for downstream applications is a critical part of the NGS workflow since it has a direct impact on data interpretation and experimental conclusions. Understanding the basics of NGS data analysis, common bioinformatic tools, and a general framework of the data analysis pipeline can facilitate the design and planning of your NGS experiments. This section provides an overview of NGS data analysis workflows for common DNA and RNA sequencing using Illumina systems.

1. Key considerations

NGS data analysis is a computationally intensive process, requiring the storage, transfer, and processing of very large data files (generally 1–3 GB in size). This means access to advanced computing centers on-site via a private network or in the cloud is highly recommended. While many off-the-shelf and “plug-and-play” tools are available, it is very likely that some level of scripting and coding skills will be necessary. Python, Perl, R, and Bash scripting are among the most prevalent, and they are typically performed within Linux or Unix-like operating systems and in command-line environments. At the very least, these skill sets can markedly facilitate the task of data analysis [1,2].

NGS data analysis generally involves three core steps: primary, secondary, and tertiary analysis (Figure 1).

Primary analysis assesses raw sequencing data for quality and is commonly performed by software built into the sequencer.
Secondary analysis converts data to results, such as alignment and expression, with the use of several bioinformatic tools.
In tertiary analysis, conclusions are made about genetic features, expression, or mutations of interest, often with recommendations for certain downstream action in a final report.

Figure 1. NGS data analysis workflow features three core stages: primary, secondary, and tertiary analysis. Key steps and input/output files of the process are shown as examples.

Top

2. Primary analysis

The primary analysis provides total reads and quality metrics of the input libraries for assessment of sequencing efficiency and quality. In Illumina sequencing, the input for primary analysis is a raw binary file with nucleotide bases which are identified during the sequencing run (also called base calls). This file has the extension .bcl. The output from primary analysis is a text-based (ASCII) file in FASTQ format [3]. Software built within, or in some cases linked directly to, the sequencer manages this conversion (historically this software has been the bcl2fastq Conversion Software).

a. Raw data assessment

Some of the quality metrics measured in the primary analysis include:

Yield: The total number of base reads achieved in a run; low numbers are indicative of poor loading and/or sequencing efficiency.
Error rate: The number of incorrect reads, based on an internal control included in Illumina sequencing runs.
Phred quality score: Each base gets assigned a quality score based on the Phred scale, which is also known as the Q score. The Q score measures the probability (P) of an incorrect base call using the equation Q = –10 log₁₀P. Q>30, which represents a <0.1% base call error, is acceptable in most cases (Figure 2) [4].
% aligned: The percentage of sequences aligned (% aligned) is errors introduced are measured using an exogenous control such as the PhiX genome.
Cluster density: The density of clonal clusters generated for sequencing is assessed by clusters passing filter (CFP) or the percentage of passed-filtered (%PF). It is a measure of the purity level of base call signals, and >80% PF is considered optimal clustering.
Phas/Prephas (%): This represents the percentage of base signal lost in each sequencing cycle by calculating the number of cluster molecules falling behind (phasing) or moving ahead (prephasing) during sequencing. A low percentage (e.g., <0.5%) is desired.

Figure 2. Phred quality score, or Q score, and its relationship to base call accuracy.

b. Demultiplexing

Because multiple library samples can be sequenced concurrently (i.e., multiplexing), they are identified and separated by index reads during primary analysis. Demultiplexing results in multiple FASTQ files, each corresponding to one unique sample. Demultiplexed files contain read name, flow cell location, etc., for sample identification (Figure 3).

Figure 3. FASTQ file format and annotation. The sequence label often contains flow cell ID, lane, coordinate, etc., which are not shown here for simplicity.

Top

3. Secondary Analysis

Demultiplexed FASTQ files are used to process each sample individually in secondary analysis. Secondary processing includes the following steps and output files (Table 1).

Table 1. Summary of secondary analysis steps.

Step	Purpose
Read cleanup	This step removes low-quality sequence reads and/or portions of reads (also called “soft-clipping”) from the data. The output from this step is a “cleaned” FASTQ file.
Alignment	This step matches, or maps, each base call to its corresponding location in the genome (or transcriptome for RNA) of the organism of interest. Although each individual organism generally has a unique genome, there are reference genomes and transcriptomes published for use as common alignment tools for the NGS community. The alignment output is a Binary Alignment Map (BAM) file that contains the base call alignments to the reference sequences.
Mutation calling	In this step, mutations, features, and/or other anomalies that do not match the reference genome of interest are identified (or called) from the BAM file; these calls are represented in a text-based tab-delimited file in VCF format.
Gene expression analysis	In the context of RNA sequencing, this can also include gene count and (differential) expression data. A standard file format (like VCF) does not exist for such data, but most tools will output in a tab-delimited format (e.g., Tab Separated Values (TSV) files) with each column representing samples, genes, amplicon IDs, raw counts, normalized counts, etc.

a. Read cleanup

The very first step of secondary analysis is to perform cleanup of sequencing reads. It usually includes trimming adapters, tracking indexes or barcodes, merging paired-end reads, and removing low-quality or suspect reads (or portions of reads) from the data. While steps like removing reads of base calls below a given Phred score cutoff (typically 30) are common, processing can also be both highly subjective and dependent on the nature or quality of samples, e.g., sequencing of archived samples that are degraded. However, reads shorter than a certain length are commonly thrown out because they are difficult to map or representative of unwanted degraded nucleic acids. Additionally, portions of reads beyond a specific read length cutoff (e.g., 200–250 bp) may be eliminated, or “soft-clipped”, due to diminishing sequencing quality at read ends.

Common tools (e.g., FastQC) can be used to check general FASTQ data [5]. FastQC provides quality scores per base and per sequence overall, sequence and GC content, and duplicate or overrepresented sequences (Figure 4). To account for specific chemistry and experiment-specific quality controls, custom scripting is usually necessary. The output from this cleanup would be another “cleaned” FASTQ file.

Click image to enlarge

Figure 4. Quality scoring in a FastQC file.

Another optional step at this stage is de-duplication of reads. This step removes repeat reads of the same library fragments and reads of PCR amplicons of the same sequence. If a unique molecular identifier (UMI, also called barcoding) is integrated in the adapters, de-duplication will be more informative since UMI fingerprints every individual nucleic acid molecule sequenced. By comparing duplicate reads to correct for errors from PCR and/or sequencing [5], this step helps ensure biological mutations, anomalies, and gene expression levels are correctly represented in sequencing data.

For RNA sequencing (RNA-Seq) data, additional steps may also be required during cleanup.

Correction of sequence bias introduced during library preparation (e.g., from cDNA synthesis [6]). Note that sequence bias can also be introduced from PCR amplification (if performed) and the sequencing step itself (e.g., from GC-rich regions).
Quantitation of RNA types (e.g., ribosomal RNA (rRNA) contaminants if rRNA is removed during library preparation). This step is a type of mapping to rRNA reference sequences using alignment tools such as Bowtie 2 or TopHat (details are in the alignment section below).
Determination of strandedness, i.e., whether the RNA has been transcribed from the forward or reverse strand of the genome. Note that this can only be done if a “directional” or “stranded” RNA sequencing kit has been used for library preparation.

Top

b. Sequence alignment

If an established reference genome exists for the sequenced sample, alignment of all short reads in a FASTQ file can be done with various off-the-shelf tools. In general, there is a trade-off between computational cost (i.e., power and speed) and the quality of mapping among these tools. Nevertheless, common alignment tools such as BWA and Bowtie 2 offer a reliable balance and are often used together to reach a satisfactory consensus. Another alignment tool, BLAST, is lower in throughput but comprehensive for verification of read sections and troubleshooting of ambiguous reads.

Note that the reference genome chosen is critical in this process. Given that no reference genome is perfect, be clear and consistent about the reference being used and understand potential weaknesses associated with the chosen reference. In the case of human samples, the latest standard is GRCh38 (also called hg38), but the prior standard, GRCh37 (also called hg19), is still commonly used.

The output file from alignment is a Binary Alignment Map (BAM) file. Although alignment output may be in the form of a text-based Sequence Alignment Map (SAM) file [7], BAM files are generally preferred because they are more compact. Interfacing and manipulation of alignment files are done by scripting or with purpose-built software such as SAMtools.

Sequence alignment can be visualized by uploading BAM or SAM files to a genome browser such as Integrative Genomic Viewer (IGV), University of California Santa Cruz (UCSC): Genome Browser, or Golden Helix. Overlaps of reads in a given region (called pileups) and mismatches from the reference (which may be due to SNPs, mutations, or other errors) can be observed (Figure 5).

Pileups of short reads mapped against the human

Click image to enlarge

Figure 5. (A) Pileups of short reads mapped against the human gene FMR1-AS1, as viewed in the Integrative Genomic Viewer (IGV) from the Broad Institute. Red and blue represent reads 1 and 2, respectively, from Illumina NGS paired-end sequencing. (B) In a view of reads mapped against the CHODL gene in IGV, mismatched regions (mutations) can be quickly scanned in the browser via color-coded highlighting, in this case a C/T SNP. Further details on the reference gene are shown at the top of the viewer, including the genomic address as well as the exact bases of interest.

To build a new reference genome (or sections of a genome), or to provide higher confidence in the alignment process itself, de novo assembly may be performed prior to alignment. In this process, short and long reads with overlapping regions are combined to yield single, longer consensus reads. De novo assembly is useful particularly when large mutations or anomalies might make alignment of shorter reads impossible.

To check that prepared libraries and data processing have been successful to this point, examine the following alignment metrics.

Percentage of mapped reads: This metric represents the percentage of reads mapped to the reference sequence (Figure 6). A low number may indicate an overfragmented library or poor sequencing quality.
On-target percentage: Percentage of reads that match the genomic or transcriptomic target regions (if performing targeted sequencing). For example, a low on-target percentage may indicate overfragmentation of input nucleic acids, suboptimal adapter ligation, or poor target enrichment.
Duplicate vs. unique reads: Duplicate reads are multiple reads of the same fragment; these should be de-duplicated to unique reads to accurately measure coverage and allele fraction.
Overall coverage and depth: Coverage measures the extent and uniformity of the genomic or transcriptomic regions from a prepared library that are captured in the sequencing data. Low coverage may be a result of suboptimal fragmentation, size selection, and/or library amplification [8]. Depth is the number of unique reads that were sequenced at a given location (Figure 7). Greater depth means greater redundancy and thus more confidence in the data, as well as a higher likelihood of capturing low–allele fraction variants.

Figure 6. Comparison of percentage of mapped reads from libraries prepared with four different kits.

Figure 6. Comparison of percentage of mapped reads from libraries prepared with four different kits.

Mapped read depth: Coverage histogram plots the total number of mapped reads (x-axis) versus number of reference bases (y-axis). A high-quality sequencing run often results in a Poisson-like distribution with a small interquartile range (IQR), representing more uniform sequence coverage (Figure 8).

Figure 8. Coverage histograms and interquartile range (IQR). The IQR represents the range of the middle 50% of the sequencing coverage histogram; a smaller IQR indicates more uniform coverage and reliable results.

For alignment of RNA-Seq data, tools such as Bowtie and BWA can be used for a first pass because many reads will likely not span exon–exon boundaries. RNA-Seq reads across splice sites, however, should be mapped with reference transcript sequences using transcriptome alignment tools such as TopHat and STAR [9,10]. In cases of incomplete or missing reference transcriptome data, de novo spliced alignment can be performed using TopHat, Cufflinks, or other open-source tools. The de novo alignment process may also allow identification of novel transcripts and new splice isoforms.

Top

c. Mutation calling

Once alignment is complete, BAM or SAM files are fed into multiple off-the-shelf tools to collate lists of mutations. These mutation lists are most commonly reported in a Variant Call Format (VCF) file. Note that poorly mapped—as defined by MAPping Quality (MAPQ) values in the SAM file, or based on the Phred scale—and unmappable reads should be removed prior to generating VCF files.

Genome Analysis Toolkit (GATK) is one of the most advanced general-purpose variant callers available. GATK serves as a good starting point for highlighting single-nucleotide variants (SNVs), multiple-nucleotide variants (MNVs), insertion deletions (indels), etc. Other alignment tools such as SAMtools and Mpileup should also be used to cross-reference the results, as each tool has strengths and weaknesses in finding specific types of mutations and variant allele fractions (VAFs).

More complicated genetic variations will often require specialized mutation callers and pipelines, such as using an off-the-shelf tool in tandem with some custom scripting. Examples of such genetic variations include structural rearrangements (e.g., gene fusions), chromosomal abnormalities, and copy number variations (CNVs). Their mutation reporting often does not fit within the context of VCF files and needs their own custom format. For instance, detection of gene fusion may occur as a supplementary step to a de novo alignment in RNA-Seq because adjacent exons in a given read are mapped to different genes (i.e., chimeric reads). Such mappings can be tabulated in a purpose-built tab-delimited (TSV) file for further investigation during tertiary analysis.

Note that understanding the nature of a mutation or other anomaly of the sample is critical in this step. Sample origin, library preparation technique, challenges in sequence alignment, and allele fraction can all impact calling mutations in a sample.

Top

d. Gene expression analysis

One of the goals of RNA sequencing is gene expression analysis. For screening, researchers may look for changes in genes of interest, e.g., those expressed at high or low levels after certain treatment. In some cases, researchers may investigate differential expression of genes by comparing their relative levels in the same sample at different time points (temporal), or in samples of different origins at a certain time point (spatial).

Regardless of the research goal, gene/transcript counting and normalization are two common steps in expression analysis.

Gene/transcript counting: This step organizes and tallies all reads that are mapped to a specific gene or transcript as an initial representation of expression level. Many off-the-shelf tools exist for gene counting, and some of them, such as Cufflinks, can also perform other RNA-Seq analysis steps [11,12].
Normalization: Raw gene count data are usually inadequate for accurate analysis of differential expression. Sample type, transcript length, library preparation, and sequencing bias can influence the total number of reads reported in a cleaned FASTQ file. Therefore, the data are usually normalized by one of the following strategies.
- Using control genes that have been experimentally shown to be expressed reliably in the sample of interest
- Spiking in known quantities of nucleic acid (exogenous control) into the sample [13]
- Using overall read data

Table 2. Normalized metrics commonly used in gene expression analysis [14].

Metric	Calculation	Analysis
Reads per kilobase of transcript per million mapped reads (RPKM)	Divides total mapped reads (counts in millions, for depth) by transcript length (in kilobases)	Normalizes counts from shorter and longer transcripts Suitable for single-end sequencing
Fragments per kilobase of transcript per million mapped reads (FPKM)	Counts the number of cDNA fragments corresponding to transcript length (in kilobases), per total mapped reads (in millions)	Avoids double counting of two reads that are mapped to the same fragment Designed for paired-end sequencing
Transcripts per million mapped reads (TPM)	Divides total mapped reads by transcript length (in kilobases), to obtain reads per kilobase (RPK). Sum of RPK values of all transcripts in a sample is then expressed per (i.e., divided by) million—known as a scaling factor. TPM is RPK of each transcript, divided by the scaling factor.	Helps compare RNA-Seq data from two different samples

An alternative to FPKM and TPM described in Table 2, DESeq2 has become a popular program to identify differential gene expression among samples because it provides more quantitative analysis of comparative RNA-Seq data from high-throughput sequencing [15].

Top

4. Tertiary analysis

Tertiary analysis is perhaps the most subjective step in the NGS data analysis workflow. This step includes annotation of genes and transcripts; interpretation of sequencing data; and prediction of mutational effects on signaling pathways, proteins, and phenotypes. Therefore, experimental hypotheses (e.g., hypothesizing which mutations and/or genetic features are of interest), the statistical confidence of data obtained, and the availability and quality of databases on studied genes and mutations can all impact tertiary analysis.

a. Mutation, gene, and transcript annotation

For the organism being studied, databases may be available for annotating mutations in the VCF file. There are also databases for known and predicted transcripts and common SNPs. One example is Ensembl, which provides basic transcript information and prediction of mutational effects on proteins for multiple organisms. Ensembl can operate as an online tool as well as a downloaded program for direct integration into a data analysis pipeline.

Annotating RNA transcripts is often more challenging because multiple transcripts (e.g., coding, noncoding, different start/stop sites, splice variants) may stem from the same genomic sequence. In some cases, library fragments that are sequenced may not cover all exons of a transcript.

In reporting mutations or sequence variants, it is critical to be clear and consistent on the position and nature of mutated nucleotides being annotated. One of the widely accepted recommendations for identifying and recording variants is a nomenclature proposed by the Human Genome Variation Society (HGVS) [16].

After identification, mutations and other variants should be prioritized based on:

Validity of, or confidence in, the mutation calls
Significance of the mutation or genetic feature to the research goal
Potential impact of the mutation

Despite quality control steps taken in primary and secondary analyses, questionable NGS data that lead to false-positive or false-negative mutation calls cannot be avoided. These scenarios arise especially when quality scores are close to their prescribed cutoff or when the allele fraction of a mutation is approaching that of the known limitation of errors introduced by library preparation and sequencing processes [17,18].

In preparing a final report, a general recommendation is to rank or categorize mutations to guide appropriate actions to take based on the sequencing data (Table 3). Also, consult sorting and ranking processes reported in the literature [19], because the method chosen depends on the types of genetic features and phenotypes of interest, as well as the actions expected from the NGS report.

Table 3. Mutation ranking.

Categories	Scenarios
High confidence	High mapping quality High allele fraction (above detection threshold) High reproducibility (based on previous reports, experiments, predictions, etc.)
Uncertain	Borderline quality metrics Ambiguous implications
Questionable but potentially impactful	Low allele fraction (below detection threshold) High implications

b. Differential expression

Tertiary analysis for RNA-Seq often includes visualization of advanced statistical analysis of the normalized expression metrics determined from the secondary analysis. Tools such as Cufflinks are used for comparison of gene expression across multiple libraries. Different models may be applied to detect systematic biases, e.g., skewed data from highly expressed genes. Different probability distributions, like Poisson vs. binomial, may be applied to quantify reproducibility and confidence levels [14].