Metabolomics Data Analysis

Turning data into knowledge

Metabolomics analysis leads to large datasets similar to the other "omics" technologies. This data may contain many experimental artifacts, and sophisticated software is required for high-throughput and efficient analysis, to provide statistical power to eliminate systematic bias, confidently identify compounds and explore significant findings.

Receive updates on Metabolomics  Join community

Metabolomics Software Solutions

Metabolomics Data Analysis workflow

Metabolomics data analysis usually consists of feature extraction, compound identification, statistical analysis and interpretation. Data analysis is a significant part of the metabolomics workflow, with compound identification being the major bottleneck. This overview reviews the challenges of data analysis for metabolomics and the strategies today to address these.

Once data acquisition is complete, spectral data pre-processing occurs through the following steps:

  1. Baseline correction is used to remove low frequency artifacts and differences between samples that are generated by experimental and any instrumental variation
  2. Spectral alignment can happen before or after feature/compound extraction. It is one of the main processing steps in metabolomics studies involving multiple samples where chromatographic retention time is the parameter that can vary. 

  

Data analysis and processing

Feature extraction

This step involves finding and quantifying all the known and unknown metabolites and extracting all relevant spectral and chromatographic information from them. Peak-based algorithms are the method of choice for MS- based studies, and peaks are detected across the entire spectrum.

Once detected, related ions indicative of a single-component chromatographic peak (adducts, multiply charged) are identified and grouped.

Their areas are then integrated to provide a quantification of the underlying metabolite.

Metabolite identification

Compound or metabolite identification is one of the major challenges of untargeted metabolomics research. However, this step must be performed in order to infer any biological or scientific meaning from a novel spectral peak.

When using an MS reference database or MS/MS spectral library matching, or a number of other commercially and open-source databases, several factors influence the selection of available resources:

  1. The number and types of compounds.
  2. The nominal or accurate mass data.
  3. The quality and curation of data.
  4. The ability to process data batches.
  5. The ability to customize databases/libraries.

MS Database Searching

When dealing with high resolution accurate mass data (full scan MS), it is fairly common to compare the neutral molecular mass (derived from m/z value) against MS databases such as METLIN , mzCloud , etc. This approach provides compound candidates, but it lacks sufficient specificity for identity confirmation.

This is why isotope pattern matching is used to confirm empirical formula. If retention time information is also included, confident compound identification can be achieved.

Such an approach works well with data acquired from either LC- or IC-MS analysis, where the molecular ion is left intact during full scan MS. With GC-MS using electron impact (EI) or chemical ionization, the molecular ion is typically fragmented, so these additional approaches are not required to achieve full compound identification.

MS/MS Spectral Library Matching

Fragmented molecular ions can be compared against MS/MS spectral libraries or EI libraries to generate more confident identification results. Combining retention time information with MS/MS library or EI library searching provides the highest level of confidence. The quality of the data found in these libraries are critical for confident identification; likewise, so is the number of metabolite spectra. Today, there are libraries that contain spectral data beyond just that of MS/MS. As data are continuously added to and curated within these spectral libraries, routine peak identification will improve.

Mass Spectral Interpretation

If the metabolite or compound is not identified using the above approaches, it’s possible to perform more in-depth mass spectrometry analysis performing MSn and utilizing several dissociation techniques to obtain multiple fragmentation patterns. The approach would be to interpret the compound fragmentation spectra and propose a rational structure. This is a time consuming process.

Two approaches exist:

De novo interpretation. Without using any prior knowledge, a chemical structure is reconstructed based on its fragmentation data.

Structure correlation. MS/MS spectra are correlated with a list of searched database structures using their calculated molecular formulae.

Metabolomics statistical analysis

Metabolomics samples are typically complex and there are many interactions between metabolites and biological states. To uncover significant differences, univariate and multivariate statistical analyses (chemometric methods) use the abundance relationships between the different metabolomics components. Visualization tools to interact more productively with the data are also an integral part of this process.

1) Univariate methods (the most common statistical approach) analyze metabolomics features separately. Their main advantage is ease of use and interpretation. There are several univariate methods for metabolomics. When assessing differences between two or more groups, parametric tests such as student’s t-test, box whisker plots and ANOVA (analysis of variance) are commonly used.

Univariate methods

The disadvantage is that this approach doesn’t take into account the presence of interactions between the different metabolic features (correlations between metabolites from the same pathway, or metadata such as diet, gender etc) increasing the probability of obtaining false positive or false negative results.

2) Multivariate methods analyze metabolomics features simultaneously and can identify relationships patterns between them. There are two groups of pattern-recognition methods: unsupervised and supervised.

Unsupervised methods are an effective way to detect patterns that are correlated with experimental or biological variables. Similarity patterns within the data are identified without taking into account the type or class of the study samples. Principal component analysis (PCA) is a common example.

Supervised methods take into account sample labels to identify features that are associated with a phenotype of interest, and down weights variance. These are also the basis for building prediction models. Partial least squares (PLS) is one of the widely used supervised methods in metabolomics.

Figure: PLS-DA model of the decomposition data. A supervised multivariate analysis that collapses high-dimensional data (e.g. a large number of metabolites with varying intensities) to principal components that encompass the majority of variance in the dataset. In this case the X axis is principal component 1 and the Y axis is principal component 2. Note that the samples cluster appropriately—each group clusters together and T0 is distinctly separated from the other groups.

Feature extraction

This step involves finding and quantifying all the known and unknown metabolites and extracting all relevant spectral and chromatographic information from them. Peak-based algorithms are the method of choice for MS- based studies, and peaks are detected across the entire spectrum.

Once detected, related ions indicative of a single-component chromatographic peak (adducts, multiply charged) are identified and grouped.

Their areas are then integrated to provide a quantification of the underlying metabolite.

Metabolite identification

Compound or metabolite identification is one of the major challenges of untargeted metabolomics research. However, this step must be performed in order to infer any biological or scientific meaning from a novel spectral peak.

When using an MS reference database or MS/MS spectral library matching, or a number of other commercially and open-source databases, several factors influence the selection of available resources:

  1. The number and types of compounds.
  2. The nominal or accurate mass data.
  3. The quality and curation of data.
  4. The ability to process data batches.
  5. The ability to customize databases/libraries.

MS Database Searching

When dealing with high resolution accurate mass data (full scan MS), it is fairly common to compare the neutral molecular mass (derived from m/z value) against MS databases such as METLIN , mzCloud , etc. This approach provides compound candidates, but it lacks sufficient specificity for identity confirmation.

This is why isotope pattern matching is used to confirm empirical formula. If retention time information is also included, confident compound identification can be achieved.

Such an approach works well with data acquired from either LC- or IC-MS analysis, where the molecular ion is left intact during full scan MS. With GC-MS using electron impact (EI) or chemical ionization, the molecular ion is typically fragmented, so these additional approaches are not required to achieve full compound identification.

MS/MS Spectral Library Matching

Fragmented molecular ions can be compared against MS/MS spectral libraries or EI libraries to generate more confident identification results. Combining retention time information with MS/MS library or EI library searching provides the highest level of confidence. The quality of the data found in these libraries are critical for confident identification; likewise, so is the number of metabolite spectra. Today, there are libraries that contain spectral data beyond just that of MS/MS. As data are continuously added to and curated within these spectral libraries, routine peak identification will improve.

Mass Spectral Interpretation

If the metabolite or compound is not identified using the above approaches, it’s possible to perform more in-depth mass spectrometry analysis performing MSn and utilizing several dissociation techniques to obtain multiple fragmentation patterns. The approach would be to interpret the compound fragmentation spectra and propose a rational structure. This is a time consuming process.

Two approaches exist:

De novo interpretation. Without using any prior knowledge, a chemical structure is reconstructed based on its fragmentation data.

Structure correlation. MS/MS spectra are correlated with a list of searched database structures using their calculated molecular formulae.

Metabolomics statistical analysis

Metabolomics samples are typically complex and there are many interactions between metabolites and biological states. To uncover significant differences, univariate and multivariate statistical analyses (chemometric methods) use the abundance relationships between the different metabolomics components. Visualization tools to interact more productively with the data are also an integral part of this process.

1) Univariate methods (the most common statistical approach) analyze metabolomics features separately. Their main advantage is ease of use and interpretation. There are several univariate methods for metabolomics. When assessing differences between two or more groups, parametric tests such as student’s t-test, box whisker plots and ANOVA (analysis of variance) are commonly used.

Univariate methods

The disadvantage is that this approach doesn’t take into account the presence of interactions between the different metabolic features (correlations between metabolites from the same pathway, or metadata such as diet, gender etc) increasing the probability of obtaining false positive or false negative results.

2) Multivariate methods analyze metabolomics features simultaneously and can identify relationships patterns between them. There are two groups of pattern-recognition methods: unsupervised and supervised.

Unsupervised methods are an effective way to detect patterns that are correlated with experimental or biological variables. Similarity patterns within the data are identified without taking into account the type or class of the study samples. Principal component analysis (PCA) is a common example.

Supervised methods take into account sample labels to identify features that are associated with a phenotype of interest, and down weights variance. These are also the basis for building prediction models. Partial least squares (PLS) is one of the widely used supervised methods in metabolomics.

Figure: PLS-DA model of the decomposition data. A supervised multivariate analysis that collapses high-dimensional data (e.g. a large number of metabolites with varying intensities) to principal components that encompass the majority of variance in the dataset. In this case the X axis is principal component 1 and the Y axis is principal component 2. Note that the samples cluster appropriately—each group clusters together and T0 is distinctly separated from the other groups.


Metabolomics data interpretation

Based on the specific objective of the analysis (untargeted metabolomics, targeted and data manipulation), most metabolomics analyses can also be classed as information/insights, discrimination and/or prediction.

Interpretation: Information, Prediction, Discrimination

Information/Insights: This approach harnesses data to provide insights for the next experiments in basic research such as the discovery of pathways, novel compounds, biomarkers, understanding metabolism or the information used to create of databases and libraries.

Discrimination: The data is used to analyze differences between sample populations without necessarily creating statistical models or evaluating possible pathways that may elucidate such differences. Examples include the classification of wine by grape variety and production area. Multivariate analyses are applied here such as PCA to maximize classification.

Figure: Principal component Analysis clearly shows that the Grenache ECR and the Grenache HighHill are different from the Fatman, Little Boy and WindMill as well as different from each other.

Prediction: Data from metabolite profiles and abundances creates to a statistical model for prediction typically using partial least squares (PLS) to predict the class membership of unknown samples. This is usually done after prior analysis and abundance profiles of features in samples with known class memberships.

Sample Class Prediction provides a robust way to determine quality in food and beverages and can be used in a production QC environment or in life science research to predict risk of disease in healthy patients.


Metabolomics pathway analysis

There are several ways of interpreting the data once metabolites have been identified. This all goes back to experimental design at the beginning, putative biomarker discovery, fingerprinting or mapping pathways to understand metabolism.

Recently, the biological knowledge available for metabolomics studies has been continuously increasing. Groups of metabolites that are related to the same biological process have been mapped to metabolic pathways. There are many biological databases available such as Kyoto Encyclopedia of Genes and Genomes (KEGG), and MetaCyc.

DatabaseDescriptionWebsiteReference
Kyoto Encyclopedia of Genes and Genomes (KEGG)466 pathways, 17,333 metabolites, and 9,764 biochemical reactionshttp://www.genome.jp/kegg/Kanehisa et al. (2012)
MetaCyc2260 pathways from 2600 different organismshttp://metacyc.org/Caspi et al. (2008)
The small molecule pathway database (SMPDB)1,594 metabolites mapping 727 small molecule pathways found in humanshttp://www.smpdb.ca/Jewison et al. (2014)
WikiPathways1,910 pathwayshttp://wikipathways.org/Kelder et al. (2012)
Plant metabolic network (PMN/PlantCyc)Multi-species pathway database for plant metabolomicshttp://www.plantcyc.org/Chae et al. (2014)
Share