Documentation

Citation

Koopmans, F. (2024). GOAT: efficient and robust identification of gene set enrichment. Communications Biology, 7(1), 744.
https://doi.org/10.1038/s42003-024-06454-5 . PMID: 38898151

How do I use this tool?

Overview of the GOAT online workflow
  1. Input data: the gene list, this is the dataset you want to analyze. Typically a table of gene identifiers and their respective effect sizes that indicate an association with some experimental condition (e.g. summary statistics from an OMICS-based study).
    See below for the Expected file format for your dataset (also contains a link to an example dataset/gene list)
  2. Select the gene set database to perform enrichment testing on (e.g. the Gene Ontology database)
  3. Optionally, adjust the GOAT online settings
  4. run the analysis by clicking "START" ("GOAT analysis" section of the tool)
  5. Download results; an Excel table with all gene set statistics & text file with a Methods description for your paper.
    See the GOAT online output files section below for a detailed description
  6. Use interactive data analysis tools to visualize/interpret results.
There is a Glossary at the bottom of this page.

Can I use the GOAT algorithm programmatically?
Yes! We also provide an R package; click here to go to the GitHub page

Expected file format for the input gene list:

  • File format: either CSV, TSV, or Excel (.xlsx file, data on the first sheet). Note that for Excel, the old .xls format is not supported, only .xlsx files work with GOAT online.
  • Required columns (column names must match exactly)
    1. gene: Human Entrez (NCBI) gene identifiers (integer values)
    2. symbol: gene symbol (at least 2 characters)
    3. effectsize: effect size or log2 foldchange (numeric/decimal values)
    4. pvalue: gene p-values (numeric/decimal values). Use the un-adjusted p-values because after multiple comparisons adjustment, there typically are many more ties (e.g. p-values set to 1) among adjusted p-values which in turn causes a loss of information
    5. signif: was the adjusted p-value significant? This column should contain boolean values (true and false, or 0 and 1). Here one should use the adjusted p-values! While this information is NOT used by the GOAT algorithm to identify significant gene sets, flagging proteins that are significant in your gene list/dataset will yield useful information in the downstream interpretation of your data. For example, in the GOAT result tables, you can see for each gene set how many (and which) significant genes are present.
  • Missing/empty values are not allowed, except for the 'gene' column; rows where this column is empty will be ignored/skipped.
  • Duplicate entries for genes, i.e. multiple rows with the same value in the 'gene' column, are reduced to only 1 row; whichever has the lowest p-value (and if there is no p-value column, whichever row has the highest absolute effect size).
  • While you can upload gene lists that only have either an 'effectsize' or 'pvalue' column, it is recommended to always include both if this information is available in your dataset because both sources of information can be used when sorting/ranking the gene list (e.g. when sorting by p-value, effect sizes can be used to break ties, especially for proteins with p-value=1).
importantly, only Human Entrez (NCBI) gene IDs are supported for now (i.e. values in the 'gene' column).
We provide a gene ID mapping tool (available through the menu on top of this screen) to easily add Entrez gene IDs to your gene lists in case these only contain gene symbols.
Example gene list: click here to download the Wingo et al. 2020 dataset in Excel format (PMID:32424284). You may use this as an example for preparing your gene list in a format compatible with GOAT.

Using a custom gene set database

A gene set database prepared in the generic GMT file format can be easily imported into this tool. Any compatible GMT file that is stored on your computer can be used with GOAT online by using the "upload GMT file" button in the "Gene sets" section of the GOAT online tool.

GOAT online only works when the input gene list and gene set database use Entrez Gene identifiers, so it is crucial to select the appropriate gene format when downloading a gene set database.

The MSigDB collection website contains a large number of gene set databases that can be downloaded in GMT format and subsequently used in GOAT online; click here to visit their website. To retrieve files compatible with the GOAT online tool, make sure to download via links that are labeled; NCBI (Entrez) Gene IDs (i.e. these are GMT files that contain collections of gene sets, with human gene identifiers in NCBI Entrez format).

Example gene set databases that can be obtained from MSigDB (there are many more!):
  • KEGG_MEDICUS: Canonical Pathways gene sets derived from the KEGG MEDICUS pathway database
  • HPO: Gene sets derived from the Human Phenotype ontology
  • CP: Canonical pathways. For example, the downloaded file would have a name similar to "c2.cp.v2023.2.Hs.entrez.gmt"
  • C3: regulatory target gene sets. Gene sets representing potential targets of regulation by transcription factors or microRNAs

GOAT online output files

After the GOAT analysis is done, you can download the results from the "Result summary" section of the GOAT online tool.

The downloaded ZIP file contains a plain text file that details the exact settings used in GOAT online and includes a paragraph that can be used as Methods text in your manuscript (i.e. GOAT online version, citation, gene set database that was used, settings for multiple testing correction, etcetera).

The Download also includes an Excel table that describes all tested gene sets and their GOAT-estimated p-values. These results can also be browsed in the online tool: click on "TABLE" in the "Data analysis" section.

The following list details all columns that are provided in the Excel table. For data that is also shown in the "TABLE" section of the online tool, the respective column names are shown between brackets:
  • source: for each gene set this describes the source/domain/classification that was used to categorize the gene set in the input gene set database. For example, in the GO database possible values for "source" are GO_CC, GO_BP, GO_MF describing the respective ontology domains Cellular Component, Biological Process, Molecular Function.
  • id: the gene set identifier provided by the input gene set database
  • name [gene set name]: the gene set "name" provided by the input gene set database
  • ngenes_input: number of (unique) genes in the gene set as provided in the input gene set database
  • ngenes [#genes]: number of genes that overlap between the input gene set and your gene list
  • ngenes_signif [#signif]: number of genes that overlap between the input gene set and those genes that are flagged as 'signif' in your gene list
  • genes: gene symbols for the top 250 gene constituents (sorted by absolute effect size or p-value, whichever was used as 'score_type'). The symbols are retrieved from your input gene list.
  • genes_signif: gene symbols for the top 250 gene constituents that are flagged as 'signif' in your input gene list (sorted by absolute effect size or p-value, whichever was used as 'score_type'). The symbols are retrieved from your input gene list.
  • zscore: a standardized z-score is computed from gene set p-values + effect size direction (up/down) if tested. Importantly, we here return standardized z-scores because the GOAT gene set score (mean of gene scores) is relative to the respective gene set-size-matched null distributions (a skewed normal)! In contrast, the standardized z-scores are comparable between gene sets (as are the p-values).
    Negative z-scores are returned for gene sets with score_type 'effectsize_down'. i.e. these are gene sets that were explicitly tested for overrepresentation of negative effect size values. Note that 'effectsize_abs' and 'pvalue' score types are agnostic to up/down regulation so z-scores are always positive in that case.
  • pvalue [p-value]: GOAT-estimated gene set p-value (as-is)
  • pvalue_adjust [adj. p-value]: GOAT-estimated gene set p-value, after multiple testing corrections (depending on your chosen settings)
  • signif [signif]: boolean value indicating whether the gene set was significant under your specified criteria (type of correction and cutoff value)
  • score_type [score type]: the values shown here depend on the "gene score type" option that you selected in the "GOAT analysis" section of the web tool. If gene sets were tested by type "effectsize", then GOAT will test for each gene set whether enrichment is strongest in either up- or down-regulation; if "effectsize_up", testing the gene set for enrichment in positive gene effect sizes yields a lower (stronger) p-value than testing the gene set in the direction of negative gene effect sizes. Gene sets indicated as "effectsize_down" are the opposite; their gene constituents are more enriched for negative effect sizes. In case any other "gene score type" than "effectsize" was selected, this value will default to the configured "gene score type". Note that in the online tool, "effectsize" is abbreviated to "ES".

Description of the GOAT algorithm

In brief, the Gene set Ordinal Association Test (GOAT) is a parameter-free permutation-based algorithm for gene set enrichment analysis. It is easy to use via the online web tool or R package, computationally efficient, and the resulting gene set p-values are well calibrated under the null hypothesis and invariant to gene set size. Application to various real-world proteomics and gene expression studies demonstrates that GOAT consistently identifies more significant Gene Ontology terms as compared to alternative methods.

GOAT algorithm
  1. Required input is a list of genes and their respective test statistics (p-value/effect size), and a gene set database obtained from GO or alternative resources.
  2. Test statistics from the gene list are transformed to gene scores by rank(-pvalue)^2 or rank(effect size)^2 depending on user input, i.e. smaller p-values translate to higher gene scores. The result is a skewed gene score distribution.
  3. For each gene set size N (number of genes), bootstrapping procedures generate a null distribution of gene set scores. This yields a skew-normal distribution for small gene sets and converges to a normal distribution for large gene sets.
  4. Gene set significance is determined for each gene set by comparing its score (mean of respective gene scores) against a null distribution of the same size (N).

Frequently Asked Questions (FAQ)

What if I want to obtain the significant gene sets that also contain at least N "significant genes" from my input gene list/dataset?
The output table from this tool contains a column "ngenes_signif", representing the number of genes that overlap between the gene set database and those genes that are flagged as 'signif' in your gene list. You can use the information in this column for post-hoc filtering.

How do we future-proof this tool? e.g. prevent it from growing stale/outdated and keep it online?
The website is implemented as a fully "static" HTML + Javascript website, so hosting the website is trivial; we use GitHub Pages to host it. As long as GitHub is online, so is this tool.
To ensure access to recent Gene Ontology database data, we are setting up an automated workflow for importing the latest GO database biannually using GitHub Actions (i.e. does not require any manual updates/edits to this website). In the next major website update, you will be able to select from available GO database versions/snapshots. The current GO version available in GOAT online is 2024-01-01.
Further, you can always import any gene set collection in GMT format into this web tool.

Does the web tool yield the same gene set p-values as the R package?
Yes. We validated that the gene set p-values computed by GOAT online and the GOAT R package are the same across all OMICS-based datasets that are described in the GOAT manuscript.

Privacy

Logo

the GOAT logo shown on this website is borrowed from the open-source Noto Emoji Font version 14.0

Glossary

  • GOAT: Gene set Ordinal Association Test. A parameter-free algorithm for gene set enrichment analysis of preranked gene lists.
  • Gene list: A preranked gene list is here defined as a table of gene identifiers and their respective effect sizes and/or p-values that indicate association with some experimental condition (e.g. summary statistics from an OMICS-based study)
  • Gene set: A gene set can be any set of genes of interest; it is typically defined as a set of genes that are known members of the same biological pathway, localized to the same (sub)cellular compartment, co-expressed under certain conditions or associated with some disorder as defined in a gene set database such as GO. In the GOAT R package (and online tool) one tests for enrichment of top-ranked genes in the input gene list against each gene set from some collection/database
  • GO: the Gene Ontology database. "The Gene Ontology (GO) knowledgebase is the world’s largest source of information on the functions of genes. This knowledge is both human-readable and machine-readable, and is a foundation for computational analysis of large-scale molecular biology and genetics experiments in biomedical research." Reference: www.geneontology.org
  • SynGO: the Synaptic Gene Ontology database. "An evidence-based, expert-curated resource for synapse function and gene enrichment studies" Reference: www.syngoportal.org
  • KEGG: Kyoto Encyclopedia of Genes and Genomes. "KEGG is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies." Reference: www.genome.jp/kegg/. Note that the easiest way to use KEGG pathways in GOAT online is to download respective GMT files from MSigDB, as detailed in the "Using a custom gene set database" section of this page.
  • M&M / Methods: The Materials and Methods section of a (scientific) manuscript, where one would typically detail exactly how GOAT was used to generate presented results. Note that results from GOAT include ready-made text that can be used for this.