Input data: the gene list, this is the dataset you want to analyze. Typically a table of
gene identifiers and their respective effect sizes that indicate an association with some
experimental condition (e.g. summary statistics from an OMICS-based study). See
below for the
Expected file format for your dataset (also contains a link to an example
dataset/gene list)
Select the gene set database to perform enrichment testing on (e.g. the Gene Ontology
database)
Optionally, adjust the GOAT online settings
run the analysis by clicking "START" ("GOAT analysis" section of the tool)
Download results; an Excel table with all gene set statistics & text file with a
Methods description for your paper. See the
GOAT online output files section below for a detailed description
Use interactive data analysis tools to visualize/interpret results.
File format: either CSV, TSV, or Excel (.xlsx file, data on the first sheet). Note that for
Excel, the old .xls format is not supported, only .xlsx files work with GOAT online.
Required columns (column names must match exactly)
gene: Human Entrez (NCBI) gene identifiers (integer values)
symbol: gene symbol (at least 2 characters)
effectsize: effect size or log2 foldchange (numeric/decimal values)
pvalue: gene p-values (numeric/decimal values). Use the un-adjusted p-values
because after multiple comparisons adjustment, there typically are many more ties (e.g.
p-values set to 1) among adjusted p-values which in turn causes a loss of information
signif: was the adjusted p-value significant? This column should contain boolean
values (true and false, or 0 and 1). Here one should use the adjusted p-values! While this
information is NOT used by the GOAT algorithm to identify significant gene sets, flagging
proteins that are significant in your gene list/dataset will yield useful information in
the downstream interpretation of your data. For example, in the GOAT result tables, you
can see for each gene set how many (and which) significant genes are present.
Missing/empty values are not allowed, except for the 'gene' column; rows where this column
is empty will be ignored/skipped.
Duplicate entries for genes, i.e. multiple rows with the same value in the 'gene' column,
are reduced to only 1 row; whichever has the lowest p-value (and if there is no p-value
column, whichever row has the highest absolute effect size).
While you can upload gene lists that only have either an 'effectsize' or 'pvalue' column, it
is recommended to always include both if this information is available in your dataset
because both sources of information can be used when sorting/ranking the gene list (e.g.
when sorting by p-value, effect sizes can be used to break ties, especially for proteins
with p-value=1).
importantly, only Human Entrez (NCBI) gene IDs are supported for now (i.e. values in the
'gene' column).
We provide a gene ID mapping tool (available through the menu on top of this screen) to easily add
Entrez gene IDs to your gene lists in case these only contain gene symbols.
Example gene list: click here to download
the Wingo et al. 2020 dataset in Excel format (PMID:32424284). You may use this as an example for preparing your gene list in a format compatible with
GOAT.
Using a custom gene set database
A gene set database prepared in the generic GMT file format can be easily imported into this
tool. Any compatible GMT file that is stored on your computer can be used with GOAT online by
using the "upload GMT file" button in the "Gene sets" section of the GOAT online tool.
GOAT online only works when the input gene list and gene set database use Entrez Gene
identifiers, so it is crucial to select the appropriate gene format when downloading a gene set
database.
The MSigDB collection website contains a large number of gene set databases that can be
downloaded in GMT format and subsequently used in GOAT online;
click here to visit their website. To retrieve files compatible with the GOAT online tool, make sure to download via links that
are labeled; NCBI (Entrez) Gene IDs (i.e. these are GMT files that contain collections of
gene sets, with human gene identifiers in NCBI Entrez format).
Example gene set databases that can be obtained from MSigDB (there are many more!):
KEGG_MEDICUS: Canonical Pathways gene sets derived from the KEGG MEDICUS pathway database
HPO: Gene sets derived from the Human Phenotype ontology
CP: Canonical pathways. For example, the downloaded file would have a name similar to
"c2.cp.v2023.2.Hs.entrez.gmt"
C3: regulatory target gene sets. Gene sets representing potential targets of regulation by
transcription factors or microRNAs
GOAT online output files
After the GOAT analysis is done, you can download the results from the "Result summary" section
of the GOAT online tool.
The downloaded ZIP file contains a plain text file that details the exact settings used in GOAT
online and includes a paragraph that can be used as Methods text in your manuscript (i.e. GOAT
online version, citation, gene set database that was used, settings for multiple testing
correction, etcetera).
The Download also includes an Excel table that describes all tested gene sets and their
GOAT-estimated p-values. These results can also be browsed in the online tool: click on "TABLE"
in the "Data analysis" section.
The following list details all columns that are provided in the Excel table. For data that is
also shown in the "TABLE" section of the online tool, the respective column names are shown
between brackets:
source: for each gene set this describes the source/domain/classification that was used to
categorize the gene set in the input gene set database. For example, in the GO database
possible values for "source" are GO_CC, GO_BP, GO_MF describing the respective ontology
domains Cellular Component, Biological Process, Molecular Function.
id: the gene set identifier provided by the input gene set database
name [gene set name]: the gene set "name" provided by the input gene set database
ngenes_input: number of (unique) genes in the gene set as provided in the input gene set
database
ngenes [#genes]: number of genes that overlap between the input gene set and your gene list
ngenes_signif [#signif]: number of genes that overlap between the input gene set and those
genes that are flagged as 'signif' in your gene list
genes: gene symbols for the top 250 gene constituents (sorted by absolute effect size or
p-value, whichever was used as 'score_type'). The symbols are retrieved from your input gene
list.
genes_signif: gene symbols for the top 250 gene constituents that are flagged as 'signif' in
your input gene list (sorted by absolute effect size or p-value, whichever was used as
'score_type'). The symbols are retrieved from your input gene list.
zscore: a standardized z-score is computed from gene set p-values + effect size direction
(up/down) if tested. Importantly, we here return standardized z-scores because the GOAT gene
set score (mean of gene scores) is relative to the respective gene set-size-matched null
distributions (a skewed normal)! In contrast, the standardized z-scores are comparable
between gene sets (as are the p-values).
Negative z-scores are returned for gene sets with score_type 'effectsize_down'. i.e. these are
gene sets that were explicitly tested for overrepresentation of negative effect size values.
Note that 'effectsize_abs' and 'pvalue' score types are agnostic to up/down regulation so z-scores
are always positive in that case.
pvalue [p-value]: GOAT-estimated gene set p-value (as-is)
pvalue_adjust [adj. p-value]: GOAT-estimated gene set p-value, after multiple testing
corrections (depending on your chosen settings)
signif [signif]: boolean value indicating whether the gene set was significant under your
specified criteria (type of correction and cutoff value)
score_type [score type]: the values shown here depend on the "gene score type" option that
you selected in the "GOAT analysis" section of the web tool. If gene sets were tested by
type "effectsize", then GOAT will test for each gene set whether enrichment is strongest in
either up- or down-regulation; if "effectsize_up", testing the gene set for enrichment in
positive gene effect sizes yields a lower (stronger) p-value than testing the gene set in
the direction of negative gene effect sizes. Gene sets indicated as "effectsize_down" are
the opposite; their gene constituents are more enriched for negative effect sizes. In case
any other "gene score type" than "effectsize" was selected, this value will default to the
configured "gene score type". Note that in the online tool, "effectsize" is abbreviated to
"ES".
Description of the GOAT algorithm
In brief, the Gene set Ordinal Association Test (GOAT) is a parameter-free permutation-based
algorithm for gene set enrichment analysis. It is easy to use via the online web tool or R
package, computationally efficient, and the resulting gene set p-values are well calibrated
under the null hypothesis and invariant to gene set size. Application to various real-world
proteomics and gene expression studies demonstrates that GOAT consistently identifies more
significant Gene Ontology terms as compared to alternative methods.
Required input is a list of genes and their respective test statistics (p-value/effect
size), and a gene set database obtained from GO or alternative resources.
Test statistics from the gene list are transformed to gene scores by rank(-pvalue)^2 or
rank(effect size)^2 depending on user input, i.e. smaller p-values translate to higher gene
scores. The result is a skewed gene score distribution.
For each gene set size N (number of genes), bootstrapping procedures generate a null
distribution of gene set scores. This yields a skew-normal distribution for small gene sets
and converges to a normal distribution for large gene sets.
Gene set significance is determined for each gene set by comparing its score (mean of
respective gene scores) against a null distribution of the same size (N).
Frequently Asked Questions (FAQ)
What if I want to obtain the significant gene sets that also contain at least N "significant
genes" from my input gene list/dataset? The output table from this tool contains a column "ngenes_signif", representing the number
of genes that overlap between the gene set database and those genes that are flagged as 'signif'
in your gene list. You can use the information in this column for post-hoc filtering.
How do we future-proof this tool? e.g. prevent it from growing stale/outdated and keep it
online? The website is implemented as a fully "static" HTML + Javascript website, so hosting the
website is trivial; we use GitHub Pages to host it. As long as GitHub is online, so is this
tool.
To ensure access to recent Gene Ontology database data, we are setting up an automated
workflow for importing the latest GO database biannually using GitHub Actions (i.e. does not
require any manual updates/edits to this website). In the next major website update, you will be
able to select from available GO database versions/snapshots. The current GO version available
in GOAT online is 2024-01-01.
Further, you can always import any gene set collection in GMT format into this web tool.
Does the web tool yield the same gene set p-values as the R package? Yes. We validated that the gene set p-values computed by GOAT online and the GOAT R
package are the same across all OMICS-based datasets that are described in the GOAT manuscript.
Privacy
all analyses are performed locally on your computer using client-side Javascript code
your gene list and all analyses thereof remain private; your data does not leave your computer
we do count the number of times this tool is used, anonymously, to gauge its popularity
Logo
the GOAT logo shown on this website is borrowed from the open-source Noto Emoji Font version
14.0
Glossary
GOAT: Gene set Ordinal Association Test. A parameter-free algorithm for gene set enrichment
analysis of preranked gene lists.
Gene list: A preranked gene list is here defined as a table of gene identifiers and their
respective effect sizes and/or p-values that indicate association with some experimental
condition (e.g. summary statistics from an OMICS-based study)
Gene set: A gene set can be any set of genes of interest; it is typically defined as a set
of genes that are known members of the same biological pathway, localized to the same
(sub)cellular compartment, co-expressed under certain conditions or associated with some
disorder as defined in a gene set database such as GO. In the GOAT R package (and online
tool) one tests for enrichment of top-ranked genes in the input gene list against each gene
set from some collection/database
GO: the Gene Ontology database. "The Gene Ontology (GO) knowledgebase is the world’s largest source of information on the
functions of genes. This knowledge is both human-readable and machine-readable, and is a
foundation for computational analysis of large-scale molecular biology and genetics
experiments in biomedical research."
Reference:
www.geneontology.org
SynGO: the Synaptic Gene Ontology database. "An evidence-based, expert-curated resource for synapse function and gene enrichment
studies"
Reference:
www.syngoportal.org
KEGG: Kyoto Encyclopedia of Genes and Genomes. "KEGG is a database resource for understanding high-level functions and utilities of the
biological system, such as the cell, the organism and the ecosystem, from molecular-level
information, especially large-scale molecular datasets generated by genome sequencing and
other high-throughput experimental technologies."
Reference:
www.genome.jp/kegg/. Note that the easiest way to use KEGG pathways in GOAT online is to download respective
GMT files from MSigDB, as detailed in the "Using a custom gene set database" section of this
page.
M&M / Methods: The Materials and Methods section of a (scientific) manuscript, where one
would typically detail exactly how GOAT was used to generate presented results. Note that
results from GOAT include ready-made text that can be used for this.