Frequently Asked Questions
1. What is gene prioritisation and how can this tool be useful for me?
In attempts to determine the genetic causes of human disease, researchers are often left with
a large selection of candidate genes. Linkage studies can point to a genomic region containing hundreds of genes, while the high-throughput sequencing
approach will identify a multitude of non-synonymous, potentially pathogenic variants which are mostly benign. Systematic experimental verification of each
such candidate gene is infeasible due to time and cost implications. A researcher will therefore have to decide which gene(s) is worth investigating further.
This decision can be made by manually trawling through vast amounts of literature and/or public databases and selecting the likeliest candidates based
on various factors. However, such approach is time consuming and prone to human biases and errors. Computational gene prioritisation presents itself as a solution to this problem,
systematically analysing candidate genes based on various criteria and sorting the candidates from the most likely gene to be disease causative to the least in a fraction of the time
it would take a researcher to perform such queries manually.
Here, candidate genes are prioritised using baseline gene expression data for various normal tissues, working under a hypothesis that the expression patterns of a disease gene are
different in tissues affected by the disease than in those that do not exhibit the disease phenotype.
2.How do I query?
Performing candidate gene prioritisation is straight forward.
The basic instructions are provided alongside each step as help bubbles. There are two required inputs: a set of genes to prioritise and a set of disease affected tissues.
Gene input can be provided as a single or a set of genomic regions. Use the drop down menu to select the chromosome and input the region as a positional interval. Alternatively,
a candidate gene list may be provided. Common gene aliases, ensembl, refseq or entrez gene ids are accepted, as well as any combination thereof.
Affected tissues can be selected from a list. Select tissues which best match the disease phenotype of interest - the accuracy by which tissues are attributed to a phenotype will greatly
affect the results. For example, in the case of diabetes mellitus, one might choose to prioritise candidate gene selection based on expression in pancreas, islet cells and/or adipose.
3.What do Array and HTS scores mean?
Scores represent combined relative gene expression values with
respect to median gene expression values for selected tissues. Separate scores are derived from High-Throughput
Sequencing datasets and Microarray datasets. A positive score indicates that the gene shows higher than median expression
of all genes within tissues of interest, whereas a negative score represents expression that is lower than the median.
4.I have input a list containing gene "X", but in the results "X" links to the wrong genomic region or shows incorrect IDs/functional annotations. Why?
It is likely that gene X is ambiguous - there may be multiple distinct genes that can be referred to by that name.
By default, the prioritisation tool will attempt to disambiguate the input automatically, selecting the likeliest candidate gene,
but this may not always be your gene of interest. Select "resolve ambiguous genes manually" under additional options to manually choose the correct gene.
In rare cases it may be that this arises due to an error in our database - if you think this is the case, please report it via the "Contact" page.
5. What does "assign lower priority to ubiquitously expressed genes" option do?
Certain housekeeping genes are expressed at a high level in most cell types.
This high expression can "mask" the real disease gene. This option, on by default, will rank genes with ubiquitous,
high expression levels lower than those with non-ubiquitous high expression within selected tissues.
6. If I query mouse tissues, should I input mouse genomic region/ genes?
No. When querying expression data from mouse only
or human and mouse tissues, input only human genomic region/ genes of interest.
Input genes are mapped to mouse homologs automatically.
7. How do you derive mouse-human homologs?
The mouse-human homolog conversion is done using Homologene.
8. How is the expression data sourced?
The gene expression database is comprised from
publicly available datasets from repositories such as Array Express or Gene Expression Omnibus.
Curated data from databases such as Gene Expression Atlas and RNA-Seq Atlas are also included.
9. What does "mouse weighting" slider do?
If you choose to query
both human and mouse tissues, you may choose how much expression data for mouse
tissues contributes to the final rank. As mouse tissues are different to human tissues, gene expression
patterns are also different and therefore should not be considered equally important in the context of
identifying human disease genes. However, querying mouse tissues allows insights into gene expression under
conditions that may not be accessible for human tissues (e.g. certain embryonic tissues).
10. Can I prioritise genes from multiple genomic regions?
Yes. Multiple regions may be chosen by selecting a
region of interest and pressing "add to selection" button. A table will appear, listing all the regions selected thus far. Only genes encompassed by the regions defined in the table
will be ranked.
11. What does the number of genes returned slider do? How do I get the full results?
Full results for every input gene are made available as a tab-delimited text file. This can be easily viewed in
any text editor, or more conveniently, spreadsheet software such as Microsoft Excel. The download link is provided at the top of the results page.
You may bookmark this for future download, however old result files may be periodically deleted. The slider determines the number of results to view in the browser window, defaulting to 50. You may increase or decrease this for convenience, up to a limit of 100.
12. I have noticed sometimes there is a discrepancy between scores and ranks. Is this a bug/error?
The scores are shown to help the user
differentiate between cases where a gene is ranked highly due to high expression and where a gene
is ranked highly due to universally low gene expression in gene input. The former is more likely to be biologically significant than the latter.
However, the scores do not determine the gene rank alone.
The underlying algorithm also takes into account factors such as
expression patterns across all tissues or data type/conflicts.
This may result in an imperfect correlation between gene ranks and expression scores and is not necessarily an error of the program. However, if you spot something that you genuinely believe to be an
error of ranking, please let us know via 'Contact' page.
13. How well does the gene prioritisation tool predict disease genes?
Gene prioritisation has been assessed as follows: benchmarking was performed using a training dataset consisting of 475 OMIM disease genes.
The results can be viewed here. Figure 1 shows the rank distribution of OMIM disease genes VS randomly selected genes in a search space of 50 random genes (left) and 500 random genes (right). ROC curve of (Figure 1, left) is shown in Figure 2.
While this shows significant classification power, any ranking result should not be regarded as absolutely accurate.
Figure 1. Density plot showing the rank distribution of 475 OMIM disease genes(pink) and 475 random genes (blue) when classified by the gene expression module. Left - search space
size 50. Right - search space size 500.
Figure 2. ROC curve showing the true positive rate vs false postive rate of candidate disease gene prioritisation, using search space of 50 genes.
14. I am studying a phenotype that presents itself mostly in tissue X, but it does not appear in the selection.
If you believe we have missed something vital, please email us using the details in the "Contact" page. However, gene expression data may not be publicly available for all tissues - in particular for highly specific cell types.
15. I have an RNA-Seq/Microarray dataset that may be of interest to you.
Email us using the information in the "Contact" page.