Introduction

PhenoScanner is a curated database holding publicly available results from large-scale genetic association studies. This tool aims to facilitate “phenome scans”, the cross-referencing of genetic variants with many phenotypes, to help aid understanding of disease pathways and biology. The catalogue currently contains over 65 billion association results, including tables of genetic associations with diseases and traits, metabolites (metabolite quantitative trait loci, mQTL), gene expression (expression quantitative trait loci, eQTL), proteins (protein quantitative trait loci, pQTL) and DNA methylation (methylation quantitative trait loci, methQTL), as well as over 150 million unique genetic variants most of which are single nucleotide polymorphisms (SNPs). It is accompanied by a web-based tool which searches the database for associations with the user-specified genetic variants, genes, geneomic regions and traits, returning association results that are aligned according to the alleles of each genetic variant. The tool provides the option of searching for trait associations with proxies of the genetic variants of interest, calculated using the super-ancestries from 1000 Genomes phase 3.

PhenoScanner consists of a Python(Django)-R interface which connects to a series of MySQL databases. To develop this version of the catalogue, we collated > 5,000 genotype-phenotype association datasets (see table of studies). To ensure consistent formatting, we aligned the results to the plus strand, added or updated chromosome positions to hg19 using dbSNP (release 147) and liftOver, and replaced old rsIDs with the current versions. Linkage disequilibrium (LD) measures in 1000 Genomes phase 3 were calculated between neighbouring genetic variants in the autosomal chromosomes using the phased haplotypes from the super-ancestries. Genetic variants with minor allele frequencies < 1% were removed along with multiallelic variants. For each remaining genetic variant, we calculated D' and r 2 for genetic variants within 500kb in either direction, and kept LD statistics for pairs of SNPs with r 2 ≥ 0.5. The genetic variants have been annotated using VEP & BEDOPS and the phenotypes have been mapped to EFO ontology terms using ZOOMA.

 

Input

Queries

PhenoScanner has four query types:

  • SNP - this presents all results with the SNP. An rsID or a hg19/hg38 chromosome-position can be queried.
  • Gene - this presents all genetic associations with p < 1E-5 within the gene (p < 5E-8 for gene expression and epigenetic markers).
  • Region - this presents all genetic associations with p < 1E-5 within the genomic region (p < 5E-8 for gene expression and epigenetic markers).
  • Trait - this presents all genetic associations with p < 1E-5 for a trait. This query uses ZOOMA and selects results based on their EFO terms.
You may enter either:
  • One SNP, gene, region or trait into the search bar on the home page. See example queries under the search bar.
  • Upload several SNPs, genes or genomic regions in a tab-delimited text file (without a header) with one SNP, gene or genomic region per line (max 100 SNPs, 10 genes or 10 regions).

Options

PhenoScanner has the following options:

  • Catalogue - allows the user to specify which type of associations they want to look-up (options: Diseases and traits, Gene expression, Metabolites, Proteins, Epigenetics, All and None; default: Diseases and traits).
  • p-value cut-off: drop down box containing p-value cut-off options (options: 1, 0.05, 0.01, 0.001, 1E-5, 5E-8; default: 1E-5). Only associations with p-value less than the threshold will be looked up.
  • Proxies: allows the user to request results for proxies above a LD threshold from the r 2 drop down box (options: None, AFR, AMR, EAS, EUR, SAS; default: None). LD statistics were calculated for SNPs with MAF < 1% using the phased haplotypes from the each of the super-ancestry samples within a 1Mb window (500kb upstream and downstream of each SNP). Proxy variants are aligned according to the reference variant. For instance, suppose the user requests results for rs11111 that has effect allele A, and there is a strong proxy rs22222 for which the T allele is the allele that segregates with allele A of rs11111, then the results for rs22222 will be aligned to the T allele. For large SNP queries, only the 5 best proxy variants for each reference variant are looked-up in the genotype-phenotype search.
  • r 2: drop down box containing cut-offs for proxies (options: 0.5, 0.6, 0.7, 0.8, and 0.9; default: 0.8).
  • Build: drop down box containing human genome build numbers (options: 37, 38; default: 37). The build option is only used in the query (e.g. for a genomic region query), both build 37 and build 38 chromosome-positions are presented in the output.

 

Output

The following files will be outputted after running PhenoScanner:

  • A variant, gene or location information file (<SNP entry/gene entry/location entry/infile name>_PhenoScanner_<SNP/Gene/Location>.tsv) containing information on the variants, genes or locations searched.
  • Association files (<SNP entry/gene entry/location entry/trait entry/infile name>_PhenoScanner_<association catalogue name>.tsv) containing the look-up results.

SNP information file

The variant information file contains the following columns:

  • snp: this is the input rsID or hg19/hg38 chromosome position.
  • rsid: the rsID for the input SNP.
  • hg19_coordinates: the hg19 chromosome position for the input SNP.
  • hg38_coordinates: the hg38 chromosome position for the input SNP.
  • chr: the chromosome where the input SNP is located.
  • pos_hg19: the hg19 position for the input SNP.
  • pos_hg38: the hg38 position for the input SNP.
  • a1: the effect allele (aligned to the + strand) for the input SNP.
  • a2: the non-effect allele (aligned to the + strand) for the input SNP.
  • afr: the allele frequency for a1 in AFR popultaion in 1000 Genomes.
  • amr: the allele frequency for a1 in AMR popultaion in 1000 Genomes.
  • eas: the allele frequency for a1 in EAS popultaion in 1000 Genomes.
  • eur: the allele frequency for a1 in EUR popultaion in 1000 Genomes.
  • sas: the allele frequency for a1 in SAS popultaion in 1000 Genomes.
  • consequence: the consequence of the SNP from VEP.
  • protein_position: the position of the SNP in the protein.
  • amino_acids: the amino acids altered by the SNP in the protein (does not relate to the ordering of the alleles a1 and a2).
  • ensembl: the Ensembl ID for the nearest gene.
  • hgnc: the HGNC ID for the nearest gene.

If proxies were requested, the variant information file contains the above columns for both the reference variants (‘ref_’ prefix) and the proxy variants, as well as the following columns:

  • proxy: an indicator variable which equals 0 if the proxy SNP is the input SNP and 1 otherwise.
  • r2: the r 2 between the input SNP and the proxy SNP based on the phased haplotypes from 1000 Genomes.
  • dprime: the D' between the input SNP and the proxy SNP based on the phased haplotypes from 1000 Genomes.

Gene information file

The gene information file contains the following columns:

  • gene: this is the input gene.
  • ensembl_id: the Ensembl ID for the input gene.
  • chr: the chromosome where the gene is located
  • start: the starting hg19/hg38 position for the gene.
  • end: the ending hg19/hg38 position for the gene.

Location information file

The location information file contains the following columns:

  • region: this is the input location.
  • chr: the chromosome where the region is located.
  • start: the starting hg19/hg38 position for the gene.
  • end: the ending hg19/hg38 position for the gene.

Association results files

The association results files all contain a subset of the following columns:

  • snp: the input rsID or hg19/hg38 chromosome position.
  • gene: the input gene.
  • region: the input region/location.
  • rsid: rsID.
  • hg19_coordinates: the hg19 chromosome position.
  • hg38_coordinates: the hg38 chromosome position.
  • a1: the effect allele (aligned to the + strand).
  • a2: the non-effect allele (aligned to the + strand).
  • afr: the allele frequency for a1 in AFR popultaion in 1000 Genomes.
  • amr: the allele frequency for a1 in AMR popultaion in 1000 Genomes.
  • eas: the allele frequency for a1 in EAS popultaion in 1000 Genomes.
  • eur: the allele frequency for a1 in EUR popultaion in 1000 Genomes.
  • sas: the allele frequency for a1 in SAS popultaion in 1000 Genomes.
  • consequence: the consequence of the SNP from VEP.
  • protein_position: the position of the SNP in the protein.
  • amino_acids: the amino acids altered by the SNP in the protein (does not relate to the ordering of the alleles a1 and a2).
  • ensembl: the Ensembl ID for the nearest gene.
  • hgnc: the HGNC ID for the nearest gene.
  • trait: phenotype or disease.
  • efo: the EFO ontology term for the phenotype or disease.
  • study: the name of the consortium/lead author of the study.
  • pmid: PubMed ID.
  • ancestry: the ancestry of the study.
  • year: the year the study was published.
  • tissue: the tissue in which the gene expression was measured (eQTL and methQTL datasets only).
  • exp_gene: the HGNC ID for the expressed gene (eQTL datasets only).
  • exp_ensembl: the Ensembl ID for the expressed gene (eQTL datasets only).
  • probe: the probe for the expressed gene (eQTL datasets only).
  • marker: the epigenetic marker measured (methQTL datasets only).
  • location: the location of the epigenetic maker (methQTL datasets only).
  • beta: association between the trait and the SNP expressed per additional copy of the effect allele (odds ratios are given on the log-scale).
  • se: standard error of beta.
  • p: p-value.
  • direction: the direction of association with respect to the effect allele.
  • n: number of individuals.
  • n_cases: number of cases.
  • n_controls: number of controls.
  • n_studies: number of studies.
  • unit: unit of analysis (IVNT stands for inverse normally rank transformed phenotype).
  • dataset: the dataset ID.

If proxies were requested, the association file contains the information on the reference variants (‘ref_’ prefix) and the proxy variants, as well as the following columns:

  • proxy: an indicator variable which equals 0 if the proxy SNP is the input SNP and 1 otherwise.
  • r2: the r 2 between the input SNP and the proxy SNP based on the phased haplotypes from 1000 Genomes.
  • dprime: the D' between the input SNP and the proxy SNP based on the phased haplotypes from 1000 Genomes.