Documentation

Contents

1. An atlas of GWAS summary statistics
2. Curated publicly available GWAS summary statistics
2.1. Conditions to be included in the database
2.2. UK Biobank traits
2.3. Pre-process of GWAS summary statistics
3. Database features (GWAS information)
4. Definition of lead SNPs and risk loci
4.1. Lead SNPs
4.2. Risk loci
4.3. Reference panel
5. Estimation of SNP heritability and genetic correlation with LD score regression
5.1. LD score regression (LDSC)
5.2. SNP heritability estimation
5.3. Genetic correlation
6. MAGMA analyses
6.1. MAGMA gene analysis
6.2. MAGMA gene-set analysis
7. Multi GWAS comparison and pleiotropy
7.1. Scatter plots and regression lines
7.2. Genetic correlation heatmap
7.3. MAGMA gene overlap
7.4. Pleiotropic risk loci
7.5. Pleiotropic genes
8. PheWAS plot
9. Citation
10. References
11. URLs

1. An atlas of GWAS summary statistics

This website is a comprehensive database of publicly available GWAS summary statistics. This is not only the central hub of summary statistics but we also aimed to provide insight into human complex traits. In this website, users are able to not only access to the original summary statistics but also obtain a variety of results from pre-performed analyses such as risk loci information, LD score regression [1], MAGMA [2] and multi GWAS comparison. To obtain global view of genetic architecture, we performed several analyses for selected 501 traits which are available in our paper [4].

[Go to top]

2. Curated publicly available GWAS

2.1. Conditions to be included in the database

Publicly available GWAS summary statistics which include full list of tested SNPs are included in the database regardless of the cohort population or sample size. Publicly available means that full summary statistics are available without applying for access with review, but the online submission of use information, such as email and names, is considered as "publicly available". We excluded GWAS based on whole exome sequencing, immune-chip sequencing and GWAS of replication cohorts. When there are several version of GWAS summary statistics from the same study for the same traits, e.g. sex specific and pooled sex GWAS or adjusted for additional covariates, we included all of summary statistics as long as the sample size and population are explicitly mentioned in the original study. Details are described in the online Methods of [4].

2.2. UK Biobank traits

We performed GWAS of 600 traits from UK Biobank release 2 [5] under application ID 1640. We selected traits with at least 50,000 individuals with non-missing phenotypes and both cases and controls are at least 10,000 for binary traits. Only phenotype of first visit and first run (f.xxx.0.0) was used with some exceptions (please refer Supplementary Text and Supplementary Table 1-2 of [4] for details). GWAS was performed using PLINK v2.0 with either linear or logistic model by correcting for age, sex, 20 PCs (re-computed for EUR subjects only, not the PCs provided by UKB), array, assessment center and Townsend deprivation index.

Columns of the summary statistics
SNP: unique ID of the SNP consists of chromosome, position and alphabetically ordered alleles
CHR: chromosome
BP: base pair position on GRCh37
A1: effect allele
TEST: Type of test (ADD for all files)
NMISS: Number of non-missing genotypes
BETA/OR: Regression coefficient or odds ratio
SE: Standard error (for OR, in logOR scale)
L95: Lower bound on confidence interval for CMH odds ratio
U95: Upper bound on confidence interval for CMH odds ratio
STAT: Coefficient t-statistics
P: P-value
A2: non effect allele
MAF: Minor allele frequency
NCHROBS: Number of allele observation
SNPID_UKB: rsID provided by UK Biobank
A1_UKB: A1 allele in UK Biobank
A2_UKB: A2 allele in UK Biobank
INFO_UKB: Info score provided by UK Biobank
MAF_UKB: MAF of entire UK Boiobank samples

2.3. Pre-process of GWAS summary statistics

Curated summary statistics on the database was pre-processed to standardise the format. SNPs with P-value <=0 or >1, or non-numeric value such as “NA” were excluded. For summary statistics with non-hg19 genome coordinate, liftOver software was used to align to hg19. When only rsID is available in the summary statistics file without chromosome and position, genome coordinates were extracted from dbSNP 146. When rsID is missing, it is assigned based on dbSNP 146. When only effect allele is reported, another allele was extracted from dbSNP 146.
We do not distribute pre-processed GWAS summary statistics to not induce confusion due to duplicated information in public domain. These pre-processed summary statistics were used for all the analyses available on this website and can be sued as input of FUMA (server side translation).

[Go to top]

3. Database features (GWAS information)

Feature Description
id Unique ID in the database (arbitrary order).
PMID Pubmed ID of the original study. If the study is not published, mentioned in this feature or dio of bioRxiv is provided.
Year The year of the original study is published. If the study is not published, the year of the data was distributed.
File Link to the original summary statistics. When submission of online form is required, the link is to the online form. Otherwise, directly linked to the downloadable file.
Website Link to the website if available (not the direct link to the summary statistics file).
Consortium The name of consortium if available.
Domain General domain of the trait.
ChapterLevel Chapter of the trait obtained from either ICD10 or ICF10.
SubchapterLevel Subchapter of the trait obtained from either ICD10 or ICF10.
Trait The trait name used in the original study (or as close as possible).
uniqTrait The trait name harmonised across database. This matches traits with slightly different name in the "Trait" feature, but does not mean that phenotype definition is exactly the same. Please refer the original study for detailed phenotype definition.
Population One of the five super ancestry populations defined in 1000 genome project, AFR (African), AMR (American), EAS (East asian), EUR (European), and SAS (South asian). If the GWAS is trans ethnic study, all of the population is listed but the first one has the highest proportion of the total sample size of the study. For example, EUR+EAS+SAS means the study cohorts are mix of three populations, but EUR samples occupy the highest proportion of the total sample size (does not have to be majority). For UK Biobank cohort, it is "UKB1 (EUR)" for release 1 and "UKB2 (EUR)" for release 2.
Ncase For binary trait, the number of cases.
Ncontrol For binary trait, the number of controls.
N Total number of sample size used for the analyses. This number is the total sample size used to generate the summary statistics which is publicly available. In some meta-analysis studies, some cohorts are restricted to distribute summary statistics. In that case, publicly available summary statistics dose not include specific cohorts and the sample size in this database corresponds to the sample size excluding those cohorts.
Nsnps The number of SNPs in the original GWAS summary statistics.
Nhits The number of risk loci. The definition of risk loci is described in the next section, 4. Definition of lead SNPs and risk loci.
SNPh2 SNP heritability estimated by LD score regression [1]. This is only available for GWAS that meet certain criteria, otherwise blank. See section 5. Estimation of SNP heritability and genetic correlation with LD score regression for details.
SNPh2_se If SNP h2 is available, standard error of SNP h2.
SNPh2_z If SNP h2 is available, Z statistics of SNP h2.
LambdaGC If SNP h2 is available, estimated Lambda GC.
Chi2 If SNP h2 is available, estimated chi square.
Intercept If SNP h2 is available, estimated single trait intercept.
Note Any information that is relevant, extracted from the original study.

[Go to top]

4. Definition of lead SNPs and risk loci

4.1. Lead SNPs

As described previously [3], lead SNPs are defined by double clumping. The first clumping is a clumping of SNPs with P-value < 0.05 at genome wide significant (P-value < 5e-8) and independent at r2 < 0.6 which defines independent significant SNPs. The second clumping is a clumping of significant independent SNPs at r2 < 0.1 which defines lead SNPs.

4.2. Risk loci

Each of the independent significant SNPs have it's own LD block defined by the SNPs (P-value < 0.05) that are in LD with the independent significant SNP (r2 ≥ 0.6). To define genomic risk loci as a region, first LD blocks of independent significant SNPs belongs to the same lead SNPs are merged. Then LD blocks which are physically overlapping or distance is 250Kb are merged. Each risk locus is represented by one of the lead SNPs with the minimum P-value within the locus. Therefore, a risk locus can contain multiple independent significant SNPs and lead SNPs.

4.3. Reference panel

We used 1000 genome phase 3 [6] of corresponding population (AFR, AMR, EAS, EUR, SAS) as a reference panel to compute LD for most of the GWAS in the database. For trans-ethnic GWAS, the population with the most proportion of the total sample size was used. When the GWAS is based on UKB release 1 cohort, we used randomly sampled 10,000 white British subjects from UKB release 1 as reference. For GWAS performed in this study or based on UKB2 cohort, we used randomly sampled 10,000 unrelated EUR subjects as a reference. For meta-analyses including UKB cohort, either UKB1 or UKB2 was used as reference. Indels and non-bi-allelic SNPs were excluded. For each GWAS, the population is specified in the "Population" feature of the database (described in the previous section 3. Database features).

[Go to top]

5. Estimation of SNP heritability and genetic correlation with LD score regression

5.1. LD score regression (LDSC)

We used LD score regression [1] software with pre-computed LD scores for EUR and EAS populations obtained from https://data.broadinstitute.org/alkesgroup/LDSCORE/. SNPs are filtered on HapMap3 SNPs and the MHC region was excluded from any of LDSC analyses.

5.2. SNP heritability estimation

SNP heritability was estimated for GWAS with either EUR or EAS population (or, EUR or EAS has the most proportion of the total sample size), total sample size > 5,000 and the number of SNPs available in the summary statistics file is > 450,000. When signed effect size or odds ratio is not available in the summary statistics file, "--a1-inc” flag was used. For all GWASs fulfil above criteria, SNP heritability was computed in observed scale.
For binary traits, SNPs heritability was also computed in liability scale. The population prevalence was curated from literatures (only for diseases whose prevalence was available, Supplementary Table 25 [4] or in "Note" column of this databset) to compute SNP heritability at liability scale with “--samp-prep” and ”--pop-prep” flags and specified in the "Note" feature of the database. For most of personality/activity (binary) traits from UKB2 cohort, we assumed that sample prevalence is equal to population prevalence since UK Biobank is not designed to study a certain disease/trait as described previously [7]. Likewise, when population prevalence was not available, sample prevalence was used as population prevalence for all other binary traits.

5.3. Genetic correlation

Genetic correlation was computed only for pairwise GWAS with the following criteria as suggested previously [8].
  • SNP heritability was estimated
  • GWAS of EUR population of more than 80% of samples are EUR.
  • Effect and non-effect alleles are explicitly mentioned in the header or elsewhere
  • SNP heritability Z-score > 2

[Go to top]

6. MAGMA analyses

For the current release, MAGMA v1.06 [2] was used.

6.1. MAGMA gene analysis

MAGMA gene analysis was performed using 19,436 protein coding genes obtained from biomaRt (primary ID is Ensembl ID v92 GRCh37) which are mapped to entrez ID from NCBI. SNPs are assigned to genes with 1kb window both side. Reference panel of corresponding population based on either 1000G, UK Biobank release 1 or release 2 was used as described in the "Population" feature of the section 3. Database features. The default model, snp-wide (mean) was used.

6.2. MAGMA gene-set analysis

MAGMA gene-set analysis was performed for 4,737 curated gene set and 5,917 GO terms (4,436 biological processes, 580 cellular components and 901 molecular functions) from MsigDB v6.1 [9].
For the release of v20191115, gene-set analyses for all traits in the database are updated to MsigDB v7.0 (5,500 curated gene sets, 7,350 GO biological process, 1,001 GO cellular components, 1,645 GO molecular function).

[Go to top]

7. Multi GWAS comparison

7.1. Scatterplots and regression lines

Linear regression line, correlation coefficient and P-value (null hypothesis is that the slope is zero) are computed by linregress function of scipy.stats module in Python 2.7. For the plot of year vs sample size, the regression line is not displayed when all data points are in the same year. For the plot inclucde SNP heritability, data points are limited to the GWAS whose SNP heritability was estimated by LD score regression. See 5.2. SNP heritability estimation for details.

7.2. Genetic correlation heatmap

The heatmap only includes GWAS which meet the criteria for analyses of genetic correlation. See 5.3. Genetic correlation for details. The value of genetic correlation was winsolized between -1.25 and 1.25. The Bonferroni correction was performed based on the number of possible pair in the heatmap (#GWASx(#GWAS-1)x0.5).

7.3. MAGMA gene overlap

The heatmap only includes GWAS with at least one genome-wide significant genes based on the genes tested in all selected GWAS (P-value < 0.05/#tested genes). The cell of i-th column and j-th row represents the proportion of overlapped significant genes between two GWAS based on the number of significant genes in i-th GWAS.

7.4. Pleiotropic risk loci

Pre-defined risk loci for each selected GWAS are pooled and physically overlapping risk loci are grouped. It does not require all risk loci in a group to be overlapped. For example, locus A and locus B are overlapping and locus B and locus C are also overlapping but locus A and locus C are not, in this case, locus A, B and C are grouped into one. Therefore, it is possible that a group of risk loci contains more than one risk loci from a single trait. For each grouped risk loci, number of associated GWAS (summary statistics) and domains are counted. Note that the number of associated GWAS is the the number of associated unique summary statistics which does not necessary reflect the number of associated unique traits when multiple summary statistics were selected for a single trait. See 4. Definition of lead SNPs and risk loci for definition of risk loci.

7.5. Pleiotropic genes

First genes which are tested in all selected GWAS are selected and considered as genome-wide significant when P-value < 0.05/#tested genes. For each gene, the number of associated GWAS (summary statistics) and domains are counted.

[Go to top]

8. PheWAS plot

PheWAS plot can be created for a SNP or gene. For SNP, only SNPs with P-value < 0.05 are available to keep performance.
When rsID is provided, chromosome and position are extracted from dbSNP build 146. Note that alleles across GWAS are ignored which means SNPs on the same genomic coordinate are considered as the same SNP. Genes can be searched by Ensembl ID, gene symbol or NCBI entrez ID.

[Go to top]

9. Citation

When you use results from GWAS atlas website, please cite the following.
Watanabe, K. et al. A global view of genetic architecture in human complex traits. [under preparation].
When you use GWAS summary statistics searched from atlas database, please cite the original GWAS study.

[Go to top]

10. References

  1. Bulik-Sullivan, B.K. et al. LD Score refression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291-295 (2015). PMID: 25642630
  2. de Leeuw, C.A. et al. MAGMA: Generalized gene-set analysis of GWAS data. PLoS Comput. Biol. 11, e1004219 (2015). PMID: 25885710
  3. Watanabe, K. et al. Functional mapping and annotation of genetic associations with FUMA. Nat. Commun. 8, 1826 (2017). PMID: 29184056
  4. Watanabe, K. et al. A global overview of pleiotropy and genetic architecture in complex traits. Nat. Genet. 51, 1339-1348 (2019).
  5. Bycroft, C. et al. Genome-wide genetic data on ~500,000 UK Biobank particiants. bioRxiv doi:https://doi.org/10.1101/166298 (2017).
  6. The 1000 Genome Project Consortium. A global reference for human genetic variation. Nature. 526, 68-74 (2015). PMID: 26432245
  7. Ge, T. et al. Phenome-wide heritability analysis of the UK Biobank. PLoS Genet. 13, e1006711 (2017). PMID: 28388634
  8. Zheng, J. et al. LD Hub: a centralized database and web interface to perform LD score regression that maximizes the potential of summary level GWAS data for SNP heritability and genetic correlation analysis. Bioinformatics. 33, 272-279 (2017). PMID: 27663502
  9. Libezon, A. et al. Molecular signatures database (MsigDB) 3.0. Bioinformatics. 27, 1739-40 (2011). PMID: 21546393

[Go to top]

11. URLs

Here we list the URLs used to curate publicly availalbe GWAS summary statistics and software/tools used in GWAS atlas.

Software/tools

LD score: https://github.com/bulik/ldsc
MAGMA: https://ctg.cncr.nl/software/magma
FUMA: http://fuma.ctglab.nl/

GWAS resources

https://www.ebi.ac.uk/gwas/downloads/summary-statistics
https://grasp.nhlbi.nih.gov/FullResults.aspx
http://www.type2diabetesgenetics.org/informational/data#
https://www.ncbi.nlm.nih.gov/gap
ftp://twinr-ftp.kcl.ac.uk/ImmuneCellScience/2-GWASResults
http://amdgenetics.org/
http://archive.broadinstitute.org/ftp/pub/rheumatoid_arthritis/Stahl_etal_2010NG/
http://csg.sph.umich.edu/abecasis/public/amdgene2012/
http://csg.sph.umich.edu/abecasis/public/lipids2013
http://diagram-consortium.org
http://egg-consortium.org
http://enigma.ini.usc.edu/
http://metabolomics.helmholtz-muenchen.de
http://mips.helmholtz-muenchen.de/proj/GWAS/gwas/gwas_server/
http://research-pub.gene.com/bronson_et_al_2016/
http://ssgac.org
http://web.pasteur-lille.fr/en/recherche/u744/igap/igap_download.php
http://wp.unil.ch/sgg/bayesian-lifespan-gwas/
http://www.broadinstitute.org/collaboration/giant
http://www.cardiogramplusc4d.org
http://www.ccace.ed.ac.uk/node/335
http://www.computationalmedicine.fi
http://www.gefos.org
http://www.ilae.org/Commission/genetics/consortium.cfm
http://www.ipscsg.org/
http://www.mcgill.ca/genepi/adipogen-consortium
http://www.med.unc.edu/pgc/
http://www.reprogen.org
http://www.t2diabetesgenes.org/data/
http://www.thessgac.org/data
http://www.tweelingenregister.org/EAGLE/
http://www.tweelingenregister.org/GPC
http://www.urr.cat/
https://ctg.cncr.nl/software/summary_statistics
https://data.bris.ac.uk/data/dataset/28uchsdpmub118uex26ylacqm
https://sleepgenetics.org/downloads/
https://walker05.u.hpc.mssm.edu/
https://www.cng.fr/gabriel/index.html
https://www.ibdgenetics.org
https://www.magicinvestigators.org/
https://www.nhlbi.nih.gov/research/intramural/researchers/ckdgen
https://www.broadinstitute.org/diabetes


[Go to top]