1- Overview
This is a guide for an introductory analysis to 1) construct a polygenic risk score (PRS) using the base data (GWAS summary statistics, particularly effect-sizes and P-values, generally public available) via a clumping and thresholding method (C + T); and 2) test the constructed PRS for prediction using the target data (PLINK binary data format). In general, it is the ‘user’ data).
2- Learning Objectives
- Apply quality control measures to base/target sample prior to PRS analysis;
- Perform PRS analysis (hands-on);
- Understand the graphs and outputs (hands-on)
3-
...
Materials
Ideally for PRS analyses, you would be using the genome-wide genotype data. Here we use base data containing summary statistics and target data containing genotypes for chromosome 16 as an example to demonstrate the workflow for prediction of simulated body mass index (BMI) data. The procedure described below will be the same for the genome-wide dataset. All the materials required for this workshop are attached here. Relevant materials for this workshop are as follows:
...
3.1.1- QC for Base data:1)-
- Check SNPs Heritability: h2SNP>0.05
- LD Score Regression: https://github.com/bulik/ldsc (
...
- Bulik-Sullivan et al, Nature Genetics, 2015)
- SumHer: http://dougspeed.com/snp-heritability/ (
...
- Speed et al, Nature Genetics, 2020)
...
- The effect allele must be known. Both Base and Target datasets should have the same effect allele.
...
- Filter out SNPs with MAF < 0.01 and INFO < 0.8.
Code Block |
---|
### BMI_1kgph3_chr16_snps_summarystat.txt: # With the following columns: “SNP” = marker ID, “A1” = minor allele or effect allele, “A2” = major allele or reference allele, “Freq1.Hapmap” = allele frequency according to Hapmap, “b” = effect estimate for the minor allele, “se” = standard error of the effect estimate, “p” = p-value, “N” = sample size SNP A1 A2 Freq1.Hapmap b se p N 1 rs1000014 G A 0.8000 -0.0055 0.0049 0.2617 233462 2 rs1000047 C T 0.6917 0.0019 0.0040 0.6348 233959 3 rs1000077 C G 0.4833 0.0006 0.0038 0.8745 220681 4 rs1000078 A G 0.7000 -0.0031 0.0041 0.4496 232588 5 rs1000100 A T 0.4167 0.0006 0.0040 0.8808 233753 6 rs1000174 A G 0.3417 -0.0025 0.0052 0.6307 150418 |
...
3.2.1- QC for Target data: 1)-
- Check if your Target data is in the same genome build as Base data.
...
- Basic QC: e.g., geno >0.99, mind <0.02, HWE P>1x10-6, 3SD HET, MAF >0.01, INFO >0.8. Also, remove indels and multi-allelic SNPs.
...
- Avoid sample overlap, as well as high degree of relatedness between individuals of Base and Target data.
...
- Check strand and allele calls
Code Block |
---|
### 1kgph3_chr16.bed: # binary plink pedigree genotype file; do not try to open this file ### 1kgph3_chr16.bim (as appearing in R): # with 277,663 biallelic markers (SNPs or Ins/Del markers) – 6 columns without column names: “V1”=chromosome, “V2”=marker ID, “V3”=genetic distance, “V4”=chromosomal position in bp, “V5”=Allele1, “V6”=Allele2 V1 V2 V3 V4 V5 V6 1 16 rs185537431 0 60778 A G 2 16 rs377548396 0 62569 A G 3 16 rs368745239 0 66640 T G 4 16 rs187053456 0 70765 A C 5 16 rs193118147 0 70767 A G 6 16 rs201639477 0 75246 C CTTTTTTT ### 1kgph3_chr16.fam (as appearing in R): # with 476 individuals – 6 columns without column names: “V1”=Family ID, “V2”=Individual ID, “V3”=Father ID, “V4”=Mother ID, “V5”=Sex (Male=1, Female=2), “V6”=Affected status (Yes=2, No=1, Unknown=-9) V1 V2 V3 V4 V5 V6 1 HG00096 HG00096 0 0 1 -9 2 HG00097 HG00097 0 0 2 -9 3 HG00099 HG00099 0 0 2 -9 4 HG00101 HG00101 0 0 1 -9 5 HG00102 HG00102 0 0 2 -9 6 HG00103 HG00103 0 0 1 -9 ### 1kgph3_dummybmi20200804.csv (as appearing in R): # with the columns with column names: “V1”=Family ID, “V2”=Individual ID, “V3”=Father ID, “V4”=Mother ID, “V5”=Sex, “V6”=Affective status, “dummybmi”=phenotype of interest, “Sex” (male=1, female=0) V1 V2 V3 V4 V5 V6 dummybmi sex HG00096 HG00096 0 0 1 -9 13.42457034 1 HG00097 HG00097 0 0 2 -9 11.29648997 0 HG00099 HG00099 0 0 2 -9 10.98411446 0 HG00101 HG00101 0 0 1 -9 8.268308741 1 HG00102 HG00102 0 0 2 -9 11.1241505 0 |
3.3- Required Softwares
- PLINK 1.9 (http://www.cog-genomics.org/plink2/). Use the stable version (19 Oct 2020 (b6.21)).
- R (version 3.2.3+) (https://cran.r-project.org/)
4- Quality Control Checklist
The Base and Target data must be matching for the following items:
- SNP IDs (for this workshop, both base and target datasets have rsIDs)
- chromosome and map positions (we can double-check base vs target datasets if that information is available)
- allele 1 and allele 2 calls
- ambiguous SNPs
- strand
- allele 1 vs allele 2
For the workshop we are using simulated data (see above). For this dataset, we will only check strand and allele calls.
...
### merging base (GWAS summary stat) with target (GWAS) bim file to check strands and allele calls. The expected outputs for "nrow" are shown.
Code Block |
---|
R --vanilla ## only if you are running R in your computer terminal kg=read.table("1kgph3_chr16.bim",header=F) bmi=read.table("BMI_1kgph3_chr16_snps_summarystat.txt",header=T) nrow(kg) ##277,663 head(kg) nrow(bmi) ##68,385 head(bmi) bmikg=merge(bmi,kg,by.x="SNP",by.y="V2") # merging step nrow(bmikg) ##68,385 head(bmikg) |
...
Code Block |
---|
bmikg2=rbind(bmikg1a,bmikg1b) nrow(bmikg2) #55,752 head(bmikg2) ## Saving the base-cleaned-alleles file to be used later write.table(bmikg2,"BMI_unambiguousSNPsposstrand-B_AllelesV5V6_20200226.txt",quote=F,sep='\t',col.names=T,row.names=F) ## Setting up and saving the file for clumping in the next step bmikg2s=subset(bmikg2,select=c("SNP","p")) colnames(bmikg2s)<-c("SNP","P") write.table(bmikg2s,"BMI_SNPs_P_forclumping.txt",quote=F,sep='\t',col.names=T,row.names=F) |
5- Calculation of PRS via clumping and thresholding
The following procedure is using the cleaned file generated above.
Tip: Effect sizes given as odds ratios (OR) will need to be converted to Beta (B) using the natural logarithm of the OR. In this way, the PRS can be computed using summation. (The effect estimates can be transformed from B back to OR afterwards).
5.1- CLUMPING using PLINK
### The most commonly used method for computing PRS is clumping and thresholding (C+T). Before calculating PRS, the variants are first clumped, and variants that are weakly correlated (r2) with one another are retained. The clumping step prunes redundant correlated effects caused by linkage disequilibrium (LD) between variants. Thresholding will remove variants with a p value larger than a chosen level of significance (default: 0.0001).
...
Code Block |
---|
bmikg2=read.table("BMI_unambiguousSNPsposstrand-B_AllelesV5V6_20200226.txt",header=T) clumped=read.table("1kgph3_chr16_test_clumped_1.clumped",header=T) nrow(clumped) bmikg3=merge(bmikg2,clumped,by="SNP",sort=F) nrow(bmikg3) ##4651 |
5.2- THRESHOLDING using R
### As an example, we will use either (a) p-value threshold of 0.5, or (b) a range of p-value thresholds from base GWAS summary statistics to select SNPs for polygenic risk scoring in the target GWAS dataset
...
Code Block |
---|
## writing input file for generating polygenic risk scores bmikg3b=subset(bmikg3,select=c("SNP","V5","B")) colnames(bmikg3b)<-c("SNP","A1","Score") write.table(bmikg3b,"1kgph3_chr16_test_clumped_1_nothreshold.raw",col.names=T,row.names=F,quote=F,sep='\t') ## writing file for filtering based on p-values in the range_list.txt file above bmikg3c=subset(bmikg3,select=c("SNP","P")) write.table(bmikg3c,"1kgph3_chr16_test_clumped_1_nothreshold.snppvalues",col.names=T,row.names=F,quote=F,sep='\t') |
5.3- GENERATING THE POLYGENIC RISK SCORES
...
# The command above will generate one output file with extension "*.profile"
Code Block |
---|
### 1kgph3_chr16_profile_5_20200226.profile: ## with columns: FID=Family ID, IID=Individual ID, PHENO=phenotype, CNT=number of SNPs used for scoring, CNT2=number of named (effect) alleles for each subject, SCORE=polygenic risk score for each subject FID IID PHENO CNT CNT2 SCORE HG00096 HG00096 -9 6508 1107 -2.34634e-05 HG00097 HG00097 -9 6508 1193 0.000309573 HG00099 HG00099 -9 6508 1156 3.56331e-05 HG00101 HG00101 -9 6508 1167 -4.76951e-05 HG00102 HG00102 -9 6508 1157 -3.64935e-05 |
### Example (b) using a range of p-value thresholds
...
# The command above will generate eight output files with extension "*.profile"
Code Block |
---|
### For example, 1kgph3_chr16_profile_20200226.0.001.profile: ## with columns: FID=Family ID, IID=Individual ID, PHENO=phenotype, CNT=number of SNPs used for scoring, CNT2=number of named (effect) alleles for each subject, SCORE=polygenic risk score for each subject FID IID PHENO CNT CNT2 SCORE HG00096 HG00096 -9 148 45 -0.000618919 HG00097 HG00097 -9 148 40 0.00219797 HG00099 HG00099 -9 148 44 0.00117297 HG00101 HG00101 -9 148 47 -0.00123784 HG00102 HG00102 -9 148 46 -0.00163041 |
6- Applications of PRS scores
6.1- REGRESSION ANALYSIS
## In R:
...
Code Block |
---|
prs1phen=read.csv("test_clump_1_prs_5.csv",header=T) head(prs1phen) lmprs<-lm(data=prs1phen, dummybmi~SCORE + sex) summary(lmprs) |
# Plotting6.2- PLOTTING
Code Block |
---|
## Histogram showing distribution of polygenic risk scores (shown for p-value threshold of 0.5 from Example (a)) pdf("Histogram_testscore_pT_5.pdf") hist(prs1phen$SCORE, breaks=100, main = paste("Distribution of PRS scores (n =", length(prs1phen$SCORE), ")")) dev.off() ## Scatterplot of BMI vs PRS (shown for p-value threshold of 0.5 from Example (a)) pdf("Scatterplot_dummybmi_vs_PRS_pT_5.pdf") plot(prs1phen$SCORE,prs1phen$dummybmi) abline(lm(prs1phen$dummybmi ~ prs1phen$SCORE), col = "red", lty = 1, lwd = 1) dev.off() # Plots of prediction R-squared values vs p-value thresholds (for Example (b)) prs_files <- list.files(pattern = "1kgph3_chr16_profile_20200226.*.*.profile$"); prs_list <- lapply(prs_files, function(rr) read.table(rr, head=T)) r_squared<-NA for (j in 1:length(prs_list)){ prs1phen=merge(prs_list[[j]] ,phen, by.x="FID", by.y="V1") lmprs<-lm(data= prs1phen, dummybmi~SCORE + sex) r_squared[j]<-summary(lmprs)$r.squared } p_threshold <- as.numeric(sapply(sapply(prs_files, function(rr) strsplit(rr, "_20200226*\\.")[[1]][[2]]), function(ee) strsplit(ee, ".profile")[[1]])) ## Lineplot of prediction R-squared values vs p-value thresholds pdf("Lineplot_R-squared_vs_p-value_threshold.pdf") plot(p_threshold, r_squared, xlab = "P-value threshold", ylab="Prediction R-squared") lines(p_threshold, r_squared); abline(v=p_threshold[which.max(r_squared)], col=2) dev.off() ## Barplot of prediction R-squared values vs p-value thresholds -- colour gradient from blue for lowest R-squared value to red for highest R-squared value; legend showing R-squared values in the order of the p-value thresholds rankfac <- rank(r_squared) rbPal <- colorRampPalette(c('blue','red')) Col <- rbPal(length(rankfac))[as.numeric(cut(rankfac,breaks = length(rankfac)))] pdf("Barplot_R-squared_vs_p-value_threshold_bluetored.pdf") barplot(r_squared, names.arg=p_threshold, xlab = "P-value threshold", ylab="Prediction R-squared",col=Col, legend=round(r_squared, digits=4), args.legend=list(title="R-squared"), xlim=c(0,12)) dev.off() |
...
7- Additional resources for PRS analysis
Software for alternative PRS methods:
- PRSice (https://www.prsice.info/)
- LDhub (http://ldsc.broadinstitute.org/ldhub/)
- LDPred (https://github.com/bvilhjal/ldpred)
Base data:
- GWAS catalog (https://www.ebi.ac.uk/gwas/downloads/summary-statistics)
- Psychiatric Genomics Consortium (https://www.med.unc.edu/pgc/download-results/)
- GIANT Consortium online portal (http://portals.broadinstitute.org/collaboration/giant/index.php/GIANT_consortium_data_files)
- DIAGRAM (http://diagram-consortium.org/)
- Global Lipids Genetics Consortium (http://csg.sph.umich.edu/willer/public/lipids2013/)
Additional tutorial or guides:
- PRS tutorial (https://choishingwan.github.io/PRS-Tutorial/; Choi et al., 2020)
- General GWAS and PRS tutorial (https://github.com/MareesAT/GWA_tutorial/)
...
8- References
...
- Bulik-Sullivan BK, Loh PR, Finucane HK, Ripke S, Yang J; Schizophrenia Working Group of the Psychiatric Genomics Consortium, Patterson N, Daly MJ, Price AL, Neale BM. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nature Genetics, 2015; 47:291–295. PMID: 25642630 PMCID: PMC4495769 DOI: 10.1038/ng.3211
...
- Choi, S.W., Mak, T.S. & O’Reilly, P.F. Tutorial: a guide to performing polygenic risk score analyses. Nat Protoc. 2020; 15
...
- : 2759–2772
...
- . https://doi-org.myaccess.library.utoronto.ca/10.1038/s41596-020-0353-1
- Locke AE, Kahali B, Berndt SI, Justice AE, Pers TH, Day FR, Powell C, Vedantam S, Buchkovich ML, Yang J, Croteau-Chonka DC, Esko T et al. Genetic studies of body mass index yield new insights for obesity biology. Nature. 2015; 518:197-206. PMID: 25673413 PMCID: PMC4382211 DOI: 10.1038/nature14177
- Speed D, Holmes J, Balding DJ. Evaluating and improving heritability models using summary statistics. Nature Genetics, 2020; 52: 458–462. PMID: 32203469 DOI: 10.1038/s41588-020-0600-y