Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

1- Overview 

This is a guide for an introductory analysis to 1) construct a polygenic risk score (PRS) using the base data (GWAS summary statistics, particularly with effect-sizes and P-values, generally publicly available) via a clumping and thresholding method (C + T); and 2) test the constructed PRS for prediction using the target data (PLINK binary data files and phenotype csv file). In general, it is the ‘user’ data).

2- Learning Objectives

  1. Apply quality control measures to base/target sample prior to PRS analysis;
  2. Perform PRS analysis (hands-on);
  3. Understand the graphs and outputs (hands-on)

3- Materials

Ideally for PRS analyses, you would be using the genome-wide genotype data. Here we use base data containing summary statistics and target data containing genotypes for chromosome 16 as an example to demonstrate the workflow for the prediction of simulated body mass index (BMI) data. The procedure described below will be the same for the genome-wide dataset. All the materials required for this workshop are attached here. Relevant materials for this workshop are as follows:

...

  1. Check SNPs Heritability: h2SNP>0.05
  2. The effect allele must be known. Both Base and Target datasets should have the same effect allele.
  3. Filter out SNPs with MAF < 0.01 and INFO < 0.8. (May also consider filtering for minimum effective sample size NEFF – at least half of full sample, as suggested by Dr. Tian Ge (PRS-CS))


Code Block
## BMI_1kgph3_chr16_snps_summarystat.txt:
 
# With the following columns: “SNP” = marker ID, “A1” = effect allele (could be minor allele if output from plink), “A2” = reference allele, “Freq1.Hapmap” = allele frequency according to Hapmap, “b” = effect estimate for the minor allele, “se” = standard error of the effect estimate, “p” = p-value, “N” = sample size
 
        SNP A1 A2 Freq1.Hapmap       b     se      p      N
1 rs1000014  G  A       0.8000 -0.0055 0.0049 0.2617 233462
2 rs1000047  C  T       0.6917  0.0019 0.0040 0.6348 233959
3 rs1000077  C  G       0.4833  0.0006 0.0038 0.8745 220681
4 rs1000078  A  G       0.7000 -0.0031 0.0041 0.4496 232588
5 rs1000100  A  T       0.4167  0.0006 0.0040 0.8808 233753
6 rs1000174  A  G       0.3417 -0.0025 0.0052 0.6307 150418

...


4- Quality Control Checklist 

The Base and Target data must be matching for the following items:

...

Code Block
bmikg2=rbind(bmikg1a,bmikg1b)
 
nrow(bmikg2)    #55,752
 
head(bmikg2)
 
## Saving the base-cleaned-alleles file to be used later
write.table(bmikg2,"BMI_unambiguousSNPsposstrand-B_AllelesV5V6_20200226.txt",quote=F,sep='\t',col.names=T,row.names=F)
 
## Setting up and saving the file for clumping in the next step
bmikg2s=subset(bmikg2,select=c("SNP","p"))
colnames(bmikg2s)<-c("SNP","P")
write.table(bmikg2s,"BMI_SNPs_P_forclumping.txt",quote=F,sep='\t',col.names=T,row.names=F)


5- Calculation of PRS via clumping and thresholding

The following procedure is using the cleaned file generated above.

...

Code Block
## For example, 1kgph3_chr16_profile_20200226.0.001.profile:
## with columns: FID=Family ID, IID=Individual ID, PHENO=phenotype, CNT=number of SNPs used for scoring, CNT2=number of named (effect) alleles for each subject, SCORE=polygenic risk score for each subject
 
      FID       IID  PHENO    CNT   CNT2        SCORE
  HG00096   HG00096     -9    148     45 -0.000618919
  HG00097   HG00097     -9    148     40   0.00219797
  HG00099   HG00099     -9    148     44   0.00117297
  HG00101   HG00101     -9    148     47  -0.00123784
  HG00102   HG00102     -9    148     46  -0.00163041


6- Applications of PRS scores


6.1- REGRESSION ANALYSIS 

...

Code Block
## Histogram showing distribution of polygenic risk scores (shown for p-value threshold of 0.5 from Example (a))
pdf("Histogram_testscore_pT_5.pdf")
hist(prs1phen$SCORE, breaks=100, main = paste("Distribution of PRS scores (n =", length(prs1phen$SCORE), ")"))
dev.off()
 
## Scatterplot of BMI vs PRS (shown for p-value threshold of 0.5 from Example (a))
pdf("Scatterplot_dummybmi_vs_PRS_pT_5.pdf")
plot(prs1phen$SCORE,prs1phen$dummybmi)
abline(lm(prs1phen$dummybmi ~ prs1phen$SCORE), col = "red",
lty = 1, lwd = 1)
dev.off()
 
 
# Plots of prediction R-squared values vs p-value thresholds (for Example (b))
prs_files  <- list.files(pattern = "1kgph3_chr16_profile_20200226.*.*.profile$");
prs_list <- lapply(prs_files, function(rr) read.table(rr, head=T))
r_squared<-NA
for (j in 1:length(prs_list)){
prs1phen=merge(prs_list[[j]] ,phen, by.x="FID", by.y="V1")
lmprs<-lm(data= prs1phen, dummybmi~SCORE + sex)
r_squared[j]<-summary(lmprs)$r.squared
}
 
p_threshold <- as.numeric(sapply(sapply(prs_files, function(rr) strsplit(rr, "_20200226*\\.")[[1]][[2]]), function(ee) strsplit(ee, ".profile")[[1]]))
 
## Lineplot of prediction R-squared values vs p-value thresholds
pdf("Lineplot_R-squared_vs_p-value_threshold.pdf")
plot(p_threshold, r_squared, xlab = "P-value threshold", ylab="Prediction R-squared")
lines(p_threshold, r_squared);
abline(v=p_threshold[which.max(r_squared)], col=2)
dev.off()
 
 
## Barplot of prediction R-squared values vs p-value thresholds -- colour gradient from blue for lowest R-squared value to red for highest R-squared value; legend showing R-squared values in the order of the p-value thresholds
rankfac <- rank(r_squared)
rbPal <- colorRampPalette(c('blue','red'))
Col <- rbPal(length(rankfac))[as.numeric(cut(rankfac,breaks = length(rankfac)))]
pdf("Barplot_R-squared_vs_p-value_threshold_bluetored.pdf")
barplot(r_squared, names.arg=p_threshold, xlab = "P-value threshold", ylab="Prediction R-squared",col=Col, legend=round(r_squared, digits=4), args.legend=list(title="R-squared"), xlim=c(0,12))
dev.off()



7- Additional resources for PRS analysis

Software for alternative PRS methods:

...

Additional codes (after installation following instructions):

Code Block
## PRSice2
module load lang/R/3.5.1-Python-3.8.5-Anaconda3-2020.11
R --vanilla
library(ggplot2)
library(optparse)
library(method)
library(tools)
library(data.table)
library(grDevices)
library(RColorBrewer)
q()
Rscript workdir/PRSice.R --dir workdir --prsice ./workdir/PRSice_linux --base workdir/source_gwas_sumstat_info_9_prsicein.txt --target indir/target_gwas_cleaned_plink --bar-levels 5e-8,0.00001,0.00005,0.0001,0.0005,0.001,0.005,0.01,0.05,0.1,0.2,0.3,0.4,0.5,1 --seed 1234 --perm 10000 --fastscore --all-score --no-regress T --out outdir/target_gwas_cleaned_source_gwas_prs

## PRS-CS
python workdir/PRScs/PRScs.py --ref_dir=workdir/ldblk_1kg_eur --bim_prefix=indir/target_gwas_cleaned_plink --sst_file=workdir/source_gwas_sumstat_prscsin.txt --n_gwas=[neff] --seed=1234 --out_dir=outdir/target_gwas_source_gwas_sumstat_prscsout
# Merge sumstat outputs from PRS-CS chr1-22 in R:
R
mci<-c()
mcall<-c()
for (i in 1:22) {
	mci<-read.table(paste("outdir//target_gwas_source_gwas_sumstat_prscsout_pst_eff_a1_b0.5_phiauto_chr",i,".txt",sep=""),header=F)
	mcall<-rbind(mcall,mci)
}
colnames(mcall)<-c("CHR","SNP","BP","A1","A2","B")
write.table(mcall,"outdir//target_gwas_source_gwas_sumstat_prscsout_pst_eff_a1_b0.5_phiauto_chr1-22.txt", row.names=F, col.names=T, quote=F, sep='\t')
mcalla=subset(mcall,select=c("SNP","A1","B"))
colnames(mcalla)<-c("SNP","A1","Score")
write.table(mcalla,"outdir//target_gwas_source_gwas_sumstat_prscsout_pst_eff_a1_b0.5_phiauto_chr1-22.raw", row.names=F, col.names=T, quote=F, sep='\t')
# Run polygenic scoring in PLINK:
plink --bfile indir/target_gwas_cleaned_plink --score outdir/target_gwas_source_gwas_sumstat_prscsout_pst_eff_a1_b0.5_phiauto_chr1-22.raw --out outdir/target_gwas_source_gwas_sumstat_prscsout

8- References

  • Bulik-Sullivan BK, Loh PR, Finucane HK, Ripke S, Yang J; Schizophrenia Working Group of the Psychiatric Genomics Consortium, Patterson N, Daly MJ, Price AL, Neale BM. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies.  Nature Genetics, 2015; 47:291–295PMID: 25642630  PMCID: PMC4495769  DOI: 10.1038/ng.3211
  • Choi, S.W., Mak, T.S. & O’Reilly, P.F. Tutorial: a guide to performing polygenic risk score analyses. Nat Protoc.  2020; 15: 2759–2772. https://doi-org.myaccess.library.utoronto.ca/10.1038/s41596-020-0353-1
  • Locke AE, Kahali B, Berndt SI, Justice AE, Pers TH, Day FR, Powell C, Vedantam S, Buchkovich ML, Yang J, Croteau-Chonka DC, Esko T et al. Genetic studies of body mass index yield new insights for obesity biology. Nature. 2015; 518:197-206. PMID: 25673413  PMCID: PMC4382211  DOI: 10.1038/nature14177
  • Ni G, Zeng J, Revez JA, Wang Y, Zheng Z, Ge T, Restuadi R, Kiewa J, Nyholt DR, Coleman JRI, Smoller JW, Schizophrenia Working Group of the Psychiatric Genomics Consortium, Major Depressive Disorder Working Group of the Psychiatric Genomics Consortium, Yang J, Visscher PM, Wray NR.  A comparison of ten polygenic score methods for psychiatric disorders applied across multiiple cohorts.  Biol Psychiatry. 2021 Nov 1; 90(9): 611–620. PMID: 34304866 URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8500913/
  • Pain O, Glanville KP, Hagenaars SP, Selzam S, Fürtjes AE, Gaspar HA, Coleman JRI, Rimfeld K, Breen G, Plomin R, Folkersen L, Lewis CM. Evaluation of polygenic prediction methodology within a reference-standardized framework. PLoS Genet. 2021 May 4;17(5):e1009021. doi: 10.1371/journal.pgen.1009021. eCollection 2021 May. PMID: 33945532 URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8121285/
  • Speed D, Holmes J, Balding DJ. Evaluating and improving heritability models using summary statistics.  Nature Genetics, 2020; 52: 458–462.  PMID: 32203469  DOI: 10.1038/s41588-020-0600-y


9- Acknowledgements

  • McLaughlin PRS team: e.g., Dr. Wei Deng, Dr. Delnaz Roshandel

...