...
R (version 3.2.3+) (https://cran.r-project.org/)
4- Quality Control Checklist
The Base and Target data must be matching for the following items:
- SNP IDs (for this workshop, both base and target datasets have rsIDs)
- chromosome and map positions (we can double-check base vs target datasets if that information is available)
- allele 1 and allele 2 calls
- ambiguous SNPs
- strand
- allele 1 vs allele 2
For the workshop we are using simulated data (see above). For this dataset, we will only check strand and allele calls.
Tip: It is important to know the effect allele from the Base (GWAS summary stats) data to avoid misleading conclusions. Both Base and Target datasets must have the same effect allele.
We will remove strand-ambiguous SNPs (i.e. those with complementary alleles, either C/G or A/T SNPs) and duplicate SNPs (indicative of multiallelic markers) in both the Base and Target data. We will also resolve mismatched SNPs names between the two datasets. The unresolved, mismatching SNPs will be removed as well.
### merging base (GWAS summary stat) with target (GWAS) bim file to check strands and allele calls. The expected outputs are shown.
Code Block |
---|
R --vanilla ## only if you are running R in your computer terminal
kg=read.table("1kgph3_chr16.bim",header=F)
bmi=read.table("BMI_1kgph3_chr16_snps_summarystat.txt",header=T)
nrow(kg) ##277,663
head(kg)
nrow(bmi) ##68,385
head(bmi)
bmikg=merge(bmi,kg,by.x="SNP",by.y="V2") # merging step
nrow(bmikg) ##68,385
head(bmikg)
|
Tip: If the SNPs do not have the same id between the target and base datasets, merge by chromosome, position (and alleles if the files include multi-allelic variants).
### To mark ambiguous G/C or A/T SNPs for removal
## let’s create a column (ambiguousSNPs) flagging the ambiguous SNPs as “1” in our merged file bmikg.
Code Block |
---|
bmikg$ambiguousSNPs[bmikg$A1 == "A" & bmikg$A2 == "T"]<-1
bmikg$ambiguousSNPs[bmikg$A1 == "C" & bmikg$A2 == "G"]<-1
bmikg$ambiguousSNPs[bmikg$A1 == "G" & bmikg$A2 == "C"]<-1
bmikg$ambiguousSNPs[bmikg$A1 == "T" & bmikg$A2 == "A"]<-1
head(bmikg) ## ambiguousSNPs column with ambiguous markers marked as “1”, and non-ambiguous markers marked as “NA” |
### To exclude ambiguous SNPs
Code Block |
---|
bmikg1=bmikg[is.na(bmikg$ambiguousSNPs),] ## to keep only non-ambiguous SNPs
nrow(bmikg1) ##55,752 non-unambiguous SNPs remaining |
### To find perfect allele matches between base GWAS summary statistics file and target GWAS genotype data (i.e., A1 (column “A1”) in GWAS summary statistics file is the same as A1 (column “V5”) in target GWAS data, and A2 (column “A2”) in base GWAS summary statistics file is the same as A2 (column “V6”) in target GWAS data) -- beta will be unchanged
Code Block |
---|
bmikg1aa=bmikg1[as.character(bmikg1$A1) == bmikg1$V5 & as.character(bmikg1$A2) == bmikg1$V6,]
nrow(bmikg1aa) ##28,842 |
### To find allele matches for flipped strand between base GWAS summary statistics file and target GWAS data -- beta will be unchanged
Code Block |
---|
bmikg1ab=bmikg1[bmikg1$A1 == "A" & bmikg1$A2 == "C" & bmikg1$V5 == "T" & bmikg1$V6 == "G",]
bmikg1ac=bmikg1[bmikg1$A1 == "A" & bmikg1$A2 == "G" & bmikg1$V5 == "T" & bmikg1$V6 == "C",]
bmikg1ad=bmikg1[bmikg1$A1 == "C" & bmikg1$A2 == "A" & bmikg1$V5 == "G" & bmikg1$V6 == "T",]
bmikg1ae=bmikg1[bmikg1$A1 == "C" & bmikg1$A2 == "T" & bmikg1$V5 == "G" & bmikg1$V6 == "A",]
bmikg1af=bmikg1[bmikg1$A1 == "G" & bmikg1$A2 == "A" & bmikg1$V5 == "C" & bmikg1$V6 == "T",]
bmikg1ag=bmikg1[bmikg1$A1 == "G" & bmikg1$A2 == "T" & bmikg1$V5 == "C" & bmikg1$V6 == "A",]
bmikg1ah=bmikg1[bmikg1$A1 == "T" & bmikg1$A2 == "C" & bmikg1$V5 == "A" & bmikg1$V6 == "G",]
bmikg1ai=bmikg1[bmikg1$A1 == "T" & bmikg1$A2 == "G" & bmikg1$V5 == "A" & bmikg1$V6 == "C",]
bmikg1a=rbind(bmikg1aa,bmikg1ab,bmikg1ac,bmikg1ad,bmikg1ae,bmikg1af,bmikg1ag,bmikg1ah,bmikg1ai) # to combine all the datasets created above + bmikg1aa with non-ambiguous SNPs
nrow(bmikg1a) ##28,854 |
### To add column “B” for beta to be used in polygenic risk scoring (note: “b” is the original beta)
### B=b for SNPs with matching alleles above (beta will remain unchanged)
Code Block |
---|
bmikg1a$B=bmikg1a$b |
### To find perfect allele switches between A1 and A2 (i.e., A1 in base GWAS summary statistics file is the same as A2 in target GWAS data, and vice versa) -- beta will have opposite sign
Code Block |
---|
bmikg1ba=bmikg1[as.character(bmikg1$A1) == bmikg1$V6 & as.character(bmikg1$A2) == bmikg1$V5,]
nrow(bmikg1ba) ##26,883 |
### To find flipped strands but perfect allele switches between A1 and A2 (i.e., A1 (column “A1”) in base GWAS summary statistics file is the same as the complementary allele for A2 (column “V6”) in target GWAS data, and vice versa) -- beta will have opposite sign
Code Block |
---|
bmikg1bb=bmikg1[bmikg1$A1 == "A" & bmikg1$A2 == "C" & bmikg1$V5 == "G" & bmikg1$V6 == "T",]
bmikg1bc=bmikg1[bmikg1$A1 == "A" & bmikg1$A2 == "G" & bmikg1$V5 == "C" & bmikg1$V6 == "T",]
bmikg1bd=bmikg1[bmikg1$A1 == "C" & bmikg1$A2 == "A" & bmikg1$V5 == "T" & bmikg1$V6 == "G",]
bmikg1be=bmikg1[bmikg1$A1 == "C" & bmikg1$A2 == "T" & bmikg1$V5 == "A" & bmikg1$V6 == "G",]
bmikg1bf=bmikg1[bmikg1$A1 == "G" & bmikg1$A2 == "A" & bmikg1$V5 == "T" & bmikg1$V6 == "C",]
bmikg1bg=bmikg1[bmikg1$A1 == "G" & bmikg1$A2 == "T" & bmikg1$V5 == "A" & bmikg1$V6 == "C",]
bmikg1bh=bmikg1[bmikg1$A1 == "T" & bmikg1$A2 == "C" & bmikg1$V5 == "G" & bmikg1$V6 == "A",]
bmikg1bi=bmikg1[bmikg1$A1 == "T" & bmikg1$A2 == "G" & bmikg1$V5 == "C" & bmikg1$V6 == "A",]
bmikg1b=rbind(bmikg1ba,bmikg1bb,bmikg1bc,bmikg1bd,bmikg1be,bmikg1bf,bmikg1bg,bmikg1bh,bmikg1bi)
nrow(bmikg1b) ##26,898 |
### NOTE: Need to check for allelic mismatches beyond those assessed in the steps above (same strands or flipped strands, same alleles or switched alleles): additional checks would include checking for INDELs (insertions/deletions) and non-matching alleles (not included in this workshop)
### HERE: our data have 0 SNPs with mismatching A1 and A2 between GWAS summary statistics and target GWAS data
### NOTE: there are 0 INDELs with mismatching A1 and A2 between Base GWAS summary stat and target GWAS data
Code Block |
---|
nrow(bmikg1b)+nrow(bmikg1a) ##55,752 |
### B=-b for SNPs with switched alleles above (beta will have opposite sign)
Code Block |
---|
bmikg1b$B=0-bmikg1b$b |
### ***NOTE: Use column “V5” as Allele1 & column “V6” as Allele2 & column “B” as the effect estimate***
Code Block |
---|
bmikg2=rbind(bmikg1a,bmikg1b)
nrow(bmikg2) #55,752
head(bmikg2)
## Saving the base-cleaned-alleles file to be used later
write.table(bmikg2,"BMI_unambiguousSNPsposstrand-B_AllelesV5V6_20200226.txt",quote=F,sep='\t',col.names=T,row.names=F)
## Setting up and saving the file for clumping in the next step
bmikg2s=subset(bmikg2,select=c("SNP","p"))
colnames(bmikg2s)<-c("SNP","P")
write.table(bmikg2s,"BMI_SNPs_P_forclumping.txt",quote=F,sep='\t',col.names=T,row.names=F)
|
5- Calculation of PRS via clumping and thresholding
The following procedure is using the cleaned file generated above.
Tip: Effect sizes given as odds ratios (OR) will need to be converted to Beta (B) using the natural logarithm of the OR. In this way, the PRS can be computed using summation. (The effect estimates can be transformed from B back to OR afterwards).
CLUMPING using PLINK
### The most commonly used method for computing PRS is clumping and thresholding (C+T). Before calculating PRS, the variants are first clumped, and variants that are weakly correlated (r2) with one another are retained. The clumping step prunes redundant correlated effects caused by linkage disequilibrium (LD) between variants. Thresholding will remove variants with a p value larger than a chosen level of significance (default: 0.0001).
It requires 4 parameters: clump-p1=significance threshold of the index SNP, clump-p2=significance threshold of the tag/clumped SNP, clump-r2=LD threshold for clumping, clump-kb=threshold of physical distance for clumping in kb. Here, we are using the following parameters: clump-r2 = 0.1 and clump-kb = 250
Code Block |
---|
plink --bfile 1kgph3_chr16 --clump-p1 1 --clump-p2 1 --clump-r2 0.10 --clump-kb 250 --clump BMI_SNPs_P_forclumping.txt --out 1kgph3_chr16_test_clumped_1
## If you are using MAC, use the command below:
./plink --bfile 1kgph3_chr16 --clump-p1 1 --clump-p2 1 --clump-r2 0.10 --clump-kb 250 --clump BMI_SNPs_P_forclumping.txt --out 1kgph3_chr16_test_clumped_1
|
# The command above produces the output ".clumped"
In the commands below, we are combining base-cleaned-alleles file (BMI_unambiguousSNPsposstrand-B_AllelesV5V6_20200226.txt) with SNPs retained after the clumping step.
## In R:
Code Block |
---|
bmikg2=read.table("BMI_unambiguousSNPsposstrand-B_AllelesV5V6_20200226.txt",header=T)
clumped=read.table("1kgph3_chr16_test_clumped_1.clumped",header=T)
nrow(clumped)
bmikg3=merge(bmikg2,clumped,by="SNP",sort=F)
nrow(bmikg3) ##4651 |
THRESHOLDING using R
### As an example, we will use either (a) p-value threshold of 0.5, or (b) a range of p-value thresholds from base GWAS summary statistics to select SNPs for polygenic risk scoring in the target GWAS dataset
### Example (a) p=<0.5
Code Block |
---|
bmikg3a=bmikg3[bmikg3$P<0.5,]
bmikg3a1=subset(bmikg3a,select=c("SNP","V5","B"))
nrow(bmikg3a) ##3254
colnames(bmikg3a1)<-c("SNP","A1","Score")
write.table(bmikg3a1,"1kgph3_chr16_test_clumped_1_threshold_5.raw",col.names=T,row.names=F,quote=F,sep='\t') |
### Example (b) using range of p-value thresholds in PLINK
#sample range list txt: The columns of this file are ‘Name of threshold’, ‘Lower bound p-value’, and ‘Upper bound p-value’.
# In a text file, write:
Code Block |
---|
0.001 0 0.001
0.05 0 0.05
0.1 0 0.1
0.2 0 0.2
0.3 0 0.3
0.4 0 0.4
0.5 0 0.5
1.0 0 1.0 |
#save as: range_list.txt
Code Block |
---|
## writing input file for generating polygenic risk scores
bmikg3b=subset(bmikg3,select=c("SNP","V5","B"))
colnames(bmikg3b)<-c("SNP","A1","Score")
write.table(bmikg3b,"1kgph3_chr16_test_clumped_1_nothreshold.raw",col.names=T,row.names=F,quote=F,sep='\t')
## writing file for filtering based on p-values in the range_list.txt file above
bmikg3c=subset(bmikg3,select=c("SNP","P"))
write.table(bmikg3c,"1kgph3_chr16_test_clumped_1_nothreshold.snppvalues",col.names=T,row.names=F,quote=F,sep='\t') |
GENERATING THE POLYGENIC RISK SCORES
### Example (a) p-value threshold of 0.5
Code Block |
---|
plink --bfile 1kgph3_chr16 --score 1kgph3_chr16_test_clumped_1_threshold_5.raw --out 1kgph3_chr16_profile_5_20200226
## If you are using MAC, use the command below:
./plink --bfile 1kgph3_chr16 --score 1kgph3_chr16_test_clumped_1_threshold_5.raw --out 1kgph3_chr16_profile_5_20200226 |
# The command above will generate one output file with extension "*.profile"
Code Block |
---|
# 1kgph3_chr16_profile_5_20200226.profile:
## with columns: FID=Family ID, IID=Individual ID, PHENO=phenotype, CNT=number of SNPs used for scoring, CNT2=number of named (effect) alleles for each subject, SCORE=polygenic risk score for each subject
FID IID PHENO CNT CNT2 SCORE
HG00096 HG00096 -9 6508 1107 -2.34634e-05
HG00097 HG00097 -9 6508 1193 0.000309573
HG00099 HG00099 -9 6508 1156 3.56331e-05
HG00101 HG00101 -9 6508 1167 -4.76951e-05
HG00102 HG00102 -9 6508 1157 -3.64935e-05 |
### Example (b) using range of p-value thresholds
Code Block |
---|
plink --bfile 1kgph3_chr16 --score 1kgph3_chr16_test_clumped_1_nothreshold.raw --q-score-range range_list.txt 1kgph3_chr16_test_clumped_1_nothreshold.snppvalues --out 1kgph3_chr16_profile_20200226
## If you are using MAC, use the command below:
./plink --bfile 1kgph3_chr16 --score 1kgph3_chr16_test_clumped_1_nothreshold.raw --q-score-range range_list.txt 1kgph3_chr16_test_clumped_1_nothreshold.snppvalues --out 1kgph3_chr16_profile_20200226 |
# The command above will generate eight output files with extension "*.profile"
Code Block |
---|
# For example, 1kgph3_chr16_profile_20200226.0.001.profile:
## with columns: FID=Family ID, IID=Individual ID, PHENO=phenotype, CNT=number of SNPs used for scoring, CNT2=number of named (effect) alleles for each subject, SCORE=polygenic risk score for each subject
FID IID PHENO CNT CNT2 SCORE
HG00096 HG00096 -9 148 45 -0.000618919
HG00097 HG00097 -9 148 40 0.00219797
HG00099 HG00099 -9 148 44 0.00117297
HG00101 HG00101 -9 148 47 -0.00123784
HG00102 HG00102 -9 148 46 -0.00163041 |
6- Applications of PRS scores
REGRESSION ANALYSIS
## In R:
# Import the phenotype file
Code Block |
---|
phen=read.csv("1kgph3_dummybmi20200804.csv", header=T)
head(phen) |
# sex: male = 1, female =0
# import the PRS file output from plink
Code Block |
---|
prs1=read.table("1kgph3_chr16_profile_5_20200226.profile",header=T) # this is output from Example (a) polygenic risk scoring with p-value threshold of 0.5
head(prs1)
nrow(prs1) ##476 |
# merge phenotype (dummybmi) and PRS files
Code Block |
---|
prs1phen=merge(prs1,phen,by.x="FID",by.y="V1")
write.csv(prs1phen,"test_clump_1_prs_5.csv",row.names=F) |
# Regression analysis with quantitative outcome phenotype
Code Block |
---|
prs1phen=read.csv("test_clump_1_prs_5.csv",header=T)
head(prs1phen)
lmprs<-lm(data=prs1phen, dummybmi~SCORE + sex)
summary(lmprs) |
# Plotting
Code Block |
---|
## Histogram showing distribution of polygenic risk scores (shown for p-value threshold of 0.5 from Example (a))
pdf("Histogram_testscore_pT_5.pdf")
hist(prs1phen$SCORE, breaks=100, main = paste("Distribution of PRS scores (n =", length(prs1phen$SCORE), ")"))
dev.off()
## Scatterplot of BMI vs PRS (shown for p-value threshold of 0.5 from Example (a))
pdf("Scatterplot_dummybmi_vs_PRS_pT_5.pdf")
plot(prs1phen$SCORE,prs1phen$dummybmi)
abline(lm(prs1phen$dummybmi ~ prs1phen$SCORE), col = "red",
lty = 1, lwd = 1)
dev.off()
# Plots of prediction R-squared values vs p-value thresholds (for Example (b))
prs_files <- list.files(pattern = "1kgph3_chr16_profile_20200226.*.*.profile$");
prs_list <- lapply(prs_files, function(rr) read.table(rr, head=T))
r_squared<-NA
for (j in 1:length(prs_list)){
prs1phen=merge(prs_list[[j]] ,phen, by.x="FID", by.y="V1")
lmprs<-lm(data= prs1phen, dummybmi~SCORE + sex)
r_squared[j]<-summary(lmprs)$r.squared
}
p_threshold <- as.numeric(sapply(sapply(prs_files, function(rr) strsplit(rr, "_20200226*\\.")[[1]][[2]]), function(ee) strsplit(ee, ".profile")[[1]]))
## Lineplot of prediction R-squared values vs p-value thresholds
pdf("Lineplot_R-squared_vs_p-value_threshold.pdf")
plot(p_threshold, r_squared, xlab = "P-value threshold", ylab="Prediction R-squared")
lines(p_threshold, r_squared);
abline(v=p_threshold[which.max(r_squared)], col=2)
dev.off()
## Barplot of prediction R-squared values vs p-value thresholds -- colour gradient from blue for lowest R-squared value to red for highest R-squared value; legend showing R-squared values in the order of the p-value thresholds
rankfac <- rank(r_squared)
rbPal <- colorRampPalette(c('blue','red'))
Col <- rbPal(length(rankfac))[as.numeric(cut(rankfac,breaks = length(rankfac)))]
pdf("Barplot_R-squared_vs_p-value_threshold_bluetored.pdf")
barplot(r_squared, names.arg=p_threshold, xlab = "P-value threshold", ylab="Prediction R-squared",col=Col, legend=round(r_squared, digits=4), args.legend=list(title="R-squared"), xlim=c(0,12))
dev.off() |
8- Additional resources for PRS analysis
Software for alternative PRS methods:
PRSice (https://www.prsice.info/)
LDhub (http://ldsc.broadinstitute.org/ldhub/)
LDPred (https://github.com/bvilhjal/ldpred)
Base data:
GWAS catalog (https://www.ebi.ac.uk/gwas/downloads/summary-statistics)
Psychiatric Genomics Consortium (https://www.med.unc.edu/pgc/download-results/)
GIANT Consortium online portal (http://portals.broadinstitute.org/collaboration/giant/index.php/GIANT_consortium_data_files)
DIAGRAM (http://diagram-consortium.org/)
Global Lipids Genetics Consortium (http://csg.sph.umich.edu/willer/public/lipids2013/)
Additional tutorial or guides:
PRS tutorial (https://choishingwan.github.io/PRS-Tutorial/; Choi et al., 2020)
General GWAS and PRS tutorial (https://github.com/MareesAT/GWA_tutorial/)
9- References:
Locke AE, Kahali B, Berndt SI, Justice AE, Pers TH, Day FR, Powell C, Vedantam S, Buchkovich ML, Yang J, Croteau-Chonka DC, Esko T et al. (2015). Genetic studies of body mass index yield new insights for obesity biology. Nature 518, 197-206.
Choi, S.W., Mak, T.S. & O’Reilly, P.F. Tutorial: a guide to performing polygenic risk score analyses. Nat Protoc 15, 2759–2772 (2020). https://doi-org.myaccess.library.utoronto.ca/10.1038/s41596-020-0353-1