Tutorial for Polygenic Risk Score Analysis

1- Overview

This is a guide for an introductory analysis to 1) construct a polygenic risk score (PRS) using the base data (GWAS summary statistics, particularly with effect-sizes and P-values, generally publicly available) via a clumping and thresholding method (C + T); and 2) test the constructed PRS for prediction using the target data (PLINK binary data files and phenotype csv file). In general, it is the ‘user’ data).

2- Learning Objectives

Apply quality control measures to base/target sample prior to PRS analysis;
Perform PRS analysis (hands-on);
Understand the graphs and outputs (hands-on)

3- Materials

Ideally for PRS analyses, you would be using the genome-wide genotype data. Here we use base data containing summary statistics and target data containing genotypes for chromosome 16 as an example to demonstrate the workflow for the prediction of simulated body mass index (BMI) data. The procedure described below will be the same for the genome-wide dataset. All the materials required for this workshop are attached here. Relevant materials for this workshop are as follows:

plink 1

1kgph3_chr16.bed

kgph3_chr16.bim

1kgph3_chr16.fam

BMI_1kgph3_chr16_snps_summarystat.txt

1kgph3_dummybmi20200804.csv

range_list.txt

PRS_KCNI_2021_new.pptx

PRS_supplementary.pptx

#Additional test files for running PRSice (Section 8):

BMI_1kgph3_chr16_snps_summarystat_2024.txt

1kgph3_dummybmi20200804.txt

NCBI37.3.gene.loc.gtf

3.1- Base Data

We will use as the base data part of GWAS Anthropometric 2015 BMI summary statistics ( https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4382211/), made available by the GIANT consortium and were extracted from their online portal

https://portals.broadinstitute.org/collaboration/giant/index.php/GIANT_consortium_data_files#BMI_and_Height_GIANT_and_UK_BioBank_Meta-analysis_Summary_Statistics).

3.1.1- QC for Base data:

Check SNPs Heritability: h²_SNP>0.05
- LD Score Regression: https://github.com/bulik/ldsc (Bulik-Sullivan et al, Nature Genetics, 2015)
- SumHer: http://dougspeed.com/snp-heritability/ (Speed et al, Nature Genetics, 2020)
The effect allele must be known. Both Base and Target datasets should have the same effect allele.
Filter out SNPs with MAF < 0.01 and INFO < 0.8. (May also consider filtering for minimum effective sample size NEFF – at least half of full sample, as suggested by Dr. Tian Ge (PRS-CS))

## BMI_1kgph3_chr16_snps_summarystat.txt:
 
# With the following columns: “SNP” = marker ID, “A1” = effect allele (could be minor allele if output from plink), “A2” = reference allele, “Freq1.Hapmap” = allele frequency according to Hapmap, “b” = effect estimate for the minor allele, “se” = standard error of the effect estimate, “p” = p-value, “N” = sample size
 
        SNP A1 A2 Freq1.Hapmap       b     se      p      N
1 rs1000014  G  A       0.8000 -0.0055 0.0049 0.2617 233462
2 rs1000047  C  T       0.6917  0.0019 0.0040 0.6348 233959
3 rs1000077  C  G       0.4833  0.0006 0.0038 0.8745 220681
4 rs1000078  A  G       0.7000 -0.0031 0.0041 0.4496 232588
5 rs1000100  A  T       0.4167  0.0006 0.0040 0.8808 233753
6 rs1000174  A  G       0.3417 -0.0025 0.0052 0.6307 150418

Tip: The base data may come in different formats. For example, if marker IDs (rs IDs) are not available, we may have to derive them based on available chromosome and map positions. The effect estimate may come in the form of an odds ratio (OR), in which case we will calculate beta using the formula beta=ln(OR). It is also ideal to have the allele calls based on the positive strand. Make sure to identify the effect allele.

3.2- Target Data

Our target data for this tutorial consist of genotype data on the subset of European samples for chromosome 16 from the 1000 Genomes datasets in PLINK format. Phenotype consists of simulated BMI data. For demonstration purposes, the BMI in the target dataset was simulated randomly from a normal distribution with no reference to the genotype data.

Tip: Base and target data should share the same ancestry background.

3.2.1- QC for Target data:

Check if your Target data is in the same genome build as Base data.
- LiftOver: https://genome.ucsc.edu/cgi-bin/hgLiftOver
Basic QC: e.g., geno >0.99, mind <0.02, HWE P>1x10-6, 3SD HET, MAF >0.01, INFO >0.8. Also, remove indels and multi-allelic SNPs.
Avoid sample overlap, as well as a high degree of relatedness between individuals of Base and Target data.
Check strand and allele calls

## 1kgph3_chr16.bed:		
# binary plink pedigree genotype file; do not try to open this file
 
## 1kgph3_chr16.bim (as appearing in R):
# with 277,663 biallelic markers (SNPs or Ins/Del markers) – 6 columns without column names: “V1”=chromosome, “V2”=marker ID, “V3”=genetic distance, “V4”=chromosomal position in bp, “V5”=Allele1, “V6”=Allele2
 
  V1          V2 V3    V4 V5       V6
1 16 rs185537431  0 60778  A        G
2 16 rs377548396  0 62569  A        G
3 16 rs368745239  0 66640  T        G
4 16 rs187053456  0 70765  A        C
5 16 rs193118147  0 70767  A        G
6 16 rs201639477  0 75246  C CTTTTTTT
 
## 1kgph3_chr16.fam (as appearing in R):
# with 476 individuals – 6 columns without column names: “V1”=Family ID, “V2”=Individual ID, “V3”=Father ID, “V4”=Mother ID, “V5”=Sex (Male=1, Female=2), “V6”=Affected status (Yes=2, No=1, Unknown=-9)
 
       V1      V2 V3 V4 V5 V6
1 HG00096 HG00096  0  0  1 -9
2 HG00097 HG00097  0  0  2 -9
3 HG00099 HG00099  0  0  2 -9
4 HG00101 HG00101  0  0  1 -9
5 HG00102 HG00102  0  0  2 -9
6 HG00103 HG00103  0  0  1 -9
 
## 1kgph3_dummybmi20200804.csv (as appearing in R):
# with the columns with column names: “V1”=Family ID, “V2”=Individual ID, “V3”=Father ID, “V4”=Mother ID, “V5”=Sex, “V6”=Affective status, “dummybmi”=phenotype of interest, “Sex” (male=1, female=0)
 
V1        V2        V3   V4   V5   V6   dummybmi     sex
HG00096   HG00096   0    0    1    -9   13.42457034    1
HG00097   HG00097   0    0    2    -9   11.29648997    0
HG00099   HG00099   0    0    2    -9   10.98411446    0
HG00101   HG00101   0    0    1    -9   8.268308741    1
HG00102   HG00102   0    0    2    -9   11.1241505     0

3.3- Required Softwares

PLINK 1.9 (http://www.cog-genomics.org/plink2/). Use the stable version (19 Oct 2020 (b6.21)). CAMH SCC: module load bio/PLINK/1.9b_6.15-x86_64
R (version 3.2.3+) (https://cran.r-project.org/) CAMH SCC: module load lang/R/4.0.3-Python-3.8.5-Anaconda3-2020.11

4- Quality Control Checklist

The Base and Target data must be matching for the following items:

SNP IDs (for this workshop, both base and target datasets have rsIDs)

Chromosome and map positions (we can double-check base vs target datasets if that information is available)

Allele 1 and Allele 2 calls

Ambiguous SNPs to be excluded
Allele 1 and allele 2 calls between Base and Target data – 4 scenarios:
- (i) perfect allele matches with same strand
- (ii) perfect allele matches with opposite strands
- (iii) switched allele 1 and allele 2 calls with same strand
- (iv) switched allele 1 and allele 2 calls with opposite strand

For the workshop, we are using simulated data (see above). For this dataset, we will only check strand and allele calls.

Tip: It is important to know the effect allele from the Base (GWAS summary stats) data to avoid misleading conclusions. Both Base and Target datasets must have the same effect allele.

We will remove strand-ambiguous SNPs (i.e. those with complementary alleles, either C/G or A/T SNPs) and duplicate SNPs (indicative of multiallelic markers) in both the Base and Target data. We will also resolve mismatched SNPs names between the two datasets. The unresolved, mismatching SNPs will be removed as well.

### merging base (GWAS summary stat) with target (GWAS) bim file to check strands and allele calls. The expected outputs for "nrow" are shown.

R --vanilla  ## only if you are running R in your computer terminal
 
kg=read.table("1kgph3_chr16.bim",header=F)
bmi=read.table("BMI_1kgph3_chr16_snps_summarystat.txt",header=T)
 
nrow(kg)    ##277,663
head(kg)
 
nrow(bmi)    ##68,385
head(bmi)
 
bmikg=merge(bmi,kg,by.x="SNP",by.y="V2")    # merging step
 
nrow(bmikg)    ##68,385
head(bmikg)

Tip: If the SNPs do not have the same id between the target and base datasets, merge by chromosome, position (and alleles if the files include multi-allelic variants).

### To mark ambiguous G/C or A/T SNPs for removal

## let’s create a column (ambiguousSNPs) flagging the ambiguous SNPs as “1” in our merged file bmikg.

bmikg$ambiguousSNPs[bmikg$A1 == "A" & bmikg$A2 == "T"]<-1
bmikg$ambiguousSNPs[bmikg$A1 == "C" & bmikg$A2 == "G"]<-1
bmikg$ambiguousSNPs[bmikg$A1 == "G" & bmikg$A2 == "C"]<-1
bmikg$ambiguousSNPs[bmikg$A1 == "T" & bmikg$A2 == "A"]<-1
 
head(bmikg) ## ambiguousSNPs column with ambiguous markers marked as “1”, and non-ambiguous markers marked as “NA”

### To exclude ambiguous SNPs

bmikg1=bmikg[is.na(bmikg$ambiguousSNPs),]    ## to keep only non-ambiguous SNPs
 
nrow(bmikg1)    ##55,752 non-unambiguous SNPs remaining

### Scenario (i): To find perfect allele matches between base GWAS summary statistics file and target GWAS genotype data (i.e., A1 (column “A1”) in GWAS summary statistics file is the same as A1 (column “V5”) in target GWAS data, and A2 (column “A2”) in base GWAS summary statistics file is the same as A2 (column “V6”) in target GWAS data) -- beta will be unchanged

bmikg1aa=bmikg1[as.character(bmikg1$A1) == bmikg1$V5 & as.character(bmikg1$A2) == bmikg1$V6,] 
nrow(bmikg1aa)    ##28,842 SNPs in scenario (i)

### Scenario (ii): To find allele matches for flipped strand between base GWAS summary statistics file and target GWAS data -- beta will be unchanged

bmikg1ab=bmikg1[bmikg1$A1 == "A" & bmikg1$A2 == "C" & bmikg1$V5 == "T" & bmikg1$V6 == "G",]
bmikg1ac=bmikg1[bmikg1$A1 == "A" & bmikg1$A2 == "G" & bmikg1$V5 == "T" & bmikg1$V6 == "C",]
bmikg1ad=bmikg1[bmikg1$A1 == "C" & bmikg1$A2 == "A" & bmikg1$V5 == "G" & bmikg1$V6 == "T",]
bmikg1ae=bmikg1[bmikg1$A1 == "C" & bmikg1$A2 == "T" & bmikg1$V5 == "G" & bmikg1$V6 == "A",]
bmikg1af=bmikg1[bmikg1$A1 == "G" & bmikg1$A2 == "A" & bmikg1$V5 == "C" & bmikg1$V6 == "T",]
bmikg1ag=bmikg1[bmikg1$A1 == "G" & bmikg1$A2 == "T" & bmikg1$V5 == "C" & bmikg1$V6 == "A",]
bmikg1ah=bmikg1[bmikg1$A1 == "T" & bmikg1$A2 == "C" & bmikg1$V5 == "A" & bmikg1$V6 == "G",]
bmikg1ai=bmikg1[bmikg1$A1 == "T" & bmikg1$A2 == "G" & bmikg1$V5 == "A" & bmikg1$V6 == "C",]
bmikg1a=rbind(bmikg1aa,bmikg1ab,bmikg1ac,bmikg1ad,bmikg1ae,bmikg1af,bmikg1ag,bmikg1ah,bmikg1ai)    # to combine all the datasets created above + bmikg1aa with non-ambiguous SNPs
 
nrow(bmikg1a)    ##28,854 SNPs in scenarios (i) & (ii)

### To add column “B” for the beta to be used in polygenic risk scoring (note: “b” is the original beta)

### B=b for SNPs with matching alleles above (beta will remain unchanged for Scenarios (i) and (ii))

bmikg1a$B=bmikg1a$b

### Scenario (iii): To find perfect allele switches between A1 and A2 (i.e., A1 in base GWAS summary statistics file is the same as A2 in target GWAS data, and vice versa) -- beta will have opposite sign

bmikg1ba=bmikg1[as.character(bmikg1$A1) == bmikg1$V6 & as.character(bmikg1$A2) == bmikg1$V5,]
 
nrow(bmikg1ba)    ##26,883 SNPs in scenario (iii)

### Scenario (iv): To find flipped strands but perfect allele switches between A1 and A2 (i.e., A1 (column “A1”) in base GWAS summary statistics file is the same as the complementary allele for A2 (column “V6”) in target GWAS data, and vice versa) -- beta will have opposite sign

bmikg1bb=bmikg1[bmikg1$A1 == "A" & bmikg1$A2 == "C" & bmikg1$V5 == "G" & bmikg1$V6 == "T",]
bmikg1bc=bmikg1[bmikg1$A1 == "A" & bmikg1$A2 == "G" & bmikg1$V5 == "C" & bmikg1$V6 == "T",]
bmikg1bd=bmikg1[bmikg1$A1 == "C" & bmikg1$A2 == "A" & bmikg1$V5 == "T" & bmikg1$V6 == "G",]
bmikg1be=bmikg1[bmikg1$A1 == "C" & bmikg1$A2 == "T" & bmikg1$V5 == "A" & bmikg1$V6 == "G",]
bmikg1bf=bmikg1[bmikg1$A1 == "G" & bmikg1$A2 == "A" & bmikg1$V5 == "T" & bmikg1$V6 == "C",]
bmikg1bg=bmikg1[bmikg1$A1 == "G" & bmikg1$A2 == "T" & bmikg1$V5 == "A" & bmikg1$V6 == "C",]
bmikg1bh=bmikg1[bmikg1$A1 == "T" & bmikg1$A2 == "C" & bmikg1$V5 == "G" & bmikg1$V6 == "A",]
bmikg1bi=bmikg1[bmikg1$A1 == "T" & bmikg1$A2 == "G" & bmikg1$V5 == "C" & bmikg1$V6 == "A",]
bmikg1b=rbind(bmikg1ba,bmikg1bb,bmikg1bc,bmikg1bd,bmikg1be,bmikg1bf,bmikg1bg,bmikg1bh,bmikg1bi)
 
nrow(bmikg1b)    ##26,898 SNPs in scenarios (iii) & (iv)

### NOTE: Need to check for allelic mismatches beyond those assessed in the steps above (same strands or flipped strands, same alleles or switched alleles): additional checks would include checking for INDELs (insertions/deletions) and non-matching alleles (not included in this workshop)

### HERE: our data have 0 SNPs with mismatching A1 and A2 between GWAS summary statistics and target GWAS data

### NOTE: there are 0 INDELs with mismatching A1 and A2 between Base GWAS summary stat and target GWAS data

### B=-b for SNPs with switched alleles above (beta will have an opposite sign for Scenarios (iii) and (iv))

bmikg1b$B=0-bmikg1b$b

### same number of non-ambiguous SNPs that we began with.

nrow(bmikg1b)+nrow(bmikg1a)    ##55,752

### ***NOTE: Use column “V5” as Allele1 & column “V6” as Allele2 & column “B” as the effect estimate***

bmikg2=rbind(bmikg1a,bmikg1b)
 
nrow(bmikg2)    #55,752
 
head(bmikg2)
 
## Saving the base-cleaned-alleles file to be used later
write.table(bmikg2,"BMI_unambiguousSNPsposstrand-B_AllelesV5V6_20200226.txt",quote=F,sep='\t',col.names=T,row.names=F)
 
## Setting up and saving the file for clumping in the next step
bmikg2s=subset(bmikg2,select=c("SNP","p"))
colnames(bmikg2s)<-c("SNP","P")
write.table(bmikg2s,"BMI_SNPs_P_forclumping.txt",quote=F,sep='\t',col.names=T,row.names=F)

5- Calculation of PRS via clumping and thresholding

The following procedure is using the cleaned file generated above.

Tip: Effect sizes given as odds ratios (OR) will need to be converted to Beta (B) using the natural logarithm of the OR. In this way, the PRS can be computed using summation. (The effect estimates can be transformed from B back to OR afterward).

5.1- CLUMPING using PLINK

### The most commonly used method for computing PRS is clumping and thresholding (C+T). Before calculating PRS, the variants are first clumped, and variants that are weakly correlated (r2) with one another are retained. The clumping step prunes redundant correlated effects caused by linkage disequilibrium (LD) between variants. Thresholding will remove variants with a p-value larger than a chosen level of significance (default: 0.0001).

It requires 4 parameters: clump-p1=significance threshold of the index SNP, clump-p2=significance threshold of the tag/clumped SNP, clump-r2=LD threshold for clumping, clump-kb=threshold of physical distance for clumping in kb. Here, we are using the following parameters: clump-r2 = 0.1 and clump-kb = 250

plink --bfile 1kgph3_chr16 --clump-p1 1 --clump-p2 1 --clump-r2 0.10 --clump-kb 250 --clump BMI_SNPs_P_forclumping.txt --out 1kgph3_chr16_test_clumped_1
 
## If you are using MAC, use the command below:
 
./plink --bfile 1kgph3_chr16 --clump-p1 1 --clump-p2 1 --clump-r2 0.10 --clump-kb 250 --clump BMI_SNPs_P_forclumping.txt --out 1kgph3_chr16_test_clumped_1

# The command above produces the output ".clumped"

In the commands below, we are combining the base-cleaned-alleles file (BMI_unambiguousSNPsposstrand-B_AllelesV5V6_20200226.txt) with SNPs retained after the clumping step.

## In R:

bmikg2=read.table("BMI_unambiguousSNPsposstrand-B_AllelesV5V6_20200226.txt",header=T)
 
clumped=read.table("1kgph3_chr16_test_clumped_1.clumped",header=T)
nrow(clumped)
 
bmikg3=merge(bmikg2,clumped,by="SNP",sort=F)
 
nrow(bmikg3)    ##4651

5.2- THRESHOLDING using R

### As an example, we will use either (a) p-value threshold of 0.5, or (b) a range of p-value thresholds from base GWAS summary statistics to select SNPs for polygenic risk scoring in the target GWAS dataset

### ***NOTE: Use column “V5” as Allele1 & column “V6” as Allele2 & column “B” as the effect estimate***

### Example (a) p=<0.5

bmikg3a=bmikg3[bmikg3$P<0.5,]
bmikg3a1=subset(bmikg3a,select=c("SNP","V5","B"))
nrow(bmikg3a)    ##3254

# Renaming column headings from "SNP", "V5", and "B" to "SNP", "A1", and "Score"
colnames(bmikg3a1)<-c("SNP","A1","Score")
write.table(bmikg3a1,"1kgph3_chr16_test_clumped_1_threshold_5.raw",col.names=T,row.names=F,quote=F,sep='\t')

### Example (b) using a range of p-value thresholds in PLINK

#sample range list txt: The columns of this file are ‘Name of the threshold’, ‘Lower bound p-value’, and ‘Upper bound p-value’.

# In a text file, write:

0.001 0 0.001
0.05 0 0.05
0.1 0 0.1
0.2 0 0.2
0.3 0 0.3
0.4 0 0.4
0.5 0 0.5
1.0 0 1.0

#save as: range_list.txt

## writing input file for generating polygenic risk scores
 
bmikg3b=subset(bmikg3,select=c("SNP","V5","B"))
colnames(bmikg3b)<-c("SNP","A1","Score")
write.table(bmikg3b,"1kgph3_chr16_test_clumped_1_nothreshold.raw",col.names=T,row.names=F,quote=F,sep='\t')
 
 
## writing file for filtering based on p-values in the range_list.txt file above
 
bmikg3c=subset(bmikg3,select=c("SNP","P"))
write.table(bmikg3c,"1kgph3_chr16_test_clumped_1_nothreshold.snppvalues",col.names=T,row.names=F,quote=F,sep='\t')

5.3- GENERATING THE POLYGENIC RISK SCORES

### Example (a) p-value threshold of 0.5

plink --bfile 1kgph3_chr16 --score 1kgph3_chr16_test_clumped_1_threshold_5.raw --out 1kgph3_chr16_profile_5_20200226
 
 
## If you are using MAC, use the command below:
 
./plink --bfile 1kgph3_chr16 --score 1kgph3_chr16_test_clumped_1_threshold_5.raw --out 1kgph3_chr16_profile_5_20200226

# The command above will generate one output file with extension "*.profile"

## 1kgph3_chr16_profile_5_20200226.profile:
## with columns: FID=Family ID, IID=Individual ID, PHENO=phenotype, CNT=number of SNPs used for scoring, CNT2=number of named (effect) alleles for each subject, SCORE=polygenic risk score for each subject
 
      FID       IID  PHENO    CNT   CNT2        SCORE
  HG00096   HG00096     -9   6508   1107 -2.34634e-05
  HG00097   HG00097     -9   6508   1193  0.000309573
  HG00099   HG00099     -9   6508   1156  3.56331e-05
  HG00101   HG00101     -9   6508   1167 -4.76951e-05
  HG00102   HG00102     -9   6508   1157 -3.64935e-05

### Example (b) using a range of p-value thresholds

plink --bfile 1kgph3_chr16 --score 1kgph3_chr16_test_clumped_1_nothreshold.raw --q-score-range range_list.txt 1kgph3_chr16_test_clumped_1_nothreshold.snppvalues --out 1kgph3_chr16_profile_20200226
 
 
## If you are using MAC, use the command below:
 
./plink --bfile 1kgph3_chr16 --score 1kgph3_chr16_test_clumped_1_nothreshold.raw --q-score-range range_list.txt 1kgph3_chr16_test_clumped_1_nothreshold.snppvalues --out 1kgph3_chr16_profile_20200226

# The command above will generate eight output files with extension "*.profile"

## For example, 1kgph3_chr16_profile_20200226.0.001.profile:
## with columns: FID=Family ID, IID=Individual ID, PHENO=phenotype, CNT=number of SNPs used for scoring, CNT2=number of named (effect) alleles for each subject, SCORE=polygenic risk score for each subject
 
      FID       IID  PHENO    CNT   CNT2        SCORE
  HG00096   HG00096     -9    148     45 -0.000618919
  HG00097   HG00097     -9    148     40   0.00219797
  HG00099   HG00099     -9    148     44   0.00117297
  HG00101   HG00101     -9    148     47  -0.00123784
  HG00102   HG00102     -9    148     46  -0.00163041

6- Applications of PRS scores

6.1- REGRESSION ANALYSIS

## In R:

# Import the phenotype file

phen=read.csv("1kgph3_dummybmi20200804.csv", header=T)
head(phen)

# sex: male = 1, female =0

# import the PRS file output from plink

prs1=read.table("1kgph3_chr16_profile_5_20200226.profile",header=T)    # this is output from Example (a) polygenic risk scoring with p-value threshold of 0.5
 
head(prs1)
nrow(prs1)    ##476

# merge phenotype (dummybmi) and PRS files

prs1phen=merge(prs1,phen,by.x="FID",by.y="V1")
write.csv(prs1phen,"test_clump_1_prs_5.csv",row.names=F)

# Regression analysis with quantitative outcome phenotype

prs1phen=read.csv("test_clump_1_prs_5.csv",header=T)
head(prs1phen)
 
lmprs<-lm(data=prs1phen, dummybmi~SCORE + sex) #adding sex as a covariate here; consider adding population structure PCs as covariates
summary(lmprs)

6.2- PLOTTING

## Histogram showing distribution of polygenic risk scores (shown for p-value threshold of 0.5 from Example (a))
pdf("Histogram_testscore_pT_5.pdf")
hist(prs1phen$SCORE, breaks=100, main = paste("Distribution of PRS scores (n =", length(prs1phen$SCORE), ")"))
dev.off()
 
## Scatterplot of BMI vs PRS (shown for p-value threshold of 0.5 from Example (a))
pdf("Scatterplot_dummybmi_vs_PRS_pT_5.pdf")
plot(prs1phen$SCORE,prs1phen$dummybmi)
abline(lm(prs1phen$dummybmi ~ prs1phen$SCORE), col = "red",
lty = 1, lwd = 1)
dev.off()
 
 
# Plots of prediction R-squared values vs p-value thresholds (for Example (b))
prs_files  <- list.files(pattern = "1kgph3_chr16_profile_20200226.*.*.profile$");
prs_list <- lapply(prs_files, function(rr) read.table(rr, head=T))
r_squared<-NA
for (j in 1:length(prs_list)){
prs1phen=merge(prs_list[[j]] ,phen, by.x="FID", by.y="V1")
lmprs<-lm(data= prs1phen, dummybmi~SCORE + sex)
r_squared[j]<-summary(lmprs)$r.squared
}
 
p_threshold <- as.numeric(sapply(sapply(prs_files, function(rr) strsplit(rr, "_20200226*\\.")[[1]][[2]]), function(ee) strsplit(ee, ".profile")[[1]]))
 
## Lineplot of prediction R-squared values vs p-value thresholds
pdf("Lineplot_R-squared_vs_p-value_threshold.pdf")
plot(p_threshold, r_squared, xlab = "P-value threshold", ylab="Prediction R-squared")
lines(p_threshold, r_squared);
abline(v=p_threshold[which.max(r_squared)], col=2)
dev.off()
 
 
## Barplot of prediction R-squared values vs p-value thresholds -- colour gradient from blue for lowest R-squared value to red for highest R-squared value; legend showing R-squared values in the order of the p-value thresholds
rankfac <- rank(r_squared)
rbPal <- colorRampPalette(c('blue','red'))
Col <- rbPal(length(rankfac))[as.numeric(cut(rankfac,breaks = length(rankfac)))]
pdf("Barplot_R-squared_vs_p-value_threshold_bluetored.pdf")
barplot(r_squared, names.arg=p_threshold, xlab = "P-value threshold", ylab="Prediction R-squared",col=Col, legend=round(r_squared, digits=4), args.legend=list(title="R-squared"), xlim=c(0,12))
dev.off()

7- Additional resources for PRS analysis

Software for alternative PRS methods:

PRSice (https://www.prsice.info/)
LDhub (http://ldsc.broadinstitute.org/ldhub/)
LDPred (https://github.com/bvilhjal/ldpred)
PRScs (https://github.com/getian107/PRScs)

Base data:

GWAS catalog (https://www.ebi.ac.uk/gwas/downloads/summary-statistics)
Polygenic Score Catalog (https://pgscatalog.org)
Psychiatric Genomics Consortium (https://www.med.unc.edu/pgc/download-results/)
GIANT Consortium online portal (http://portals.broadinstitute.org/collaboration/giant/index.php/GIANT_consortium_data_files)
DIAGRAM (http://diagram-consortium.org/)
Global Lipids Genetics Consortium (http://csg.sph.umich.edu/willer/public/lipids2013/)
CARDIoGRAMplusC4D Consortium (http://www.cardiogramplusc4d.org/data-downloads/)

Additional tutorials or guides:

PRS tutorial (https://choishingwan.github.io/PRS-Tutorial/; Choi et al., 2020)
General GWAS and PRS tutorial (https://github.com/MareesAT/GWA_tutorial/)

Some additional codes to try (need to download and install required items following instructions):

8- Additional Commands (under development)

Running PRSice2

PRSice

#### PRSice2 (adapted from https://choishingwan.github.io/PRSice/) ####
##Some quality control steps before using summary statistics file in PRSice2
## *** make sure to review any documents that come with the summary statistics to understand how the summary statistics were derived and what the column names stand for ***
# In R:
ss=read.table("BMI_1kgph3_chr16_snps_summarystat_2024.txt",header=T)
nrow(ss) #[1] 68385

head(ss)

summary(ss$INFO) #imputation quality

summary(ss$N) #N for each SNP in source GWAS

summary(ss$Freq1.Hapmap) #minor allele frequency for each SNP

nrow(ss[ss$INFO<0.8,]) #[1] 199

nrow(ss[ss$N<0.5*max(ss$N),]) #[1] 2629

nrow(ss[ss$Freq1.Hapmap<=0.01 | is.na(ss$Freq1.Hapmap),]) #[1] 1590

#Some quality control for imputation quality, N, and minor allele frequency:
ss1=ss[ss$INFO>0.8 & ss$N>0.5*max(ss$N) & ss$Freq1.Hapmap>0.01 & !is.na(ss$Freq1.Hapmap),]

nrow(ss1) #[1] 64933

colnames(ss1)<-c("SNP","A1","A2","Freq1.Hapmap","BETA","SE","P","N","INFO")

head(ss1)

write.table(ss1,"BMI_1kgph3_chr16_snps_summarystat_2024_short.txt", col.names=T, row.names=F, quote=F, sep='\t')

##Now we are ready to try PRSice2!
##Need to set up the following modules in R for PRSice2 to run properly:
module load lang/R/3.5.1-Python-3.8.5-Anaconda3-2020.11

R --vanilla
#if R packages not already installed:
install.packages("ggplot2")
install.packages("optparse") #may need to install this package manually by downloading from https://cran.r-project.org/web/packages/optparse/index.html; install.packages(c("optparse"))
install.packages("methods")
install.packages("tools")
install.packages("data.table")
install.packages("grDevices")
install.packages("RColorBrewer")

#once R packages are installed:
library(ggplot2)
library(optparse)
library(methods)
library(tools)
library(data.table)
library(grDevices)
library(RColorBrewer)
#if using Linux, exit R (if using Windows: stay in R):
q() #to exit R

## Linux -- Using our tutorial files: 
#may need to make PRSice_linux executable
chmod u+x ./PRSice_linux
#
Rscript PRSice.R --prsice ./PRSice_linux --base BMI_1kgph3_chr16_snps_summarystat_2024_short.txt --target 1kgph3_chr16 --pheno 1kgph3_dummybmi20200804.txt --pheno-col dummybmi --thread 1 --stat BETA --pvalue P --binary-target F --out dummybmi_nocov2024

## Windows (in R) -- Using our tutorial files:
system("Rscript.exe PRSice.R --prsice ./PRSice_win64.exe --base BMI_1kgph3_chr16_snps_summarystat_2024_short.txt --target 1kgph3_chr16 --pheno 1kgph3_dummybmi20200804.txt --pheno-col dummybmi --thread 1 --stat BETA --pvalue P --binary-target F --out dummybmi_nocov2024")

## Linux -- Adding a covariate:
Rscript PRSice.R --prsice ./PRSice_linux --base BMI_1kgph3_chr16_snps_summarystat_2024_short.txt --target 1kgph3_chr16 --pheno 1kgph3_dummybmi20200804.txt --pheno-col dummybmi --cov 1kgph3_dummybmi20200804.txt --cov-col sex --thread 1 --stat BETA --pvalue P --binary-target F --out dummybmi_sexcov2024

## Windows (in R) -- Adding a covariate:
system("Rscript.exe PRSice.R --prsice ./PRSice_win64.exe --base BMI_1kgph3_chr16_snps_summarystat_2024_short.txt --target 1kgph3_chr16 --pheno 1kgph3_dummybmi20200804.txt --pheno-col dummybmi --cov 1kgph3_dummybmi20200804.txt --cov-col sex --thread 1 --stat BETA --pvalue P --binary-target F --out dummybmi_sexcov2024")

## Check the log files to see what went on behind the scene.
# How many SNPs went into the PRS calculations? #Hint: check the *.prsice file

## Now we check the output file with PRS (in R):
prs=read.table("dummybmi_sexcov2024.best",header=T)
pheno=read.table("1kgph3_dummybmi20200804.txt",header=T)

nrow(prs) #[1] 476

nrow(pheno) #[1] 476

head(prs)

head(pheno)

data=merge(pheno,prs,by.x=c("V1","V2"),by.y=c("FID","IID"))

nrow(data) #[1] 476 

head(data)

lmprs<-lm(data=data, dummybmi~PRS + sex)
summary(lmprs) #notice the magnitude of the effect estimate

mean(data$PRS)

sd(data$PRS)

data$stdprs<-scale(data$PRS) #standardizing the PRS to a mean of 0 and standard deviation of 1

mean(data$stdprs)

sd(data$stdprs)

lmprs<-lm(data=data, dummybmi~stdprs + sex)
summary(lmprs) #notice how the magnitude of the effect estimate has changed and everything else stayed the same

## Scatterplot of BMI vs best PRS
pdf("Scatterplot_dummybmi_vs_PRSice2_best_stdprs.pdf")
plot(data$stdprs,data$dummybmi)
abline(lm(data$dummybmi ~ data$stdprs), col = "red",
lty = 1, lwd = 1)
dev.off()
 

#######################################################################
## Additional Notes/Examples:
# add "--perm 10000": *.summary file with additional column showing empirical p value
# "--all-score": outputs PRS for all thresholds tested -- beware of large output file
# "--quantile 10": generates plot and file for PRS quantiles (deciles in this case)
# add "--gtf gene.gtf --msigdb set.txt --multi-plot 10" for PRSet --> outputs plot for top 10 gene sets (*.gtf file with genome boundary of each gene from MAGMA; set.txt with gene IDs in each gene set from MSigDB -- refer to https://www.gsea-msigdb.org/gsea/msigdb/index.jsp for gene set definitions
#
##This is sample script (***Need to change the actual file names***):
#Rscript workdir/PRSice.R --dir workdir --prsice ./workdir/PRSice_linux --base workdir/source_gwas_sumstat_info_9_prsicein.txt --target indir/target_gwas_cleaned_plink --bar-levels 5e-8,0.00001,0.00005,0.0001,0.0005,0.001,0.005,0.01,0.05,0.1,0.2,0.3,0.4,0.5,1 --seed 1234 --perm 10000 --fastscore --all-score --no-regress T --out outdir/target_gwas_cleaned_source_gwas_prs
#
#system("Rscript.exe PRSice.R --prsice ./PRSice_win64.exe --base BMI_1kgph3_chr16_snps_summarystat_2024_short.txt --target 1kgph3_chr16 --pheno 1kgph3_dummybmi20200804.txt --pheno-col dummybmi --cov 1kgph3_dummybmi20200804.txt --cov-col sex --bar-levels 0.001,0.005,0.01,0.05,0.1,0.2,0.3,0.4,0.5,1 --fastscore --thread 1 --stat BETA --pvalue P --binary-target F --out dummybmi_sexcov2024setpvalthreshs")
#system("Rscript.exe PRSice.R --prsice ./PRSice_win64.exe --base BMI_1kgph3_chr16_snps_summarystat_2024_short.txt --target 1kgph3_chr16 --pheno 1kgph3_dummybmi20200804.txt --pheno-col dummybmi --cov 1kgph3_dummybmi20200804.txt --cov-col sex --thread 1 --stat BETA --pvalue P --binary-target F --gtf NCBI37.3.gene.loc.gtf --msigdb c2.all.v2024.1.Hs.entrez.gmt --multi-plot 10 --out dummybmi_sexcov2024")

Running PRS-CS

PRS-CS-auto

## PRS-CS
## setting up (https://github.com/getian107/PRScs):
git clone https://github.com/getian107/PRScs.git
wget https://www.dropbox.com/s/mt6var0z96vb6fv/ldblk_1kg_eur.tar.gz?dl=0
tar -zxvf ldblk_1kg_eur.tar.gz
#downloading and installing h5py and scipy:
pip install scipy h5py
#
#need effective sample size of source GWAS (neff =2*N_cases*N_controls/(N_cases+N_controls))
#source_gwas_sumstat_prscsin.txt file should have gone through QC with target data above and with columns "SNP", "A1", "A2", "OR", "P" -- (A1 from target ie V5; A2 from target ie V6; OR from B via OR=exp(B))
python workdir/PRScs/PRScs.py --ref_dir=workdir/ldblk_1kg_eur --bim_prefix=indir/target_gwas_cleaned_plink --sst_file=workdir/source_gwas_sumstat_prscsin.txt --n_gwas=[neff] --seed=1234 --out_dir=outdir/target_gwas_source_gwas_sumstat_prscsout
# Can add --chrom=21,22 to run PRS-CS only on chr 21 and 22
# Merge sumstat outputs from PRS-CS chr1-22 in R:
R
mci<-c()
mcall<-c()
for (i in 1:22) {
	mci<-read.table(paste("outdir//target_gwas_source_gwas_sumstat_prscsout_pst_eff_a1_b0.5_phiauto_chr",i,".txt",sep=""),header=F)
	mcall<-rbind(mcall,mci)
}
colnames(mcall)<-c("CHR","SNP","BP","A1","A2","B")
mcalla=subset(mcall,select=c("SNP","A1","B"))
colnames(mcalla)<-c("SNP","A1","Score")
write.table(mcalla,"outdir//target_gwas_source_gwas_sumstat_prscsout_pst_eff_a1_b0.5_phiauto_chr1-22.raw", row.names=F, col.names=T, quote=F, sep='\t')
# Run polygenic scoring in PLINK:
plink --bfile indir/target_gwas_cleaned_plink --score outdir/target_gwas_source_gwas_sumstat_prscsout_pst_eff_a1_b0.5_phiauto_chr1-22.raw --out outdir/target_gwas_source_gwas_sumstat_prscsout

9- References

Bulik-Sullivan BK, Loh PR, Finucane HK, Ripke S, Yang J; Schizophrenia Working Group of the Psychiatric Genomics Consortium, Patterson N, Daly MJ, Price AL, Neale BM. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nature Genetics, 2015; 47:291–295. PMID: 25642630 PMCID: PMC4495769 DOI: 10.1038/ng.3211
Choi, S.W., Mak, T.S. & O’Reilly, P.F. Tutorial: a guide to performing polygenic risk score analyses. Nat Protoc. 2020; 15: 2759–2772. https://doi-org.myaccess.library.utoronto.ca/10.1038/s41596-020-0353-1
Ge T, Chen C-Y, Ni Y, Feng Y-CA, Smoller JW. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat Commun. 2019 Apr 16;10(1): 1776. PMID: 30992449 PMCID: PMC6467998 DOI:10.1038/s41467-019-09718-5
Locke AE, Kahali B, Berndt SI, Justice AE, Pers TH, Day FR, Powell C, Vedantam S, Buchkovich ML, Yang J, Croteau-Chonka DC, Esko T et al. Genetic studies of body mass index yield new insights for obesity biology. Nature. 2015; 518:197-206. PMID: 25673413 PMCID: PMC4382211 DOI: 10.1038/nature14177
Ni G, Zeng J, Revez JA, Wang Y, Zheng Z, Ge T, Restuadi R, Kiewa J, Nyholt DR, Coleman JRI, Smoller JW, Schizophrenia Working Group of the Psychiatric Genomics Consortium, Major Depressive Disorder Working Group of the Psychiatric Genomics Consortium, Yang J, Visscher PM, Wray NR. A comparison of ten polygenic score methods for psychiatric disorders applied across multiiple cohorts. Biol Psychiatry. 2021 Nov 1; 90(9): 611–620. PMID: 34304866 URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8500913/
Pain O, Glanville KP, Hagenaars SP, Selzam S, Fürtjes AE, Gaspar HA, Coleman JRI, Rimfeld K, Breen G, Plomin R, Folkersen L, Lewis CM. Evaluation of polygenic prediction methodology within a reference-standardized framework. PLoS Genet. 2021 May 4;17(5):e1009021. doi: 10.1371/journal.pgen.1009021. eCollection 2021 May. PMID: 33945532 URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8121285/
Speed D, Holmes J, Balding DJ. Evaluating and improving heritability models using summary statistics. Nature Genetics, 2020; 52: 458–462. PMID: 32203469 DOI: 10.1038/s41588-020-0600-y

9- Acknowledgements

McLaughlin PRS team: e.g., Dr. Wei Deng, Dr. Delnaz Roshandel

Page tree