Tutorial for Polygenic Risk Score

1- Overview

This is a guide for an introductory analysis to 1) construct a polygenic risk score (PRS) using the base data (GWAS summary statistics, particularly effect-sizes and P-values, generally public available) via a clumping and thresholding method (C + T); and 2) test the constructed PRS for prediction using the target data (PLINK binary data format). In general, it is the ‘user’ data).

2- Learning Objectives

Apply quality control measures to base/target sample prior to PRS analysis;
Perform PRS analysis (hands-on);
Understand the graphs and outputs (hands-on)

3- Material

Ideally for PRS analyses, you would be using the genome-wide genotype data. Here we use base data containing summary statistics and target data containing genotypes for chromosome 16 as an example to demonstrate the workflow for prediction of simulated body mass index (BMI) data. The procedure described below will be the same for the genome-wide dataset. All the materials required for this workshop are attached here. Relevant materials for this workshop are as follows:

plink 1

kgph3_chr16.bim

1kgph3_dummybmi20200804.csv

range_list.txt

BMI_1kgph3_chr16_snps_summarystat.txt

1kgph3_chr16.bed

1kgph3_chr16.fam

4- Base Data

We will use as base data part of GWAS Anthropometric 2015 BMI summary statistics ( https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4382211/), made available by the GIANT consortium and were extracted from their online portal

https://portals.broadinstitute.org/collaboration/giant/index.php/GIANT_consortium_data_files#BMI_and_Height_GIANT_and_UK_BioBank_Meta-analysis_Summary_Statistics).

QC for Base data:

1)- Check SNPs Heritability: h2SNP>0.05

LD Score Regression: https://github.com/bulik/ldsc (add ref)

SumHer: http://dougspeed.com/snp-heritability/ (add ref)

2)- The effect allele must be known. Both Base and Target datasets should have the same effect allele.

3)- Filter out SNPs with MAF < 0.01 and INFO < 0.8.

# BMI_1kgph3_chr16_snps_summarystat.txt:
 
# With the following columns: “SNP” = marker ID, “A1” = minor allele or effect allele, “A2” = major allele or reference allele, “Freq1.Hapmap” = allele frequency according to Hapmap, “b” = effect estimate for the minor allele, “se” = standard error of the effect estimate, “p” = p-value, “N” = sample size
 
        SNP A1 A2 Freq1.Hapmap       b     se      p      N
1 rs1000014  G  A       0.8000 -0.0055 0.0049 0.2617 233462
2 rs1000047  C  T       0.6917  0.0019 0.0040 0.6348 233959
3 rs1000077  C  G       0.4833  0.0006 0.0038 0.8745 220681
4 rs1000078  A  G       0.7000 -0.0031 0.0041 0.4496 232588
5 rs1000100  A  T       0.4167  0.0006 0.0040 0.8808 233753
6 rs1000174  A  G       0.3417 -0.0025 0.0052 0.6307 150418

Tip: The base data may come in different formats. For example, if marker IDs (rs IDs) are not available, we may have to derive them based on available chromosome and map positions. The effect estimate may come in the form of an odds ratio (OR), in which case we will calculate beta using the formula beta=ln(OR). It is also ideal to have the allele calls based on the positive strand. Make sure to identify the effect allele.

Page tree

Tutorial for Polygenic Risk Score