Created by Genetics Working Group CAMH, last modified on October 25, 2024

Overview

This is a guide for a genome-wide data cleaning and analysis in Plink.

Learning Objectives

(1) Apply quality control measures for genome-wide genotyping data sets;

(2) Perform population stratification analysis;

(3) Perform basic association analysis;

(4) Get familiarized with Plink and R environments.

 Materials

  1. highLDregions4bim_b37_24JoK.txt - a file specifying extended genomic regions in high LD
  2. Raw_DATA.ped/map - plink files to work on; map in build 36, so need to update map to build 37
  3. 1KG_Phase3_Pruned_forMDS_2019.bed/bim/fam - 1000 Genome Phase 3 data in build 37 (Afr, Eur, EAs founders)
  4. 1KGPh3_b37fwd2019.bim - 1000 Genomes Phase 3 binary map file in build 37 (for updating map from b36 to b37
  5. manhattan_v1.R, qq_plot_v5.R for plotting GWAS finding
  6. Files for plotting population structure

## All material is located in the following folder:

https://support.scinet.utoronto.ca/education/go.php/542/file_storage/index.php

## If the link above does not work, please download the following zipped folder:

GWAS_QC_workshop_files.zip

# ***Powerpoint slides for this workshop: CAMH_GWAS_slides_20231128.pdf


Please download all files and create a folder in your PC or cluster to run the analysis. Remember: plink must be stored in the same folder if running in your PC.

Resources

  1. PLINK (http://pngu.mgh.harvard.edu/~purcell/plink/)
    1. PLINK 1.9 (https://www.cog-genomics.org/plink/)
    2. Need to load plink module if using CAMH Specialized Computing Cluster: e.g., module load bio/PLINK/1.9b_6.15-x86_64
  2. R (http://cran.r-project.org/)
    1. Need to load R module if using CAMH Specialized Computing Cluster: e.g., module load lang/R/4.0.3-Python-3.8.5-Anaconda3-2020.11
  3. Unix-based command terminal
  4. A text editor

QUALITY CONTROL & ANALYSIS OF GWAS DATA

## Convert Plink pedigree files to binary format to save space

Generate binary files and update map file
plink --file Raw_DATA --make-bed --out Raw_DATA_b36


## Visualize the 1Kgenomes .bim file using the command ‘head’ and then use “awk” to create a new file containing the columns #2 (Variant identifier) and #4 (base pair coordinate) of 1Kgenomes .bim file to use as build 37 SNP map (and an additional file with columns #2 (Variant identifier) and #1 (chromosome)) for the next step below

Visualize the 1K genome and create map
#Linux (and MAC):
head 1KGPh3_b37fwd2019.bim
awk '{print $2 , $1}' 1KGPh3_b37fwd2019.bim > 1KGPh3_b37fwd2019_snps_b37chr.txt
head 1KGPh3_b37fwd2019_snps_b37chr.txt
awk '{print $2 , $4}' 1KGPh3_b37fwd2019.bim > 1KGPh3_b37fwd2019_snps_b37map.txt
head 1KGPh3_b37fwd2019_snps_b37map.txt

#In R (Windows):
install.packages("data.table")
library(data.table)
a = read.table("1KGPh3_b37fwd2019.bim", header = F)

b = subset(a, select=c("V2", "V1"))
c = subset(a, select=c("V2", "V4"))
write.table(b, "1KGPh3_b37fwd2019_snps_b37chr.txt", col.names=F,row.names=F,quote=F,sep='\t')
write.table(c, "1KGPh3_b37fwd2019_snps_b37map.txt", col.names=F,row.names=F,quote=F,sep='\t')

#alternatively
#write.table(a[,c(2,1)], "1KGPh3_b37fwd2019_snps_b37chr.txt", col.names=F,row.names=F,quote=F,sep='\t')
#write.table(a[,c(2,4)], "1KGPh3_b37fwd2019_snps_b37map.txt", col.names=F,row.names=F,quote=F,sep='\t')
 


## As the target dataset map is in build 36, use the 1KGPh3_b37fwd2019_snps_b37map.txt to filter variants present in 1K genomes dataset and update the map of the target dataset to build 37, using Plink flags --extract and --update-map (***extract and update to go in separate plink commands***)

Update-map file
plink --bfile Raw_DATA_b36 --extract 1KGPh3_b37fwd2019_snps_b37map.txt --make-bed --out sample_b36_s
plink --bfile sample_b36_s --update-map 1KGPh3_b37fwd2019_snps_b37chr.txt  --update-chr --make-bed --out sample_b37a
plink --bfile sample_b37a --update-map 1KGPh3_b37fwd2019_snps_b37map.txt --make-bed --out sample_b37  

## Your map file is now in build 37

## It is recommended for the strand of the SNP genotype calls in the dataset to be updated to the positive strand.  This will prevent problems e.g., when merging or imputing data.


Quality Control Steps


QUALITY CONTROL FOR INDIVIDUALS

Identify and list individuals with discordant sex information (requires X chromosome information)

Check discordant sex information
plink --bfile sample_b37 --check-sex --out sample_sex


## Interpretation:

# output produces the file "sample.sexcheck" with the following columns:

# FID IID       PEDSEX       SNPSEX       STATUS           F

# where “STATUS” is either “OK” or “PROBLEM”

## Use ‘grep’ in the terminal to select rows containing the pattern ‘PROBLEM’ and print these rows to a text file containing all the individuals, which failed in ‘sex check’ to be removed from dataset

  

Select the discordant individuals using grep
#Linux (and MAC):
grep "PROBLEM" sample_sex.sexcheck > fail-sexcheck-qc.txt

#In R (Windows):
sp=read.table("sample_sex.sexcheck",header=T)
write.table(sp[sp$STATUS=="PROBLEM",],"fail-sexcheck-qc.txt", col.names=F,row.names=F,quote=F,sep='\t')

## It is also possible to impute-sex using genotype of X-chromosome. If you choose to impute instead of removing individuals that failed sex-check, run the following command: # plink --bfile sample_b37 --impute-sex --make-bed --out sample_b37_sexqc

Identify individuals with excessive missing data rates and determine threshold

Identify individuals with excessive missing data
plink --bfile sample_b37 --missing --out check_individuals

## Interpretation:

# Two output are generated: "sample.imiss" (individual) and "sample.lmiss" (locus).

# The file.imiss header: FID IID MISS_PHENO   N_MISS   N_GENO   F_MISS

# where N_MISS is the number and F_MISS is the proportion of missing SNPs per individual

## Plot individual missingness data in R

Select individuals with high rate of missing data
IMISS<-read.table("check_individuals.imiss", header=T, as.is=T)
pdf("individualcallrate.pdf")
plot( (1:dim(IMISS)[1])/(dim(IMISS)[1]-1), sort(1-IMISS$F_MISS), main="Individual call rate cumulative distribution", xlab="Quantile", ylab="Call Rate" ); grid() 
dev.off()
q()

## The proportion of SNPs successfully genotyped per individual and the proportion of individuals successfully genotyped per SNP are called Individual Call rate (CR) and SNP CR, respectively.  Usually, only SNPs and individuals showing CR>95% or higher are kept in the sample. Select the threshold suitable for the dataset and the analyses you intend to run.

## For the purpose of this tutorial, let’s remove individuals with missing rate > 0.05 (CR<95%) use --mind 0.05 in Plink


Removing individuals with CR > 95%
plink --bfile sample_b37 --mind 0.05 --make-bed --out sample-remove-missing


Identify individuals with outlying heterozygosity rates and determine threshold


Identify individuals with heterozygosity rates
plink --bfile sample-remove-missing  --het --out het

## Interpretation:
# which produces the file het.het with the following columns
# file.het header: FID  IID       O(HOM)       E(HOM)        N(NM)            F
# O(HOM) = observed # homozygous genos; N(NM) = nonmissing genos per individual

# (N(NM) – O(HOM)]/N(NM)= mean heterozygosity rate per individual)


## Plot heterozygosity data in R

Plot heterozygosity in R
het<-read.table("het.het", h=T)
het$meanHet <- (het$N.NM. - het$O.HOM.)/het$N.NM. 
pdf("het-histogram.pdf")
hist(het$meanHet, col="green", ylab="Heterozygosity Rate", xlab="Heterozygosity", main= "Distribution of heterozygosity per individual")
linea<- mean(het$meanHet) + 2*sd(het$meanHet)
lineb<- mean(het$meanHet) - 2*sd(het$meanHet)
abline(v=linea, col="red") #red lines on the graph represent +/- 2 standard deviations from the mean
abline(v=lineb, col="red")
linea<- mean(het$meanHet) + 3*sd(het$meanHet)
lineb<- mean(het$meanHet) - 3*sd(het$meanHet)
abline(v=linea, col="blue") #blue lines on the graph represent +/-3 standard deviations from the mean
abline(v=lineb, col="blue")
dev.off()

##Make list of heterozygosity outliers, examples: exclude +/-3SD from the mean
errora<-subset(het,het$meanHet>mean(het$meanHet) + 3*sd(het$meanHet))
errorb<-subset(het,het$meanHet<mean(het$meanHet) - 3*sd(het$meanHet))
errorall<-rbind(errora[1:2],errorb[1:2])
write.table(errorall, "fail-het-qc.txt", col.names=F,row.names=F,quote=F,sep='\t')

Identify duplicated or highly related individuals

## (ie. IBD>0.185) (Figure "IBSplot.pdf")
## For analysis of duplicate and related individuals, we will generate a series of temporary files. ## We will go back to "sample-remove-missing" when we have a list of unrelated and unique individuals.

## Exclude SNPs in high LD, and also snps from chr 23, 24, 25, and 26

## Visualize the last lines of the .bim file.

Visualize the last lines of the .bim file
#Linux:
tail sample-remove-missing.bim

#In R (Windows):
bimm=read.table("sample-remove-missing.bim",header=F)
tail(bimm)

## Use ‘awk’ to make a text file with SNPs mapped on chromosomes 1 to 22:

Make a text file with SNPs mapped on chromosomes 1 to 22
#Linux:
awk '{ if ($1 == 1 || $1 == 22) print $2 }' sample-remove-missing.bim > snp.txt

#In R (Windows):
bimm=read.table("sample-remove-missing.bim",header=F)
bimm1=bimm[bimm$V1<23,]
bimm2=subset(bimm1,select=c(2))
write.table(bimm2,"snp.txt", col.names=F,row.names=F,quote=F,sep='\t')

#write.table(bimm1[2],"snp.txt", col.names=F,row.names=F,quote=F,sep='\t')

## Use ‘--extract and snp.txt’ to make new binary files containing only autosomal SNPs

Make new binary files containing only autosomal SNPs
plink --bfile sample-remove-missing --extract snp.txt --make-bed --out sample-remove-missing-chr

## Use ‘tail’ to confirm non-autosomal SNPs were removed

Confirm non-autosomal SNPs were removed
#Linux:
tail sample-remove-missing-chr.bim

#In R (Windows):
bima=read.table("sample-remove-missing-chr.bim",header=F)
tail(bima)

## The file highLDregions4bim_b37_24JoK.txt contains coordinate ranges of high LD regions of the genome. These regions must be excluded before pruning process. Use ‘--exclude highLDregions4bim_b37_24JoK.txt --range’ to remove SNPs in these regions during pruning and use ‘--indep-pairwise 50 5 0.2’ to prune the target dataset. The parameters ‘50 5 0.2’ stand respectively for: the window size, the number of SNPs to shift the window at each step, and the multiple correlation coefficient for a SNP being regressed on all other SNPs simultaneously.

Pruning
plink --bfile sample-remove-missing-chr --exclude highLDregions4bim_b37_24JoK.txt --range --indep-pairwise 50 5 0.2 --out prune

## Interpretation:
## It produces two files prune.prune.in and prune.prune.out. Prune.prune.in file contains independent SNPs only, which can be used to calculate relatedness in the target dataset.
## Generate pairwise IBS for all individuals
## Use ‘--extract prune.prune.in’ and ‘--genome’ to calculate IBS using only independent SNPs.

Extract prunned SNPs
plink --bfile sample-remove-missing-chr --extract prune.prune.in --genome --out IBS

## Interpretation:
## IBS.genome file header: FID1     IID1   FID2     IID2 RT    EZ      Z0      Z1      Z2  PI_HAT PHE       DST     PPC   RATIO
## PI_HAT= proportion (IBD) per pair of individuals (P(IBD=2)+0.5*P(IBD=1))
## Typically, one individual in a pair that achieved an IBD of over 0.1875 (ie. halfway between 1 = #duplicates /monozygotic twins, and 0.5= first degree relatives) is removed


## Plot relatedness data in R

IBS plot
IBS<-read.table("IBS.genome", h=T)
pdf("IBSplot.pdf") 
hist(IBS$PI_HAT, col="green", breaks=100,  xlab= "Estimated mean pairwise IBD", main="Pairwise Identity-by-descent")
abline(v=c(0.185)) #related individuals are assumed to have a pairwise IBD of over 0.185
dev.off()
q()

## Use the following commands to create a file with pairs of subjects related (PI_HAT>0.2)

Create file with related pairs
#Linux:
awk '{ if ($10 >0.2) print $0 }' IBS.genome > pihat_min0.2.txt

#In R (Windows):
ibs=read.table("IBS.genome",header=T)
write.table(ibs[ibs$PI_HAT>0.2,],"pihat_min0.2.txt", col.names=T,row.names=F,quote=F,sep='\t')

## For each pair of 'related' individuals with a pihat > 0.2, we recommend to remove the individual with the lowest call rate, but you also may choose to remove individuals with poor-quality phenotype data, depending on dataset and study design.  Write the individuals to be removed in a text file, such as fail-pihat-qc.txt

## If you choose to filter by CR, you can use an UNIX text editor (e.g., vi(m) ) to check which individual has the lowest CR in the 'related pair' and include it in the list of individuals to be removed. Use the ‘--missing’ command to check CR.

Identify individuals of divergent ancestry

## Extract independent SNPs from target dataset by using ‘--extract’ and the file prune.prune.in generated above.

Extract independent SNPs
plink --bfile sample-remove-missing-chr --extract prune.prune.in --make-bed --out sample-remove-missing-chr-pruned

## Observe that the dataset is smaller now for population stratification cleaning

## Preparing 1000G (keep only SNPs in common with data set pruned)

## Merge study sample with 1000G and keep only SNPs with CR >0.03

Merge sample with 1kgenomes
plink --bfile sample-remove-missing-chr-pruned --bmerge 1KGP_Phase3_Pruned_forMDS_2019.bed 1KGP_Phase3_Pruned_forMDS_2019.bim 1KGP_Phase3_Pruned_forMDS_2019.fam --geno 0.03 --make-bed --out sample_1000G_merged_tmp

plink --bfile sample_1000G_merged_tmp --genome --out sample_1000G_merged_ibd

## Check Eigenvalues in "MDS.mds.eigvals" file.

Check eigenvalues
plink --bfile sample_1000G_merged_tmp --read-genome sample_1000G_merged_ibd.genome --cluster --mds-plot 10 --out MDS

## Split 1000G populations by continent

Split 1000G sample
plink --bfile 1KGP_Phase3_Pruned_forMDS_2019 --keep 1KGP3_eur20170320.txt --make-bed --out 1000G_europeans
plink --bfile 1KGP_Phase3_Pruned_forMDS_2019 --keep 1KGP3_afr20170320.txt --make-bed --out 1000G_africans
plink --bfile 1KGP_Phase3_Pruned_forMDS_2019 --keep 1KGP3_eas20170320.txt --make-bed --out 1000G_asians


## Plot using the ggplot2 package in R

MDS plot
#install.packages("ggplot2")
library(ggplot2)

pcas<-read.table("MDS.mds", header=TRUE)
sample.pop<-read.table("sample_1000G_merged_tmp.fam", header=FALSE)
eur<-read.table("1000G_europeans.fam", header=FALSE)
afr<-read.table("1000G_africans.fam", header=FALSE)
asn<-read.table("1000G_asians.fam", header=FALSE)
mds<-read.table("MDS.mds", h=T)
pop<-merge(sample.pop, mds, by.x=c("V1","V2"),by.y=c("FID","IID"))
pop1<-merge(eur, mds, by.x=c("V1","V2"),by.y=c("FID","IID"))
pop2<-merge(asn, mds, by.x=c("V1","V2"),by.y=c("FID","IID"))
pop3<-merge(afr, mds, by.x=c("V1","V2"),by.y=c("FID","IID"))

## Population Structure Plot for all populations (Figure "sample_population-all.pdf")

pdf("sample_population-all.pdf")
ggplot(pop1, aes(C1,C2)) + geom_point(shape=21,colour = "yellow", fill="yellow", size = 3) + geom_point(shape=21,data=pop, colour = "white",fill = "black", size = 3, alpha = 0.5)  + geom_point(shape=21,data=pop2, colour = "maroon",fill = "maroon", size = 3, alpha = 0.5)+ geom_point(shape=21,data=pop3, colour = "lightGreen",fill =  "lightGreen", size = 3, alpha = 0.5) + annotate("text", x=0.15,y=-0.1,label = "bold(Sample)",color = "black",parse = TRUE)+ annotate("text", x=0.15,y=-0.11,label = "bold(CEU)",color = "yellow",parse = TRUE)+ annotate("text", x=0.15,y=-0.12,label = "bold(CHB+JPT)",color = "maroon", parse = TRUE)+ annotate("text", x=0.15,y=-0.13,label = "bold(AFR)",color = "lightGreen",parse = TRUE)
dev.off()
q()

##Here all individuals appear to be of European ancestry according to PCA plot.  Otherwise, create a file fail-pca-qc.txt with 2 columns (FID, IID) containing listing individuals not in the European cluster on the PCA plot

##Make a master list of all the individuals in: fail-sexcheck-qc, fail-het-qc, fail-ibs-pihat-qc, and non-Europeans
#Linux:
cat fail-* | sort -k1 | uniq > fail_inds.txt
wc -l fail_inds.txt #check total number

#In R (Windows):
spf=read.table("fail-sexcheck-qc.txt",header=F)
errorall=read.table("fail-het-qc.txt",header=F)
ibsf=read.table("pihat_min0.2.txt",header=F)
pcaf=read.table("fail-pca-qc.txt",header=F)
failed<-c()
failed=rbind(failed,if(exists("spf"))spf[1:2],if(exists("errorall"))errorall[1:2],if(exists("ibsf"))ibsf[1:2],if(exists("pcaf"))pcaf[1:2])
failed1=failed[!duplicated(failed$V1),]
write.table(failed1,"fail_inds.txt", col.names=F,row.names=F,quote=F,sep='\t')

QUALITY CONTROL FOR MARKERS

## Now that we have the list of individuals that fail QC steps, we will return to the target dataset in the build37 with all the chromosomes and CR>95% (sample-remove-missing.bed .bim .fam) and remove these individuals from the sample by using the flag ‘--remove’. Then we will decide the best thresholds for SNP quality according to our final goal. In the following ‘notes’, you can see some examples of parameters of SNP quality for performing a simple GWAS.

## Notes:
## --hwe  0.000001  - markers with HWE P values less than 0.000001 (select case/control)
## --maf 0.01   - markers with a minor allele frequency less than 0.01
## --geno 0.01   - markers with a call-rate threshold of less than 1% (see plot)

## Keep in mind that you can choose different parameters according to your goals.

## Additional QC may be required for non-autosomes – for chr X: e.g., separately for males and females: missingness, maf; female hwe


## Identify markers with excessive missing data rate and determine threshold (Figure "SNPcoverage.pdf")

Missing data rate
plink --bfile sample-remove-missing --remove fail_inds.txt --mind 0.02 --missing --out lmiss

## Interpretation:
## It produces the files .imiss and .lmiss
## The lmiss file header: CHR  SNP   N_MISS   N_GENO   F_MISS
## Where N_MISS is the number of individuals missing this SNP and F_MISS is the proportion of the sample missing for this SNP

## Plot SNP missingness data in R

Plot SNP coverage
x<-read.table("lmiss.lmiss", header=T, as.is=T)
pdf("SNPcoverage.pdf")
plot( (1:dim(x)[1])/(dim(x)[1]-1), sort(1-x$F_MISS), main="SNP coverage cumulative distribution", xlab="Quantile", ylab="Coverage" ); grid() 
dev.off()
q()

## SNPs that fail this step will be removed using the --geno flag (find the threshold)

Filter individuals and markers that fail QC

##Remove all markers and samples failing quality control (by default the --hwe option in plink by itself only filters for controls)

##HWE will likely remove more SNPs from chr X than necessary because of different genotype distributions between the sexes for chr X SNPs

Filter markers
plink --bfile sample-remove-missing --remove fail_inds.txt --mind 0.02 --hwe 1e-6 --maf 0.01 --geno 0.01  --make-bed  --out sample_clean_final_temp

## This second HWE step only focuses on cases because in the controls all SNPs with a HWE p-value < 1e-6 were already been removed

Filter markers - part 2
# use either "--hwe-all" or "--hwe include-nonctrl"
plink --bfile sample_clean_final_temp --hwe 1e-10 --hwe-all --make-bed --out sample_clean_final


## QC is complete now!

Whole-genome Imputation using the Michigan Imputation Server

(Not a part of this tutorial)

http://imputationserver.camhres.ca

## Convert cleaned GWAS data from plink to vcf.gz format

Change format from plink to vcf.gz
module load VCFTOOLS/0.1.13
module load TABIX/0.2.3
module load bio/BCFtools/1.14-GCC-10.2.0

## shown here for chromosome 1
/plink --bfile sample_clean_final --chr 1 --recode vcf --out sample_clean_final_chr1
/vcf-sort sample_clean_final_chr1.vcf | bgzip -c > sample_clean_final_chr1.vcf.gz
#or
/plink --bfile sample_clean_final --chr 1 --recode vcf --out sample_clean_final_chr1
/bcftools sort sample_clean_final_chr1.vcf -Oz -o sample_clean_final_chr1.vcf.gz

## Upload vcf.gz files to Michigan Imputation Server

## Parameters: genome build, R-sq filter, reference panel, pre-phasing, ancestry

## Post-imputation quality control

Post-imputation formatting and initial filtering
# download imputed files from Michigan Imputation Server
# make sure you record the password for the job from the portal.  The password will be needed for extracting imputed data from the zip files

## For QC Report:
wget https://imputationserver.camhres.ca/share/results/...html
## For QC Statistics, imputation results, and logs:
curl -sL https://imputationserver.camhres.ca/get/... | bash

# extract dose and info files from downloaded imputation results zip folder
## shown here for chromosome 1
7za e chr_1.zip -p'<password>'
## type in <password> (provided for each imputation experiment in the Michigan Imputation Server

# filter for SNPs with a minimum imputation info score (shown here: 0.9)
gzip -cd chr1.info.gz | awk -F " " '($7>=0.9) {print $1,$7}' > chr1.info_9.SNPs
plink --vcf chr1.dose.vcf.gz --extract chr1.info_9.SNPs --make-bed --out chr1.dose

# merge chr1.dose plink data with imputed data on other chromosomes
plink --bfile chr1.dose --merge-list postmis-chr1-22mergelist2020.txt --make-bed --out sample_imputed


## Post-imputation quality control

Post-imputation quality control
# variants with same positions: check the plink bim file and make a list of ALL markers with same chromosomal positions with filename = "list_variants_with_same_positions.txt"
plink --bfile sample_imputed --exclude list_variants_with_same_positions.txt --make-bed --out sample_imputed_clean1

# update sex and affected status: this information should be coming from the cleaned pre-imputation plink fam file
plink --bfile sample_imputed_clean1 --update-sex sample_clean_final.fam_sex --pheno sample_clean_final.fam_aff --make-bed --out sample_imputed_clean2

# update rsID: merge post-imputation plink "sample_imputed_clean2.bim" file with reference plink bim file (with rsID; reference data may be 1000 Genomes Project Phase 3) using chromosome and position as keys, then set up a list of post-imputation marker names (column 1) and corresponding rsID from reference file (column 2) with filename = "sample_imputed_clean2-1000G_rsID.txt"
plink --bfile sample_imputed_clean2 --update-map sample_imputed_clean2-1000G_rsID.txt --update-name --make-bed --out sample_imputed_clean3

# final filtering on --maf, --geno, --mind, --hwe


Statistical Analysis and Graphs


ANALYSIS

(using unimputed data for this tutorial)

For case control analysis

regression analysis
#For autosomes (unimputed chromosomes 1-22 for this tutorial):
    plink --bfile sample_clean_final --chr 1-22 --logistic --out res_chr1-22

#Additional option: analyzing X chromosome (For more information on analytic models for X chromosome markers: https://www.cog-genomics.org/plink/1.9/assoc; exclude sex as covariate):
    plink --bfile sample_clean_final --chr 23 --logistic --xchr-model 2 --out res_chr23_xmeth2  

## QQ plot  in R (Figure "qqPlot_data_clean_final.assoc.logistic.pdf")

QQ plot
#In R
#If analyzing only chromosomes 1-22 (for this tutorial):
    data<-read.table("res_chr1-22.assoc.logistic", h=T, as.is=T) #or use clean-DATA.assoc.linear

#If analyzing chromosomes 1-22 and X chromosome:
    x1<-read.table("res_chr1-22.assoc.logistic", h=T, as.is=T) #or use clean-DATA.assoc.linear
    x2<-read.table("res_chr23_xmeth2.assoc.logistic", h=T, as.is=T)
    data=rbind(x1,x2)

#Plotting QQ plot:
source('qq_plot_v5.R')
# if covariates are included in model, then make sure to filter for rows that correspond to SNP results (TEST=="ADD")
pdf("qqPlot_data_clean_final.assoc.logistic.pdf") 
qq.plot(data[ ,9], xlab= "Expected P value", ylab= "Observed P value", main="qqPlot for SNPs in clean-DATA")
dev.off()

## Manhattan Plot in R (Figure "Manhattan_res.assoc.logistic.pdf")

Manhattan plot
#In R
#If analyzing only chromosomes 1-22 (for this tutorial):
    data<-read.table("res_chr1-22.assoc.logistic", h=T, as.is=T) #or use clean-DATA.assoc.linear

#If analyzing chromosomes 1-22 and X chromosome:
    x1<-read.table("res_chr1-22.assoc.logistic", h=T, as.is=T) #or use clean-DATA.assoc.linear
    x2<-read.table("res_chr23_xmeth2.assoc.logistic", h=T, as.is=T)
    data=rbind(x1,x2)

#Plotting Manhattan Plot: 
source('manhattan_v1.R')
pdf("Manhattan_res.assoc.logistic.pdf") 
d<-subset(data, data$TEST=="ADD")
X<-data.frame(CHR=d$CHR, BP=d$BP, P=d$P)
manhattan( X, GWthresh=5e-8, GreyZoneThresh=1e-5, DrawGWline=TRUE )
dev.off()

## Selecting SNPs by p-values threshold

Filter SNPs by p-value
#In R
#If analyzing only chromosomes 1-22 (for this tutorial):
    data<-read.table("res_chr1-22.assoc.logistic", h=T, as.is=T) #or use clean-DATA.assoc.linear

#If analyzing chromosomes 1-22 and X chromosome:
    x1<-read.table("res_chr1-22.assoc.logistic", h=T, as.is=T) #or use clean-DATA.assoc.linear
    x2<-read.table("res_chr23_xmeth2.assoc.logistic", h=T, as.is=T)
    data=rbind(x1,x2)
 
#Filtering results based on p-values:
unadj2=data[which(data$TEST==c("ADD")),]
significant=unadj2[which(unadj2$P<5e-8),]
significant
suggestive=unadj2[which(unadj2$P<0.0005),]
head(suggestive)

write.table(suggestive, "case-control-res.txt", col.names=TRUE, row.names=FALSE, quote=FALSE)

q()
n


## Gene-based analysis using MAGMA (not a part of this tutorial; https://ctg.cncr.nl/software/magma)

Gene-based analysis with MAGMA
module load bio/MAGMA/1.07b-foss-2018b
#need to download gene location file (https://ctg.cncr.nl/software/MAGMA/aux_files/NCBI37.3.zip) to e.g., GWAS_RefData/MAGMA folder and extract zip file

##annotation (with flanking regions of 5kb upstream and 1.5kb downstream shown here as an example)
magma --annotate --snp-loc data/sample_chr1-22misRSq7_23_cleaned.bim --gene-loc GWAS_RefData/MAGMA/NCBI37.3/NCBI37.3.gene.loc --annotate window=5,1.5 --out data/ample_chr1-22misRSq7_23_cleaned_genesflanking.annot

##analysis
magma --bfile data/sample_chr1-22misRSq7_23_cleaned --pval data/sample_chr1-22misRSq7_23_cleaned_cov20220331out_add.txt ncol=NMISS --gene-annot data/sample_chr1-22misRSq7_23_cleaned_genesflanking.annot.genes.annot.txt --gene-settings fixed-permp=10000 --out data/sample_chr1-22misRSq7_23_cleaned_genesflanking.annot_genebased1000perm.out


References

Anderson, C.A., et al. (2010) Data quality control in genetic #case-control association studies. Nat Protoc. 5:1564-1573. doi:10.1038/nprot.2010.116

Clarke, G.M., et al. (2011) Basic statistical analysis in genetic case-control studies. Nat Protoc. 6: 121-133. doi:10.1038/nprot.2010.182.

Weale, Michael. (2011, Sept. 23) Notes on GWAS QC/QA. from

https://sites.google.com/site/mikeweale/software/gwascode

Marees AT, de Kluiver H, Stringer S, et al. A tutorial on conducting genome-wide association studies: Quality control and statistical analysis. Int J Methods Psychiatr Res. 2018;27(2):e1608. doi:10.1002/mpr.1608

Uffelmann, E., et al. (2021) Genome-wide association studies.  Nature Reviews Methods Primers. Vol 1: Article number 59.  https://doi.org/10.1038/s43586-021-00056-9


Additional Resources

1000 Genomes reference data for QC of (+/-) strands and b37 map positions (>13.5M SNPs)

HapMap data for population structure analysis (CEU, YRI, CHB, JPT only) HapMap_b37_pops.zip

Local Ancestry Analyses using TRACTOR – tutorial: https://atkinson-lab.github.io/Tractor-tutorial/

Acknowledgements

Dr. Joanne Knight, Dr. Sarah Gagliano, Dr. Victoria Marshe


  • No labels