Created by Genetics Working Group CAMH, last modified on October 25, 2024
Overview
This is a guide for a genome-wide data cleaning and analysis in Plink.
Learning Objectives
(1) Apply quality control measures for genome-wide genotyping data sets;
(2) Perform population stratification analysis;
(3) Perform basic association analysis;
(4) Get familiarized with Plink and R environments.
Materials
- highLDregions4bim_b37_24JoK.txt - a file specifying extended genomic regions in high LD
- Raw_DATA.ped/map - plink files to work on; map in build 36, so need to update map to build 37
- 1KG_Phase3_Pruned_forMDS_2019.bed/bim/fam - 1000 Genome Phase 3 data in build 37 (Afr, Eur, EAs founders)
- 1KGPh3_b37fwd2019.bim - 1000 Genomes Phase 3 binary map file in build 37 (for updating map from b36 to b37
- manhattan_v1.R, qq_plot_v5.R for plotting GWAS finding
- Files for plotting population structure
## All material is located in the following folder:
https://support.scinet.utoronto.ca/education/go.php/542/file_storage/index.php
## If the link above does not work, please download the following zipped folder:
# ***Powerpoint slides for this workshop: CAMH_GWAS_slides_20231128.pdf
Please download all files and create a folder in your PC or cluster to run the analysis. Remember: plink must be stored in the same folder if running in your PC.
Resources
- PLINK (http://pngu.mgh.harvard.edu/~purcell/plink/)
- PLINK 1.9 (https://www.cog-genomics.org/plink/)
- Need to load plink module if using CAMH Specialized Computing Cluster: e.g., module load bio/PLINK/1.9b_6.15-x86_64
- R (http://cran.r-project.org/)
- Need to load R module if using CAMH Specialized Computing Cluster: e.g., module load lang/R/4.0.3-Python-3.8.5-Anaconda3-2020.11
- Unix-based command terminal
- A text editor
QUALITY CONTROL & ANALYSIS OF GWAS DATA
## Convert Plink pedigree files to binary format to save space
plink --file Raw_DATA --make-bed --out Raw_DATA_b36
## Visualize the 1Kgenomes .bim file using the command ‘head’ and then use “awk” to create a new file containing the columns #2 (Variant identifier) and #4 (base pair coordinate) of 1Kgenomes .bim file to use as build 37 SNP map (and an additional file with columns #2 (Variant identifier) and #1 (chromosome)) for the next step below
#Linux (and MAC): head 1KGPh3_b37fwd2019.bim awk '{print $2 , $1}' 1KGPh3_b37fwd2019.bim > 1KGPh3_b37fwd2019_snps_b37chr.txt head 1KGPh3_b37fwd2019_snps_b37chr.txt awk '{print $2 , $4}' 1KGPh3_b37fwd2019.bim > 1KGPh3_b37fwd2019_snps_b37map.txt head 1KGPh3_b37fwd2019_snps_b37map.txt #In R (Windows): install.packages("data.table") library(data.table) a = read.table("1KGPh3_b37fwd2019.bim", header = F) b = subset(a, select=c("V2", "V1")) c = subset(a, select=c("V2", "V4")) write.table(b, "1KGPh3_b37fwd2019_snps_b37chr.txt", col.names=F,row.names=F,quote=F,sep='\t') write.table(c, "1KGPh3_b37fwd2019_snps_b37map.txt", col.names=F,row.names=F,quote=F,sep='\t') #alternatively #write.table(a[,c(2,1)], "1KGPh3_b37fwd2019_snps_b37chr.txt", col.names=F,row.names=F,quote=F,sep='\t') #write.table(a[,c(2,4)], "1KGPh3_b37fwd2019_snps_b37map.txt", col.names=F,row.names=F,quote=F,sep='\t')
## As the target dataset map is in build 36, use the 1KGPh3_b37fwd2019_snps_b37map.txt to filter variants present in 1K genomes dataset and update the map of the target dataset to build 37, using Plink flags --extract and --update-map (***extract and update to go in separate plink commands***)
plink --bfile Raw_DATA_b36 --extract 1KGPh3_b37fwd2019_snps_b37map.txt --make-bed --out sample_b36_s plink --bfile sample_b36_s --update-map 1KGPh3_b37fwd2019_snps_b37chr.txt --update-chr --make-bed --out sample_b37a plink --bfile sample_b37a --update-map 1KGPh3_b37fwd2019_snps_b37map.txt --make-bed --out sample_b37
## Your map file is now in build 37
## It is recommended for the strand of the SNP genotype calls in the dataset to be updated to the positive strand. This will prevent problems e.g., when merging or imputing data.
Quality Control Steps
QUALITY CONTROL FOR INDIVIDUALS
Identify and list individuals with discordant sex information (requires X chromosome information)
plink --bfile sample_b37 --check-sex --out sample_sex
## Interpretation:
# output produces the file "sample.sexcheck" with the following columns:
# FID IID PEDSEX SNPSEX STATUS F
# where “STATUS” is either “OK” or “PROBLEM”
## Use ‘grep’ in the terminal to select rows containing the pattern ‘PROBLEM’ and print these rows to a text file containing all the individuals, which failed in ‘sex check’ to be removed from dataset
#Linux (and MAC): grep "PROBLEM" sample_sex.sexcheck > fail-sexcheck-qc.txt #In R (Windows): sp=read.table("sample_sex.sexcheck",header=T) write.table(sp[sp$STATUS=="PROBLEM",],"fail-sexcheck-qc.txt", col.names=F,row.names=F,quote=F,sep='\t')
## It is also possible to impute-sex using genotype of X-chromosome. If you choose to impute instead of removing individuals that failed sex-check, run the following command: # plink --bfile sample_b37 --impute-sex --make-bed --out sample_b37_sexqc
Identify individuals with excessive missing data rates and determine threshold
plink --bfile sample_b37 --missing --out check_individuals
## Interpretation:
# Two output are generated: "sample.imiss" (individual) and "sample.lmiss" (locus).
# The file.imiss header: FID IID MISS_PHENO N_MISS N_GENO F_MISS
# where N_MISS is the number and F_MISS is the proportion of missing SNPs per individual
## Plot individual missingness data in R
IMISS<-read.table("check_individuals.imiss", header=T, as.is=T) pdf("individualcallrate.pdf") plot( (1:dim(IMISS)[1])/(dim(IMISS)[1]-1), sort(1-IMISS$F_MISS), main="Individual call rate cumulative distribution", xlab="Quantile", ylab="Call Rate" ); grid() dev.off() q()
## The proportion of SNPs successfully genotyped per individual and the proportion of individuals successfully genotyped per SNP are called Individual Call rate (CR) and SNP CR, respectively. Usually, only SNPs and individuals showing CR>95% or higher are kept in the sample. Select the threshold suitable for the dataset and the analyses you intend to run.
## For the purpose of this tutorial, let’s remove individuals with missing rate > 0.05 (CR<95%) use --mind 0.05 in Plink
plink --bfile sample_b37 --mind 0.05 --make-bed --out sample-remove-missing
Identify individuals with outlying heterozygosity rates and determine threshold
plink --bfile sample-remove-missing --het --out het
## Interpretation:
# which produces the file het.het with the following columns
# file.het header: FID IID O(HOM) E(HOM) N(NM) F
# O(HOM) = observed # homozygous genos; N(NM) = nonmissing genos per individual
# (N(NM) – O(HOM)]/N(NM)= mean heterozygosity rate per individual)
## Plot heterozygosity data in R
het<-read.table("het.het", h=T) het$meanHet <- (het$N.NM. - het$O.HOM.)/het$N.NM. pdf("het-histogram.pdf") hist(het$meanHet, col="green", ylab="Heterozygosity Rate", xlab="Heterozygosity", main= "Distribution of heterozygosity per individual") linea<- mean(het$meanHet) + 2*sd(het$meanHet) lineb<- mean(het$meanHet) - 2*sd(het$meanHet) abline(v=linea, col="red") #red lines on the graph represent +/- 2 standard deviations from the mean abline(v=lineb, col="red") linea<- mean(het$meanHet) + 3*sd(het$meanHet) lineb<- mean(het$meanHet) - 3*sd(het$meanHet) abline(v=linea, col="blue") #blue lines on the graph represent +/-3 standard deviations from the mean abline(v=lineb, col="blue") dev.off() ##Make list of heterozygosity outliers, examples: exclude +/-3SD from the mean errora<-subset(het,het$meanHet>mean(het$meanHet) + 3*sd(het$meanHet)) errorb<-subset(het,het$meanHet<mean(het$meanHet) - 3*sd(het$meanHet)) errorall<-rbind(errora[1:2],errorb[1:2]) write.table(errorall, "fail-het-qc.txt", col.names=F,row.names=F,quote=F,sep='\t')
## (ie. IBD>0.185) (Figure "IBSplot.pdf")
## For analysis of duplicate and related individuals, we will generate a series of temporary files. ## We will go back to "sample-remove-missing" when we have a list of unrelated and unique individuals.
## Exclude SNPs in high LD, and also snps from chr 23, 24, 25, and 26
## Visualize the last lines of the .bim file.
#Linux: tail sample-remove-missing.bim #In R (Windows): bimm=read.table("sample-remove-missing.bim",header=F) tail(bimm)
## Use ‘awk’ to make a text file with SNPs mapped on chromosomes 1 to 22:
#Linux: awk '{ if ($1 == 1 || $1 == 22) print $2 }' sample-remove-missing.bim > snp.txt #In R (Windows): bimm=read.table("sample-remove-missing.bim",header=F) bimm1=bimm[bimm$V1<23,] bimm2=subset(bimm1,select=c(2)) write.table(bimm2,"snp.txt", col.names=F,row.names=F,quote=F,sep='\t') #write.table(bimm1[2],"snp.txt", col.names=F,row.names=F,quote=F,sep='\t')
## Use ‘--extract and snp.txt’ to make new binary files containing only autosomal SNPs
plink --bfile sample-remove-missing --extract snp.txt --make-bed --out sample-remove-missing-chr
## Use ‘tail’ to confirm non-autosomal SNPs were removed
#Linux: tail sample-remove-missing-chr.bim #In R (Windows): bima=read.table("sample-remove-missing-chr.bim",header=F) tail(bima)
## The file highLDregions4bim_b37_24JoK.txt contains coordinate ranges of high LD regions of the genome. These regions must be excluded before pruning process. Use ‘--exclude highLDregions4bim_b37_24JoK.txt --range’ to remove SNPs in these regions during pruning and use ‘--indep-pairwise 50 5 0.2’ to prune the target dataset. The parameters ‘50 5 0.2’ stand respectively for: the window size, the number of SNPs to shift the window at each step, and the multiple correlation coefficient for a SNP being regressed on all other SNPs simultaneously.
plink --bfile sample-remove-missing-chr --exclude highLDregions4bim_b37_24JoK.txt --range --indep-pairwise 50 5 0.2 --out prune
## Interpretation:
## It produces two files prune.prune.in and prune.prune.out. Prune.prune.in file contains independent SNPs only, which can be used to calculate relatedness in the target dataset.
## Generate pairwise IBS for all individuals
## Use ‘--extract prune.prune.in’ and ‘--genome’ to calculate IBS using only independent SNPs.
plink --bfile sample-remove-missing-chr --extract prune.prune.in --genome --out IBS
## Interpretation:
## IBS.genome file header: FID1 IID1 FID2 IID2 RT EZ Z0 Z1 Z2 PI_HAT PHE DST PPC RATIO
## PI_HAT= proportion (IBD) per pair of individuals (P(IBD=2)+0.5*P(IBD=1))
## Typically, one individual in a pair that achieved an IBD of over 0.1875 (ie. halfway between 1 = #duplicates /monozygotic twins, and 0.5= first degree relatives) is removed
## Plot relatedness data in R
IBS<-read.table("IBS.genome", h=T) pdf("IBSplot.pdf") hist(IBS$PI_HAT, col="green", breaks=100, xlab= "Estimated mean pairwise IBD", main="Pairwise Identity-by-descent") abline(v=c(0.185)) #related individuals are assumed to have a pairwise IBD of over 0.185 dev.off() q()
## Use the following commands to create a file with pairs of subjects related (PI_HAT>0.2)
#Linux: awk '{ if ($10 >0.2) print $0 }' IBS.genome > pihat_min0.2.txt #In R (Windows): ibs=read.table("IBS.genome",header=T) write.table(ibs[ibs$PI_HAT>0.2,],"pihat_min0.2.txt", col.names=T,row.names=F,quote=F,sep='\t')
## For each pair of 'related' individuals with a pihat > 0.2, we recommend to remove the individual with the lowest call rate, but you also may choose to remove individuals with poor-quality phenotype data, depending on dataset and study design. Write the individuals to be removed in a text file, such as fail-pihat-qc.txt
## If you choose to filter by CR, you can use an UNIX text editor (e.g., vi(m) ) to check which individual has the lowest CR in the 'related pair' and include it in the list of individuals to be removed. Use the ‘--missing’ command to check CR.
Identify individuals of divergent ancestry
## Extract independent SNPs from target dataset by using ‘--extract’ and the file prune.prune.in generated above.
plink --bfile sample-remove-missing-chr --extract prune.prune.in --make-bed --out sample-remove-missing-chr-pruned
## Observe that the dataset is smaller now for population stratification cleaning
## Preparing 1000G (keep only SNPs in common with data set pruned)
## Merge study sample with 1000G and keep only SNPs with CR >0.03
plink --bfile sample-remove-missing-chr-pruned --bmerge 1KGP_Phase3_Pruned_forMDS_2019.bed 1KGP_Phase3_Pruned_forMDS_2019.bim 1KGP_Phase3_Pruned_forMDS_2019.fam --geno 0.03 --make-bed --out sample_1000G_merged_tmp plink --bfile sample_1000G_merged_tmp --genome --out sample_1000G_merged_ibd
## Check Eigenvalues in "MDS.mds.eigvals" file.
plink --bfile sample_1000G_merged_tmp --read-genome sample_1000G_merged_ibd.genome --cluster --mds-plot 10 --out MDS
## Split 1000G populations by continent
plink --bfile 1KGP_Phase3_Pruned_forMDS_2019 --keep 1KGP3_eur20170320.txt --make-bed --out 1000G_europeans plink --bfile 1KGP_Phase3_Pruned_forMDS_2019 --keep 1KGP3_afr20170320.txt --make-bed --out 1000G_africans plink --bfile 1KGP_Phase3_Pruned_forMDS_2019 --keep 1KGP3_eas20170320.txt --make-bed --out 1000G_asians
## Plot using the ggplot2 package in R
#install.packages("ggplot2") library(ggplot2) pcas<-read.table("MDS.mds", header=TRUE) sample.pop<-read.table("sample_1000G_merged_tmp.fam", header=FALSE) eur<-read.table("1000G_europeans.fam", header=FALSE) afr<-read.table("1000G_africans.fam", header=FALSE) asn<-read.table("1000G_asians.fam", header=FALSE) mds<-read.table("MDS.mds", h=T) pop<-merge(sample.pop, mds, by.x=c("V1","V2"),by.y=c("FID","IID")) pop1<-merge(eur, mds, by.x=c("V1","V2"),by.y=c("FID","IID")) pop2<-merge(asn, mds, by.x=c("V1","V2"),by.y=c("FID","IID")) pop3<-merge(afr, mds, by.x=c("V1","V2"),by.y=c("FID","IID")) ## Population Structure Plot for all populations (Figure "sample_population-all.pdf") pdf("sample_population-all.pdf") ggplot(pop1, aes(C1,C2)) + geom_point(shape=21,colour = "yellow", fill="yellow", size = 3) + geom_point(shape=21,data=pop, colour = "white",fill = "black", size = 3, alpha = 0.5) + geom_point(shape=21,data=pop2, colour = "maroon",fill = "maroon", size = 3, alpha = 0.5)+ geom_point(shape=21,data=pop3, colour = "lightGreen",fill = "lightGreen", size = 3, alpha = 0.5) + annotate("text", x=0.15,y=-0.1,label = "bold(Sample)",color = "black",parse = TRUE)+ annotate("text", x=0.15,y=-0.11,label = "bold(CEU)",color = "yellow",parse = TRUE)+ annotate("text", x=0.15,y=-0.12,label = "bold(CHB+JPT)",color = "maroon", parse = TRUE)+ annotate("text", x=0.15,y=-0.13,label = "bold(AFR)",color = "lightGreen",parse = TRUE) dev.off() q() ##Here all individuals appear to be of European ancestry according to PCA plot. Otherwise, create a file fail-pca-qc.txt with 2 columns (FID, IID) containing listing individuals not in the European cluster on the PCA plot ##Make a master list of all the individuals in: fail-sexcheck-qc, fail-het-qc, fail-ibs-pihat-qc, and non-Europeans #Linux: cat fail-* | sort -k1 | uniq > fail_inds.txt wc -l fail_inds.txt #check total number #In R (Windows): spf=read.table("fail-sexcheck-qc.txt",header=F) errorall=read.table("fail-het-qc.txt",header=F) ibsf=read.table("pihat_min0.2.txt",header=F) pcaf=read.table("fail-pca-qc.txt",header=F) failed<-c() failed=rbind(failed,if(exists("spf"))spf[1:2],if(exists("errorall"))errorall[1:2],if(exists("ibsf"))ibsf[1:2],if(exists("pcaf"))pcaf[1:2]) failed1=failed[!duplicated(failed$V1),] write.table(failed1,"fail_inds.txt", col.names=F,row.names=F,quote=F,sep='\t')
QUALITY CONTROL FOR MARKERS
## Now that we have the list of individuals that fail QC steps, we will return to the target dataset in the build37 with all the chromosomes and CR>95% (sample-remove-missing.bed .bim .fam) and remove these individuals from the sample by using the flag ‘--remove’. Then we will decide the best thresholds for SNP quality according to our final goal. In the following ‘notes’, you can see some examples of parameters of SNP quality for performing a simple GWAS.
## Notes:
## --hwe 0.000001 - markers with HWE P values less than 0.000001 (select case/control)
## --maf 0.01 - markers with a minor allele frequency less than 0.01
## --geno 0.01 - markers with a call-rate threshold of less than 1% (see plot)
## Keep in mind that you can choose different parameters according to your goals.
## Additional QC may be required for non-autosomes – for chr X: e.g., separately for males and females: missingness, maf; female hwe
## Identify markers with excessive missing data rate and determine threshold (Figure "SNPcoverage.pdf")
plink --bfile sample-remove-missing --remove fail_inds.txt --mind 0.02 --missing --out lmiss
## Interpretation:
## It produces the files .imiss and .lmiss
## The lmiss file header: CHR SNP N_MISS N_GENO F_MISS
## Where N_MISS is the number of individuals missing this SNP and F_MISS is the proportion of the sample missing for this SNP
## Plot SNP missingness data in R
x<-read.table("lmiss.lmiss", header=T, as.is=T) pdf("SNPcoverage.pdf") plot( (1:dim(x)[1])/(dim(x)[1]-1), sort(1-x$F_MISS), main="SNP coverage cumulative distribution", xlab="Quantile", ylab="Coverage" ); grid() dev.off() q()
## SNPs that fail this step will be removed using the --geno flag (find the threshold)
Filter individuals and markers that fail QC
##Remove all markers and samples failing quality control (by default the --hwe option in plink by itself only filters for controls)
##HWE will likely remove more SNPs from chr X than necessary because of different genotype distributions between the sexes for chr X SNPs
plink --bfile sample-remove-missing --remove fail_inds.txt --mind 0.02 --hwe 1e-6 --maf 0.01 --geno 0.01 --make-bed --out sample_clean_final_temp
## This second HWE step only focuses on cases because in the controls all SNPs with a HWE p-value < 1e-6 were already been removed
# use either "--hwe-all" or "--hwe include-nonctrl" plink --bfile sample_clean_final_temp --hwe 1e-10 --hwe-all --make-bed --out sample_clean_final
## QC is complete now!
Whole-genome Imputation using the Michigan Imputation Server
(Not a part of this tutorial)
http://imputationserver.camhres.ca
## Convert cleaned GWAS data from plink to vcf.gz format
module load VCFTOOLS/0.1.13 module load TABIX/0.2.3 module load bio/BCFtools/1.14-GCC-10.2.0 ## shown here for chromosome 1 /plink --bfile sample_clean_final --chr 1 --recode vcf --out sample_clean_final_chr1 /vcf-sort sample_clean_final_chr1.vcf | bgzip -c > sample_clean_final_chr1.vcf.gz #or /plink --bfile sample_clean_final --chr 1 --recode vcf --out sample_clean_final_chr1 /bcftools sort sample_clean_final_chr1.vcf -Oz -o sample_clean_final_chr1.vcf.gz
## Upload vcf.gz files to Michigan Imputation Server
## Parameters: genome build, R-sq filter, reference panel, pre-phasing, ancestry
## Post-imputation quality control
# download imputed files from Michigan Imputation Server # make sure you record the password for the job from the portal. The password will be needed for extracting imputed data from the zip files ## For QC Report: wget https://imputationserver.camhres.ca/share/results/...html ## For QC Statistics, imputation results, and logs: curl -sL https://imputationserver.camhres.ca/get/... | bash # extract dose and info files from downloaded imputation results zip folder ## shown here for chromosome 1 7za e chr_1.zip -p'<password>' ## type in <password> (provided for each imputation experiment in the Michigan Imputation Server # filter for SNPs with a minimum imputation info score (shown here: 0.9) gzip -cd chr1.info.gz | awk -F " " '($7>=0.9) {print $1,$7}' > chr1.info_9.SNPs plink --vcf chr1.dose.vcf.gz --extract chr1.info_9.SNPs --make-bed --out chr1.dose # merge chr1.dose plink data with imputed data on other chromosomes plink --bfile chr1.dose --merge-list postmis-chr1-22mergelist2020.txt --make-bed --out sample_imputed
## Post-imputation quality control
# variants with same positions: check the plink bim file and make a list of ALL markers with same chromosomal positions with filename = "list_variants_with_same_positions.txt" plink --bfile sample_imputed --exclude list_variants_with_same_positions.txt --make-bed --out sample_imputed_clean1 # update sex and affected status: this information should be coming from the cleaned pre-imputation plink fam file plink --bfile sample_imputed_clean1 --update-sex sample_clean_final.fam_sex --pheno sample_clean_final.fam_aff --make-bed --out sample_imputed_clean2 # update rsID: merge post-imputation plink "sample_imputed_clean2.bim" file with reference plink bim file (with rsID; reference data may be 1000 Genomes Project Phase 3) using chromosome and position as keys, then set up a list of post-imputation marker names (column 1) and corresponding rsID from reference file (column 2) with filename = "sample_imputed_clean2-1000G_rsID.txt" plink --bfile sample_imputed_clean2 --update-map sample_imputed_clean2-1000G_rsID.txt --update-name --make-bed --out sample_imputed_clean3 # final filtering on --maf, --geno, --mind, --hwe
Statistical Analysis and Graphs
ANALYSIS
(using unimputed data for this tutorial)
For case control analysis
#For autosomes (unimputed chromosomes 1-22 for this tutorial): plink --bfile sample_clean_final --chr 1-22 --logistic --out res_chr1-22 #Additional option: analyzing X chromosome (For more information on analytic models for X chromosome markers: https://www.cog-genomics.org/plink/1.9/assoc; exclude sex as covariate): plink --bfile sample_clean_final --chr 23 --logistic --xchr-model 2 --out res_chr23_xmeth2
## QQ plot in R (Figure "qqPlot_data_clean_final.assoc.logistic.pdf")
#In R #If analyzing only chromosomes 1-22 (for this tutorial): data<-read.table("res_chr1-22.assoc.logistic", h=T, as.is=T) #or use clean-DATA.assoc.linear #If analyzing chromosomes 1-22 and X chromosome: x1<-read.table("res_chr1-22.assoc.logistic", h=T, as.is=T) #or use clean-DATA.assoc.linear x2<-read.table("res_chr23_xmeth2.assoc.logistic", h=T, as.is=T) data=rbind(x1,x2) #Plotting QQ plot: source('qq_plot_v5.R') # if covariates are included in model, then make sure to filter for rows that correspond to SNP results (TEST=="ADD") pdf("qqPlot_data_clean_final.assoc.logistic.pdf") qq.plot(data[ ,9], xlab= "Expected P value", ylab= "Observed P value", main="qqPlot for SNPs in clean-DATA") dev.off()
## Manhattan Plot in R (Figure "Manhattan_res.assoc.logistic.pdf")
#In R #If analyzing only chromosomes 1-22 (for this tutorial): data<-read.table("res_chr1-22.assoc.logistic", h=T, as.is=T) #or use clean-DATA.assoc.linear #If analyzing chromosomes 1-22 and X chromosome: x1<-read.table("res_chr1-22.assoc.logistic", h=T, as.is=T) #or use clean-DATA.assoc.linear x2<-read.table("res_chr23_xmeth2.assoc.logistic", h=T, as.is=T) data=rbind(x1,x2) #Plotting Manhattan Plot: source('manhattan_v1.R') pdf("Manhattan_res.assoc.logistic.pdf") d<-subset(data, data$TEST=="ADD") X<-data.frame(CHR=d$CHR, BP=d$BP, P=d$P) manhattan( X, GWthresh=5e-8, GreyZoneThresh=1e-5, DrawGWline=TRUE ) dev.off()
## Selecting SNPs by p-values threshold
#In R #If analyzing only chromosomes 1-22 (for this tutorial): data<-read.table("res_chr1-22.assoc.logistic", h=T, as.is=T) #or use clean-DATA.assoc.linear #If analyzing chromosomes 1-22 and X chromosome: x1<-read.table("res_chr1-22.assoc.logistic", h=T, as.is=T) #or use clean-DATA.assoc.linear x2<-read.table("res_chr23_xmeth2.assoc.logistic", h=T, as.is=T) data=rbind(x1,x2) #Filtering results based on p-values: unadj2=data[which(data$TEST==c("ADD")),] significant=unadj2[which(unadj2$P<5e-8),] significant suggestive=unadj2[which(unadj2$P<0.0005),] head(suggestive) write.table(suggestive, "case-control-res.txt", col.names=TRUE, row.names=FALSE, quote=FALSE) q() n
## Gene-based analysis using MAGMA (not a part of this tutorial; https://ctg.cncr.nl/software/magma)
module load bio/MAGMA/1.07b-foss-2018b #need to download gene location file (https://ctg.cncr.nl/software/MAGMA/aux_files/NCBI37.3.zip) to e.g., GWAS_RefData/MAGMA folder and extract zip file ##annotation (with flanking regions of 5kb upstream and 1.5kb downstream shown here as an example) magma --annotate --snp-loc data/sample_chr1-22misRSq7_23_cleaned.bim --gene-loc GWAS_RefData/MAGMA/NCBI37.3/NCBI37.3.gene.loc --annotate window=5,1.5 --out data/ample_chr1-22misRSq7_23_cleaned_genesflanking.annot ##analysis magma --bfile data/sample_chr1-22misRSq7_23_cleaned --pval data/sample_chr1-22misRSq7_23_cleaned_cov20220331out_add.txt ncol=NMISS --gene-annot data/sample_chr1-22misRSq7_23_cleaned_genesflanking.annot.genes.annot.txt --gene-settings fixed-permp=10000 --out data/sample_chr1-22misRSq7_23_cleaned_genesflanking.annot_genebased1000perm.out
References
Anderson, C.A., et al. (2010) Data quality control in genetic #case-control association studies. Nat Protoc. 5:1564-1573. doi:10.1038/nprot.2010.116
Clarke, G.M., et al. (2011) Basic statistical analysis in genetic case-control studies. Nat Protoc. 6: 121-133. doi:10.1038/nprot.2010.182.
Weale, Michael. (2011, Sept. 23) Notes on GWAS QC/QA. from
https://sites.google.com/site/mikeweale/software/gwascode
Marees AT, de Kluiver H, Stringer S, et al. A tutorial on conducting genome-wide association studies: Quality control and statistical analysis. Int J Methods Psychiatr Res. 2018;27(2):e1608. doi:10.1002/mpr.1608
Uffelmann, E., et al. (2021) Genome-wide association studies. Nature Reviews Methods Primers. Vol 1: Article number 59. https://doi.org/10.1038/s43586-021-00056-9
Additional Resources
1000 Genomes reference data for QC of (+/-) strands and b37 map positions (>13.5M SNPs)
HapMap data for population structure analysis (CEU, YRI, CHB, JPT only) HapMap_b37_pops.zip
Local Ancestry Analyses using TRACTOR – tutorial: https://atkinson-lab.github.io/Tractor-tutorial/
Acknowledgements
Dr. Joanne Knight, Dr. Sarah Gagliano, Dr. Victoria Marshe