Overview
This is a guide for mtDNA sequencing data cleaning and analysis in Plink and R.
Learning Objectives
(1) Alignment and variants calling (from fastQ to bam files)
(2) Apply quality control measures for mtDNA genotyping data sets;
(3) Identification of homoplasmic and heteroplasmic variants;
(4) Functional mapping;
(5) Get familiarized with Plink and R environments.
## All materials are located in the following links: haplocheck_results, haplogroups_workshopsamples.txt, HG0096c_test.bam, HG0097c_test.bam, HG0099c_test.bam, HG0100c_test.bam, HG0101c_test.bam, HG0102c_test.bam, HG0103c_test.bam, HG0105c_test.bam, HG0106c_test.bam, HG0107c_test.bam, Merged.txt, Merged.vcf.gz.
## ***Powerpoint slides for this workshop:
Please download all files and create a folder on your PC or cluster to run the analysis. Remember: plink must be stored in the same folder if running on your PC.
Resources
- Reference files: Human genome
- Download the reference files using this link https://console.cloud.google.com/storage/browser/genomics-public-data/references/hg38/v0;tab=objects?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&prefix=&forceOnObjectsSortingFiltering=false
- Nextflow (https://www.nextflow.io/)
- Need to install nextflow in your work directory by using the command: curl -s https://get.nextflow.io | bash
- Singularity
- Need to load singularity module if using CAMH Specialized Computing Cluster: e.g., module load tools/Singularity/3.8.5
- Mutserve (https://github.com/seppinho/mutserve)
- Need to install mutserve in your work directory by using the command: curl -sL mutserve.vercel.app | bash
- Our pipeline was adapted from https://github.com/gatk-workflows/gatk4-mitochondria-pipeline.
- PLINK (http://pngu.mgh.harvard.edu/~purcell/plink/)
- PLINK 1.9 (https://www.cog-genomics.org/plink/)
- Need to load plink module if using CAMH Specialized Computing Cluster: e.g., module load bio/PLINK/1.9b_6.15-x86_64
- R (http://cran.r-project.org/)
- Need to load R module if using CAMH Specialized Computing Cluster: e.g., module load lang/R/4.0.3-Python-3.8.5-Anaconda3-2020.11
- Unix-based command terminal
- A text editor
Before starting
- Download reference files.
- Module load Java, Singularity, and install nextflow and mutserve.
- Clone your pipeline into your work directory: e.g. git clone pipeline_link_depository
NOTE #1: all written in red is editable.
Aligment and variants calling
- To run the alignment and variants calling use the command line below.
./nextflow run pipeline_name -r pipeline_version --fastq “/work_directory_path/*_R{1,2}.fastq” -profile singularity --reference /reference_files_path/HG38/references_hg38_v0_Homo_sapiens_assembly38.fasta |
---|
#NOTE: Ensure that at the end of the job were created 4 different files for each sample (.vcf, .vcf.tbi, .bam, .bai) and all_samples.csv, and if they were stored inside the folder called output.
Merging the vcf files and obtaining an unique txt files containing all the variants for all the individuals
- For each one of the samples we need to generate an unique .vcf and .txt file that are going to be used for quality control analysis. For that, use the command below.
./mutserve call --reference rCRS.fasta --output /work_directory_path/Merged.vcf.gz --threads 4 /work_directory_path/filename.bam |
---|
#NOTE: It is expected to generate two files: Merged.txt and Merged.vcf.gz in the work directory.
#NOTE: all written in red is editable.
Alternative way to merge the vcf files
- If the previous command doesn't work you can use this alternative way to merge all the .vcf files.
module load bio/BCFtools/1.11-GCC-10.2.0 for i in $(ls *.vcf.gz); do bcftools index --tbi $i; done bcftools merge -Oz /work_directory/*.vcf.gz -o Merged.vcf.gz |
---|
Haplocheck
Contamination
wget https://github.com/genepi/haplocheck/releases/download/v1.3.3/haplocheck.zip unzip haplocheck.zip ./haplocheck --out haplocheck_results Merged.vcf.gz
|
---|
Haplogrep
Haplogroup
curl -sL haplogrep.now.sh | bash ./haplogrep classify --in Merged.vcf.gz --format vcf --out haplogroups_workshopsamples.txt
|
---|
- Before go to the Quality Control steps, please download the all_samples.csv and filename.txt files to your computer.
Quality Control analysis
Use the excel to open the files Merged.txt, haplocheck_results (contains the contamination status), and haplogroups.txt (contains the haplogroup information for each one of the samples).
Homoplasmic and common variants filtering using PLINK in the SCC cluster
module load bio/PLINK2/20210420-x86_64 plink2 --vcf work_directory_path/file_name.vcf.gz --double-id --max-alleles 2 --vcf-half-call haploid --make-bed --out file_name_output |
---|
#NOTE: After running the command above you need to ensure that you got 4 different files called file_name_output.bed, file_name_output.bim, file_name_output.fam and file_name_output.log.