Overview

This is a guide for mtDNA sequencing data cleaning and analysis in Plink and R.

Learning Objectives

(1) Alignment and variants calling (from fastQ to bam files)

(2) Apply quality control measures for mtDNA genotyping data sets;

(3) Identification of homoplasmic and heteroplasmic variants;

(4) Functional mapping;

(5) Get familiarized with Plink and R environments.


## All materials are located in the following links: haplocheck_results, haplogroups_workshopsamples.txt, HG0096c_test.bam, HG0097c_test.bam, HG0099c_test.bam, HG0100c_test.bam, HG0101c_test.bam, HG0102c_test.bam, HG0103c_test.bam, HG0105c_test.bam, HG0106c_test.bam, HG0107c_test.bam, Merged.txt, Merged.vcf.gz.

## ***Powerpoint slides for this workshop: Workshop_mtDNA_QC_analysis.pptx


Please download all files and create a folder on your PC or cluster to run the analysis. Remember: plink must be stored in the same folder if running on your PC.

Resources

  1. Reference files: Human genome 
    1. Download the reference files using this link https://console.cloud.google.com/storage/browser/genomics-public-data/references/hg38/v0;tab=objects?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&prefix=&forceOnObjectsSortingFiltering=false
  2.  Nextflow (https://www.nextflow.io/)
    1. Need to install nextflow in your work directory by using the command: curl -s https://get.nextflow.io | bash
  3. Singularity
    1. Need to load singularity module if using CAMH Specialized Computing Cluster: e.g., module load tools/Singularity/3.8.5
  4. Mutserve (https://github.com/seppinho/mutserve
    1. Need to install mutserve in your work directory by using the command: curl -sL mutserve.vercel.app | bash
  5. Our pipeline was adapted from https://github.com/gatk-workflows/gatk4-mitochondria-pipeline.
  6. PLINK (http://pngu.mgh.harvard.edu/~purcell/plink/)
    1. PLINK 1.9 (https://www.cog-genomics.org/plink/)
    2. Need to load plink module if using CAMH Specialized Computing Cluster: e.g., module load bio/PLINK/1.9b_6.15-x86_64
  7. R (http://cran.r-project.org/)
    1. Need to load R module if using CAMH Specialized Computing Cluster: e.g., module load lang/R/4.0.3-Python-3.8.5-Anaconda3-2020.11
  8. Unix-based command terminal
  9. A text editor

Before starting

NOTE #1: all written in red is editable.

1. Aligment and variants calling

./nextflow run pipeline_name -r pipeline_version --fastq “/work_directory_path/*_R{1,2}.fastq” -profile singularity --reference /reference_files_path/HG38/references_hg38_v0_Homo_sapiens_assembly38.fasta

#NOTE: Ensure that at the end of the job were created 4 different files for each sample (.vcf, .vcf.tbi, .bam, .bai) and all_samples.csv, and if they were stored inside the folder called output.

Merging the vcf files and obtaining an unique txt files containing all the variants for all the individuals

./mutserve call --reference rCRS.fasta --output /work_directory_path/Merged.vcf.gz --threads 4 /work_directory_path/filename.bam

#NOTE: It is expected to generate two files: Merged.txt and Merged.vcf.gz in the work directory.

#NOTE: all written in red is editable.

Alternative way to merge the vcf files

module load bio/BCFtools/1.11-GCC-10.2.0
for i in $(ls *.vcf.gz); do bcftools index --tbi $i; done
bcftools merge -Oz /work_directory/*.vcf.gz -o Merged.vcf.gz

2. Haplocheck

Contamination

wget https://github.com/genepi/haplocheck/releases/download/v1.3.3/haplocheck.zip
unzip haplocheck.zip
./haplocheck --out haplocheck_results Merged.vcf.gz

3. Haplogrep

Haplogroup

curl -sL haplogrep.now.sh | bash
./haplogrep classify --in Merged.vcf.gz --format vcf --out haplogroups_workshopsamples.txt


4. Quality Control analysis 

Use the excel to open the files Merged.txt, haplocheck_results (contains the contamination status), and haplogroups_workshopsamples.txt (contains the haplogroup information for each one of the samples).

Filtering out variants from samples identified as contaminated

  1.  First, to identify the samples that are contaminated we will be using the haplocheck_results file. For that, check the column B (Contaminated Status) and verify if there is any sample indicating YES. If , you will need to copy the Sample IDs (column A: Sample) and paste in a new excel file. Name your file as samples_to_remove and save it as txt format (see the Figure 1 below as an example). 

Figure 1 - samples_to_remove.txt file example.

#NOTE: The samples_to_remove.txt file should have two columns and no header. Both columns have the same information (Sample IDs). This file is going to be used in further steps. If you want, you can already upload this file into your work directory in the SCC cluster.

  1. Continue the next steps using the Merged.txt file.
  2. First, call the first excel tab as Merged_raw_data.
  3. Second, create a 2nd tab and call it as Merged_nocont. Copy the entire data from the raw results (1st tab - Merged_raw_data) and paste it into the 2nd tab. Using the Sample IDs from the samples_to_remove.txt file you will identify the variants from the each one of the contaminated samples and manually remove their respective rows on the Merged_nocont tab.


#NOTE: Check out at the end of this section an example of how the excel file was organized in different sheets based on the QC steps (Figure 2).

5. Homoplasmic and common variants filtering using PLINK in the SCC cluster

Converting vcf to plink files

module load bio/PLINK2/20210420-x86_64
plink2 --vcf work_directory_path/Merged.vcf.gz --double-id --max-alleles 2 --vcf-half-call haploid  --make-bed --out Workshop_samples_05-17-23

#NOTE: After running the command above you need to ensure that you got 4 different files called Workshop_samples_05-17-23.bed, Workshop_samples_05-17-23.bim, Workshop_samples_05-17-23.fam and Workshop_samples_05-17-23.log.

awk '{print $1, $1'MT-'$4, $3, $4, $5, $6}' Workshop_samples_05-17-23.bim > Workshop_samples_05-17-23_nv.bim
vi Workshop_samples_05-17-23_nv.bim
ESC :wq

Filtering out the samples that were contaminated


module load bio/PLINK/1.9b_6.22-x86_64
plink --bfile Workshop_samples_05-17-23_nv --remove samples_to_remove.txt --make-bed --out Workshop_samples_05-17-23_nocont

#NOTE: The samples_to_remove.txt file was created at the QC section.

Homoplasmic variants calling

plink --bfile Workshop_samples_05-17-23_nocont --extract homoplasmic_variants.txt --make-bed --out Workshop_samples_05-17-23_nocont_homo

Common variants (MAF ≥ 5%) calling

plink --bfile Workshop_samples_05-17-23_nocont_homo --maf 0.05 --make-bed --out Workshop_samples_05-17-23_nocont_homo_common

6. Functional analysis


./mutserve annotate --input variantsfile.txt --annotation rCRS_annotation_2020-08-20.txt --output AnnotatedVariants.txt

Annotation description

https://github.com/seppinho/mutserve/blob/master/files/rCRS_annotation_descriptions.md

 

Interpretation of the results

The MutPred score is the probability (expressed as a figure between 0 and 1) that an amino acid substitution is deleterious/disease-associated. A missense mutation with a MutPred score > 0.5 could be considered as 'harmful', while a MutPred score > 0.75 should be considered a high confidence 'harmful' prediction.