Overview
This is a guide for mtDNA sequencing data cleaning and analysis in Plink and R.
Learning Objectives
(1) Alignment and variants calling (from fastQ to bam files)
(2) Apply quality control measures for mtDNA genotyping data sets;
(3) Identification of homoplasmic and heteroplasmic variants;
(4) Functional mapping;
(5) Get familiarized with Plink and R environments.
## All materials are located in the following links: haplocheck_results, haplogroups_workshopsamples.txt, HG0096c_test.bam, HG0097c_test.bam, HG0099c_test.bam, HG0100c_test.bam, HG0101c_test.bam, HG0102c_test.bam, HG0103c_test.bam, HG0105c_test.bam, HG0106c_test.bam, HG0107c_test.bam, Merged.txt, Merged.vcf.gz.
## ***Powerpoint slides for this workshop: Workshop_mtDNA_QC_analysis.pptx
Please download all files and create a folder on your PC or cluster to run the analysis. Remember: plink must be stored in the same folder if running on your PC.
Resources
- Reference files: Human genome
- Nextflow (https://www.nextflow.io/)
- Need to install nextflow in your work directory by using the command: curl -s https://get.nextflow.io | bash
- Singularity
- Need to load singularity module if using CAMH Specialized Computing Cluster: e.g., module load tools/Singularity/3.8.5
- Mutserve (https://github.com/seppinho/mutserve)
- Need to install mutserve in your work directory by using the command: curl -sL mutserve.vercel.app | bash
- Our pipeline was adapted from https://github.com/gatk-workflows/gatk4-mitochondria-pipeline.
- PLINK (http://pngu.mgh.harvard.edu/~purcell/plink/)
- PLINK 1.9 (https://www.cog-genomics.org/plink/)
- Need to load plink module if using CAMH Specialized Computing Cluster: e.g., module load bio/PLINK/1.9b_6.15-x86_64
- R (http://cran.r-project.org/)
- Need to load R module if using CAMH Specialized Computing Cluster: e.g., module load lang/R/4.0.3-Python-3.8.5-Anaconda3-2020.11
- Unix-based command terminal
- A text editor
Before starting
- Download reference files.
- Module load Java, Singularity, and install nextflow and mutserve.
- Clone your pipeline into your work directory: e.g. git clone pipeline_link_depository
NOTE #1: all written in red is editable.
Aligment and variants calling
- To run the alignment and variants calling use the command line below.
./nextflow run pipeline_name -r pipeline_version --fastq “/work_directory_path/*_R{1,2}.fastq” -profile singularity --reference /reference_files_path/HG38/references_hg38_v0_Homo_sapiens_assembly38.fasta |
---|
#NOTE: Ensure that at the end of the job were created 4 different files for each sample (.vcf, .vcf.tbi, .bam, .bai) and all_samples.csv, and if they were stored inside the folder called output.
Merging the vcf files and obtaining an unique txt files containing all the variants for all the individuals
- For each one of the samples we need to generate an unique .vcf and .txt file that are going to be used for quality control analysis. For that, use the command below.
./mutserve call --reference rCRS.fasta --output /work_directory_path/Merged.vcf.gz --threads 4 /work_directory_path/filename.bam |
---|
#NOTE: It is expected to generate two files: Merged.txt and Merged.vcf.gz in the work directory.
#NOTE: all written in red is editable.
Alternative way to merge the vcf files
- If the previous command doesn't work you can use this alternative way to merge all the .vcf files.
module load bio/BCFtools/1.11-GCC-10.2.0 |
---|
Haplocheck
Contamination
|
---|
Haplogrep
Haplogroup
|
---|
- Before go to the Quality Control steps, please download the all_samples.csv and filename.txt files to your computer.
Quality Control analysis
Use the excel to open the files Merged.txt, haplocheck_results (contains the contamination status), and haplogroups_workshopsamples.txt (contains the haplogroup information for each one of the samples).
Filtering out variants from samples identified as contaminated
- By using the haplocheck_results file, you will check which samples are contaminated (column B: Contamination Status). If there is any sample indicating YES in the contamination status column, you will need to copy the Sample IDs (column A: Sample) and paste in a new excel file. Name your file as samples_to_remove and save it as txt format (see the image below as an example).
#NOTE: The samples_to_remove.txt file should have two columns and no header. Both columns have the same information (Sample IDs). This file is going to be used in further steps. If you want, you can already upload this file into your work directory in the SCC cluster.
- Continue the next steps using the txt copy file in excel. Check out at the end of this section an example of how the excel file was organized in different sheets based on the QC steps (Figure 1).
Homoplasmic and common variants filtering using PLINK in the SCC cluster
Converting vcf to plink files
module load bio/PLINK2/20210420-x86_64 |
---|
#NOTE: After running the command above you need to ensure that you got 4 different files called Workshop_samples_05-17-23.bed, Workshop_samples_05-17-23.bim, Workshop_samples_05-17-23.fam and Workshop_samples_05-17-23.log.
- IMPORTANT! Before continuing the analysis, you will need to edit your .bim file by adding the preposition MT- plus the variant location in the second column. For that, you can use the command described below.
awk '{print $1, $1'MT-'$4, $3, $4, $5, $6}' Workshop_samples_05-17-23.bim > Workshop_samples_05-17-23_nv.bim |
---|
- After running the command above, I suggest you verify if the new file file_name_outputbim has the MT- added to the variant's position in column 2. For that, you can use the command below.
vi Workshop_samples_05-17-23_nv.bim |
---|
- VERY IMPORTANT! To continue the analysis all the files (.bed, .bim and .fam) need to have the same name. As you changed the .bim file name from Workshop_samples_05-17-23.bim to Workshop_samples_05-17-23_nv.bim remember to change the other files too. For that, you can create a copy of the .bed and .fam files and paste them into a new folder. After doing this, you can manually edit the name of them using the same name structure, such as Workshop_samples_05-17-23_nv.
Filtering out the samples that were contaminated
module load bio/PLINK/1.9b_6.22-x86_64 |
---|
#NOTE: The samples_to_remove.txt file was created at the QC section.
Homoplasmic variants calling
- For this part of the analysis, you are going to call only the homoplasmic variants by using the txt file. This file was created in the section Quality Control (QC) steps, step 5. Before continuing, you must certify that the homoplasmic_variants.txt file was uploaded into your work directory in the SCC cluster. Next, run the analysis by using the commands described below.
plink --bfile Workshop_samples_05-17-23_nocont --extract homoplasmic_variants.txt --make-bed --out Workshop_samples_05-17-23_nocont_homo |
---|
Common variants (MAF ≥ 5%) calling
plink --bfile Workshop_samples_05-17-23_nocont_homo --maf 0.05 --make-bed --out Workshop_samples_05-17-23_nocont_homo_common |
---|
Functional analysis
./mutserve annotate --input variantsfile.txt --annotation rCRS_annotation_2020-08-20.txt --output AnnotatedVariants.txt |
---|
Annotation description
https://github.com/seppinho/mutserve/blob/master/files/rCRS_annotation_descriptions.md
Interpretation of the results
The MutPred score is the probability (expressed as a figure between 0 and 1) that an amino acid substitution is deleterious/disease-associated. A missense mutation with a MutPred score > 0.5 could be considered as 'harmful', while a MutPred score > 0.75 should be considered a high confidence 'harmful' prediction.