...

## All materials are located in the following links: haplocheck_results, haplogroups_workshopsamples.txt, HG0096c_test.bam, HG0097c_test.bam, HG0099c_test.bam, HG0100c_test.bam, HG0101c_test.bam, HG0102c_test.bam, HG0103c_test.bam, HG0105c_test.bam, HG0106c_test.bam, HG0107c_test.bam, Merged.txt, Merged.vcf.gz, Workshop_samples_05-17-23_nocont_homo_common.bim, AnnotatedVariants.txt.

## ***Powerpoint slides for this workshop: Workshop_mtDNA_QC_analysis.pptx

...

Reference files: Human genome
1. Download the reference files using this link https://console.cloud.google.com/storage/browser/genomics-public-data/references/hg38/v0;tab=objects?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&prefix=&forceOnObjectsSortingFiltering=false
Nextflow (https://www.nextflow.io/)
1. Need to install nextflow in your work directory by using the command: curl -s https://get.nextflow.io | bash
Singularity
1. Need to load singularity module if using CAMH Specialized Computing Cluster: e.g., module load tools/Singularity/3.8.5
Mutserve (https://github.com/seppinho/mutserve)
1. Need to install mutserve in your work directory by using the command: curl -sL mutserve.vercel.app | bash
Our pipeline was adapted from https://github.com/gatk-workflows/gatk4-mitochondria-pipeline.
PLINK (http://pngu.mgh.harvard.edu/~purcell/plink/)
1. PLINK 1.9 (https://www.cog-genomics.org/plink/)
2. PLINK 2.0 (PLINK 2.0 (cog-genomics.org))
3. Need to load plink module if using CAMH Specialized Computing Cluster: e.g., module load bio/PLINK/1.9b_6.15-x86_64
Unix-based command terminal
A text editor

...

Download reference files.
Module load Java (e.g. module load lang/Java/11.0.6), Singularity, and install nextflow and mutserve.
Clone your pipeline into your work directory: e.g. git clone pipeline_link_depository

...

Continue the next steps using the Merged.txt file.
First, call Create the first excel Excel tab asnamed "Merged_raw_data" using the Merged.txt file.
Create Second, create a 2^nd tab and call it as named "Merged_nocont". Copy the entire data from the raw results (1^st tab - Merged_raw_data) and paste it into the 2^nd tab. Using the Sample IDs from the samples_to_remove.txt file you will identify the variants from the each one of the contaminated samples and manually remove their respective rows on the Merged_nocont tab.
Create a 3^rd tab and call it as named "coverage>200_both_strand". Copy the entire data from 2^nd sheet (Merge_nocont tab) and paste it into the 3^rd tab. After that, remove all the rows containing variants with coverage lower than 200x in both strands (verify for both CoverageFWD and CoverageREV columns).
Create a 4^th tab and call it as named "Fwd-rev_ratio". Copy the entire data from the 3^rd tab (overage>200_both_strand tab) and paste it into the 4^th. Next, create a new column called FWD-Rev_ratio and calculate the ratio between the CoverageFWD and CoverageREV values (columns L and M). After that, filter out all rows with variants showing Fwd/Rev ratio below 0.5 or higher than 1.5.
Create a 5^th tab and call it as remove_del. Copy the entire data from the 4^th tab (Fwd-rev_ratio tab) and paste it into the 5^th . Check the Ref column and exclude all the rows containing the letter N (which means deletion) on this column.
Create a 6^th tab and call it as named "remove_primer_phantom". Copy the entire data from the 5^th tab (remove_del tab) and paste it into the 6^th . After that, remove all the rows containing variants in the primer regions (e.g. 0-500 bp and 16000-16655 bp). Also check if there is any variant at the known phantom mutation sites (72':['G','T'], 257':['A','C'], '414':['G','T'], 3492':['A','C'], 3511':['A','C'], 4774':['T','A'], 5290':['A','T'], '9801':['G','T'], 10306':['A','C'], '10792':['A','C'], '11090':['A','C']). If yes, you should remove the rows containing these variants as well.
Create a 7^th tab and call it as named "homoplasmy_only". Copy the entire data from the 6^th tab (remove_primer_phantom tab) and paste it into the 7^th . Check the VariantLevel column and remove all the rows containing values lower than 95%.
#NOTE: Here you can decide for other values such as 97% or 99%, depending You have the flexibility to adjust the threshold values based on the quality of your sequencing data and the sample sizespecific requirements of your analysis.
Create a 8^th sheet in excel and name it as tab named "heteroplasmy_only". Copy the entire data fromthe 6^th tab (remove_primer_phantom tab) and paste it into the 8^th tab. Check the VariantLevel column and remove all the rows containing values lower than 3% and higher than 95%.

#NOTE: Check it out an example of how the Merged_QC excel file should be organized in different tabs based on the QC steps described above (Figure 2).

Image Added

Figure 2 - Merged_QC.csv file example

IMPORTANT! After following the QC steps, create a new excel file. Copy and paste the entire data from IMPORTANT! After following the QC steps, create a new excel file. Copy and paste the entire data from homoplasmic_only tab into this new excel. Leave only the column containing the variant position (column C: Pos) in the document and exclude the duplicate values. Add MT- ahead to the position of each variant(example MT-73) in Pos column. Remove the header and save as text file called homoplasmic_variants (Figure 23).

Figure

...

3 - homoplasmic_variants.txt file

#NOTE: The homoplasmic_variants.txt file has only one column with variants names (e.g MT-73) and no header. This file will be further used in the Homoplasmic variants calling step described below and also for statistical analysis.

...

Copy the entire data from the homoplasmy_only tab (from Merged_QC.csv file) and paste it into a new excel file. Next, follow the steps described below.
First, name the 1^st tab as homoplasmic_variants. The first tab contains the total number of variants (number of rows). Take a side note of the total number of variants you have.
Create a 2^nd tab and call it as unique_homoplasmic_variants. Copy and paste the entire data from the 1^st tab into the 2^nd and exclude all the duplicate variants, this way you are going to keep only unique/bi-allelic variants. Take a side note of the number of unique variants you have at this step.
Create a new column in the unique_homoplasmic_variants tab called Type_of_mutation. The Type_of_mutation column will show a number that indicates the type of substitution mutation which 1 will refer to a transition type and 2 to a transversion type. In transition, one purine is substituted for another purine or one pyrimidine is substituted for another pyrimidine (Transitions are: A>G, G>A, C>T or T>C). In transversion, a purine base is substituted for a pyrimidine base or vice versa (Transversions are: A>C, C>A, G>T, T>G, G>C or C>G). In figure 34, you can double-check the different types of substitution mutations.

Figure

...

4 - Type of mutation

Check out the columns Ref and Variant (columns D and E) and add the value 1 (when you identify a transition) or add the value 2 (when you identify a transversion) type to the column Type_of_mutation.
Next, take a side note of the total number of transitions and transversions you have into the column Type_of_mutation.
Calculate the percentage (%) of both transitions and transversions in comparison to the total of unique variants.
Calculate transition/transversion ratio by dividing the percentage (%) of transitions by the percentage (%) of transversions.

...

#NOTE: There is no established reference ratio of Ti/Tv for heteroplasmic variants.

#NOTE: Check out at the end of this section an example of how the excel file was organized in different sheets based on the QC steps (Figure 2).

5. Homoplasmic and common variants filtering using PLINK in the SCC cluster

Converting vcf to plink files

...

plink --bfile Workshop_samples_05-17-23_nocont_homo_common --allow-no-sex --linear --pheno phenotype.txt --pheno-name phenotype_interested_measure --out res_phenotype_interested_measure
NOTE: do the command described previously for all phenotype variables you have interest in your data. The type of test used (e.g. linear model, logistic, etc) will depend on the type of data you have.

6. Functional analysis

...

you have.

6. Functional analysis

The effect of mutations-caused amino acid changes on protein function was predicted by a combination of tools that use sequence homology, evolutionary conservation, and protein structural information.

To run the functional analysis, first you need to create the variantsfile.txt. For that, you will utilize the information from the Workshop_samples_05-17-23_nocont_homo_common.bim file obtained in the previous step. The variantsfile.txt file should be structured with two columns, namely Pos and Variant. For further reference, please consult Figure 5.

Image Added

Figure 5 - variantsfile.txt example

2. Upload the variantsfile.txt into your work directory in the SCC cluster and run the command below.

./mutserve annotate --input variantsfile.txt --annotation rCRS_annotation_2020-08-20.txt --output AnnotatedVariants.txt

...

https://github.com/seppinho/mutserve/blob/master/files/rCRS_annotation_descriptions.md

Interpretation of the results

...

Page tree

Versions Compared

Old Version 34

New Version Current

Key

Figure 2 - Merged_QC.csv file example

Figure

3 - homoplasmic_variants.txt file

Figure

4 - Type of mutation

5. Homoplasmic and common variants filtering using PLINK in the SCC cluster

5. Homoplasmic and common variants filtering using PLINK in the SCC cluster

Converting vcf to plink files

6. Functional analysis

6. Functional analysis

Figure 5 - variantsfile.txt example

Page tree

Page History

Versions Compared

Old Version 34

New Version Current

Key

Figure 2 - Merged_QC.csv file example

Figure

3 - homoplasmic_variants.txt file

Figure

4 - Type of mutation

5. Homoplasmic and common variants filtering using PLINK in the SCC cluster

5. Homoplasmic and common variants filtering using PLINK in the SCC cluster

Converting vcf to plink files

6. Functional analysis

6. Functional analysis

Figure 5 - variantsfile.txt example