Comparative Genomics Group

From Computational Genomics

Jump to: navigation, search



The group was tasked with developing a strain-typing pipeline that takes in assembled FASTA files as input and is capable of Species Delineation, Serogroup(Nm)/Serotype(Hi) determination and Molecular Characterization (indicate presence of important genotypes associated with pathways of interest). Our objective was also to investigate methods that could reduce computational time and resources. We investigated 53 unidentified samples (44 paired and 9 single reads) from one of three bacterial species responsible for meningitis in humans:

  1. Neisseria meningitidis
  2. Haemophilus influenzae
  3. Haemophilus haemolyticus
Image Subset: Metadata for samples provided by CDC. 98 files, 9 single reads and 44 paired reads.

Species Delineation

Species delineation comprises of the techniques to determine the species of the organism from the given whole genome sequence. Below are the tools we used for this process.


Average Nucleotide Identity is developed in Dr.Konstantinidis' lab. It was introduced as a substitute to DNA-DNA hybridization (DDH) and relies on shared identity between the conserved genes of two genomes. ANI values of ≈94% correspond to the traditional 70% reassociation standard of DDH and ANI values above 95% imply same species.

ANI technique for measuring relatedness. (from Konstantinidis presentation for Environmental Molecular Genomics Spring 2015)

Three alignment tools (BLAST, Mummer, Tetra) were investigated for use with ANI analysis. All three worked well (results below), with reference assemblies clustering together by species. BLAST was chosen as the best alignment tool.

Figure 1: ANIb (ANI-BLAST) results.
Figure 2: ANIm (ANI-Mummer) results.
Figure 3: Tetra results.

To compute ANI for a genome, the best reciprocal match is found to identify a query genome and a reference genome. The nucleotide identities between the conserved genes of these two genomes are then averaged, resulting in a figure indicating the percent of nucleotides similar on average across all conserved (and thus comparable) genes in the two genomes. Phylogenies are computed based on distances; however, so the percent nucleotide identities (percent average similarity) are converted to percent average difference, represented as a fraction between 0 (no difference) and 1 (completely different). The results are detailed below.

Figure 4: ANI Final results.

Alignment Free Methods

Alignment free techniques for species delineation use k-mer based analysis to cluster similar species together. It calculates a pairwise distance matrix using all the samples and uses a minimal XOR function to cluster similar samples together.Species classification can be done using the same methodology with the reference genomes and the samples. One of the important step is to select the optimum k-mer size that can maximize the number of distinct k-mers identified. Only k-mers with count greater than 1 are selected to ensure that sequencing errors do not contaminate the analysis. It is generally seen that the number of distinct k-mers peak at a given kmer length, this is selected as the optimal k-mer length. The highest number of distinct k-mers were found at k-mer length of 11 for all the assemblies.

Figure 5: Dendogram showing the clustering of N.meningitidis and H.influenzae species


Multi Locus Sequence Typing is a technique for defining strains from the sequences at seven house-keeping loci derived from the traditional lab technique multilocus enzyme electrophoresis (MLEE). It is a nucleotide sequence-based approach for the unambiguous characterization of strains of bacterial species, or other microbial species. In this method, the sequences of each fragment are compared with all the previously identified sequences (alleles) at that locus and, thereby, are assigned allele numbers at each of the seven loci. In this study, MLST analysis is done using python toolkit for MLST written by Leighton Pritchard.

Table 1: MLST results in N.meningitidis.
Table 2: MLST results in H.influenzae.

Serogroup and Serotype Determination

A Serogroup is a group of bacteria containing a common antigen, sometimes including more than one serotype, species, or genus. Serotyping/serogrouping is used for grouping subspecies based on the chemical composition of the cell. There are 12 serogroups in N.meningitidis species and 6 of them(A,B,C,W135,X,Y) are invasive. On the other hand, the serotypes in H.influenzae are 6 A,B,C,D,E,F.

Table 3: Genetic basis of Serogroup determination in N.meningitidis.
Table 4: Genetic basis of Serotype determination in H.influenzae.
Table 4: Within annotated genomes from N. meningitidis FAM18, MC58, Z2491,and 05342 and from N. lactamica 020–06, two other distinct lipA and lipB genes have been described encoding a lipoic acid synthetase and a lipoate-protein ligase protein, respectively. (i.e. region B and C have lipA/ctrE, lipB/ctrf).

Molecular Characterization

Final Pipeline

Figure 8: Final Pipeline


Figure 9: Molecular Characterization Results

Strain Typing Tool

The alignment free method of species delineation uses a minimum XOR function to determine the evolutionary distance between two genomes. The first step of the process is to find the distinct k-mer space for each genome. This is done by k-mering the genomes using KAnalyze, for this analysis we used a k-mer size of 11. After which, we create a distance matrix, where the distance between two genomes is calculated as the total number of present and absent k-mers.

The prerequisites for the strain typing tool are R Python Blast Java

Software packages

KAnalyze run_MLST KmerDistance R package

Typing tool Usage

Figure 10: Typing tool Usage

Personal tools