Genome Assembly Group

From Computational Genomics

Jump to: navigation, search



Our goal was the development of a scalable, semi-automated genome assembly pipeline that would use state of the art assembly tools and quality metrics to generate a high fidelity consensus sequence for downstream computational uses. We wanted to evaluate new or infrequently used assemblers to provide state-of-the-art knowledge. Our plan was to combined the results from multiple assemblers in order to create one super-assembly. The super-assembly would be the best, highest-fidelity consensus sequence.

The spirit of this project was to use a combination of tools and approaches to improve results: for this purpose, we combined de novo assembly with reference-based assembly. For maximum applicability and ease of use, we constructed a wrapper to semi-automate this process.

Find links to our tools at GitHub. The pipeline is assemblyMagic. The scripts used for Prinseq, ABySS, Fastqc, CISA, Edena, SPAdes, QUAST, and SMALT are also located here.

We received 53 strains from the CDC that were either Neisseria meningitidis, Haemophilus influenzae, or Haemophilus haemolyticus. The strains were a mixture of single-end and paired-end reads generated from GAII (single-end) or Illumina HiSeq/MiSeq (paired-end) instruments. We received a total of 97 reads.

Proposed pipeline:


Pipeline steps



We used FastQC as our metrics program due to its high quality results and the diversity of information it could provide, such as:
1. Basic statistics such as per-base sequence quality and per-sequence quality scores
2. Per-base N content which shows the proportion of invalid base calls
3. Per-sequence GC content, which can be used to make a preliminary classification between Neisseria meningitidis and Haemophilus influenzae/Haemophilus haemolyticus.
4. Identification of overrepresented sequences/kmers

FastQC takes in input of BAM, SAM, or FASTQ files and outputs a zip file containing a HTML report.

fastqc <filename.fastq> For paired-end reads, you need to execute the command for each read.


PRINSEQ trims specified base pairs from the right and left side of reads to remove adapters and low quality bases. In combination with FastQC, we are able to trim bases that fall below a certain quality score. For this project, we used 28 as our cutoff, a recommended value for good-quality reads.


Paired end read: -p <read1.fastq> <read2.fastq> -l <left trim number> -r <right trim number>
Single end read: -s <read.fastq> -l <left trim number> -r <right trim number>

Examples of FastQC results before and after PrinSeq processing

The image to the left contains examples of FastQC data before and after PRINSEQ processing. The top graphs show the per base sequence quality and the bottom graphs show the per base sequence content. Using the per base sequence content, we can decide how much to trim from the left and right side of the reads by choosing where the lines flatten out.

De novo assembly

When performing de novo assembly, the assembly is based on the sequence files alone and does not require a reference sequence for scaffolding. The de novo assemblers were chosen based on their published comparisons to other de novo assemblers, the variety of algorithms used to maximize the strengths of each, and their ability to handle both single-end and paired-end reads.

SPAdes [1]

SPAdes is an A-Bruijn assembler which uses kmers for building an initial de Bruijn graph and then forgets them during assembly. It also does multiple assemblies using different kmer sizes to optimize the end product, a final consensus DNA sequence. SPAdes takes as input single-end and paired-end reads in FASTA/FASTQ format.

SPADES COMMAND: [options] -o <output_directory>
-12 File with interlaced data.
-1 File with forward reads.
-2 File with reverse reads.
-s File with unpaired reads.
-o <output_dir> specify the output directory. Required option.

Edena [2]

Edena is built around a standard overlap-layout-consensus (OLC) workflow. Edena was chosen because we are working with relatively small genomes and Edena is suited to short reads. OLC also keeps all the information in reads. In our case, we opted to use exact matching for overlap detection and assumed that all reads had the same length.

Edena assembly has four major steps:
1. Remove redundant reads so that dataset size is more manageable (done in prefix tree)
2. Overlap detection and overlap graph construction(use bidirected graph)
3. Graph cleaning: simplification and ambiguity resolution
4. Produce contigs

Overlapping Mode:
-takes as input either a single-end or paired-end files and outputs a <.ovl> file for the next step
edena -r <file1> -p <output_file_name>
edena -DRpairs <file_forward> <file_reverse> -p <output_file_name>
-r single-end reads
-DRpairs paired-end reads
-p prefix name for output files

Assembling Mode:
-takes as input the Edena .ovl file and outputs a <_contigs.fasta> file
edena -e <file.ovl> -p <output_file_name>
edena -e <file.ovl> -p <output_file_name>
-e invokes assembly mode

ABySS [3]

ABySS is a de Bruijn graph based algorithm, where each node in the graph represents a kmer, and two adjacent kmers overlap by k-1 bases. The goal is to find a Eularian Walk in the graph. For our assemblies, a kmer size of 21 was chosen as ABySS does not have a default kmer size.

There are two stages:
1. The reads are broken into kmers and adjacent kmers that overlap by k-1 bases are found. Kmers resulting from read errors and variant sequences are removed and contigs are generated.
2. The reads are aligned to the contigs created in the first stage. The distance between contigs is estimated using the paired reads that align to different contigs. Paths are found through the contigs that agree with the distance estimates. Overlapping paths are merged and the contigs in these paths are merged.

paired end: k=21 name=<strain ID> <read1.fastq> <read2.fastq>
single end:
abyss -l 21 <read.fastq> -o <output filename.fa>

Other de novo assemblers were evaluated for potential inclusion in the pipeline but were ultimately not used. These included ALLPATHS-LG (requires 2 paired-end libraries, which we did not have), RAY (had issues with getting FASTA results out, was only getting metrics), Velvet (need to have an idea of coverage beforehand, but we had a wide variety of genomes), and MINIA (evaluated later, decided to go with the three we already chose).

Reference Selection and Alignment

Reference Assembly Flowchart

Reference assembly is done by using one or more references and aligning the raw reads to the reference(s).

Reference Selection

In this case, we knew that the reads were from one of three species but we did not know which species a particular read went with. Entrez Direct (EDirect), which allows access to the databases of NCBI through the command line, was used to download the reference genomes and contigs for N. meningitidis, H. influenzae, and H.haemolyticus from the NCBI Genome database. Ten complete reference genomes were selected each for N. meningitidis and H. influenzae, with those selected having the highest number of annotated genes and proteins of all available. All six available reference genomes composed of 22-123 scaffolds were used as references for H. haemolyticus.

N. meningitidis strains chosen from NCBI for reference assembly
H. infulenzae strains chosen from NCBI for reference assembly
H. haemolyticus strains chosen from NCBI for reference assembly

esearch - searches database
efetch - downloads file

For strains that are complete, the entire file can be retrieved with the following command:
esearch -db nucleotide -query [accession number] | efetch -format fasta > [output file name.fasta]

For incomplete strains, like H. hemolyticus, all of the contigs were retrieved in a batch with the following command:
(for i in $(cat [strain ID.txt]); do esearch -db nucleotide -query $i | efetch -format fasta; done) > [output file name.fasta]


SMALT was used to align the reads to the references. SMALT was chosen based on a comparison of reference assemblers performed by Pightling et al. [5]. In overall performance, SMALT had the edge over the other assemblers and consistently performed well. SMALT also performed best when the sequence and reference are more distant which is an advantage for us since we did not know what species we had and would be choosing representative reference sequences for alignment. We aligned all 53 sets of reads to the 26 reference genomes/contigs we downloaded. The first step was to build an index of k-mer words for each reference sequence. Smalt was then used to index each reference sequence:
smalt index Output.fasta

A perl script was then run to map the references to the query strain, process the files using SAMtools, convert to pileup format using SAMtools, consensus calling using BCFtools, and process the fasta/fastq files.

For paired end usage, the script command is: -r [reference_file_path] -o [prefix of final file name] -p1 [R1_read_file] -p2 [R2_read file]

This script loops through all of the reference files with the following commands:

Map the reference file =>
smalt map -f sam -o [file.sam] -n 3 [reference] [paired reads file 1] [paired reads file 2]
Convert sam to bam file =>
samtools view -bS [file.sam] > [file.bam]
Sort the file =>
samtools sort [file.bam] [finalFileName]
Index the sorted bam file =>
samtools index [finalFileName.bam]
Retrieve unmapped reads =>
bam2fastq --no-aligned --strict --force -o [unmapped.fastq] [finalFileName.bam]
Create pileup file and call consensus =>
samtools mpileup -f [reference] -gu [finalFileName.bam] | bcftools call -c -O b -o [finalFileName.raw.bcf]
Convert file to fastq format =>
bcftools view -O v [finalFileName.raw.bcf] | vcf2fq > [output.fastq
Convert fastq consensus to fasta =>
seqret -sequence [output.fastq] -outseq [output.fasta]
Remove N’s, empty lines, and headers with regex
Analyze assembly results => -o [quastDirect] []
Compress the read files

Best matching results in reference assembly, part one
Best matching results in reference assembly, part two

The best reference was chosen based on the QUAST results with the following critera:
* Correct GC content: N. meningitidis-~51%, H. influenzae/H. haemolyticus-~38%
* Highest N50
* Lowest number of contigs
* Hightest total length
* Lowest number of N's per 100 kpb
* Longest contig

Additionally, a preliminary assignment of species to each strain could be done based on the GC content and which reference assembly the strain best matched to.

Quality Metrics - QUAST[6]

QUAST was used to evaluate the results of each assembly and the super-assembly. QUAST uses aggregated metrics and methods from existing software and does not need a reference genome. Examples of metrics from QUAST include number of contigs, size of contigs, longest contig, total length, N50, and L50. A combination of these metrics was used to rate each assembler.

QUAST COMMAND: <contig/scaffold file> less quast-results/latest/report.txt>

Super-Assembly - CISA[7]

CISA (Contig Integrator for Sequence Assembly) was used to create the super assembly. CISA removes uncertainties at the ends of contigs with a minimum of 30% overlap between contigs. The longest contig in a region is used as the representative contig. The theory is that we will get the best result using all of the assemblies.

Evaluation of Assemblies

Each individual assembly along with the CISA super-assembly was evaluated by calculating an Assembly Score. Using prior years' class' work we using the following equation:

Assembly Score = log10((N50 * Total length) / #Contigs))

Initial Quast results on individual assemblies and CISA

Initial comparisons indicated that the CISA super-assembly had the best assembly score across all samples. [figure] However, as the results were used down the line for gene prediction and functional annotation, it became apparent that there was someting wrong with the CISA super-assembly. The number of predicted coding genes from the functional annotation was much lower than what was expected from known available information. Looking more in-depth at the CISA super-assemblies revealed that skew was being introduced. The way QUAST was being run, basing it off of the assembly as opposed to comparing it to a reference, the CISA score was inflated due to the fact the genome length was smaller than expected given a predicted reference.

A new assembly score metric was devised:

New Assembly Score = log10((N50 / # contigs)) * (Genome Length/Expected Length)^2

Table of new assembly scores and which assembly was chosen for each strain
Average expected genome sizes
GC Predicted genome sizes

Using this score, the score skew was removed in all but 5 strains. Within the 5 strains where score skew still remained, there was a partial skew reduction where another assembly had a higher score over CISA. Using this new assembly score, the best assembly was chosen for each strain and passed on to the other groups. With the new assembly score, the CISA assembly typically scored more poorly than the individual assemblies.


[1] Bankevich et al. 2012. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Bio. 12:455-477.
[2] Hernandez et al. 2008. De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Res. 18(5):802-809.
[3] Simpson et al. 2009. ABySS: a parallel assembler for short read sequence data. Genome Res. 18:821-829.
[4] Ponstingl and Ning. 2010. SMALT - a new mapper for DNA sequencing reads. The Wellcome Trust/Sanger Institute. Poster.
[5] Pightling et al. 2014. Choice of reference assembler for alignment of Listeria monocytogenes short-read sequence data greatly influences rates of error in SNP analyses. PLoS ONE. 9:e104579.
[6] Gurevich et al. 2013. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 29:1072-1075.
[7] Lin and Liao. 2013. CISA: contig integrator for sequence assembly of bacterial genomes. PLoS ONE. 8(3): e60843.

Personal tools