Functional Annotation Group
From Computational Genomics
Functional Annotation Background
Functional annotation is the assignment of biological information and function to genomic sequence information. The type of information that can be gained includes biochemical function, biological function, the involved regulation and interactions, and expression information. With this information, the role of that function in the cell can be studied as well as how it is the same or different between organisms.
There are two basic types of functional annotation: extrinsic and intrinsic. In extrinsic annotation, comparisons are made against various databases such as Ensembl, GENCODE, Gene Ontology Consortium, and Uniprot. The quality of annotation achieved using this method is highly dependent on the quality of databases and similarity to other entries in the databases. Intrinsic annotation is based on characteristics of the gene or protein sequence itself. Known biochemical properties and behavior of a gene or protein sequence are used to predict similar genes or proteins in the query sequence. Examples of features that can be predicted using intrinsic annotation include transmembrane proteins and signal peptides.
Functional annotation is closely linked with genome annotation and comparative genomics. Genome annotation will provide information on the location of potential features in the sequence. Functional annotation will take that information and identify the biological function of those features. Often, genome annotation and functional annotation is combined in a single step. The results of the genome and functional annotation can then be used to compare the query sequence to known sequences for identification and analysis.
Important Functional Features for Species and Serotype/Serogroup Determination
Serotypes (Neisseria meningitidis) and serogroups (Haemophilus influenzae) are groups within a single species of microorganisms which have distinct surface structures and may behave differently, like in their interactions with the immune system and severity of infection. Traditional serological serotyping/serogrouping is based on the chemical composition of the cell surface, but these features have a genetic basis. Serotyping/serogrouping is a long, labor-intensive process. Additionally, traditional serogrouping may not be able to differentiate non-typable H. influenzae from Haemophilus haemolyticus.
N. meningitidis and H. influenzae/H. haemolyticus can be initially sorted based on the GC content and sequence length using genome assembly metrics information. Using data from the NCBI database, N. meningitidis has a GC content of ~51% and is ~2.2 million base pairs in length. H.influenzae/H. haemolyticus has a GC content of ~38% and is ~1.9 million base pairs in length.
Typing of the gene coding regions, particularly the cap locus, is very helpful for serogrouping/serotyping. The coding region information can also be used looking at the genetic information associated with methods like Multilocus Sequence Typing (MLST). As other coding genes are identified that can be used for serogrouping/serotyping, the sequence information will already be available and annotated without needing to develop new serology or PCR-based tools.
Ribosomal RNA (rRNA) can be used to construct phylogenetic trees for species delineation. The 16S rRNA gene is commonly used for phylogenetic analysis of microorganisms.
Clustered Regularly Interspaced Short Palindromic Repeats (CRISPRs) are repeat sequences interspaced with non-repeating spacer sequences of similar size. They are found in approximately 40% of bacterial genomes investigated so far. CRISPRs are known to provide immunity against viruses and plasmids.
Operons are genes coding for functionally related proteins and are clustered along the DNA. Operons allow a coordinated protein synthesis response to needs of the cell. Knowledge of an operon structure provides valuable insight about gene regulation
There are two approaches for executing tools for functional annotation: command-line and web-based interface. We decided to use command-line interface for following reasons:
1. Scalability-We can scale to larger data without taking too much time
We developed a pipeline, ANNOX (Annotation Toolbox), to combine multiple functional annotation tools into one streamlined process. Available pipelines, such a Prokka and BG7, only contain a limited number of functional annotation tools. We aimed to create a pipeline that contained a larger variety of functional annotation tools to cover as many features as possible. There are two levels of functional annotation in ANNOX. The first level contains the tools that were identified to be most helpful for further species identification. These tools annotate the coding region, CRISPRs, and OPERONS. The second level of annotation adds additional functional annotation tools which may be of interest to the researcher. The second level annotation contains tools for transmembrane proteins, tRNAs, signal peptides, domains/motifs, pathways, and virulence factors. The requirement that the individual tools within the pipeline be command-line interface was often the largest limitation in choosing a tool.
The script for ANNOX can be found at github (Click here-).
Individual Annotation Tools
Coding Region -- Prokka
prokka --force --outdir <output directory> --prefix <file prefixes> --rfam --rnammer <contig.fa>
Prokka is a command line software tool to annotate bacterial genome rapidly. It is easy to implement with reasonable installation time, providing standardized input/output file formats. It is well known for rapidly and accurately annotating bacterial genomes in about 10 min on a typical desktop computer. Prokka features include coding sequences (Prodigal), rRNA (RNAmmer), tRNA (Aragorn), signal peptides (SignalP), and non-coding RNA (Infernal). Prokka performs annotation in five steps:
1. An optional search against a user provided set of annotated proteins using BLAST+.
2. Determination of core bacterial proteins by searching UniProt using BLAST+ (contains ~16,000 proteins).
3. An optional BLAST+ search to all proteins from finished bacterial genomes in RefSeq.
4. A search of Pfam and TIGFRAMS databases using hmm scan from HMMER. Both are collections of protein families based on multiple sequence alignments and hidden markov models.
5. Non-matches are labeled hypothetical proteins.
BG7 was another functional annotation pipeline tool that we tested. Ultimately, we decided not use it for the following reasons: the release version was not well maintained with multiple errors in the script that runs the main program, it needed .frn format for RNA reference which was not available for H. haemolyticus, and it took a very long time due to its dependence on Blast.
Operons -- DOORS2 
blastn -query <inputFile> -db <DB.fna> -out <outputFile> -evalue 1e-10 -outfmt "6 qseqid sseqid qstart qend sstart send length bitscore evalue qcovs pident"
Algorithms designed to detect operons are based on finding signals, which rely on well-studied operons, and finding gene clusters, which depend on the availability of genome sequences. DOORS uses an algorithm which is basically a data-mining classifier. It utilizes information from various sources such as intergenic distance, neighborhood conservation, phylogenetic distance,iInformation from short DNA motifs, and others in order to make a prediction. Operons predicted for the species of interest were downloaded from DOORS and query sequences were used to run a blast against them. Any hit which had an e-vaue < 1e -10, percentage identity > 80%, and qcoverage >80% were regarded as putative operons in the input file.
CRISPRs -- Pilecr 
./pilercr –in <infile> -out <output text file> -seq <output fasta>
Pilecr uses a local alignment method to detect closely spaced repeat regions. A k-mer matching approach is used to detect repeat regions within a given range with interspaced sequences of similar length. The results are parsed and blast is performed for the spacers and repeats against CRISPR database. In Pilecr, Blast is performed against the repeats database of CRISPRdb. For 53 assemblies, the total runtime was 1 minute and 12 seconds.
Transmembrane Proteins -- TMHMM 
./tmhmm <input faa file> > <predicted tm output file>
TMHMM using a hidden markov model-based approach for detecting helix and loop structure. To be considered a transmembrane protein, we looked at the last two columns of output for each sample. The expected number of amino acids in intransmembrane helices should be greater than 18 and the expected number of amino acids in transmembrane helices in the first 60 amino acids of the protein should be less than 10.
For 53 assemblies, the total runtime was 2 hours, 3 minutes, 20 seconds.
tRNAs -- Contained within Prokka
We used Prokka to handle the tRNA annotation, which uses Aragorn . Aragorn can also be used as a stand-alone application. Run as a stand-alone application, it took less than two minutes to process 53 genomes.
aragorn -fasta -gc11 -o <output file name> <input file name>
Signal Peptides -- SignalP 4.1 
./signalp –t gram- -f short [input] > [output]
SignalP4.1 uses a neural network-based method to predict the presence and location of signal peptide cleavage sites in amino acid sequences. It takes fast amino acid (.faa) files as input. SignalIP was chosen for this pipeline due to its performance in a 2009 benchmark study [citation], it use in other pipelines such as Prokka, and is performance with gram negative bacteria.
Domains/Motifs -- Interproscan 
./interproscan.sh -i [input fasta] -f gff3
-t n if the input file is a nucleotide sequence
-appl [name of the analysis] if only certain analysis is needed
InterProScan is a software package that allows sequences (protein and nucleic) to be scanned against InterPro's signatures. Signatures are predictive models, provided by several different databases that make up the InterPro consortium. The databases we used were HAMAP, ProDom, PIRSF, PfamA, ProSite Profiles, SMART, ProSite Patterns, TIGRFAM, PRINTS, Superfamily, Coils, and Gene3d.
When using this tool, we ran into some issues. The first issue involved running errors. The pre-compiled version of some of the tools used by InterProScan were not compatible with our cluster. This issue was resolved by recompiling some of the tools. The second issue involved the speed of the analysis. The tool was running extremely slow on nucleotide sequences, taking more than a day to run one sample. Partitioning the fast files and running InterProScan in parallel helped some, but it was still very slow. Using the .faa files output from Prokka turned out to be much faster.
Pathways -- KOBAS 2.0
blastp -db <reference_pep_fasta> -query <input_fasta_file> -outfmt 6 -out <blasttab.out>
annotate. py -i <blasttab.out> -s ko -t blastout:tab -o <output_file.txt>
KOBAS performs a search from KEGG GENES, KEGG ORTHOLOGY, Pathway databases, Disease databases, and GO. The sequences are annotated with KEGG genes, KO, Pathway, diseases, and GO. With KOBAS2.0, we again ran into issues with the tool taking a long time to run. By making a Blast database merging all Haemophilus and Neisseria databases, as opposed to the entire ko.fasta database, the analysis time was shortened from a hours to minutes.
Virulence Factors -- VFDB
blastn -db <DatabaseName> -query <inputFile> -evalue 1e-10 -outfmt "6 qseqid sseqid qstart qend sstart send length bitscore evalue qcovs pident" -out <OutputFile>
Virulence Factor Database (VFDB) collectively presents the virulence factors of various medically significant bacterial pathogens. This tool was able to process 53 strains in less than 5 minutes.
 Harrison et al. 2013. Description and nomenclature of Neisseria meningitidis capsule locus. Emerging Infectious Diseases. 19(4): dx.doi.org/10.3201/eid1904.111799
 Seeman, T. 2014 Prokka: rapid prokaryotic geneome annotation. Bioinformatics. 30(14): 2068-2069.
Thomason, M. K., Bischler, T., Eisenbart, S. K., Förstner, K. U., Zhang, A., Herbig, A., ... & Storz, G. (2015). Global transcriptional start site mapping using differential RNA sequencing reveals novel antisense RNAs in Escherichia coli. Journal of bacteriology, 197(1), 18-28.
 Edgar, RC. 2011. PILE-CR: fast and accurate identification of CRISPR repeats. BMC Bioinformatics. 8:18.
 Krogh et al. 2001 Predicting transmembrane protein topology with a Hidden Markov Model: application to complete genomes. J Mol Biol.. 305: 567-580.
 Laslet and Canback. 2004. ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucl. Acids Res. 32(1): 11-16.
 Peterson et al. 2011. SignalP4.0: discriminating signal peptides from trans-membrane regions. Nat. Methods. 8: 785-786.
 Jones, Philip, et al. "InterProScan 5: genome-scale protein function classification." Bioinformatics 30.9 (2014): 1236-1240.
Wu, J., Mao, X., Cai, T., Luo, J., & Wei, L. (2006). KOBAS server: a web-based platform for automated annotation and pathway identification. Nucleic acids research, 34(suppl 2), W720-W724.
Chen, L., Yang, J., Yu, J., Yao, Z., Sun, L., Shen, Y., & Jin, Q. (2005). VFDB: a reference database for bacterial virulence factors. Nucleic acids research, 33(suppl 1), D325-D328.