Gene Prediction Group

From Computational Genomics

Jump to: navigation, search



In computational biology gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced. Gene prediction is one of the key steps in Genome annotation, following Sequence assembly, the filtering of non-coding regions and repeat masking.

Gene Prediction in Prokaryotes: The transcription (the formation of mRNA from the DNA sequence) and translation (coding-regions of mRNA into corresponding proteins) differ at a fundamental level in prokaryotes and eukaryotes. In prokaryotic genomes, DNA sequences that encode proteins are transcribed into mRNA, and then RNA is usually translated directly into proteins without significant modification. The gene structure in prokaryotes is captured in terms of Promotor elements, Open Reading Frame (ORF) and Terminating sequences. Since the gene regulation in prokaryotes is less complex and there are no complicated components like exons and introns, it is easy to differentiate between coding and non-coding regions. Gene prediction in prokaryotes can either be based on Ab initio (Statistical) methods or Homology methods.

Approaches to gene prediction

Ab initio gene prediction: Ab initio gene prediction is based on statistical signals within the DNA. The signals are short DNA motifs and the coding statistics are nucleotide compositional bias in coding and non-coding regions. Most of the ab initio prediction tools are easy to run and fast execution time and require only the DNA sequence as input. However, prior knowledge on the data required (training sets) and the number of mis predicted genes is high. Ab initio gene prediction programs can also have Supervised or Non-Supervised learning techniques. With supervised learning, the parameters of the algorithm are pre-determined based on knowledge of the genome from experimental work. In contrast, non-supervised learning uses the input sequence to determine the parameters of the model.

Homology based gene prediction: In Homology based gene prediction, the gene structure is deduced using homologous sequences (protein and mRNA). The results can be very accurate when homologous sequences with high similarity were used for the analysis.

RNA prediction: Non-Coding RNAs (ncRNA) are RNAs that are transcribed, but not translated into protein. They include well-characterized transfer RNAs and ribosomal RNAs, snRNAs, snoRNAs, and miRNAs, as well as a plethora of new ncRNAs that have been shown to play major roles in the cellular processes of prokaryotes. The RNA prediction methods include a set of functionalities to assess predicted ncRNA genes with regards to their context, annotation, conservation, and secondary structure.

Ab initio gene prediction tools


GeneMarkS a self-training tool for analyzing novel genomes in order to predict genes. It is a combination of GeneMark.hmm with a self training algorithm to modify parameters. It uses a ab initio method which utilizes a HMM to find/differentiate coding and noncoding regions. It undergoes unsupervised training for modeling, estimation and inference.

Commands to run GeneMarkS: -prok <inputfilename.fasta> -format <output format>


Prokaryotic Dynamic Programming gene finding Algorithm is an extremely fast gene recognition tool and can analyze an entire microbial genome in 30 seconds or less. It is a highly accurate gene finder. It correctly locates the 3' end of every gene in the experimentally verified Ecogene data set. It possesses a very sophisticated ribosomal binding site scoring system that enables it to locate the translation initiation site with great accuracy. When run for Reference files, Prodigal exhibited the highest averaged Specificity and Sensitivity of any combination of programs.

Commands to run Prodigal:

 prodigal.linux -i input_file_name -o output_file_name -f output_format -d nucleotide_sequences_of_all_genes -a protein_sequences_of_all_genes -s potential_genes_with_scores

Homology based prediction Tools


Basic Local Alignment Search Tool is one of the best used homology prediction tool. assumes that sequence, structure and function are interrelated. It searches for only more significant patterns in sequence yet with comparative sensitivity. It is faster than smith waterman algorithm.

Commands to run BLAST:

 makeblastdb –in <input_file> -dbtype <nucl/prot> -out <output_database>
 blastn -db <database_name> –query <query_fasta> –outfmt 6 -out <output_filename>

ncRNA prediction Tools

tRNAscan-SE 1.21

tRNAscan-SE identifies 99–100% of transfer RNA genes in DNA sequence while giving less than one false positive per 15 gigabases. Two tRNA detection programs are used as fast, first-pass prefilters to identify candidate tRNAs, which are then analyzed by a highly selective tRNA covariance model.

Commands to run tRNAscan-SE:

   tRNAscan-SE -B [FASTA]

RNAmmer 1.2

RNAmmer uses hidden Markov models trained on data from the 5S ribosomal RNA database and the European ribosomal RNA database project. A pre-screening step makes the method fast with little loss of sensitivity, enabling the analysis of a complete bacterial genome in less than a minute. Results from running RNAmmer on a large set of genomes indicate that the location of rRNAs can be predicted with a very high level of accuracy.

Commands to run RNAmmer:

   RNAmmer -S bac-multi -gff [FILE]

Rfam 12.0 (Infernal 1.1)

Rfam is a database containing information about non-coding RNA (ncRNA) families and other structured RNA elements. Unlike proteins, ncRNAs often have similar secondary structure without sharing much similarity in the primary sequence. Rfam divides ncRNAs into families based on evolution from a common ancestor. For now, Rfam 12.0 consists 2544 RNA families.

Infernal ("INFERence of RNA ALignment") is for searching DNA sequence databases for RNA structure and sequence similarities. It is an implementation of a special case of profile stochastic context-free grammars called covariance models (CMs).

Commands to run Infernal:

   cmscan --noali -E 0.01 -cpu 4 [Rfam CM] [FASTA]



Figure 1. Gene prediction pipeline


Gene Prediction


Figure 1. Prodigal Output: Standard GFF format, with contig name, type ,start stop etc.


Figure 2. GenemarkS output: Standard GFF format, with contig name, type ,start stop etc.


Figure 3. Blast output: Each of the entries predicted by the gene prediction programs is searched against a master blast database. If an entry exists, then the a row is added to the blastoutput, where the first column is the query and the second column is the match in the blast database. The rest of the values orrespons to identity and other blast related scores.


Figure 4. Compiled results: The first column is the contig name, the second column is the number of genes predicted by the combination of both programs. The third column is the number of genes that exist in blast, the fourth column is the number of genes that are present in both programs but not in blast. The last column is the fraction of genes from the combined list, that were annotated as genes.


Figure 5. Number of genes predicted per program for the assemblies.

ncRNA Prediction

TRNA result.png

Figure 5. Predicted tRNA counts. Most tRNA results is close to 52, which is the result of FAM18 reference. That means most of tRNA sequences were not cleaved during sequencing process.

RRNA result.png

Figure 6. Predicted rRNA counts. Lots of rRNA prediction results are much smaller than 12, which is the result of FAM18 reference. This is our expected result, because rRNA sequences are much longer than tRNA. So it would be more difficult to assemble complete rRNA sequences.

OtherRNA result.png

Figure 7. Other predicted RNA counts. Includes miRNA, tmRNA, miscRNA etc.

Personal tools