Sunday, November 30, 2008

Simplified Procedures for Database Searching

This is the process by which testable hypotheses are generated on the function or structure of a gene or protein of interest by identifying similar sequences in better-characterized organisms.

In any sequence analysis, all known databases are scanned by sequence alignment procedures for proteins with homology to the search sequence. Aligning two sequences by dynamic programming is a matter of seconds on a computer. However, database searches require repeating this many times, and since the databases grow, CPU time becomes a constraint in everyday sequence analysis. This problem is overcome by the most widely used programs BLAST and FASTA . For users who want to fine-tune the final alignment, can use PSI-BLAST . For multiple sequence alignment T- Coffee and ClustalW provide an excellent tool.

BLAST is a sequence comparison algorithm optimized for speed used to search sequence databases for optimal local alignments to a query. The initial search is done for a word of length "W" that scores at least "T" when compared to the query using a substitution matrix. Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding the threshold of "S". The "T" parameter dictates the speed and sensitivity of the search .

FASTA is the first widely used algorithm for database similarity searching. The program looks for optimal local alignments by scanning the sequence for small matches called "words". Initially, the scores of segments in which there are multiple word hits are calculated ("init1"). Later the scores of several segments may be summed to generate an "initn" score. An optimized alignment that includes gaps is shown in the output as "opt". The sensitivity and speed of the search are inversely related and controlled by the "k-tup" variable that specifies the size of a "word".

T-Coffee is a multiple sequence alignment program. Multiple sequence alignment programs are meant to align a set of sequences previously gathered using other programs such as BLAST, FASTA. The main characteristic of T-Coffee is that it makes possible to combine a collection of multiple/pair wise, global/local alignments into a single one. Our alignments might come from any source. It also estimates the level of consistency of each position within the new alignment with the rest of the alignments. This consistency is usually an indicator of alignment accuracy .

ClustalW is a general purpose multiple sequence alignment program for DNA and/or proteins. It produces biologically meaningful multiple sequence alignments of divergent sequences. It calculates the best match for the selected sequences, and lines them up so that the identities, similarities and differences can be seen. Here a pairwise score is calculated for every pair of sequences that are to be aligned. These scores are presented in a table in the results. Pairwise scores are calculated as the number of identities in the best alignment divided by the number of residues compared (gap positions are excluded). Both of these scores are initially calculated as percent identity scores and are converted to distances by dividing by 100 and subtracting from 1.0 to give number of differences per site.

As the pairwise score is calculated independently of the matrix and gaps chosen, it will always be the same value for a particular pair of sequences.
Alignment score is calculated in two ways - fast and slow (more accurate mode). The scores are calculated from separate pairwise alignments. These can be calculated using 2 methods: dynamic programming (slow but accurate) or by the method of Wilbur and Lipman (extremely fast but approximate).

Prakash Chandra Mishra
HOD Biotechnology
MITS School of Biotechnology

No comments: