Sequence database search
One very common bioinfomatic problem is to look for a sequence in a sequence database by comparing it with a query sequence of our own. For instance, we could have a sequence isolated from a virus and we could look in the database for similar sequences in other to assign the species. We can also look for similar genes in other species having the sequence of the gene in one species.
One way to search for the sequences in the database is to align our query sequence with all the sequences in the database and to keep the good enough matches. We could try to do it by using the pair-wise alignment algorithms that we studied in the alignment lesson. The problem with those algorithms is the speed. We need a very fast a aligner becase the sequence databases are commonly comprised by many millions of sequences.
BLAST is one of the most used algorithms in bioinformatics.
BLAST includes a family of related algorithms capable of aligning a query sequence against all the sequences in a database in a very fast way.
BLAST will produce local alignments of the query sequence against the sequences in the database.
BLAST also generates a expect value that is related with the statistical significance of the alignment.
BLAST uses a very fast algorithm based on words (also known as k-tuples or kmers).
BLAST is many orders of magnitude faster than any Smith & Waterman implementation. To achive that speed it will create less acurate alignments and it will be somewhat less sentitive. The Smith & Waterman algorithm is guaranteed to get always the best alignment, the best scoring one, but it will be much, much slower.
BLAST uses an index of the sequence database in which the positions for all words presents in the sequences are stored. Let’s suppose that our database includes only one protein sequence: DSFASPUIVH. If we create the database index with all the three letter words we would have:
sequence: DSFASPUIVH words: DSF SFA FAS ASP SPU PUI UIV IVH
To do the BLAST search the words for the query sequence are generated and those words are looked for in the database index. This process will be very fast because we have very fast indexing algorithms.
The sequence regions in the database in which the query words are found are known as seeds and will be the seed from which the alignments will be build. One drawback of this algorithm is that if the query and the database do not share words, if they do not share regions with exact matches BLAST won’t be able to detect any similiarity at all. That could be well the case with very different sequences. If the sequences are different enough to have no words in common BLAST won’t be able to find the alignment. This is the main reason why BLAST is less sentitive than Smith & Waterman. But this word indexing approach is much faster.
The word length will influence the BLAST sensitivity. The shorter the word length the more sentitive will be the search, but the slower will be the process.
Once the seeds have been generated BLAST looks for diagonals in the similiarity matrix. From the seeds the alignments are extended. A query sequence and a single database sequence can generate several alignments. BLAST won’t force to join all these alignments into one single alignment. The alignment created by BLAST will not be optimal in most cases. Each alignmed is a High Scoring Pair (HSP) in the BLAST nomenclature. A database sequence and the query sequence can have several HSPs.
Once we have the alignments we should ask: are the alignments statistically significative?
Let’s try to align to random sequences:
Query= unknown (207 letters)
So, we can obtain an alignmnet between two random sequences. That is even more problematic for the blast case because we are trying to align a query sequence to millions of sequences in a sequence database. In this case we could obtain many alignments just by chance.
One way to check if an alignment that we obtain has been obtained just by chance or if it is really a significative alignment is to compare its score with the scores of alignments between random sequences. Let’s supposee that the alignment of random sequences tend to give scores between 0 and 30. If we obtain an alignment between the query sequence and the database of 15 we might think that is just a random fluke, but if we get a score of 70 we might be quite sure that the alignment is truly significative. You expect to get lots of random alignments with low scores, but few with high scores. This is the approach followed by BLAST to assertain the significance of its alignments.
BLAST creates the distribution of scores between random sequences in the database. For every alignment (HSP) it compares its score to the distribution of the random sequence alignments. The e-value (expect) is the number of hits that one can expect to see by chance in a database. The higher the score for an HSP the lower the e-value will be. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance.
The BLAST e-value depends of the database queried. We can not compare e-values obtained for different databases.
It is not possible to set a general threshold for e-values, but in real searches it is usual to find that significant alignments have e-values between 1e-5 and 0.
Blast of sequence c15d_01-D10-M13R_c against database arabidopsis BLASTX 2.2.10 [Oct-19-2004] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= c15d_01-D10-M13R_c (892 letters) Database: tair6 30,690 sequences; 12,653,157 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value AT3G28480.1 | Symbol: None | oxidoreductase, 2OG-Fe(II) oxygenas... 280 6e-76 AT3G28490.1 | Symbol: None | oxidoreductase, 2OG-Fe(II) oxygenas... 258 3e-69 AT5G18900.1 | Symbol: None | oxidoreductase, 2OG-Fe(II) oxygenas... 244 4e-65 AT3G06300.1 | Symbol: AT-P4H-2 | Encodes a prolyl-4 hydroxylase ... 234 4e-62 AT1G20270.1 | Symbol: None | oxidoreductase, 2OG-Fe(II) oxygenas... 213 1e-55 AT4G35810.1 | Symbol: None | oxidoreductase, 2OG-Fe(II) oxygenas... 201 4e-52 AT2G17720.1 | Symbol: None | oxidoreductase, 2OG-Fe(II) oxygenas... 192 2e-49 AT5G66060.1 | Symbol: None | oxidoreductase, 2OG-Fe(II) oxygenas... 186 2e-47 AT4G35820.1 | Symbol: None | oxidoreductase, 2OG-Fe(II) oxygenas... 159 2e-39 AT4G33910.1 | Symbol: None | oxidoreductase, 2OG-Fe(II) oxygenas... 149 3e-36 AT2G43080.1 | Symbol: AT-P4H-1 | Encodes a prolyl-4 hydroxylase ... 147 1e-35 AT2G23096.1 | Symbol: None | oxidoreductase, 2OG-Fe(II) oxygenas... 137 7e-33 AT4G25600.1 | Symbol: None | ShTK domain-containing protein, sim... 131 6e-31 >AT3G28480.1 | Symbol: None | oxidoreductase, 2OG-Fe(II) oxygenase family protein, similar to prolyl 4-hydroxylase, alpha subunit, from Gallus gallus (GI:212530), Rattus norvegicus (GI:474940), Mus musculus (SP:Q60715); contains PF03171 2OG-Fe(II) oxygenase superfamily domain | chr3:10677275-10679525 REVERSE | Aliases: MFJ20.16 Length = 316 Score = 280 bits (717), Expect = 6e-76 Identities = 141/235 (60%), Positives = 174/235 (74%), Gaps = 1/235 (0%) Frame = +2 Query: 191 NRFPKMLLHNNDMYESVIRMKTGGSAITIDPTRVIQLSSKPRAFLYEGFLSYEECQHLIH 370 NRF + +N SVI+MKT S+ DPTRV QLS PR FLYEGFLS EEC H I Sbjct: 25 NRF--LTRSSNTRDGSVIKMKTSASSFGFDPTRVTQLSWTPRVFLYEGFLSDEECDHFIK 82 Query: 371 LAKGKLRQSLVAAG-TGESVASKERTSTGMFLRKAQGKIVARIESRIAAWTFLPLDNGEP 547 LAKGKL +S+VA +GESV S+ RTS+GMFL K Q IV+ +E+++AAWTFLP +NGE Sbjct: 83 LAKGKLEKSMVADNDSGESVESEVRTSSGMFLSKRQDDIVSNVEAKLAAWTFLPEENGES 142 Query: 548 IQILRYENGQKYEPHFDFFQDPGNIAIGGHRIATILMYLSDVEKGGETVFPNSPVKLSEE 727 +QIL YENGQKYEPHFD+F D N+ +GGHRIAT+LMYLS+VEKGGETVFP K ++ Sbjct: 143 MQILHYENGQKYEPHFDYFHDQANLELGGHRIATVLMYLSNVEKGGETVFPMWKGKATQL 202 Query: 728 EKGDLSECAXVGYGVRPKLGDALLFFSMNPNVTPDATSYHGSCPVIEGEKMVCTK 892 +ECA GY V+P+ GDALLFF+++PN T D+ S HGSCPV+EGEK T+ Sbjct: 203 KDDSWTECAKQGYAVKPRKGDALLFFNLHPNATTDSNSLHGSCPVVEGEKWSATR 257 >AT3G28490.1 | Symbol: None | oxidoreductase, 2OG-Fe(II) oxygenase family protein, similar to prolyl 4-hydroxylase, alpha subunit, from Caenorhabditis elegans (GI:607947), Mus musculus (SP:Q60715), Homo sapiens (GI:18073925); contains PF03171 2OG-Fe(II) oxygenase superfamily domain | chr3:10680286-10681891 REVERSE | Aliases: MFJ20.17 Length = 288 Score = 258 bits (659), Expect = 3e-69 Identities = 128/211 (60%), Positives = 159/211 (75%), Gaps = 2/211 (0%) Frame = +2 Query: 266 AITIDPTRVIQLSSKPRAFLYEGFLSYEECQHLIHLAKGKLRQSLVAAG--TGESVASKE 439 ++DPTR+ QLS PRAFLY+GFLS EEC HLI LAKGKL +S+V A +GES S+ Sbjct: 24 SFSVDPTRITQLSWTPRAFLYKGFLSDEECDHLIKLAKGKLEKSMVVADVDSGESEDSEV 83 Query: 440 RTSTGMFLRKAQGKIVARIESRIAAWTFLPLDNGEPIQILRYENGQKYEPHFDFFQDPGN 619 RTS+GMFL K Q IVA +E+++AAWTFLP +NGE +QIL YENGQKY+PHFD+F D Sbjct: 84 RTSSGMFLTKRQDDIVANVEAKLAAWTFLPEENGEALQILHYENGQKYDPHFDYFYDKKA 143 Query: 620 IAIGGHRIATILMYLSDVEKGGETVFPNSPVKLSEEEKGDLSECAXVGYGVRPKLGDALL 799 +GGHRIAT+LMYLS+V KGGETVFPN K + + S+CA GY V+P+ GDALL Sbjct: 144 LELGGHRIATVLMYLSNVTKGGETVFPNWKGKTPQLKDDSWSKCAKQGYAVKPRKGDALL 203 Query: 800 FFSMNPNVTPDATSYHGSCPVIEGEKMVCTK 892 FF+++ N T D S HGSCPVIEGEK T+ Sbjct: 204 FFNLHLNGTTDPNSLHGSCPVIEGEKWSATR 234
BLAST has different programs to compare different kinds of sequences.
- BLASTN compares a nucleotide sequence with a nucleotide database
- BLASTP compares a protein sequence with a protein database
- BLASTX compares the six translated frames of a nucleotide sequence with a protein database
- TBLASTN compares a protein sequence with the six frame translation of a nucleotide database
- TBLASTX compares a nucleotide sequence with a nucleotide database, but first it translates both
Megablast is an algorithm derived from BLAST. It is faster than BLAST although a little bit less sensitive. It will be faster than BLAST, but it will miss the match if the sequences are very different.
BLAST 2 sequences
It uses the BLAST algorithm to do a pairwise alignment. It does not uses a database.
It is a very fast way to get a rough alignment (not as good as a Smith and Waterman one). It has the advantage of trying to align both strands.
PSI-BLAST and DELTA-BLAST
PSI-BLAST and DELTA-BLAST are two alternative algorithms to blastp that are more sensitive. They can be used to find distant relatives in the sequences.
PSI-BLAST (Position-Specific Iterated BLAST) is based on the use of Position-Specific Scoring Matrices (PSSM). A PSSM summarizes the patterns found in several sequences. It calculates the frequency for each nucleotide or aminoacid in every position of a multiple alignment.
seq1 A C T G A T G seq2 A C T A A C G PSSM A 1 0 0 0.5 1 0 0 C 0 1 0 0 0 0.5 0 G 0 0 0 0.5 0 0 1 T 0 0 1 0 0 0.5 0
PSSMs describe the relevant features in the sequences and are usually used to describe conserved domains in the proteins or in the DNA.
PSI-BLAST uses an iterative approach. In the first steps it runs a standard blastp search. With the significant results it creates a mutiple alignments and from the alignment it calculates a PSSM. In the second iteration PSI-BLAST is able to search the database by aligning the PSSM with the database sequences. The sequences found in this new search are used to create a new PSSM and the process is repeated again.
DELTA-BLAST is similar to PSI-BLAST but in the first search it uses a Conserved Domain Database. With the results creates a PSSM and uses that to search in the protein database.