Idatabase 3.17

4/11/2023

For example, if there are 2 20 sequences in a database, we need 20 bits to represent the sequence ids. With this approach, the size of database index is significantly reduced, roughly \(\frac \rceil \) bits to store sequence ids. More specifically, BLAT, for example, builds database index with non-overlapping words of length W. However, to reduce memory footprint and search space, both tools build indexes of non-overlapping words from the database, which leads to extremely fast search but compromised sensitivity. SSAHA and BLAT, for example, are significantly fast for finding near-identical matches. However, these tools cannot provide the same level of sensitivity as the BLAST algorithm, or support nucleotide sequence search. Examples of such tools include BLAT, SSAHA, MegaBLAST, and CAFE. In contrast, other approaches suggest that database indexing can yield much faster speed than query indexing. These BLAST algorithms then scan each database sequence to find short matches, extend these matches to optimal alignments, and then calculate the final similarity scores. Query indexing uses a lookup table to record positions of each word in the input query. Most previous studies adopt query indexing for sequence search. However, there are few recent studies that focus on improving the performance of CPU implementations of the widely-used BLAST algorithm. To achieve higher throughput on a per-node basis, BLAST has also been mapped and optimized onto various accelerators, including FPGAs and GPUs. On CPU clusters, TurboBLAST, ScalaBLAST, and mpiBLAST have been proposed. NCBI BLAST uses pthreads to speed up BLAST on a multicore CPU.

Much of this research effort has focused on the parallelization of BLAST on different parallel architectures due to its compute- and data-intensive nature. Consequently, significant research effort has been invested into accelerating the BLAST search algorithm. Specifically, the increasing demands to mine sequence databases for useful information requires substantial computing power. With the advent of next-generation sequencing (NGS), whether at the outset or downstream from NGS, the exponential growth of sequence databases is arguably outstripping our ability to analyze the data. The similarities identified by BLAST can be used to infer functional and structural relationships between the corresponding biological entities, for example.

The Basic Local Alignment Search Tool (BLAST) is a fundamental algorithm in life sciences that compares a query sequence to a database of sequences, i.e., subject sequences, to identify sequences that are the most similar to the query sequence. With a newly designed index structure for protein database and associated optimizations in BLASTP algorithm, we re-factored BLASTP algorithm for modern multicore processors that achieves much higher throughput with acceptable memory footprint for the database index. For a batch of queries, the multithreaded muBLASTP achieves up to a 5.7-fold speedups for alignment stages, and up to a 4.56-fold end-to-end speedup over multithreaded NCBI BLAST. On Intel Haswell multicore CPUs, for a single query, the single-threaded muBLASTP achieves up to a 4.41-fold speedup for alignment stages, and up to a 1.75-fold end-to-end speedup over single-threaded NCBI BLAST.

MuBLASTP, a novel database-indexed BLAST for protein sequence search, delivers identical hits returned to NCBI BLAST. Due to different challenges and characteristics between query indexing and database indexing, the existing techniques for query-indexed search cannot be used into database indexed search. Although many approaches suggest that sequence search with a database index can achieve much higher throughput (e.g., BLAT, SSAHA, and CAFE), they cannot deliver the same level of sensitivity as the query-indexed BLAST, i.e., NCBI BLAST, or they can only support nucleotide sequence search, e.g., MegaBLAST. Currently, the BLAST algorithm utilizes a query-indexed approach. The Basic Local Alignment Search Tool (BLAST) is a fundamental program in the life sciences that searches databases for sequences that are most similar to a query sequence.

0 Comments

Idatabase 3.17

Leave a Reply.

Author

Archives

Categories