酶技术与工程研究室

With the advancem ent of sequencing technology, the protein/gene sequences in the database have become larger. It becomes particularyly important to analyze and annotate such a huge data resource. As a tool for sequence alignment analysis, BLAST is widely used. It was developed in 1990 and has been upgraded several versions since then. There are also many researchers who have improved BLAST, such as PSI-BLAST, PHI-BLAST, DELTA-BLAST, BLAT, UBLAST, MegaBLAST. These BLASTs have either increased speed, increased accuracy, or reduced data requirements. They play an important role in gene and protein data minner and analysis.

However, these BLAST methods are all based on forward-ordered alignments, in which deletions or insertions are compensated for by adding spaces to achieve full sequence alignment. This method cannot solve the problem of sequence rearrangement of gene/proteins during evolution. Sequence rearrangement occurs in the evolutionary process driven by gene duplication. In the process of evolution, gene duplication is the production of double or multiple copies of a gene, which allows gene mutation or even temporary inactivation mutation, which is the basis for the generation of new functions of genes. In the subsequent evolutionary process, double-copy genes may also be at the beginning of one gene and another gene. A deletion or deletion mutation occurs at the end of the gene, and the remaining two partial genes reconstitute a complete gene.

As a result of this rearrangement, the protein encoded by the gene sequence did not significantly change in structure, but it may have gained new functions. There have been several sequence rearrangements discovered by researchers, such as a shift of substrate specificity from starch to sucrose in glycoside hydrolase family. And some researchers have linked the N-terminus and C-terminus of the protein encoded by the gene, and divided them within the original sequence, thus producing a new protein N-terminus and C-terminus. The sequence is then rearrange at the end, which is called circular permutation. Various researchers have systematically examined sequence rearrangement and built cyclic permutation databases, which have been included in the Circular Permutation Database (CPDB), which contains a large amount of direct information.

Based on Circular Permutation, we developed CircBLASTp, in which we try to add this biological strategy into BLAST. In search, first, we divided the sequence into pieces of seed, which made up an unordered dataset, which can be regarded as a circle sequence dataset. Then, we found the starting site of the query sequence and subject sequence read to generate the new sequences. Final, we used the Smith-Waterman algorithm to complete the two-sequence alignment. We used CircBLASTp to analyze the GH70 of the glycoside hydrolase family, and found that CircBLASTp significantly improved the accuracy of sequence alignment, and more rearranged sequences could be found.