Tunneled Burrows-Wheeler Transform Encoding with Improved Indexing for Genomic Sequences Compression
Recently, the compression and indexing of DNA or genomic sequences experience high difficulty due to the rapid growth of large databases. The most common compression algorithms are not suitable for compressing and indexing the genomic sequences well. As a result, Run-Length Encoded (RLE) Burrows-Wheeler Transform (BWT)-based lossless compression algorithm has been preferred for genomic sequence data compression and indexing. But, it is still non-trivial to constrain BWT for the huge collection of genomic sequences. So, Tunneled BWT (TBWT) has been proposed that focuses only on width-maximal run-blocks and requires residual BWT with the bit vectors for encoding the TBWT. However, inverting a TBWT will not be able to correctly match a tunnel start or end to the corresponding tunnel when dealing with critical collisions due to the interactions of start or end intervals of blocks. Therefore in this work, a Tunneled BWT encoding with Improved Index (TBWT-II) is proposed to simplify the TBWT and construct a compressed FM-index with full functionality i.e., efficiently index the labeled texts. In this algorithm, two indexes are considered for a text with non-overlapping labels in which one can store the text in a BWT and the other can store the compressed label sequence in a Wavelet Tree (WT). The label sequence is taken in the order of the BWT (TLBW-index) which is robust to non-redundant text with almost random labels everywhere and requires a space related to the entropy of the labeled text. This index allows efficient text–label queries to count and find labeled patterns. Such patterns may be used on any labeled data such as DNA sequences. Finally, the experimental results on SCOPe 1.67 dataset shows the performance efficiency of proposed TBWT-II algorithm compared to the TBWT algorithm.