Optimization of the canonical Huffman codes for word-based natural language text compression
DOI:
https://doi.org/10.17721/1812-5409.2025/2.25Keywords:
Huffman, encoding, codes, canonical, text compressionAbstract
Canonical Huffman codes are widely used in data compression and allow fast decoding without compromising compression efficiency compared to classical Huffman codes. If the maximal length of a codeword does not exceed 11-12 bits or long codewords are very rare, any variable-length code, including Huffman codes, can be effectively decoded using a naive ’table’ method. Canonical Huffman codes should be used primarily if the encoded text consists of a relatively large number of long codewords. For example, this is the case of word-level natural language text compression, where words are the symbols of an alphabet. A. Moffat and A. Turpin proposed the fastest state-of-the-art algorithm for decoding canonical codes. We offer several optimizations of this algorithm, which, according to our experiments, can speed it up by 20-30% without reducing the compression ratio. Some of these optimizations are based on low-level code parallelization using the SIMD architecture of modern processors, while others propose combining classical decoding of canonical codes with a universal ’table’ decoding method. In addition to the above optimizations, the article discusses in detail the canonical Huffman codes, including the basic and optimized algorithms, as well as the universal ’table’ method for decoding arbitrary variable-length codes. The results of an experimental comparison of the performance of the basic and optimized algorithms are presented. The results of the study can be used in information retrieval and data archiving systems.
Pages of the article in the issue: 159 - 165
Language of the article: English
References
Anisimov, A. V., & Zavadskyi, I. O. (2017). Variable-length prefix codes with multiple delimiters. IEEE Transactions on Information Theory, 63(5), 2885–2985. https://doi.org/10.1109/TIT.2017.2674670
Brisaboa, N. R., Fariña, A., Navarro, G., & Esteller, M. F. (2003). (s,c)-dense coding: An optimized compression code for natural language text databases. In Nascimento, de Moura, & Oliveira (Eds.), String Processing and Information Retrieval (pp. 122–136). Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-39984-1_10
Choueka, Y., Klein, S. T., & Perl, Y. (1985). Efficient variants of huffman codes in high level languages. In Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (p. 122–-130). New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/253495.342777
Huffman, D. A. (1952). A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 40(9), 1098–1101. https://doi.org/10.1109/JRPROC.1952.273898
Liddell, M., & Moffat, A. (2006). Decoding prefix codes. Software Practice and Experience, 36, 1687–1710.https://doi.org/10.1002/spe.741
Milidiu, R., Laber, E., Moreno, L., & Duarte, J. (2003, March 18–20). A fast decoding method for prefix codes. In Storer & Cohn (Eds.), Data Compression Conference (p. 438). Snowbird, Utah, USA. https://doi.org/10.1109/DCC.2003.1194057
Moffat, A., & Turpin, A. (1997, October). On the implementation of minimum-redundancy prefix codes. IEEE Transactions on Communications, 45(10), 1200–1207. https://doi.org/10.1109/26.634683
Schwartz, E., & Kallick, B. (1964). Generating a canonical prefix encoding. Communications of the ACM, 7(3), 40–50. https://doi.org/10.1145/363958.363991
Zavadskyi, I. (2022, March 19–21). Binary-coded ternary number representation in natural language text compression. In Bilgin, Marcellin, Serra-Sagristà, & Storer (Eds.), Data Compression Conference (pp. 419–428). Snowbird, Utah, USA: IEEE. https://doi.org/10.1109/DCC52660.2022.00050
Zavadskyi, I., & Kovalchuk, M. (2023, September 26-28). Binary mixed-digit data compression codes. In Nardini, Pisanti, & Venturini (Eds.), String Processing and Information Retrieval (pp. 381–392). Cham: Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-43980-3_31
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Igor Zavadskyi, Maksym Kovalchuk

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
