Optimization of the canonical Huffman codes for word-based natural language text compression

Authors

  • Igor Zavadskyi Taras Shevchenko National University of Kyiv https://orcid.org/0000-0002-4826-5265
  • Maksym Kovalchuk Taras Shevchenko National University of Kyiv

DOI:

https://doi.org/10.17721/1812-5409.2025/2.25

Keywords:

Huffman, encoding, codes, canonical, text compression

Abstract

Canonical Huffman codes are widely used in data compression and allow fast decoding without compromising compression efficiency compared to classical Huffman codes. If the maximal length of a codeword does not exceed 11-12 bits or long codewords are very rare, any variable-length code, including Huffman codes, can be effectively decoded using a naive ’table’ method. Canonical Huffman codes should be used primarily if the encoded text consists of a relatively large number of long codewords. For example, this is the case of word-level natural language text compression, where words are the symbols of an alphabet. A. Moffat and A. Turpin proposed the fastest state-of-the-art algorithm for decoding canonical codes. We offer several optimizations of this algorithm, which, according to our experiments, can speed it up by 20-30% without reducing the compression ratio. Some of these optimizations are based on low-level code parallelization using the SIMD architecture of modern processors, while others propose combining classical decoding of canonical codes with a universal ’table’ decoding method. In addition to the above optimizations, the article discusses in detail the canonical Huffman codes, including the basic and optimized algorithms, as well as the universal ’table’ method for decoding arbitrary variable-length codes. The results of an experimental comparison of the performance of the basic and optimized algorithms are presented. The results of the study can be used in information retrieval and data archiving systems.

Pages of the article in the issue: 159 - 165

Language of the article: English

References

Anisimov, A. V., & Zavadskyi, I. O. (2017). Variable-length prefix codes with multiple delimiters. IEEE Transactions on Information Theory, 63(5), 2885–2985. https://doi.org/10.1109/TIT.2017.2674670

Brisaboa, N. R., Fariña, A., Navarro, G., & Esteller, M. F. (2003). (s,c)-dense coding: An optimized compression code for natural language text databases. In Nascimento, de Moura, & Oliveira (Eds.), String Processing and Information Retrieval (pp. 122–136). Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-39984-1_10

Choueka, Y., Klein, S. T., & Perl, Y. (1985). Efficient variants of huffman codes in high level languages. In Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (p. 122–-130). New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/253495.342777

Huffman, D. A. (1952). A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 40(9), 1098–1101. https://doi.org/10.1109/JRPROC.1952.273898

Liddell, M., & Moffat, A. (2006). Decoding prefix codes. Software Practice and Experience, 36, 1687–1710.https://doi.org/10.1002/spe.741

Milidiu, R., Laber, E., Moreno, L., & Duarte, J. (2003, March 18–20). A fast decoding method for prefix codes. In Storer & Cohn (Eds.), Data Compression Conference (p. 438). Snowbird, Utah, USA. https://doi.org/10.1109/DCC.2003.1194057

Moffat, A., & Turpin, A. (1997, October). On the implementation of minimum-redundancy prefix codes. IEEE Transactions on Communications, 45(10), 1200–1207. https://doi.org/10.1109/26.634683

Schwartz, E., & Kallick, B. (1964). Generating a canonical prefix encoding. Communications of the ACM, 7(3), 40–50. https://doi.org/10.1145/363958.363991

Zavadskyi, I. (2022, March 19–21). Binary-coded ternary number representation in natural language text compression. In Bilgin, Marcellin, Serra-Sagristà, & Storer (Eds.), Data Compression Conference (pp. 419–428). Snowbird, Utah, USA: IEEE. https://doi.org/10.1109/DCC52660.2022.00050

Zavadskyi, I., & Kovalchuk, M. (2023, September 26-28). Binary mixed-digit data compression codes. In Nardini, Pisanti, & Venturini (Eds.), String Processing and Information Retrieval (pp. 381–392). Cham: Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-43980-3_31

Downloads

Published

2025-12-23

Issue

Section

Computer Science and Informatics

How to Cite

Zavadskyi, I., & Kovalchuk, M. (2025). Optimization of the canonical Huffman codes for word-based natural language text compression. Bulletin of Taras Shevchenko National University of Kyiv. Physics and Mathematics, 81(2), 159-165. https://doi.org/10.17721/1812-5409.2025/2.25