TY - GEN
T1 - Parametric Matching for Improved Data Compression
AU - Kruger, Dov
AU - Kumar, Yulia
AU - Li, J. Jenny
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Modern general-purpose compressors can compress a wide variety of files but do not achieve high compression ratios on files that contain short sequences of delimiters with interleaved numeric data and generally with interleaved data where each sequence is not well correlated to the previous bytes. We demonstrate Parametric Matching (PM), which vastly improves the compression of various structured languages, including PDF, SVG, and G-code files. By de-interleaving and coalescing delimiters and storing data as delta-encoded, discretized binary, compressions of a factor of 10 or more are possible. A Python prototype compresses files to a binary representation, which is then compressed using Lempel-Ziv-Markov (LZMA) to efficiently store the binary tokens in a minimal number of bits. Table 1 shows a ratio of 6 for PDF files containing only text, which are first parsed, and recompressed using PM. For SVG, we demonstrate a factor of 8 to 10 for files including a randomized spiral and a US county map. For the G-code, we compressed the Statue of Liberty, demonstrating that even when the layers are different, a high degree of compression can be achieved. Times are all less than 250ms, even in our Python prototype.
AB - Modern general-purpose compressors can compress a wide variety of files but do not achieve high compression ratios on files that contain short sequences of delimiters with interleaved numeric data and generally with interleaved data where each sequence is not well correlated to the previous bytes. We demonstrate Parametric Matching (PM), which vastly improves the compression of various structured languages, including PDF, SVG, and G-code files. By de-interleaving and coalescing delimiters and storing data as delta-encoded, discretized binary, compressions of a factor of 10 or more are possible. A Python prototype compresses files to a binary representation, which is then compressed using Lempel-Ziv-Markov (LZMA) to efficiently store the binary tokens in a minimal number of bits. Table 1 shows a ratio of 6 for PDF files containing only text, which are first parsed, and recompressed using PM. For SVG, we demonstrate a factor of 8 to 10 for files including a randomized spiral and a US county map. For the G-code, we compressed the Statue of Liberty, demonstrating that even when the layers are different, a high degree of compression can be achieved. Times are all less than 250ms, even in our Python prototype.
KW - compression
KW - lzma
KW - parametric matching
UR - https://www.scopus.com/pages/publications/105006674018
U2 - 10.1109/DCC62719.2025.00070
DO - 10.1109/DCC62719.2025.00070
M3 - Conference contribution
AN - SCOPUS:105006674018
T3 - Data Compression Conference Proceedings
SP - 383
BT - Proceedings - DCC 2025
A2 - Bilgin, Ali
A2 - Fowler, James E.
A2 - Serra-Sagrista, Joan
A2 - Ye, Yan
A2 - Storer, James A.
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2025 Data Compression Conference, DCC 2025
Y2 - 18 March 2025 through 21 March 2025
ER -