算数の圧縮 - ptoolisの日記

最近データの圧縮する方法を考えています。ぐたい的にどんな計算方式を使ってデータの用量を小さくするし読み込む時に圧縮されたデータから実際に入力されたデータに戻す事を調べたいです。The method above is known as "lossless" data compression and arithmetic coding is one such method. Basically, a sequence of characters comprising a message is encoded as some number. The number is chosen from a range of possible numbers which will uniquely encode the data. The ideal case provides a range containing some numbers with a string of zeroes at the end. These zeroes can be dropped and then added back on in decompression, allowing for greater bit savings. The compression algorithm seems to be based on number theory.
Basically, the formula, for a 6 character message is:
LOW = 6^5*c1 + f1*(6^4*c2+f2*(6^3*c3+f3*(6^2*c4+f4*(6*c5+f5*(1*c6)))))
where LOW is lower bound of range, ci is the cumulative frequency of the ith character in the message, and fi is the frequency of the ith character in the message.
In DABDDB, we let A=0,B=1,D=3 be the sequence of symbols. The cumulative frequency of A is 0, as no earlier letters exist in the sequence. The frequency of A is 1, as it occurs once. The cumulative frequency and frequency of B are 1 and 2, respectively. For D, the cumulative frequency is 3 (1 A and 2 Bs occur), and the frequency is 3.
HIGH = LOW + f1*f2*...*fn, where n is the number of characters in the message.
I was able to prove that, using any number in [HIGH,LOW] to encode the sequence of characters above, the decoding algorithm would correctly identify the first character D. The decoding algorithm takes the code number (adding any 0s back onto end, if they were removed) and divides by 6^5. The quotient (floored) is compared to intervals each letter is mapped to. As D has cumulative frequency of 3 and frequency of 3, it has interval [3,5]. The floor(code/6^5) quotient would fall in this interval. I did not prove that all letters would correctly be identified, but this seems likely, as the decoding is basically undoing the encoding.
What I am unsure of is:
1. how do we know that a HIGH,LOW range obtained using the formula will provide a number with enough trailing 0s to enable us to save bits substantially?
2. why is the HIGH value chosen to be LOW + PI(fi)?
Note that a 6 character message in ascii requires 6*8=48 bits, while the number 23671 requires 15 bits (above message if represented in straight base 6), while 25100->251 requires 8 bits. Shannon's formula for the theoretical limit of the message length(lower bound) computed via entropy indicates 9 bits are needed. Note that arithmetic coding is a variable-length entropy encoding, where less frequent characters require more bits and more frequent characters use fewer bits.