Huffman codes enable compression. It is a technique that can result in savings of 20 to 90% depending on the data being compressed. It scans the text to populate a table of frequencies of the occurrence of characters and then builds an optimal way of representing them. Let us say that the entire text was formed of three characters and that the first character appears the majority of the time, then there can be significant savings if there were fewer bits to represent the first character. Thus this technique uses a greedy strategy.
The characters can be represented by fixed-length codes or variable length codes. The variable-length codes will have more savings. That we know so far but how do we decode such compressed data. We make the assumption that the codes are such that no codeword is also a prefix of some other codeword. Such codes are called prefix codes. When reading a prefix code, we can now decipher the first character unambiguously. Then we can repeat the decoding.
The representation of the prefix code is possible with a binary tree where the characters are the leaves of the trees and the path from the root to leaf is the prefix code. The prefix is appended 0 for going to the left child and 1 to go to the right child. These are not Binary Search Trees because they need not be sorted and the internal ones don’t have character keys. Trees correspond to the coding schemes. The optimal code is always represented by a full binary tree.
Now that we know the prefix codes can be obtained by constructing the tree, we build it up in a bottom-up manner. We take a set of C leaves and perform a sequence of C-1 merging operations to create the final tree. A min priority queue Q, keyed on f, is used to identify the two least-frequent objects to merge together. The result of the merger of two objects is a new object whose frequency is the sum of those merged.
HUFFMAN(c)
N <- C
Q <-C
For I in range (n-1):
Do allocate a new node z
Left[z] = x = EXTRACT-MIN(Q) // removes from Q
Right[z] = y = EXTRACT-MIN(Q)
f[z] = f[x] + f[y]
INSERT(Q, Z)
Return EXTRACT-MIN(Q) // root of the tree
The characters can be represented by fixed-length codes or variable length codes. The variable-length codes will have more savings. That we know so far but how do we decode such compressed data. We make the assumption that the codes are such that no codeword is also a prefix of some other codeword. Such codes are called prefix codes. When reading a prefix code, we can now decipher the first character unambiguously. Then we can repeat the decoding.
The representation of the prefix code is possible with a binary tree where the characters are the leaves of the trees and the path from the root to leaf is the prefix code. The prefix is appended 0 for going to the left child and 1 to go to the right child. These are not Binary Search Trees because they need not be sorted and the internal ones don’t have character keys. Trees correspond to the coding schemes. The optimal code is always represented by a full binary tree.
Now that we know the prefix codes can be obtained by constructing the tree, we build it up in a bottom-up manner. We take a set of C leaves and perform a sequence of C-1 merging operations to create the final tree. A min priority queue Q, keyed on f, is used to identify the two least-frequent objects to merge together. The result of the merger of two objects is a new object whose frequency is the sum of those merged.
HUFFMAN(c)
N <- C
Q <-C
For I in range (n-1):
Do allocate a new node z
Left[z] = x = EXTRACT-MIN(Q) // removes from Q
Right[z] = y = EXTRACT-MIN(Q)
f[z] = f[x] + f[y]
INSERT(Q, Z)
Return EXTRACT-MIN(Q) // root of the tree
No comments:
Post a Comment