Paste: huffman in haskell

Author: komu
Mode: literate-haskell
Date: Tue, 27 Oct 2009 07:16:36
Plain Text |
> import Control.Arrow ((&&&))
> import Data.List (group, sort, insert)
> import qualified Data.Map as M
> import Test.QuickCheck (quickCheck, (==>), Property)

We could represent bit strings by a special type, but for simplicity we'll
simply assume that they consist of a list of zeros and ones.

> type BitString            = [Int]

The frequency-dictionary for a type is simply list of pairs where each object 
is associated with a frequency:

> type Freqs a              = [(a,Int)]

The tree used for encoding is a binary tree with frequencies of individual
elements of our alphabet in the leafs. To make matters speedier, the branches
also contain a pre-calculated sum of leafs in that subtree.

> data FreqTree a           = Leaf a Int
>                           | Branch (FreqTree a) (FreqTree a) Int
>    deriving (Eq, Show)

> frequency                 :: FreqTree a -> Int
> frequency (Leaf _ f)      = f
> frequency (Branch _ _ f)  = f

We can provide a partial ordering for trees by comparing their frequencies:

> instance Eq a => Ord (FreqTree a) where
>    compare l r = compare (frequency l) (frequency r)

To merge trees, we simply create a new branch that has the combined
frequency of the subtrees.

> merge                     :: FreqTree a -> FreqTree a -> FreqTree a
> merge l r                 = Branch l r (frequency l + frequency r)

To construct the encoding tree, we need to find out the frequencies of objects
in the input data:

> frequencies               :: Ord a => [a] -> Freqs a
> frequencies               = map (head &&& length) . group . sort

Building the tree is a simple process. First we'll create a list of leaf
nodes and sort it so that the most rare elements (trees with lowest frequency)
are the first. We'll continue merging the first trees of the list to new
trees and push the new trees onto their proper place in the list until we
are left with one node. This is our final tree.

Finally, we normalize the tree to handle degenerate cases where the whole 
tree contains only one leaf node: we'll put that leaf node under a branch
so that encoding and decoding can assume that all trees contain at least
one branch.

> buildTree                 :: Ord a => Freqs a -> FreqTree a
> buildTree                 = normalize . build . sort . map (uncurry Leaf)
>   where
>     build []                   = error "can't build tree without input"
>     build [t]                  = t
>     build (t1:t2:ts)           = build (insert (merge t1 t2) ts)
>     normalize t@(Branch _ _ _) = t
>     normalize t@(Leaf _ _)     = merge t t

The Huffman codes of leafs are their paths from root to elements. We can
create these simply by recursing down the tree and adding 0 to path for
every left and 1 for every right we take down the recursion.

> createDictionary          :: Ord a => FreqTree a -> M.Map a BitString
> createDictionary          = M.fromList . paths []
>   where
>     paths p  (Leaf a _)     = [(a,reverse p)]
>     paths p  (Branch l r _) = paths (0:p) l ++ paths (1:p) r

Now that we have a way to build a dictionary, we can simply look up the
encodings for different elements from the dictionary.

> encode                    :: Ord a => Freqs a -> [a] -> BitString
> encode d s                = concatMap encoder s
>   where
>     encoder a = maybe (error "unknown input") id (M.lookup a dict)
>     dict      = createDictionary (buildTree d)

When decoding, we can just consume bits from the input until we end up
in a leaf. Whenever we end up in the leaf, we have a new output elements
and start consuming bits from the root again.

> decode                    :: Ord a => Freqs a -> BitString -> [a]
> decode d s                = loop s tree
>   where
>     loop xs     (Leaf a _)     = a : loop xs tree
>     loop (0:xs) (Branch l _ _) = loop xs l
>     loop (1:xs) (Branch _ r _) = loop xs r
>     loop []     _              = []
>     loop _      _              = error "malformed bit-string"
>     tree                       = buildTree d

Finally, we can check with QuickCheck that decoding is inverse of encoding.

> prop_decoding_is_inverse_for_decoding :: [Int] -> Property
> prop_decoding_is_inverse_for_decoding xs = 
>     not (null xs) ==> decode fs (encode fs xs) == xs
>   where fs = frequencies xs

> test :: IO ()
> test = quickCheck prop_decoding_is_inverse_for_decoding

New Annotation