\documentclass[a4paper,12pt]{article} \usepackage[utf8]{inputenc} \usepackage{graphicx} \usepackage[superscript]{cite} \usepackage{color} \usepackage{listings} \definecolor{light-gray}{gray}{0.95} \lstdefinestyle{normal}{ basicstyle=\footnotesize\ttfamily, numbers=none, backgroundcolor=\color{light-gray}, showspaces=false, showstringspaces=false, showtabs=false, tabsize=2, captionpos=b, escapechar=¤, frame=single, framesep=0.2cm, xleftmargin=2mm, xrightmargin=2mm, breaklines=true, emph={} } \lstset{style=normal} \lstdefinestyle{javascript}{ language=JavaScript } \lstdefinestyle{haskell}{ language=Haskell } \lstset{style=haskell} \title{HQMp3 -- MP3 decoder in Haskell} \author{Tobias Olausson \\ \texttt{\small{olaussot@student.chalmers.se}} \and Anders Karlsson \\ \texttt{\small{andekar@student.chalmers.se}} } \date{ \rule{0.8 \linewidth}{0.5mm} \\[3mm] University of Gothenburg \\ \small{\today} } \begin{document} \maketitle \begin{abstract} This project aims to show how one can parse and decode binary formats in Haskell, showcased by implementing an mp3 decoder. The decoder features its own library for handling of binary formats, a fast huffman decoder and is written entirely in Haskell. \end{abstract} \tableofcontents TODO: Lägg all kod i captions, Lägg alla referenser till funktioner i en texttt, Fixa bilder, Skriv en snutt om björns decoder, Lägg alla tabeller i captions. \section{Motivation} MP3 decoders exists today in most languages, however there does not exist a fast one written in Haskell. It does exists one written by Björn Edström \cite{bjorn}, but it is very slow, and bitwise poorly implemented (sorry Björn). \section{Background} \subsection{The MP3 format} The MP3 format is probably the most well-known and widespread format for encoding audio. It was engineered by the MPEG group as part of the MPEG-1 standard ISO-11172 which was published in 1993. As part of this standard, the document describing MP3 is ISO-11172-3 \cite{wikimp3,wikimpeg1}. The data in an mp3 file is divided into frames. Each frame consists of a header, side info, and main data. We distinguish between physical frames, and logical frames. Physical frames are all data between two headers (unsorted), whereas a logical frame is a header and it's associated frame data (sorted). The distance in bits between two consecutive physical frames are constant once you know bitrate and sample rate. However, there might not be data to fill all those bits, in which case the data for the \textit{next} frame is inserted before its frame. This implies that one needs to keep track of two consecutive frames at a time in order to know where data begins and ends. To avoid redundant information we only briefly describe the format here. If you feel that there are a lot of details missing, then don't worry -- we will cover all that in the implementation part. \subsubsection{Header} \label{sec:header} Bild här \\ \begin{itemize} \item{12 bits} All ones, sync word. \item{3 bits} 101 for MPEG layer 3. \item{1 bit} 0 if CRC follows the frame. \item{4 bits} Bitrate index. \item{2 bits} Sample rate index. \item{1 bit} 1 if frame is padded. \item{1 bit} Private bit - not used. \item{2 bits} Mode - Mono/Stereo/DualChannel/JointStereo. \item{2 bits} Mode extension - only used with JointStereo. \item{2 bits} Copyright data. \item{2 bits} Emphasis. \end{itemize} There are lots of data here that does not matter for the decoding process, and some that does not matter because of limits in our decoder. We restricted our decoder to files not using JointStereo, so the Mode Extension bits are not used. All bits are however checked for correctness. The most important bits for decoding are bitrate index (BT) and sample rate index (SR). The decoding process can be divided into two parts: unpacking and decoding. For the unpacking, the CRC info and padding data are also important. One might think that the header would contain all data needed for decoding, but that is absolutely not the case. In fact, the only useful data for decoding that is contained in the header are SR and BT. Then some data in the header is of course useful for unpacking, but unpacking is a rather small part of the whole decoding process. \subsubsection{Side Info} The side info is the part that contains most information on how to unpack the compressed audio and some information on how to decode the data. However, there is some information usable for decoding that is \textit{not} included in the side info, for some unknown reason. Some things to note in this data are: \begin{itemize} \item main\_data\_begin - a pointer to the audio data \item scfsi bands - scale factor information for each data part. \item part2\_3\_length - the amount of audio data for each data part. \end{itemize} The rest of this section contains only decoding-specific data, which we will cover more in the implementation part. \subsubsection{Main Data} The main data is further divided into two granules, which may be divided further, depending on the number of channels (1 or 2). For each of these data parts, the main data contains scale information for decoding, and huffman data. \subsubsection{ID3} ID3 tags may be present within an mp3 file. Technically, it may appear anywhere in the file, but is by convention always in the beginning. Our decoder ignores id3 tags, however a library for parsing these tags was written for future use. \section{Implementation} We wanted different parts of the process to be separate in the code, so we divided the decoder into two main modules: Unpack.hs and Decode.hs. \subsection{Unpacking} \begin{lstlisting} unpackMp3 :: L.ByteString -> [SideInfo BS.BitString] unpackMp3 = runBitGet first . BS.convert where first = do skipId3 init <- lookAhead (runMaybeT readHeader) unpackFrames init \end{lstlisting} What is going on here? The function expects a lazy bytestring as input (this is the contents of the mp3 file), and finally gives us a list of SideInfo datatypes parametrized by a BitString. There are lots of stuff that needs to be explained, clearly! First, what is the SideInfo type? \begin{lstlisting} data SideInfo a = Single { sampleRate :: !Int , dataPointer :: !Int -- 9 bits , scales :: ![Bool] , gran :: !(Granule a, Granule a) } | Dual { sampleRate :: !Int , dataPointer :: !Int -- 9 bits , scales' :: ![Bool] , gran0 :: !(Granule a, Granule a) , gran1 :: !(Granule a, Granule a) } deriving Show \end{lstlisting} So, it holds some values (sampleRate, dataPointer), some scale data information, and two or four Granules. There is a type of headers aswell, but that is only used intermediately when unpacking, so it is of rather little interest. We might describe it later. What is a Granule then? This is where the good stuff is! \begin{lstlisting} data Granule a = Granule { bigValues :: !Int , globalGain :: !Double , scaleFacCompress :: !Int , windowSwitching :: !Bool -- windows , blockType :: !Int , mixedBlock :: !Bool , tableSelect_1 :: !Int , tableSelect_2 :: !Int -- window == 0 , tableSelect_3 :: !Int -- window == 1 , subBlockGain1 :: !Double , subBlockGain2 :: !Double , subBlockGain3 :: !Double , preFlag :: !Bool , scaleFacScale :: !Bool , count1TableSelect :: !Bool -- calculated from previous values , region0Start :: !Int , region1Start :: !Int , region2Start :: !Int , mp3Data :: a } deriving Show \end{lstlisting} So, it contains lots of information on how to decode, and then it contains an \texttt{a}. Recall that the result from unpackMp3 was parametrized by a BitString, so the mp3Data in the Granule would be a BitString in this case. BitString is a library we wrote during the project. It works like ByteString, but has support for bit-level operations, hence the name. We will cover BitString in section \ref{sec:bitstring}. We saw in \texttt{unpackMp3} that it read one frame, and then called \texttt{unpackFrames} with that frame as an argument. \texttt{unpackFrames} simply loops and accumulates frames until there are no more frames to read. The interesting function here is \texttt{readHeader}, which looks as follows: \begin{lstlisting} readHeader :: MaybeT BitGet (MP3Header Int) readHeader = do h <- lift syncWord if not h then fail "" else do prot <- lift $ liftM not getBit brate <- getBitRate freq <- getFreq padd <- lift getBit lift $ skip 1 -- private bit mode <- getMode mext' <- getModeExt lift $ skip 4 -- copyright, original & emphasis when prot $ lift $ skip 16 sinfo <- lift $ readSideInfo mode freq let size = ((144 * 1000 * brate) `div` freq + if padd then 1 else 0) * 8 f' = if prot then 2 else 0 ff = case mode of Mono -> 17 _ -> 32 hsize = (f' + ff + 4) * 8 mext = case mode of JointStereo -> mext' _ -> (False, False) return $ MP3Header mode mext size hsize sinfo \end{lstlisting} So, comparing it to the description of an mp3 header in section \ref{sec:header} we see that the only change is the size variable, which calculates the length in bits to the next header. This function however, does not read in any data, and thus does not produce a logical frame. To do that, we need to first read side info and then look ahead to the next header in order to determine how long we should read. The function that reads the data can be found in the source code of Unpack.hs, we do not include it here. However, we do include \texttt{readSideInfo}, since it is by far more interesting even though it is long. \begin{lstlisting} readSideInfo :: MP3Mode -> Int -> BitGet (SideInfo Int) readSideInfo mode freq = do dataptr <- getInt 9 skipPrivate scaleFactors <- getScaleFactors case mode of Mono -> do g0 <- getGranule g1 <- getGranule return $ Single freq (dataptr * 8) scaleFactors (g0, g1) _ -> do g0 <- getGranule g1 <- getGranule g2 <- getGranule g3 <- getGranule return $ Dual freq (dataptr * 8) scaleFactors (g0, g2) (g1, g3) where skipPrivate = case mode of Mono -> skip 5 _ -> skip 3 getScaleFactors = case mode of Mono -> replicateM 4 getBit _ -> replicateM 8 getBit getGranule = do part23len <- getInt 12 bigValues <- getInt 9 globalGain <- getInt 8 scaleFacCompress <- getInt 4 w <- getBit -- windowSwitching -- blockType bt <- getInt $ if w then 2 else 0 -- mixedBlock mb <- if w then getBit else return False tableSelect0 <- getInt 5 tableSelect1 <- getInt 5 tableSelect2 <- if w then return 0 else getInt 5 subBlockGain0 <- if w then liftM fi (getInt 3) else return 0 subBlockGain1 <- if w then liftM fi (getInt 3) else return 0 subBlockGain2 <- if w then liftM fi (getInt 3) else return 0 region0Count <- if w then return (reg0 mb bt) else getInt 4 region1Count <- if w then return 0 else getInt 3 let r0count = if w then (if bt == 2 then 8 else 7) else region0Count r1count = if w then 20 - r0count else region1Count sbTable = tableScaleBandBoundary freq r1bound = sbTable $ r0count + 1 r2bound = sbTable $ r0count + 1 + r1count + 1 bv2 = bigValues * 2 reg0len = if bt == 2 then min bv2 36 else min bv2 r1bound reg1len = if bt == 2 then min (bv2 - reg0len) 540 else min (bv2 - reg0len) (r2bound - reg0len) reg2len = if bt == 2 then 0 else bv2 - (reg0len + reg1len) preFlag <- getBit scaleFacScale <- getBit count1TableSelect <- getBit return $ Granule bigValues (mp3FloatRep1 globalGain) scaleFacCompress w bt mb tableSelect0 tableSelect1 tableSelect2 (mp3FloatRep2 subBlockGain0) (mp3FloatRep2 subBlockGain1) (mp3FloatRep2 subBlockGain2) preFlag scaleFacScale count1TableSelect reg0len reg1len reg2len part23len reg0 False 2 = 8 reg0 _ _ = 7 \end{lstlisting} So, lots of stuff that is done here are \textit{not} mentioned in the standard, but is stuff that one have to magically know (or see in somebody else's code, like we did). However, this is what a granule looks like, except for the data that is read in as stated above. Most of the stuff here is of little interest from a programming point of view, however all of them are used when decoding. The variables named \texttt{reg\{0,1,2}len are a bit more interesting though, since they control the number of times one has to invoke the huffman decoder. These variables are used together with \texttt{tableSelect\{0,1,2\}} that tells the decoder which huffman table that was used when compressing the audio data. \subsection{Decoding} \subsubsection{Huffman} We have had two approaches to huffman decoding; binary trees, and lookup tables (arrays). The first approach was the most intuitive one in which we read one bit at a time, and walked the tree until we had read a whole code word and reached a leaf. This was rather elegant and the code for doing so was no more than $\sim$10 lines. On the downside, this approach was very slow, and at that time the huffman decoder took almost 50\% of the total running time. We did some thinking, and looking at other people's ideas we found out that one could make rather large tables of size $2^n$ in the largest code word. Of course this was not practical, since the longest code word in one of the huffman tables was 19 bits, so we ended up with a 13MB source code file which was impossible to compile. We did some further thinking at this point, and came up with another idea! How about if we fix some code word length for a table (say 8), and for those code words that were longer a new table was returned. For those code words that were shorter, these were padded with all possible combinations to make it of the right length. It might be easier to understand with an example! \begin{tabular}{| l | l |} \hline Code Word & Value \\ \hline \hline 1 & 5 \\ 000 & 7 \\ 001 & 3 \\ 010 & 2 \\ 0111 & 4 \\ 01101 & 1 \\ 01100 & 8 \\ \hline \end{tabular} \\ In this case, one might choose code word length three, which we will use in this example. The lookup table generated from this would look like this: \begin{tabular}{| l | l |} \hline Code Word & Value \\ \hline \hline 100 & 5 \\ 101 & 5 \\ 110 & 5 \\ 111 & 5 \\ 000 & 7 \\ 001 & 3 \\ 010 & 2 \\ \hline 011 & \begin{tabular}{l | l} 10 & 4 \\ 11 & 4 \\ 01 & 1 \\ 00 & 8 \\ \end{tabular} \\ \hline \end{tabular} \\ So what happened here was that the one was expanded to fill all code words of length three starting with a one. The words that were of length three already were left as is, but the ones longer had one common factor of size three, 011. So if one reads 011, a new table is returned, where all code words are expanded to the longest one (two in this case). We see already here that this is much smaller than having $2^5$ entries (we have almost a three-fold in size here). With this approach, we could read more than one bit at the time. What we did was to read the common code word length number of bits. This saved time both in reading, and it saved us the traversal of a data structure, since it was replaced with an array lookup. Together with some other optimizations we lowered the amount of time taken by the huffman decoder to $\sim$8\%. \subsubsection{Audio Decoding} All audio is represented as floating-point numbers. Now, representing floats in a binary file format is not very space efficient, so the huffman decoder decodes integers instead of floats. These integers then has to be \textit{requantized}. \\ At this point we were starting to get low on time. A lot of time had gone to parse the format and to huffman decode, largely due to the rather unspecific specification in ISO 11172-3. Therefore, some of the parts below will not be very well described, basically because we did not have time to investigate in how they work. Also, we will not show much of this code, since we did not write it. \\ So, the \textit{requantization} is basically a transformation of all integers decoded by the huffman decoder into floats. How this is done depends on blocktype and windowswitching. Depending on those, different tables are for obtaining values that are multiplied by the integer from huffdecoding and also multiplied by different other more or less special values depending on blocktype and windowswitching. \\ Next processing step is the reordering. During encoding, some parts of the data may be reordered in order to increase compression. The reordering basically reverses this process so that the actual audio is correctly ordered. Whenever the granule only has long blocks, this reordering is skipped. In the code, this step also pads the list of samples so that it is always of length 576 for simplicity in the further steps. \\ % TODO: Describe these, and that's about it for this section AliasAddition, IMDCT, FrequencyInvert, SynthethisFilterBank \\ To further compress the audio, the encoder passes the data through an \textit{Inverse Modified Discrete Cosine Transform} (IMDCT). As stated in ISO 11172-3, the transformation for each sample is done as follows: \[ x_i = \sum_{k=0}^{\frac{n}{2} - 1}{X_k * cos(\frac{\pi}{2n}(2i + 1 + \frac{n}{2})(2k + 1))} \textit{ for i=0 to n-1} \] In the code that we plugged in to, the IMDCT was written as a piece of C code, as was the synthethis filterbank. We thought that it would be much nicer to have that code written in Haskell, so we rewrote both those files into Haskell versions. For the IMDCT that went extremely smooth: \begin{lstlisting} -- Straightforward translation from the C code. imdct :: Int -> [Double] -> [Double] imdct 18 xs = imdct18 xs imdct pts xs = map (\n -> sum $ zipWith (subone n) xs [0..pts-1]) [0..2*pts-1] where subone :: Int -> Double -> Int -> Double subone n y k = y * (cos $ pipts * (fi n + 0.5 + nhalf) * (fi k + 0.5) ) pipts = pi / (fi pts) nhalf = (fi pts) / 2.0 -- Straightforward translation from the C code, elegant! imdct18 :: [Double] -> [Double] imdct18 xs = map (\s -> sumZip (*) xs s) lookupIMDCT where -- 36x18 matrix lookupIMDCT :: [[Double]] lookupIMDCT = [[ cos $ (pi / 18.0) * n * k | k <- [0.5, 1.5 .. 17.5]] | n <- [9.5, 10.5 .. 44.5]] -- Instead of specialize sumZip _ [] _ = 0 sumZip _ _ [] = 0 sumZip f (a:as) (b:bs) = let acc = (f a b) in acc `seq` acc + sumZip f as bs \end{lstlisting} As you can see, there was a special version when there were only 18 samples to transform, and we have not investigated in how that code was written, but we just translated it to Haskell, which in our opinion went very well. For performance reasons we wrote sumZip, which would not be needed otherwise. This was however quite slow compared to the C version, but that was not a big concern when we first wrote it, the goal was to have a Haskell version first, and if there were time, to optimise. \section{Results} Beauty, Slowness\ldots \subsection{bitstring} \label{sec:bitstring} Probably the best library I have ever seen \subsection{huffman} \label{sec:huffman} Probably the most unneccessary library I have ever written. Remove? maybe some different approaches should be discussed here? Ulf and Koen had their ideas that we discussed and ignored. \section{Future Work} Rewrite the back-end to use vector, optimise further. Add support for all modes that are currently not supported. Also, other implementations of the huffman library may be considered. \begin{thebibliography}{99} \bibitem{bjorn} Let's build an MP3-decoder! http://blog.bjrn.se/2008/10/lets-build-mp3-decoder.html \bibitem{wikimp3} Wikipedia on MP3: http://en.wikipedia.org/wiki/Mp3 \bibitem{wikimpeg1} Wikipedia on MPEG-1: http://en.wikipedia.org/wiki/MPEG-1 \bibitem{standard} ISO-11172-3 Information technology -- Coding of moving pictures and associated audio for digital storage media at up to about 1,5 Mbit/s -- Part 3: Audio \end{thebibliography} \end{document}