1Lossless compilers). But text can also be

1Lossless Data Compression Methods ofCompressing Text DocumentsHarihara Subramanian [email protected] of Information and Communication Security, FEECS, VSB-Technical Universityof Ostrava, 17. Listopadu 15, Ostrava, Czech RepublicAbstract – Data compression is the process of converting, modifying and encoding the bitsstructure of data in such a way that it consumes less space on disk. It reduces the storage sizeof one or more data instances or elements. Data compression is also known as source codingor bit-rate reduction. It works through several compressing techniques and softwaresolutions that utilize data compression algorithms to reduce the data size. Here, in this paper,I have discussed and provide a survey on basic lossless data compression techniques such asHuffman Coding Algorithm and LZ77, LZ78 (Lempel Zev Welch Algorithm) with examples.Index Terms – Data Compression, Lossless Compression, Huffman Coding Algorithm,LZ77 and LZ78 (Lempel Zev Welch Algorithm)I. INTRODUCTIONData Compression is a technique that transforms the data from one representation to anothernew compressed representation. The aim of compression algorithms is to produce a new file,as short as possible, containing the compressed version of same text. The term “Text” clearsthat it can be written in natural languages or can be texts usually generated by translators (likevarious types of compilers). But text can also be images or some other kinds of structures aswell provided the data are stored in linear files.The optimum of data compression techniques remains important even if mass storagesystems regularly because the amount of data grows accordingly. So, it is necessary to reducethe size of files and often need to reduce proportionally their transmission times. The size canbe reduced by simply remove the excessive information of particular text files.Benefits of compression? It reduces storage requirements simultaneously decreases overall execution time.? It provides a level of security against illicit monitoring.? It reduces the probability of transmission errors since fewer bits are transferred.? Compression provides a potential cost saving associated with sending less data overswitched telephone network where cost of call is usually based upon its duration.2II. COMPRESSION METHODS1. Huffman Coding AlgorithmHuffman coding is a lossless data compression algorithm. The idea is to assign variablelengthcodes to input characters; lengths of the assigned codes are based on the frequencies ofcorresponding characters 1. The most frequent character gets the smallest code and the leastfrequent character gets the largest code.The Huffman method is an optional statistical coding 2. It transforms the original codeused for characters of the text (ASCII code on 8 bits, for insurance). Coding the text is justreplacing each symbol by its new code word. The method works for any length of blocks(not only 8 bits), but the running time grows exponentially with the length.The Huffman coding algorithm uses the notion of prefix code 3. A prefix code is a set ofwords containing no word that is a prefix of another word of the set. The advantage of suchtype is that decoding is immediate 4.1.1. Steps followed for analyse? Step 1: Pick two letters x, y from alphabet A with the smallest frequencies andcreate a sub tree that has these two characters as leaves (Greedy Idea) and labelthe root of the sub tree as z.? Step 2: Set frequency f (z) = f(x) +f(y). Remove x, y and add z creating a newalphabet A´=A? {z}-{x, y} Note that |A´| = |A| – 1.? Repeat this procedure, called merge with new alphabet with only one symbol isleft.The resulting tree is called Huffman Coding1.2. Sample Huffman Coding Algorithm with ExampleConsider an Alphabet A with frequency distribution {f (a): a ? A}. The binaryHuffman tree is built with the help of priority queue, Q, of nodes, with labels(frequencies) as keys 5, 6.Huffman (A){ n = |A|;Q = A;for i = 1 to n – 1{ z = new code;Left z = Extract-Min (Q);Rightz = Extract-Min (Q);Fz = f leftz + f rightz;Insert (Q, z);}return Extract-Min (Q) root of the tree}3? Let A = {v/20, w/15, x/5, y/15, z/45} be the alphabet and its frequency distribution. Inthe first step algorithm merges x and y.0 1Alphabet is now become A1 = {v/20, w/15, h1/20, z/45}.? Now Huffman Coding merges v and w ( It can also merge w and n1)0 1 0 1New Alphabet becomes A2 = {h2/35, h1/20, z/45}? Now coding merges n1 and n2.0 10 1 0 1Now A3 will become A3 = {h3/55, z/45}4? Last Step, Coding Merges (h3, z) and form the root tree0 10 1120 1 0 1? Finally, Huffman code is obtained asv = 000, w = 001, x = 010, y = 011, z = 1This is the optimum (minimum-cost) prefix code for this distribution2. Lempel-Zev-Welch Algorithm – LZ77 and LZ78The LZW algorithm is a common compression technique. This coding algorithmmainly used in GIF and optionally in PDF and TIFF. It is a lossless data compression whereno data is lost during compressing 7. The algorithm is simple to implement and widely usedin UNIX file compression for determine utility compress.2.1. LZ77The compression algorithm was first developed by Jacob Ziv and Abraham Lempeland they presented their dictionary-based scheme in 1977 in the area of lossless datacompression. It exploits the fact that words and phrases within a text file are likely tobe repeated. When there will be a repetition, they encoded as a pointer to a previousoccurrence, with the pointer followed by the number of characters to be matched. Itdoesn’t require a prior knowledge of the source and simultaneously about thecharacteristics of the source 8.The encoder consists of two parts – (Search buffer and Look – ahead buffer), where itexamines the input sequence by pressing into service of sliding window. It consists ofthe sequence which is encoded in the form of a triple in which ‘o’determines an offset to the match,’1’ represents length of the match and ‘c’ denotesthe next symbol to be encoded. A null pointer generated as a pointer in case ofabsence of the match (both the offset and the match length equal to 0) and the firstsymbol in the look-ahead buffer ,i.e.(0,0, “character”)52.1.1. Sample LZ77 Algorithm and ExampleWhile (look-Ahead Buffer is not empty){Get a pointer (pointer, length) to longest match;if (length > 0){Output (position, longest match length, next symbol);Shift the window by (length+1) positions along;}Else{Output (0, 0, first symbol in the look-ahead buffer);Shift the window by 1 character along;}}Search Buffer Look-Ahead Buffer…a c c a b r a c a d a b r a r r a r r a c…Encoding of the stringabracadabrad Output tuple: (offset, length, symbol)Search Buffer Look-ahead Buffer12 characters compressed into 6tuples compression rate = (12*8)/ (6*(5+2+3)= 96/60 = 60%2.2. LZ78It is a dictionary –based compression algorithm which maintains an explicitdictionary. LZ78 consists of two elements: Index referring to the longest matchingdictionary entry and the first non-matching symbol. It also adds the index and symbolpair to the dictionary 9. The Logic behind the LZ78 is that when the symbol notfound, the code word recognizes the index value of 0 and it will add to the dictionary.With this technique LZ78 constructs the dictionary.7 6 5 4 3 2 1a b r a c ada… (0, 0, a)a b r a c a dab… (0, 0, b)a b r a c a d adr…. (0, 0, r)a b r a c a d a bra…. (3, 1, c)a b r a c a d a b r ad (2, 1, d)a b r a c a d a b r a d (7, 4,d)a d a b r a d62.2.1. Sample Algorithm with ExampleW: = NIL;While (there is input){K: = next symbol from input;If (wk exists in the dictionary){W: = wk;}Else{Output (index (w), K);Add wk to the dictionary;W: = NIL;}}Now, Consider an example encode the string ABBCBCABABCAABCAABusing LZ78 algorithm1 2 3 4 5 6 7ABBCBCABABCAABCAABThe compressed message is : (0, A)(0,B)(2,C)(3,A)(2,A)(4,A)(6,B).OutputIndexString(0, A)1A(0, B)2B(2, C)3BC(3, A)4BCA(2, A)5BA(4, A)6BCAA(6, B)7BCAAB7III. CONCLUSIONThe text compression techniques were successfully studied with the help of samplealgorithms and described briefly with their examples. The below tables show you the pros,cons and applications of Huffman coding algorithm and Lempel-Zev-Welch algorithmTable I: Comparison of LZ77 and LZ78S.No LZ77 LZ781 This algorithm works on past data This algorithm works on future data2 The LZ77 is bit slower than LZ78 It is bit faster than LZ773 The output format is triplet <0, 1, c> Output format is LZ78 is pair