Lempel-Ziv Algorithm: Parsing a String
The first type of compression that you will explore is a simple version of the Lempel-Ziv (LZ) algorithm, named after its inventors. This idea is at the heart of many of the 'zipping' programs used in PCs. In this lossless compression, data is too valuable for any of it to be lost. Therefore, encoding schemes are used to shrink the data in such a way that it can be returned to its original form.
There are literally thousands of different compression algorithms. The ones actually used in practice are extremely complicated, but for our purposes here, you will explore a simpler version.
The first step in LZ is to look at the data stream (we use _ for a blank space for clearness), for example:
If_one_doctor_doctors_another_doctor,_does_the_doctor_who_doctors_the_doctor_ doctor_the_doctor_the_way_the_doctor_he_is_doctoring_doctors?_Or_does_he_doctor_ the_doctor_the_way_the_doctor_who_doctors_doctors?
We "parse" it by inserting commas to split everything up into building blocks. (The commas we insert are not meant as punctuation.) We start at the beginning and we look at the input character by character. After each character we have to decide whether to put a comma or not. The rule is simple: if the string starting after the previous comma and ending with the character under consideration has been encountered before, then we don't put a comma, and proceed to the next character; if the string ending with the character under consideration, has not been encountered before, then we insert a comma after the character that we are considering, and we start again, now going on from this new comma. This means that every comma we insert will be preceeded by a character for which the string between the character and the previous comma was encountered earlier, but for which the longer string including the character was not encountered before.
We start with a comma, because the "empty string" wasn't seen before we started. Then we put a comma after the first I, because we hadn't seen that before. Then we put a comma after f, and after blank space and after o, and so on. When we come to the next blank space - this is the first character that repeats itself, so we add the next character d, and insert a comma after d and continue the process. The result looks like this:
,I,f,_,o,n,e,_d,oc,t,or,_do,c,to,r,s,_a,no,th,er,_doc,tor,,,_doe,s_,the,_doct,or_,w,h,o_, d,oct,ors,_t,he,_docto,r_,do,ct,or_t,he_,doc,tor_,the_,wa,y,_th,e_,doct,or_h,e_i,s_d, octo,ri,ng,_doctor,s?,_O,r_d,oe,s_h,e_d,octor,_the,_doctor_,the_w,a,y_,the_d,octor_, wh,o_d,octors,_doctors,?
Can you parse the following sequence? As you parse, pressing return in the parsed text window will tell you the correct part of your answer so far."
Once the data stream is parsed, we write it in a different form, as explained on the next page.