Contents

Acknowledgements

The author would like to thank first and foremost Dr. Paul Vitányi for his elaborate feedback and tremendous technical contributions to this work. Next I thank Dr. Peter Grünwald for ample feedback. I also thank my colleagues John Tromp and Ronald de Wolf. I thank my friends Dr. Kaihsu Tai and Ms. Anna Lissa Cruz for extensive feedback and experimental inputs. This thesis is dedicated to my grandparents, Edwin and Dorothy, for their equanimity and support. It is further dedicated in spirit to the memories of my mother, Theresa, for her compassion, and to my father, Salvatore for his deep insight and foresight.

This work was supported in part by the Netherlands BSIK/BRICKS project, and by NWO project 612.55.002, and by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778.

Papers on Which the Thesis is Based

Chapter 2 is introductory material, mostly based on M. Li and P.M.B. Vitányi. An Introduction to Kolmogorov Complexity and Its Applications. Springer–Verlag, New York, second edition, 1997.
Chapter 3 is based on R. Cilibrasi and P. Vitányi. Clustering by compression. IEEE Transactions on Information Theory, 51(4):1523-1545, 2005, as well as M.Li, X. Chen, X. Li, B. Ma and P. Vitányi. The similarity metric. IEEE Trans. Information Theory, 50(12):3250–3264, 2004. Section ?? is based on unpublished work by R. Cilibrasi.
Chapter 4 is based on R. Cilibrasi and P.M.B. Vitányi. A new quartet tree heuristic for hierarchical clustering, IEEE/ACM Trans. Comput. Biol. Bioinf., Submitted. Presented at the EU-PASCAL Statistics and Optimization of Clustering Workshop, London, 2005,
http://arxiv.org/abs/cs.DS/0606048
Chapter 5 is based on unpublished work by R. Cilibrasi.
Chapter 6 is based on

R. Cilibrasi and P. Vitányi. Clustering by compression. IEEE Transactions on Information Theory, 51(4):1523-1545, 2005;

R. Cilibrasi, P.M.B. Vitányi, and R. de Wolf. Algorithmic clustering of music based on string compression. Computer Music Journal, pages 49-67. A preliminary version appeared as

R. Cilibrasi, R. de Wolf, P. Vitányi, Algorithmic clustering of music, Proc IEEE 4th International Conference on Web Delivering of Music (WEDELMUSIC 2004), IEEE Comp. Soc. Press, 2004, 110-117.

This work was reported in among others: “Software to unzip identity of unknown composers, New Scientist, 12 April 2003, by Hazel Muir. “Software sorts tunes,” Technology Research News, April 23/30, 2003, by Kimberly Patch; and “Classer musiques, langues, images, textes et genomes,” Pour La Science, 317(March 2004), 98–103, by Jean-Paul Delahaye , (Pour la Science = Edition francaise de Scientific American).

Chapter 7 is based on

R. Cilibrasi, P.M.B. Vitanyi, Automatic meaning discovery using Google,
http://xxx.lanl.gov/abs/cs.CL/0412098 (2004); followed by conference versions

R. Cilibrasi and P.M.B. Vitányi, Automatic Extraction of Meaning from the Web, 2006 IEEE International Symposium on Information Theory (ISIT 2006), Seattle, 2006; and

R. Cilibrasi, P.M.B. Vitányi, Similarity of objects and the meaning of words, Proc. 3rd Conf. Theory and Applications of Models of Computation (TAMC), 15-20 May, 2006, Beijing, China. Lecture Notes in Computer Science, Vol. 3959, Jin-Yi Cai, S. Barry Cooper, and Angsheng Li (Eds.), 2006; to the journal version

R. Cilibrasi and P.M.B. Vitányi. The Google similarity distance, IEEE Transactions on Knowledge and Data Engineering, To appear.

The supporting experimental data for the binary classification experimental comparison with WordNet can be found at
http://www.cwi.nl/ ~cilibrar/googlepaper/appendix.eps

This work was reported in, among others, “A search for meaning,” New Scientist, 29 January 2005, p.21, by Duncan Graham-Rowe; on the Web in “Deriving semantics meaning from Google results,” Slashdot— News for nerds, Stuff that matters, Discussion in the Science section, 29 January, 2005.

Chapter 8 is based on T. Roos, T. Heikkila, R. Cilibrasi, and P. Myllymäki. Compression-based stemmatology: A study of the legend of St. Henry of Finland, 2005. HIIT technical report,
http://cosco.hiit.fi/Articles/hiit-2005-3.eps
Chapter 9 is based on unpublished work by R. Cilibrasi.
Chapter 10 describes the CompLearn system, a general software tool to apply the ideas in this Thesis, written by R. Cilibrasi, and explains the reasoning behind it; see
http://complearn.org/ for more information.

Chapter 1  Introduction

But certainly for the present age, which prefers the sign to the thing signified, the copy to the original, representation to reality, the appearance to the essence... illusion only is sacred, truth profane. Nay, sacredness is held to be enhanced in proportion as truth decreases and illusion increases, so that the highest degree of illusion comes to be the highest degree of sacredness. –Feuerbach, Preface to the second edition of The Essence of Christianity

1.1  Overview of this thesis

This thesis concerns a remarkable new scientific development that advances the state of the art in the field of data mining, or searching for previously unknown but meaningful patterns in fully or semi-automatic ways. A substantial amount of mathematical theory is presented as well as very many (though not yet enough) experiments. The results serve to test, verify, and demonstrate the power of this new technology. The core ideas of this thesis relate substantially to data compression programs. For more than 30 years, data compression software has been developed and significantly improved with better models for almost every type of file. Until recently, the main driving interests in such technology were to economize on disk storage or network data transmission costs. A new way of looking at data compressors and machine learning allows us to use compression programs for a wide variety of problems.

In this thesis a few themes are important. The first is the use of data compressors in new ways. The second is a new tree visualization technique. And the third is an information-theoretic connection of a web search engine to the data mining system. Let us examine each of these in turn.

1.1.1  Data Compression as Learning

The first theme concerns the statistical significance of compressed file sizes. Most computer users realize that there are freely available programs that can compress text files to about one quarter their original size. The less well known aspect of data compression is that combining two or more files together to create a larger single conglomerate archive file prior to compression often yields better compression in aggregate. This has been used to great advantage in widely popular programs like tar or pkzip, combining archival and compression functionality. Only in recent years have scientists begun to appreciate the fact that compression ratios signify a great deal of important statistical information. All of the experiments in this thesis make use of a group of compressible objects. In each case, the individual compressed sizes of each object are calculated. Then, some or all possible pairs of objects are combined and compressed to yield pairwise compressed sizes. It is the tiny variations in the pairwise compressed sizes that yields the surprisingly powerful results of the following experiments. The key concept to realize is that if two files are very similar in their contents, then they will compress much better when combined together prior to compression, as compared to the sum of the size of each separately compressed file. If two files have little or nothing in common, then combining them together would not yield any benefit over compressing each file separately.

Although the principle is intuitive to grasp, it has surprising breadth of applicability. By using even the simplest string-matching type compression made in the 1970’s it is possible to construct evolutionary trees for animals fully automatically using files containing their mitochondrial gene sequence. One example is shown in Figure ??. We first construct a matrix of pairwise distances between objects (files) that indicate how similar they are. These distances are based on comparing compressed file sizes as described above. We can apply this to files of widely different types, such as music pieces or genetic codes as well as many other specialized domains. In Figure ??, we see a tree constructed from the similarity distance matrix based on the mitochondrial DNA of several species. The tree is constructed so that species with “similar” DNA are “close by” in the tree. In this way we may lend support to certain evolutionary theories.

Although simple compressors work, it is also easy to use the most advanced modern compressors with the theory presented in this thesis; these results can often be more accurate than simpler compressors in a variety of particular circumstances or domains. The main advantage of this approach is its robustness in the face of strange or erroneous data. Another key advantage is the simplicity and ease of use. This comes from the generality of the method: it works in a variety of different application domains and when using general-purpose compressors it becomes a general-purpose inference engine. Throughout this thesis there is a focus on coding theory and data compression, both as a theoretical construct as well as practical approximations thereof through actual data compression programs in current use. There is a connection between a particular code and a probability distribution and this simple theoretical foundation allows one to use data compression programs of all types as statistical inference engines of remarkable robustness and generality. In Chapter 3, we describe the Normalized Compression Distance (NCD), which formalizes the ideas we have just described. We report on a plethora of experiments in Chapter 6 showing applications in a variety of interesting problems in data mining using gene sequences, music, text corpora, and other inputs.

1.1.2  Visualization

Custom open source software has been written to provide powerful new visualization capabilities. The CompLearn software system (Chapter ??) implements our theory and with it experiments of two types may be carried out: classification or clustering. Classification refers to the application of discrete labels to a set of objects based on a set of examples from a human expert. Clustering refers to arrangement of objects into groups without prior training or influence by a human expert. In this thesis we deal primarily with hierarchical or nested clustering in which a group of objects is arranged into a sort of binary tree. This clustering method is called the quartet method and will be discussed in detail later.

In a nutshell, the quartet method is a way to determine a best matching tree given some data that is to be understood in a hierarchical cluster. It is called the quartet method because it is based on the smallest unrooted binary tree, which happens to be two pairs of two nodes for a total of four nodes comprising the quartet. It adds up many such small binary trees together to evaluate a big tree and then adjusts the tree according to the results of the evaluation. After a time, a best fitting tree is declared and the interpretation of the experimental results is possible. Our compression-based algorithms output a matrix of pairwise distances between objects. Because such a matrix is hard to interpret, we try to extract some of its essential features using the quartet method. This results in a tree optimized so that similar objects with small distances are placed nearby each other. The trees given in Figures 1.1, 1.2, and 1.3 (discussed below) have all been constructed using the quartet method.

The quartet tree search is non-deterministic. There are compelling theoretical reasons to suppose that the general quartet tree search problem is intractable to solve exactly for every case. But the method used here tries instead to approximate a solution in a reasonable amount of time, sacrificing accuracy for speed. It also makes extensive use of random numbers, and so there is sometimes variation in the results that the tree search produces. We describe the quartet tree method in detail in Chapter 4. In Chapter 6 we show numerous trees based on applying the quartet method and the NCD to a broad spectrum of input files in a wide array of domains.

1.1.3  Learning From the Web

It is possible to use coding theory to connect the compression approach to the web with the help of a search engine index database. By using a simple formula based on logarithms we can find “compressed sizes” of search terms. This was used in the Chinese tree in Figure 1.2. The tree of Nobel prize winning authors in Figure 1.3 was also made this way. As in the last example, a distance matrix is made, but this time with Google providing page count statistics that are converted to codelengths for use in the distance matrix calculations. We can see English and American writers clearly separated in the tree, as well as many other defensible placements. Another example using prime numbers with Google is in Chapter ??, page ??.

Throughout this thesis the reader will find ample experiments demonstrating the machine learning technology. There are objective experiments based on pure statistics using true data compressors and subjective experiments using statistics from web pages as well. There are examples taken from genetics, linguistics, literature, radio astronomy, optical character recognition, music, and many more diverse areas. Most of the experiments can be found in Chapters 4, 6, and 7.

1.1.4  Clustering and Classification

The examples given above all dealt with clustering. It is also interesting to consider how we can use NCD to solve classification problems. Classification is the task of assigning labels to unknown test objects given a set of labeled training objects from a human expert. The goal is to try to learn the underlying patterns that the human expert is displaying in the choice of labellings shown in the training objects, and then to apply this understanding to the task of making predictions for unknown objects that are in some sense consistent with the given examples. Usually the problem is reduced to a combination of binary classification problems, where all target labels along a given dimension are either 0 or 1. In Chapter 5 we discuss this problem in greater detail, we give some information about a popular classification engine called the Support Vector Machine (SVM), and we connect the SVM to the NCD to create robust binary classifiers.

1.2  Gestalt Historical Context

Each of the three key ideas (compression as learning, quartet tree visualization, and learning from the web) have a common thread: all of them serve to increase the generality and practical robustness of the machine intelligence compared to more traditional alternatives. This goal is not new and has already been widely recognized as fundamental. In this section a brief and subjective overview of the recent history of artificial intelligence is given to provide a broader context for this thesis.

In the beginning, there was the idea of artificial intelligence. As circuit miniaturization took off in the 1970’s, people’s imaginations soared with ideas of a new sort of machine with virtually unlimited potential: a (usually humanoid) metal automaton with the capacity to perform intelligent work and yet ask not one question out of the ordinary. A sort of ultra-servant, made able to reason as well as man in most respects, yet somehow reasoning in a sort of rarefied form whereby the more unpredictable sides of human nature are factored out. One of the first big hurdles came as people tried to define just what intelligence was, or how one might codify knowledge in the most general sense into digital form. As Levesque and Brachman famously observed  [], reasoning and representation are hopelessly intertwined, and just what intelligence is depends very much on just who is doing the talking.

Immediately upon settling on the question of intelligence one almost automatically must grapple with the concept of language. Consciousness and intelligence is experienced only internally, yet the objects to which it applies are most often external to the self. Thus there is at once the question of communication and experience and this straight-away ends any hope of perfect answers. Most theories on language are not theories in the formal sense  []. A notable early exception is Quine’s famous observation that language translation is necessarily a dicey subject: for although you might collect very many pieces of evidence suggesting that a word means “X” or “Y”, you can never collect a piece of evidence that ultimately confirms that your understanding of the word is “correct” in any absolute sense. In a logical sense, we can never be sure that the meaning of a word is as it was meant, for to explain any word we must use other words, and these words themselves have only other words to describe them, in an interminable web of ontological disarray. Kantian empiricism leads us to pragmatically admit we have only the basis of our own internal experience to ground our understanding at the most basic level, and the mysterious results of the reasoning mind, whatever that might be.

It is without a doubt the case that humans throughout the world develop verbal and usually written language quite naturally. Recent theories by Smale  [] have provided some theoretical support for empirical models of language evolution despite the formal impossibility of absolute certainty. Just the same it leaves us with a very difficult question: how do we make bits think?

Some twenty years later, progress has been bursty. We have managed to create some amazingly elegant search and optimization techniques including simplex optimization, tree search, curve-fitting, and modern variants such as neural networks or support vector machines. We have built computers that can beat any human in chess, but we cannot yet find a computer smart enough to walk to the grocery store to buy a loaf of bread. There is clearly a problem of overspecialization in the types of successes we have so far enjoyed in artificial intelligence. This thesis explores my experience in charting this new landscape of concepts via a combination of pragmatic and principled techniques. It is only with the recent explosion in internet use and internet writing that we can now begin to seriously tackle these problems so fundamental to the original dream of artificial intelligence.

In recent years, we have begun to make headway in defining and implementing universal prediction, arguably the most important part of artificial intelligence. Most notable is Solomonoff prediction [], and the more practical analogs by Ryabko and Astola [] using data compression.

In classical statistical settings, we typically make some observations of a natural (or at the very least, measurable) phenomenon. Next, we use our intuition to “guess” which mathematical model might best apply. This process works well for those cases where the guesser has a good model for the phenomenon under consideration. This allows for at least two distinct modes of freedom: both in the choice of models, and also in the choice of criteria supporting “goodness”.

In the past the uneasy compromise has been to focus attention firstly on those problems which are most amenable to exact solution, to advance the foundation of exact and fundamental science. The next stage of growth was the advent of machine-assisted exact sciences, such as the now-famous four-color proof that required input (by hand!) of 1476 different graphs for computer verification (by a complicated program) that all were colorable before deductive extension to the most general case in the plane []. After that came the beginning of modern machine learning, based on earlier ideas of curve fitting and least-squares regression. Neural networks, and later support vector machines, gave us convenient learning frameworks in the context of continuous functions. Given enough training examples, the theory assured us, the neural network would eventually find the right combination of weightings and multiplicative factors that would miraculously, and perhaps a bit circularly, reflect the underlying meaning that the examples were meant to teach. Just like spectral analysis that came before, each of these areas yielded a whole new broad class of solutions, but were essentially hit or miss in their effectiveness in each domain for reasons that remain poorly understood. The focus of my research has been on the use of data compression programs for generalized inference. It turns out that this modus operandi is surprisingly general in its useful application and yields oftentimes the most expedient results as compared to other more predetermined methods. It is often “one size fits all well enough” and this yields unexpected fruits. From the outset, it must be understood that the approach here is decidedly different than more classical ones, in that we avoid in most ways an exact statement of the problem at hand, instead deferring this until very near the end of the discussion, so that we might better appreciate what can be understood about all problems with a minimum of assumptions.

At this point a quote from Goldstein and Gigerenzer [] is appropriate:

What are heuristics? The Gestalt psychologists Karl Duncker and Wolfgang Koehler preserved the original Greek definition of “serving to find out or discover” when they used the term to describe strategies such as “looking around” and “inspecting the problem” (e.g., Duncker, 1935/1945).

For Duncker, Koehler, and a handful of later thinkers, including Herbert Simon (e.g., 1955), heuristics are strategies that guide information search and modify problem representations to facilitate solutions. From its introduction into English in the early 1800s up until about 1970, the term heuristics has been used to refer to useful and indispensable cognitive processes for solving problems that cannot be handled by logic and probability theory (e.g., Polya, 1954; Groner, Groner, & Bischof, 1983). In the past 30 years, however, the definition of heuristics has changed almost to the point of inversion. In research on reasoning, judgment, and decision making, heuristics have come to denote strategies that prevent one from finding out or discovering correct answers to problems that are assumed to be in the domain of probability theory. In this view, heuristics are poor substitutes for computations that are too demanding for ordinary minds to carry out. Heuristics have even become associated with inevitable cognitive illusions and irrationality.

This author sides with Goldstein and Gigerenzer in the view that sometimes “less is more”; the very fact that things are unknown to the naive observer can sometimes work to his advantage. The recognition heuristic is an important, reliable, and conservative general strategy for inductive inference. In a similar vein, the NCD based techniques shown in this thesis provide a general framework for inductive inference that is robust against a wide variety of circumstances.

1.3  Contents of this Thesis

In this chapter a summary is provided for the remainder of the thesis as well as some historical context. In Chapter 2, an introduction to the technical details and terminology surrounding the methods is given. In chapter 3 we introduce the Normalized Compression Distance (NCD), the core mathematical formula that makes all of these experiments possible, and we establish connections between NCD and other well-known mathematical formulas. In Chapter 4 a tree search system is explained based on groups of four objects at a time, the so-called quartet method. In Chapter 5 we combine NCD with other machine learning techniques such as Support Vector Machines. In Chapter 6, we provide a wealth of examples of this technology in action. All experiments in this thesis were done using the CompLearn Toolkit, an open-source general purpose data mining toolkit available for download from the http://complearn.org/ website. In Chapter 7, we show how to connect the internet to NCD using the Google search engine, thus providing the advanced sort of subjective analysis as shown in Figure 1.2. In Chapter 8 we use these techniques and others to trace the evolution of the legend of Saint Henry. In Chapter 9 we compare CompLearn against another older tree search software system called PHYLIP. Chapter 10 gives a snapshot of the online documentation for the CompLearn system. After this, a Dutch language summary is provided as well as a bibliography, index, and list of papers by R. Cilibrasi.

Chapter 2  Technical Introduction

The spectacle is the existing order’s uninterrupted discourse about itself, its laudatory monologue. It is the self-portrait of power in the epoch of its totalitarian management of the conditions of existence. The fetishistic, purely objective appearance of spectacular relations conceals the fact that they are relations among men and classes: a second nature with its fatal laws seems to dominate our environment. But the spectacle is not the necessary product of technical development seen as a natural development. The society of the spectacle is on the contrary the form which chooses its own technical content. –Guy Debord, Society of the Spectacle

This chapter will give an informal introduction to relevant background material, familiarizing the reader with notation and basic concepts but omitting proofs. We discuss strings, languages, codes, Turing Machines and Kolmogorov complexity. This material will be extensively used in the chapters to come. For a more thorough and detailed treatment of all the material including a tremendous number of innovative proofs see []. It is assumed that the reader has a basic familiarity with algebra and probability theory as well as some rudimentary knowledge of classical information theory. We first introduce the notions of finite, infinite and string of characters. We go on to discuss basic coding theory. Next we introduce the idea of Turing Machines. Finally, in the last part of the chapter, we introduce Kolmogorov Complexity.

2.1  Finite and Infinite

In the domain of mathematical objects discussed in this thesis, there are two broad categories: finite and infinite. Finite objects are those whose extent is bounded. Infinite objects are those that are “larger” than any given precise bound. For example, if we perform 100 flips of a fair coin in sequence and retain the results in order, the full record will be easily written upon a single sheet of A4 size paper, or even a business card. Thus, the sequence is finite. But if we instead talk about the list of all prime numbers greater than 5, then the sequence written literally is infinite in extent. There are far too many to write on any given size of paper no matter how big. It is possible, however, to write a computer program that could, in principle, generate every prime number, no matter how large, eventually, given unlimited time and memory. It is important to realize that some objects are infinite in their totality, but can be finite in a potential effective sense by the fact that every finite but a priori unbounded part of them can be obtained from a finite computer program. There will be more to say on these matters later in Section ??.

2.2  Strings and Languages

A bit, or binary digit, is just a single piece of information representing a choice between one of two alternatives, either 0 or 1.

A character is a symbol representing an atomic unit of written language that cannot be meaningfully subdivided into smaller parts. An alphabet is a set of symbols used in writing a given language. A language (in the formal sense) is a set of permissible strings made from a given alphabet. A string is an ordered list (normally written sequentially) of 0 or more symbols drawn from a common alphabet. For a given alphabet, different languages deem different strings permissible. In English, 26 letters are used, but also the space and some punctuation should be included for convenience, thus increasing the size of the alphabet. In computer files, the underlying base is 256 because there are 256 different states possible in each indivisible atomic unit of storage space, the byte. A byte is equivalent to 8 bits, so the 256-symbol alphabet is central to real computers. For theoretical purposes however, we can dispense with the complexities of large alphabets by realizing that we can encode large alphabets into small ones; indeed, this is how a byte can be encoded as 8 bits. A bit is a symbol from a 2-symbol, or binary, alphabet. In this thesis, there is not usually any need for an alphabet of more than two characters, so the notational convention is to restrict attention to the binary alphabet in the absence of countervailing remarks. Usually we encode numbers as a sequence of characters in a fixed radix format at the most basic level, and the space required to encode a number in this format can be calculated with the help of the logarithm function. The logarithm function is always used to determine a coding length for a given number to be encoded, given a probability or integer range. Similarly, it is safe for the reader to assume that all log’s are taken base 2 so that we may interpret the results in bits.

We write Σ to represent the alphabet used. We usually work with the binary alphabet, so in that case Σ = {0,1}. We write Σ* to represent the space of all possible strings including the empty string. This notation may be a bit unfamiliar at first, but is very convenient and is related to the well-known concept of regular expressions. Regular expressions are a concise way of representing formal languages as sets of strings over an alphabet. The curly braces represent a set (to be used as the alphabet in this case) and the * symbol refers to the closure of the set; By closure we mean that the symbol may be repeated 0, 1, 2, 3, 5, or any number of times. By definition, {0,1}* = ∪n ≥ 0 {0,1}n. It is important to realize that successive symbols need not be the same, but could be. Here we can see that the number of possible binary strings is infinite, yet any individual string in this class must itself be finite. For a string x, we write |x| to represent the length, measured in symbols, of that string.

2.3  The Many Facets of Strings

Earlier we said that a string is a sequence of symbols from an alphabet. It is assumed that the symbols in Σ have a natural or at least conventional ordering. From this we may inductively create a rule that allows us to impose an ordering on all strings that are possible in Σ* in the conventional way: use length first to bring the shorter strings as early as possible in the ordering, and then use the leftmost different character in any two strings to determine their relative ordering. This is just a generalized restatement of the familiar alphabetical or lexicographic ordering. It is included here because it allows us to associate a positive integer ordering number with each possible string. The empty string, є, is the first string in this list. The next is the string 0, and the next 1. After that comes 00, then 01, then 10, then 11, then 000, and so on ad nauseaum. Next to each of these strings we might list their lengths as well as their ordering-number position in this list as follows:

               

x|x| ord (x)
є01
012
113
0024
0125
1026
1127
00038
00139
010310
011311
100312

... and so on forever ...

Here there are a few things to notice. First is that the second column, the length of x written |x|, is related to the ord (x) by the following relationship:

log(  ord (x) ) ≤ |x| ≤ log(  ord (x) ) + 1.     (1)

Thus we can see that the variable x can be interpreted polymorphically: as either a literal string of characters having a particular sequence and length or instead as an integer in one of two ways: either by referring to its length using the | · | symbol, or by referring to its ordinal number using ord (x). All of the mathematical functions used in this thesis are monomorphic in their argument types: each argument can be either a number (typically an integer) or a string, but not both. Thus without too much ambiguity we will sometimes leave out the ord symbol and just write x and rely on the reader to pick out the types by their context and usage. Please notice that x can either stand for the string x or the number ord (x), but never for the length of x, which we always explicitly denote as |x|.

2.4  Prefix Codes

A binary string y is a proper prefix of a binary string x if we can write x=yz for z ≠ є. A set {x,y, … } ⊆ {0,1}* is prefix-free if no element is a proper prefix of any other. A prefix-free set can be used to define a prefix code. Formally, a prefix code is defined by a decoding function D, which is a function from a prefix free set to some arbitrary set X. The elements of the prefix free set are called code words. The elements of X are called source words. If the inverse D−1 of D exists, we call it the encoding function. An example of a prefix code, that is used later, encodes a source word x=x1 x2xn by the code word

x
 = 1n 0 x .

Here X = {0,1}*, D−1(x) = x = 1n 0 x. This prefix-free code is called self-delimiting, because there is a fixed computer program associated with this code that can determine where the code word x ends by reading it from left to right without backing up. This way a composite code message can be parsed in its constituent code words in one pass by a computer program. 1

In other words, a prefix code is a code in which no codeword is a prefix of another codeword. Prefix codes are very easy to decode because codeword boundaries are directly encoded along with each datum that is encoded. To introduce these, let us consider how we may combine any two strings together in a way that they could be later separated without recourse to guessing. In the case of arbitrary binary strings x,y, we cannot be assured of this prefix condition: x might be 0 while y was 00 and then there would be no way to tell the original contents of x or y given, say, just xy. Therefore let us concentrate on just the x alone and think about how we might augment the natural literal encoding to allow for prefix disambiguation. In real languages on computers, we are blessed with whitespace and commas, both of which are used liberally for the purpose of separating one number from the next in normal output formats. In a binary alphabet our options are somewhat more limited but still not too bad. The simplest solution would be to add in commas and spaces to the alphabet, thus increasing the alphabet size to 4 and the coding size to 2 bits, doubling the length of all encoded strings. This is a needlessly heavy price to pay for the privilege of prefix encoding, as we will soon see. But first let us reconsider another way to do it in a bit more than double space: suppose we preface x with a sequence of |x| 0’s, followed by a 1, followed by the literal string x. This then takes one bit more than twice the space for x and is even worse than the original scheme with commas and spaces added to the alphabet. This is just the scheme discussed in the beginning of the section. But this scheme has ample room for improvement: suppose now we adjust it so that instead of outputting all those 0’s at first in unary, we instead just output a number of zeros equal to

⌈ log(|x|) ⌉, 

then a 1, then the binary number |x| (which satisfies |x| ≤ ⌈ logx ⌉ + 1, see Eq. (??)), then x literally. Here, ⌈ · ⌉ indicates the ceiling operation that returns the smallest integer not less than its argument. This, then, would take a number of bits about

2 ⌈ loglogx ⌉ + ⌈ logx ⌉ + 1, 

which exceeds ⌈ logx ⌉, the number of bits needed to encode x literally, only by a logarithmic amount. If this is still too many bits then the pattern can be repeated, encoding the first set of 0’s one level higher using the system to get

2 ⌈ logloglogx ⌉ + ⌈ loglogx ⌉ + ⌈ logx ⌉ + 1 .

Indeed, we can “dial up” as many logarithms as are necessary to create a suitably slowly-growing composition of however many log’s are deemed appropriate. This is sufficiently efficient for all purposes in this thesis and provides a general framework for converting arbitrary data into prefix-free data. It further allows us to compose any number of strings or numbers for any purpose without restraint, and allows us to make precise the difficult concept of K(x,y), as we shall see in Section ??.

2.4.1  Prefix Codes and the Kraft Inequality

Let X be the set of natural numbers and consider the straightforward non-prefix binary representation with the ith binary string in the length-increasing lexicographical order corresponding to the number i. There are two elements of X with a description of length 1, four with a description of length 2 and so on. However, there are less binary prefix code words of each length: if x is a prefix code word then no y = xz with z ≠ є is a prefix code word. Asymptotically there are less prefix code words of length n than the 2n source words of length n. Clearly this observation holds for arbitrary prefix codes. Quantification of this intuition for countable X and arbitrary prefix-codes leads to a precise constraint on the number of code-words of given lengths. This important relation is known as the Kraft Inequality and is due to L.G. Kraft []. Let l1 , l2 , … be a finite or infinite sequence of natural numbers. There is a prefix code with this sequence as lengths of its binary code words iff

 
n
  2−  ln   ≤ 1.     (1)

2.4.2  Uniquely Decodable Codes

We want to code elements of some set X in a way that they can be uniquely reconstructed from the encoding. Such codes are called uniquely decodable. Every prefix-code is a uniquely decodable code. On the other hand, not every uniquely decodable code satisfies the prefix condition. Prefix-codes are distinguished from other uniquely decodable codes by the property that the end of a code word is always recognizable as such. This means that decoding can be accomplished without the delay of observing subsequent code words, which is why prefix-codes are also called instantaneous codes. There is good reason for our emphasis on prefix-codes. Namely, it turns out that Lemma ?? stays valid if we replace “prefix-code” by “uniquely decodable code.” This important fact means that every uniquely decodable code can be replaced by a prefix-code without changing the set of code-word lengths. In this thesis, the only aspect of actual encodings that interests us is their length, because this reflects the underlying probabilities in an associated model. There is no loss of generality in restricting further discussion to prefix codes because of this property.

2.4.3  Probability Distributions and Complete Prefix Codes

A uniquely decodable code is complete if the addition of any new code word to its code word set results in a non-uniquely decodable code. It is easy to see that a code is complete iff equality holds in the associated Kraft Inequality. Let l1, l2, … be the code words of some complete uniquely decodable code. Let us define qx = 2lx. By definition of completeness, we have ∑x qx = 1. Thus, the qx can be thought of as probability mass functions corresponding to some probability distribution Q for a random variable X. We say Q is the distribution corresponding to l1,l2,…. In this way, each complete uniquely decodable code is mapped to a unique probability distribution. Of course, this is nothing more than a formal correspondence: we may choose to encode outcomes of X using a code corresponding to a distribution q, whereas the outcomes are actually distributed according to some pq. But, as we argue below, if X is distributed according to p, then the code to which p corresponds is, in an average sense, the code that achieves optimal compression of X. In particular, every probability mass function p is related to a prefix code, the Shannon-Fano code, such that the expected number of bits per transmitted code word is as low as is possible for any prefix code, assuming that a random source X generates the source words x according to P(X=x)=p(x). The Shannon-Fano prefix code encodes a source word x by a code word of length lx = ⌈ log1/ p(x) ⌉, so that the expected transmitted code word length equals ∑x p(x) log1/p(x)= H(X), the entropy of the source X, up to one bit. This is optimal by Shannon’s “noiseless coding” theorem []. This is further explained in Section ??.

2.5  Turing Machines

This section mainly serves as a preparation for the next section, in which we introduce the fundamental concept of Kolmogorov complexity. Roughly speaking, the Kolmogorov complexity of a string is the shortest computer program that computes the string, i.e. that prints it, and then halts. The definition depends on the specific computer programming language that is used. To make the definition more precise, we should base it on programs written for universal Turing machines, which are an abstract mathematical representation of a general-purpose computer equipped with a general-purpose or universal computer programming language.

Universal Computer Programming Languages:

Most popular computer programming languages such as C, Lisp, Java and Ruby, are universal. Roughly speaking, this means that they must be powerful enough to emulate any other computer programming language: every universal computer programming language can be used to write a compiler for any other programming language, including any other universal programming language. Indeed, this has been done already a thousand times over with the GNU (Gnu’s Not Unix) C compiler, perhaps the most successful open-source computer program in the world. In this case, although there are many different assembly languages in use on different CPU architectures, all of them are able to run C programs. So we can always package any C program along with the GNU C compiler which itself is not more than 100 megabytes in order to run a C program anywhere.

Turing Machines:

The Turing machine is an abstract mathematical representation of the idea of a computer. It generalizes and simplifies all the many specific types of deterministic computing machines into one regularized form. A Turing machine is defined by a set of rules which describe its behavior. It receives as its input a string of symbols, wich may be thought OF as a “program”, and it outputs the result of running that program, which amounts to transforming the input using the given set of rules. Just as there are universal computer languages, there are also universal Turing machines. We say a Turing Machine is universal if it can simulate any other Turing Machine. When such a universal Turing machine receives as input a pair ⟨ x,y ⟩, where x is a formal specification of another Turing machine Tx, it outputs the same result as one would get if one would input the string y to the Turing machine Tx. Just as any universal programming language can be used to emulate any other one, any universal Turing machine can be used to emulate any other one. It may help intuition to imagine any familiar universal computer programming language as a definition of a universal Turing machine, and the runtime and hardware needed to execute it as a sort of real-world Turing machine itself. It is necessary to remove resource constraints (on memory size and input/output interface, for example) in order for these concepts to be thoroughly equivalent theoretically.

Turing machines, formally:

A Turing machine consists of two parts: a finite control and a tape. The finite control is the memory (or current state) of the machine. It always contains a single symbol from a finite set Q of possible states. The tape initially contains the program which the Turing machine must execute. The tape contains symbols from the trinary alphabet A = {0,1,B}. Initially, the entire tape contains the B (blank) symbol except for the place where the program is stored. The program is a finite sequence of bits. The finite control also is always positioned above a particular symbol on the tape and may move left or right one step. At first, the tape head is positioned at the first nonblank symbol on the tape. As part of the formal definition of a Turing machine, we must indicate which symbol from Q is to be the starting state of the machine. At every time step the Turing machine does a simple sort of calculation by consulting a list of rules that define its behavior. The rules may be understood to be a function taking two arguments (the current state and the symbol under the reading head) and returning a Cartesian pair: the action to execute this timestep and the next state to enter. This is to say that the two input arguments are a current state (symbol from Q) of the finite control and a letter from the alphabet A. The two outputs are a new state (also taken from Q) and an action symbol taken from S. The set of possible actions is S = {0,1,B,L,R}. The 0, 1, and B symbols refer to writing that value below the tape head. The L and R symbols refer to moving left or right, respectively. This function defines the behavior of the Turing machine at each step, allowing it to perform simple actions and run a program on a tape just like a real computer but in a very mathematically simple way. It turns out that we can choose a particular set of state-transition rules such that the Turing machine becomes universal in the sense described above. This simulation is plausible given a moment of reflection on how a Turing Machine is mechanically defined as a sequence of rules governing state transitions etc. The endpoint in this line of reasoning is that a universal Turing Machine can run a sort of Turing Machine simulation system and thereby compute identical results as any other Turing Machine.

Notation:

We typically use the Greek letter Φ to represent a Turing machine T as a partially defined function. When the Turing machine T is not clear from the context, we write ΦT. The function is supposed to take as input a program encoded as a finite binary string and outputs the results of running that program. Sometimes it is convenient to define the function as taking integers instead of strings; this is easy enough to do when we remember that each integer is identified with a given finite binary string given the natural lexicographic ordering of finite strings, as in Section ?? The function Φ need only be partially defined; for some input strings it is not defined because some programs do not produce a finite string as output, such as infinite looping programs. We say that Φ is defined only for those programs that halt and therefore produce a definite output. We introduce a special symbol ∞ that represents an abstract object outside the space of finite binary strings and unequal to any of them. For those programs that do not halt we say Φ(x) = ∞ as a shorthand way of indicating this infinite loop: x is thus a non-halting program like the following:

x = while true ; do ; done

Here we can look a little deeper into the x program above and see that although its runtime is infinite, its definition is quite finite; it is less than 30 characters. Since this program is written in the ASCII codespace, we can multiply this figure by 8 to reach a size of 240 bits.

Prefix Turing Machines:

In this thesis we look at Turing Machines whose set of halting programs is prefix free: that is to say that the set of such programs form a prefix code (Section ??), because no halting program is a prefix of another halting program. We can realize this by slightly changing the definition of a Turing machine, equipping it with a one-way input or ‘data’ tape, a separate working tape, and a one-way output tape. Such a Turing Machine is called a prefix machine. Just as there are universal “ordinary” Turing Machines, there are also universal prefix machines that have identical computational power.

2.6  Kolmogorov Complexity

Now is where things begin to become tricky. There is a very special function K called Kolmogorov Complexity. Intuitively, the Kolmogorov complexity of a finite string x is the shortest computer program that prints x and then halts. More precisely, K is usually defined as a unary function that maps strings to integers and is implicitly based (or conditioned) on a concrete reference Turing machine represented by function Φ. The complete way of writing it is KΦ(x). In practice, we want to use a Turing Machine that is as general as possible. It is convenient to require the prefix property. Therefore we take Φ to be a universal prefix Turing Machine.2 Because all universal Turing Machines can emulate one another reasonably efficiently, it does not matter much which one we take. We will say more about this later. For our purposes, we can suppose a universal prefix Turing machine is equivalent to any formal (implemented, real) computer programming language extended with a potentially unlimited memory. Recall that Φ represents a particular Turing machine with particular rules, and remember Φ is a partial function that is defined for all programs that terminate. If Φ is the transformation that maps a program x to its output o, then KΦ(z) represents the length of the minimum program size (in bits) |x| over all valid programs x such that Φ(x) = z .

We can think of K as representing the smallest quantity of information required to recreate an object by any reliable procedure. For example, let x be the first 1000000 digits of π. Then K(x) is small, because there is a short program generating x, as explained further below. On the other hand, for a random sequence of digits, K(x) will usually be large because the program will probably have to hardcode a long list of abitrary values.

2.6.1  Conditional Kolmogorov Complexity

There is another form of K which is a bit harder to understand but still important to our discussions called conditional Kolmogorov Complexity and written

K(z | y).

The notation is confusing to some because the function takes two arguments. Its definition requires a slight enhancement of the earlier model of a Turing machine. While a Turing machine has a single infinite tape, Kolmogorov complexity is defined with respect to prefix Turing machines, which have an infinite working tape, an output tape and a restricted input tape that supports only one operation called “read next symbol”. This input tape is often referred to as a data tape and is very similar to an input data file or stream read from standard input in Unix. Thus instead of imagining a program as a single string we must imagine a total runtime environment consisting of two parts: an input program tape with read/write memory, and a data tape of extremely limited functionality, unable to seek backward with the same limitations as POSIX standard input: there is getchar but no fseek. In the context of this slightly more complicated machine, we can define K(z | y) as the size, in bits, of the smallest program that outputs z given a prefix-free encoding of y, say ȳ, as an initial input on the data tape. The idea is that if y gives a lot of information about z then K(z|y) << K(z), but if z and y are completely unrelated, then K(zy) ≈ K(z). For example, if z = y, then y provides a maximal amount of information about z. If we know that z = y then a suitable program might be the following:

while true ; do
  c = getchar()
  if (c == EOF) ; then halt
  else putchar(c)
done

Here, already, we can see that K(x | x) < 1000 given the program above and a suitable universal prefix Turing machine. Note that the number of bits used to encode the whole thing is less than 1000. The more interesting case is when the two arguments are not equal, but related. Then the program must provide the missing information through more-complicated translation, preprogrammed results, or some other device.

2.6.2  Kolmogorov Randomness and Compressibility

As it turns out, K provides a convenient means for characterizing random sequences. Contrary to popular belief, random sequences are not simply sequences with no discernible patterns. Rather, there are a great many statistical regularities that can be proven and observed, but the difficulty lies in simply expressing them. As mentioned earlier, we can very easily express the idea of randomness by first defining different degrees of randomness as follows: a string x is krandom if and only if K(x) > |x| − k. This simple formula expresses the idea that random strings are incompressible. The vast majority of strings are 1-random in this sense. This definition improves greatly on earlier definitions of randomness because it provides a concrete way to show a given, particular string is non-random by means of a simple computer program.

At this point, an example is appropriate. Imagine the following sequence of digits:

1, 4, 1, 5, 9, 2, 6, 5, 3, ...

and so on. Some readers may recognize the aforementioned sequence as the first digits of the Greek letter π with the first digit (3) omitted. If we extend these digits forward to a million places and continue to follow the precise decimal approximation of π, we would have a sequence that might appear random to most people. But it would be a matter of some confusing debate to try to settle a bet upon whether or not the sequence were truly random, even with all million of the digits written out in several pages. However, a clever observer, having noticed the digits corresponded to π, could simply write a short computer program (perhaps gotten off the internet) of no more than 10 kilobytes that could calculate the digits and print them out. What a surprise it would be then, to see such a short program reproduce such a long and seemingly meaningless sequence perfectly. This reproduction using a much shorter (less than one percent of the literal size) program is itself direct evidence that the sequence is non-random and in fact implies a certain regularity to the data with a high degree of likelihood. Simple counting arguments show that there can be no more than a vanishingly small number of highly compressible programs; in particular, the proportion of programs that are compressible by even k bits is no more than 2k. This can be understood by remembering that there are just two 1-bit strings (0 and 1), four 2-bit strings, and 2m m-bit strings. So if we consider encodings of length m for source strings of length n with n>m, then at most 2m different strings out of the total of 2n source strings can be encoded in m bits. Thus, the ratio of strings compressible by nm bits is at most a 2mn proportion of all strings.

2.6.3  Universality In K

We have remarked earlier how universal Turing machines may emulate one another using finite simulation programs. In talking about the asymptotic behavior of K, these finite simulation programs do not matter more than an additive constant. For example, if we take xn to mean the first n digits of π, then K(xn) = O(logn) no matter which universal Turing machine is in use. This is because it will be possible to calculate any number of digits of π using a fixed-size program that reads as input the number of digits to output. The length of this input cannot be encoded in shorter than logn bits by a counting argument as in the previous section.

This implies that all variations of K are in some sense equivalent, because any two different variants of K given two different reference universal Turing machines will never differ by more than a fixed-size constant that depends only on the particular Turing machines chosen and not on the sequence. It is this universal character that winds up lending credence to the idea that K can be used as an absolute measure of the information contained in a given object. This is quite different from standard Shannon Information Theory based on the idea of average information required to communicate an object over a large number of trials and given some sort of generating source []. The great benefit of the Kolmogorov Complexity approach is that we need not explicitly define the generating source nor run the many trials to see desired results; just one look at the object is enough. Section ?? provides an example that will serve to illustrate the point.

2.6.4  Sophisticated Forms of K

There is now one more form of the K function that should be addressed, though it is perhaps the most complicated of all. It is written as follows:

K(x,y). 

This represents the size in bits of the minimum program that outputs x followed by y, provided the output is given by first outputting x in a self-delimitting way (as explained earlier) and then outputting y. Formally, we define K(x,y) as K(⟨ x,y ⟩), where ⟨ ·,· ⟩ is defined as the pairing operation that takes two numbers and returns a pair: x y.

2.7  Classical Probability Compared to K

Suppose we flip a fair coin. The type of sequence generated by the series of N flips of a fair coin is unpredictable in nature by assumption in normal probability theory. To define precisely what this means presents a bewildering array of possibilities. In the simplest, we might say the sequence is generated by a Bernoulli process where X takes on value 0 or 1 with probability

P(X = 0)fair = 
1
2
  = P(X = 1)fair

The notation P(·) represents the chance that the event inside occurs. It is expressed as a ratio between 0 and 1 with 0 meaning never, 1 meaning always, and every number inbetween representing the proportion of times the event will be true given a large enough number of independent trials. In such a setting, we may use a single bit to represent either possibility efficiently, and can always store N coin flips in just N bits regardless of the outcomes.

What if, instead of a fair coin, we use a biased one? For instance, if

P(X = 0)biased = 
1
8

and therefore since our simplified coins always turn up 0 or 1,

P(X = 1)biased = 
7
8

Then we may use the scheme above to reliably transmit N flips in N bits. Alternatively, we may decide to encode the 1’s more efficiently by using the following simple rule. Assume that N is even. Divide the N flips into pairs, and encode the pairs so that a pair of 1’s takes just a single 1 bit to encode. If both are not 1, then instead output a 0 and then two more bits to represent the actual outcomes in order. Then continue with the next pair of two. One can quickly calculate that “49/64 of the time” the efficient 1-bit codeword will be output in this scheme which will save a great deal of space. Some of this savings will be lost in the cases where the 3-bit codeword is emitted, 15/64 of the time. The average number of bits needed per outcome transmitted is then the codelength c:

c = 
49
128
 + 
15 · 3
64
 = 
94
128

This can also be improved somewhat down to the Shannon entropy H(X) [] of the source X with longer blocks or smarter encoding such as arithmetic codes [] over an alphabet Σ:

H(X) = 
 
i ∈ Σ
 − P(X = i) logP(X = i), 
c = − 
1
8
 · log( 
1
8
 ) − 
7
8
 · log( 
7
8
 ). 

By Shannon’s famous coding theorem, this is essentially the smallest average code length that can be obtained under the assumption that the coin is independently tossed according to Pbiased. Here though, there is already a problem, as we now cannot say, unconditionally, at least, that this many bits will be needed for any actual sequence of bits; luck introduces some variation in the actual space needed, though it is usually near the average. We know that such a coin is highly unlikely to repeatedly emit 0’s, yet we cannot actually rule out this possibility. More to the point, in abstract terms the probability, while exponentially decaying with the greatest haste, still never quite reaches zero. It is useful to think carefully about this problem. All the laws of classical probability theory cannot make claims about a particular sequence but instead only about ensembles of experiments and expected proportions. Trying to pin down uncertainty in this crude way often serves only to make it appear elsewhere instead. In the Kolmogorov Complexity approach, we turn things upside-down: we say that a string is random if it is uncompressible. A string is crandom if K(x) > |x| − c. This then directly addresses the question of how random a given string is by introducing different grades of randomness and invoking the universal function K to automatically rule out the possibility of any short programs predicting a random string defined in this way. Returning to the fair coin example, the entropy is 1 bit per outcome. But we cannot say with certainty that a sequence coming from such a coin cannot be substantially compressed. This is only true with high probability.

2.8  Uncomputability of Kolmogorov Complexity

Some have made the claim that Kolmogorov Complexity is objective. In theory, it is. But in practice it is difficult to say; one major drawback of K is that it is uncomputable. Trying to compute it leads one to try immediately the shortest programs first, and as shown above it does not take many characters in a reasonable language to produce an infinite loop. This problem is impossible to protect against in general, and any multi-threaded approach is doomed to failure for this reason as it bumps up against the Halting Problem. []

A more fruitful approach has been to apply Kolmogorov Complexity by approximating it with data compressors. We may consider the problem of efficiently encoding a known biased random source into a minimum number of bits in such a way that the original sequence, no matter what it was, can once again be reconstructed, but so that also for certain sequences a shorter code is output. This is the basic idea of a data compression program. The most commonly used data compression programs of the last decade include gzip, bzip2, and PPM.

gzip is an old and reliable Lempel-Ziv type compressor with a 32-kilobyte window []. It is the simplest and fastest of the three compressors.

bzip2 is a wonderful new compressor using the blocksort algorithm []. It provides good compression and an expanded window of 900 kilobytes allowing for longer-range patterns to be detected. It is also reasonably fast.

PPM stands for Prediction by Partial Matching []. It is part of a new generation of powerful compressors using a pleasing mix of statistical models arranged by trees, suffix trees or suffix arrays. It usually achieves the best performance of any real compressor yet is also usually the slowest and most memory intensive.

Although restricted to the research community, a new challenger to PPM has arisen called context mixing compression. It is often the best compression scheme for a variety of file types but is very slow; further, it currently uses a neural network to do the mixing of contexts. See the paq series of compressors on the internet for more information on this exciting development in compression technology.

We use these data compressors to approximate from above the Kolmogorov Complexity function K. It is worth mentioning that all of the real compressors listed above operate on a bytewide basis, and thus all will return a multiple of 8 bits in their results. This is unfortunate for analyzing small strings, because the granularity is too coarse to allow for fine resolution of subtle details. To overcome this problem, the CompLearn system – the piece of software using which almost all experiments in later chapters have been carried out – supports the idea of a virtual compressor (originally suggested by Steven de Rooij): a virtual compressor is one that does not actually output an encoded compressed form, but instead simply accumulates the number of bits necessary to encode the results using a hypothetical arithmetic (or entropy) encoder. This frees us from the bytewide restriction and indeed eliminates the need for rounding to a whole number of bits. Instead we may just return a real or floating-point value. This becomes quite useful when analyzing very similar strings of less than 100 bytes.

2.9  Conclusion

We have introduced the notion of universal computation and the K function indicating Kolmogorov Complexity. We have introduced Turing Machines and prefix codes as well as prefix machines. We have discussed a definition of a random string using K. We use these concepts in the next few chapters to explain in great detail our theory and experimental results.


1
This desirable property holds for every prefix-free encoding of a finite set of source words, but not for every prefix-free encoding of an infinite set of source words. For a single finite computer program to be able to parse a code message the encoding needs to have a certain uniformity property like the x code.
2
There exists a version of Kolmogorov complexity that is based on standard rather than prefix Turing machines, but we shall not go into it here.

Chapter 3  Normalized Compression Distance (NCD)

You may very appropriately want to ask me how we are going to resolve the ever acceleratingly dangerous impasse of world-opposed politicians and ideological dogmas. I answer, it will be resolved by the computer. Man has ever-increasing confidence in the computer; witness his unconcerned landings as airtransport passengers coming in for a landing in the combined invisibility of fog and night. While no politician or political system can ever afford to yield understandably and enthusiastically to their adversaries and opposers, all politicians can and will yield enthusiastically to the computers safe flight-controlling capabilities in bringing all of humanity in for a happy landing. –Buckminster Fuller in Operating Manual for Spaceship Earth

In this chapter the Normalized Compression Distance (NCD) and the related Normalized Information Distance (NID) are presented and investigated. NCD is a similarity measure based on a data compressor. NID is simply the instantiation of NCD using the theoretical (and uncomputable) Kolmogorov compressor. Below we first review the definition of a metric. In Section ??, we explain precisely what is meant by universality in the case of NID. We discuss compressor axioms in Section ??, and properties of NCD in Section 3.4. At the end of the chapter, we connect NCD with a classical statistical quantity called Kullback-Leibler divergence. In Section ?? we connect arithmetic compressors to entropy, and in Section ?? we relate them to KL-divergence.

3.1  Similarity Metric

In mathematics, different distances arise in all sorts of contexts, and one usually requires these to be a “metric”. We give a precise formal meaning to the loose distance notion of “degree of similarity” used in the pattern recognition literature.

Metric: Let Ω be a nonempty set and R+ be the set of nonnegative real numbers. A metric on Ω is a function D: Ω × Ω → R+ satisfying the metric (in)equalities:

The value D(x,y) is called the distance between x,y ∈ Ω. A familiar example of a metric is the Euclidean metric, the everyday distance e(a,b) between two objects a,b expressed in, say, meters. Clearly, this distance satisfies the properties e(a,a)=0, e(a,b)=e(b,a), and e(a,b) ≤ e(a,c) + e(c,b) (for instance, a= Amsterdam, b= Brussels, and c= Chicago.) We are interested in “similarity metrics”. For example, if the objects are classical music pieces then the function D(a,b)= 0 if a and b are by the same composer and D(a,b) = 1 otherwise, is a similarity metric. This metric captures only one similarity aspect (feature) of music pieces, presumably an important one because it subsumes a conglomerate of more elementary features.

Density: In defining a class of admissible distances (not necessarily metric distances) we want to exclude unrealistic ones like f(x,y) = 1/2 for every pair xy. We do this by restricting the number of objects within a given distance of an object. As in [] we do this by only considering effective distances, as follows.

Let Ω = Σ*, with Σ a finite nonempty alphabet and Σ* the set of finite strings over that alphabet. Since every finite alphabet can be recoded in binary, we choose Σ = {0,1}. In particular, “files” in computer memory are finite binary strings. A function D: Ω × Ω → R+ is an admissible distance if for every pair of objects x,y ∈ Ω the distance D(x,y) satisfies the density condition (a version of the Kraft Inequality (??)):

 
y
 2D(x,y) ≤ 1,     (1)

is computable, and is symmetric, D(x,y)=D(y,x).

If D is an admissible distance, then for every x the set {D(x,y): y ∈ {0,1}*} is the length set of a prefix code, since it satisfies (??), the Kraft inequality. Conversely, if a distance is the length set of a prefix code, then it satisfies (??), see [].

In representing the Hamming distance d between two strings of equal length n differing in positions i1, … , id, we can use a simple prefix-free encoding of (n,d,i1, … , id) in 2 logn + 4 loglogn +2 + d logn bits. We encode n and d prefix-free in logn+2 loglogn +1 bits each, see e.g. [], and then the literal indexes of the actual flipped-bit positions. Adding an O(1)-bit program to interpret these data, with the strings concerned being x and y, we have defined Hn(x,y)=2 logn + 4 loglogn + d logn+O(1) as the length of a prefix code word (prefix program) to compute x from y and vice versa. Then, by the Kraft inequality (Chapter 2, Section ??), ∑y 2Hn(x,y) ≤ 1. It is easy to verify that Hn is a metric in the sense that it satisfies the metric (in)equalities up to O(logn) additive precision.

Normalization: Large objects (in the sense of long strings) that differ by a tiny part are intuitively closer than tiny objects that differ by the same amount. For example, two whole mitochondrial genomes of 18,000 bases that differ by 9,000 are very different, while two whole nuclear genomes of 3 × 109 bases that differ by only 9,000 bases are very similar. Thus, absolute difference between two objects does not govern similarity, but relative difference seems to. A compressor is a lossless encoder mapping Ω into {0,1}* such that the resulting code is a prefix code. “Lossless” means that there is a decompressor that reconstructs the source message from the code message. For convenience of notation we identify “compressor” with a “code word length function” C: Ω → N, where N is the set of nonnegative integers. That is, the compressed version of a file x has length C(x). We only consider compressors such that C(x) ≤ |x|+O(log|x|). (The additive logarithmic term is due to our requirement that the compressed file be a prefix code word.) We fix a compressor C, and call the fixed compressor the reference compressor. Let D be an admissible distance. Then we may make the definition D+(x)=max{ D(x,z): C(z) ≤ C(x)}, and D+(x,y) is D+(x,y) = max{ D+(x),D+(y) }. Note that since D(x,y)=D(y,x), also D+(x,y)=D+(y,x).

Let D be an admissible distance. The normalized admissible distance, also called a similarity distance, d(x,y), based on D relative to a reference compressor C, is defined by

d(x,y) = 
D(x,y)
D+(x,y)
.

It follows from the definitions that a normalized admissible distance is a function d: Ω × Ω → [0,1] that is symmetric: d(x,y)=d(y,x).

For every x ∈ Ω, and constant e ∈ [0,1], a normalized admissible distance satisfies the density constraint

| {yd(x,y) ≤ e,    C(y) ≤ C(x)  } | < 2e  D+(x)+1 .     (2)

Proof. Assume to the contrary that d does not satisfy (??). Then, there is an e ∈ [0,1] and an x ∈ Ω, such that (??) is false. We first note that, since D(x,y) is an admissible distance that satisfies (??), d(x,y) satisfies a “normalized” version of the Kraft inequality:

 
yC(y) ≤ C(x)
 2− d(x,yD+(x) ≤
 
y
 2− d(x,yD+(x,y) ≤ 1 .     (3)

Starting from (??) we obtain the required contradiction:

     
1
 ≥ 
 
yC(y) ≤ C(x
 2d(x,y)  D+(x)
         
 <