Frequently Asked Questions

1. What is CompLearn?

CompLearn is a software system built to support compression-based learning in a wide variety of applications. It provides this support in the form of a library written in highly portable ANSI C that runs in most modern computer environments with minimal confusion. It also supplies a small suite of simple, composable command-line utilities as simple applications that use this library. Together with other commonly used machine-learning tools such as LibSVM and GraphViz, CompLearn forms an attractive offering in machine-learning frameworks and toolkits. It is designed to be extensible in a variety of ways including modular dynamic-linking plugins (like those used in the Apache webserver) and a language-neutral SOAP interface to supply instant access to core functionality in every major language.

2. Why did the version numbers skip so far between 0.6.4 and 0.8.12?

In early 2005 a major rewrite occurred. This was due to poor organization of the original complearn package, leading to compilation and installation difficulties in far too many situations. This issue was addressed by using a complete rewrite from the ground up of all functionality; earlier versions used a combination of C and Ruby to deliver tree searching. The new version delivers all core functionality, such as NCD and tree searching, in a pure C library. On top of this library is layered a variety of other interfaces such as SOAP and a new in-process direct-extension CompLearn Ruby binding layer. But all dependencies have been reworked and are now modularized so that Ruby and almost every other software package is now optional and a variety of different configurations will compile cleanly.

Another major enhancement in the new complearn is the addition of a Google compressor to calculate NGD. This has opened up whole new areas of Quantitative Subjective Analysis (QSA) to complement our existing more classically pure statistical methods in earlier gzip-style NCD research. By querying the Google webserver through a SOAP layer we may convert page counts of search terms to virtual file lengths that can be used to determine semantic relationships between terms. Please see the paper Automatic Meaning Discovery Using Google for more information.

3. I am running OS/X and I can't detect GLib during configuration, or some other libraries. What am I doing wrong?

Please make sure pkgconfig (pkg-config) is installed. This program is available via fink. Then rerun the configure.

./configure

Please see our Dependencies section for more information on CompLearn library dependencies.

4. The Windows demo isn't working for me? Why not?

If you have cygwin installed on your computer, it's very likely you need to update it. The CompLearn Windows demo uses version 1.5.17 of the cygwin dll; any previous versions are not compatible with the demo. To update your cygwin, go to http://cygwin.com and hit the Install or Update now link.

You may also need to download and install DirectX.

5. gsl and CompLearn seemed to install perfectly, but ncd can't load the gsl library.

Users may get the following message if this happens:

ncd: error while loading shared libraries: libgslcblas.so.0: cannot
open shared object file: No such file or directory

If this is the case, your LD_LIBRARY_PATH environment variable may need to be set. For example, you can try the following before running the ncd command:

export LD_LIBRARY_PATH=/usr/local/lib

6. How can this demo work with only 1000 queries a day?

There are two reasons this demo is able to do as much as it does. One is that Google has generously (and free of charge to me) upgraded my Google API account key daily search limit. You might email them to ask yourself if you have an interesting Google API based search application of your own. The other reason the demo works is because there is a cache of recent page result counts. You can see this cache by looking in the $HOME/.complearn directory. Sometimes larger experiments must be run over the course of two days.

7. How come the counts returned from (any particular) Google API are different that the numbers I see when I enter searches by hand?

I have two possible explanations for this behavior. One is that it would be prohibitively expensive to count the exact total of all pages indexed for most common search terms. Instead they use an estimation heuristic called "prefixing" whereby they just use a short sample of webpages as a representative set for the web and scale up as appropriate. I presume this and also that when you do a search (either by hand or from the API) you can get connected to any one of a number of different search servers, each with a slightly different database. In a rapidly changing large global network it is unlikely that there will be an exact match for the counts on any particular common term because each server must maintain its own distinct "aging snapshot" of the internet.

8. Is it important to adjust or choose a compressor? How should I do it?

Yes, it is very important to choose a good compressor for your application. The "blocksort" compressor is the current default. It is a virtual compressor using a simple blocksorting algorithm. It will give results something like frequency analysis, spectral analysis, and substring matching combined. It works very well for small strings (or files) of 100 bytes or less. If you have more than about 100 bytes then it is probably better to use one of the other three favorite compressors other than the default:

ncd -c zlib

will get you "zlib" style compression which is like gzip and is limitted to files of up to 15K in size.

ncd -c bzlib

will get you "bzip2" style compression which is like zlib but allows for files up to about 450K in size.

9. Running ./configure gives me the following error: cannot find input file: src/complearn/aclconfig.h.in. Where can I find this file?

You will need to generate this header input file by running the autoheader command. autoheader is packaged with autoconf.

autoheader

10. I get the configure error: Can't locate object method "path" via package "Request" at /usr/share/autoconf/Autom4te/C4che.pm line 69, line 111. make[1]: *** [configure] Error 1. Is there an easy way to fix this?

In the top directory of the CompLearn distribution, run the following commands:

rm -rf autom4te.cache

make maintainer-clean