AutoClass algorithm and context¶
AutoClass is an unsupervised Bayesian classification system developed at the NASA Ames Research Center in 1991 by Hanson, Stutz and Cheeseman. This algorithm has many interesting features:
The number of classes are determined automatically.
Missing values are supported.
Discret and real values can be mixed.
For all classified objects, the class membership probability is provided.
AutoClass C is the implementation of the AutoClass algorithm in C. It has been developed by Cheeseman and Stutz in 1996. AutoClass C has been successful in classifying data as diverse as infrared spectra of stars, protein structures, introns from human DNA sequences, Landsat satellites images, body pattern in the common cuttlefish, patterns between rich and poor countries, network traffic, or catchments in the Australian landscape. In proteomics and genomics, where thousands of proteins or genes are detected at once, AutoClass C has been proven to produce insightful results.
However, AutoClass C user interface isn’t very friendly and requires that data and parameters are input in a very precise way. To help user to prepare input data, perform classification and analyze output clusters, we developed AutoClassWrapper as a Python wrapper around AutoClass C.
To install AutoClassWrapper, use
$ python3 -m pip install autoclasswrapper
you will also need AutoClass C:
$ wget https://ti.arc.nasa.gov/m/project/autoclass/autoclass-c-3-3-6.tar.gz $ tar zxvf autoclass-c-3-3-6.tar.gz $ rm -f autoclass-c-3-3-6.tar.gz $ export PATH=$PATH:$(pwd)/autoclass-c # if you use a 64-bit operating system, # you also need to install the standard 32-bit C libraries: $ sudo apt-get install -y libc6-i386
AutoClass C can handle 3 different types of data:
real scalar: numerical values bounded by 0. Examples: length, weight, age…
real location: numerical values, positive and negative. Examples: position, microarray log ratio, elevation…
discrete: qualitative data. Examples: color, phenotype, name…
Each data type must be entered in separate input file (one for each type).
The usual workflow to prepare data is to instantiate an object from the
import autoclasswrapper as wrapper clust = wrapper.Input()
then add as many datasets as wanted, usually one per different data types:
clust.add_input_data("example1.tsv", "real scalar") clust.add_input_data("example2.tsv", "real location")
The first line must be a header with column names. Avoid accentuated or special characters ($&!/β) or space. These characters will be automatically replaced by _. Avoid lengthy column names. Column names must be unique.
The first column must be gene/protein/object names.
Missing data are allowed. They must be represented by nothing (no
Together with the name of the input file, user must provide the type of data (either
real location or
The default error on real values is 0.01. Error is relative for real scalar values (0.01 means 1%) but absolute for real location values. There is no error for discrete values. For real scalar and real location values, custom error can be defined with the
input_error parameter of the
The next step is to prepare input data and generate input files required by AutoClass C:
clust.prepare_input_data() clust.create_db2_file() clust.create_hd2_file() clust.create_model_file() clust.create_sparams_file() clust.create_rparams_file()
All this commands are compulsory and will create several parameter files in the current directory.
Classification / clustering¶
Once input files are created, one can build Bash run script and actually run AutoClass C:
import autoclasswrapper as wrapper run = wrapper.Run() run.create_run_file() run.run()
At this stage, AutoClass C must be installed and available in PATH (see installation section).
The Bash script that run AutoClass C runs it actually twice. The first time to perform the classification (clustering). The second time to build a report from the raw results.
The Bash script that run AutoClass C is loaded itself with the
nohup command. This means that the only way to stop this script is by killing it!
Depending on the size of the datasets (number of lines and columns), the classification might take some time to run (from few seconds to several hours). By default, the maximum running time is 3600 seconds (1 hour). This setting can be modified with the
max_duration parameter of the
Upon classification, results are ouput in different formats:
.cdt: cluster data (CDT) files can be open with Java Treeview
.tsv: Tab-separated values (TSV) file can be easily open and process with Microsoft Excel, R, Python…
_stats.tsv: basic statistics for all classes
_dendrogram.png: figure with a dendrogram showing relationship between classes
Note that the first class has number 1 (not 0).
import autoclasswrapper as wrapper results = wrapper.Output() results.extract_results() results.aggregate_input_data() results.write_cdt() results.write_cdt(with_proba=True) results.write_class_stats() results.write_dendrogram()
.tsv files contains:
The initial dataset.
main-classcolumn that gives the class with the highest probability.
main-class-probacolumn that contains the actual probability value (between 0.0 and 1.0) of the most probable class.
xbeing a class number) that provide the probability to belong to the