Iris flower dataset

The iris flower dataset is a common dataset used in machine learning.

It has been created Ronald Fisher in 1936. It contains the petal length, petal width, sepal length and sepal width of 150 iris flowers from 3 different species.

Dataset has been downloaded from Kaggle.

To go through this example, you need to install AutoClassWrapper:

$ python3 -m pip install autoclasswrapper

AutoClass C also needs to be installed locally and available in path.

Here is a quick solution for a Linux Bash shell:

wget https://ti.arc.nasa.gov/m/project/autoclass/autoclass-c-3-3-6.tar.gz
tar zxvf autoclass-c-3-3-6.tar.gz
rm -f autoclass-c-3-3-6.tar.gz
export PATH=$PATH:$(pwd)/autoclass-c

# if you use a 64-bit operating system,
# you also need to install the standard 32-bit C libraries:
# sudo apt-get install -y libc6-i386
[1]:
from pathlib import Path
import sys
import time

import matplotlib
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
import numpy as np
import pandas as pd

%matplotlib inline

print("Python:", sys.version)
print("matplotlib:", matplotlib.__version__)
print("numpy:", np.__version__)
print("pandas:", pd.__version__)

import autoclasswrapper as wrapper
print("AutoClassWrapper:", wrapper.__version__)

version = sys.version_info
if not ((version.major >= 3) and (version.minor >= 6)):
    sys.exit("Need Python>=3.6")
Python: 3.7.1 | packaged by conda-forge | (default, Feb 26 2019, 04:48:14)
[GCC 7.3.0]
matplotlib: 3.0.3
numpy: 1.16.2
pandas: 0.24.1
AutoClassWrapper: 1.4.1

Dataset preparation

[2]:
df = pd.read_csv("iris.csv", index_col="Id")
df.head()
[2]:
SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
Id
1 5.1 3.5 1.4 0.2 Iris-setosa
2 4.9 3.0 1.4 0.2 Iris-setosa
3 4.7 3.2 1.3 0.2 Iris-setosa
4 4.6 3.1 1.5 0.2 Iris-setosa
5 5.0 3.6 1.4 0.2 Iris-setosa
[3]:
df.describe(include='all').T
[3]:
count unique top freq mean std min 25% 50% 75% max
SepalLengthCm 150 NaN NaN NaN 5.84333 0.828066 4.3 5.1 5.8 6.4 7.9
SepalWidthCm 150 NaN NaN NaN 3.054 0.433594 2 2.8 3 3.3 4.4
PetalLengthCm 150 NaN NaN NaN 3.75867 1.76442 1 1.6 4.35 5.1 6.9
PetalWidthCm 150 NaN NaN NaN 1.19867 0.763161 0.1 0.3 1.3 1.8 2.5
Species 150 3 Iris-versicolor 50 NaN NaN NaN NaN NaN NaN NaN

Add discrete values

Apart iris species, data in this dataset are numerical values only.

To demonstrate the ability of AutoClass C to handle discrete values, we will convert PetalWidthCm column to discrete categorical values.

[4]:
def categorize(value):
    if value <= 0.75:
        return "small"
    elif 0.75 < value <= 1.75:
        return "medium"
    elif 1.75 < value:
        return "large"
[5]:
df["PetalWidthCat"] = df["PetalWidthCm"].apply(categorize)
df.head()
[5]:
SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species PetalWidthCat
Id
1 5.1 3.5 1.4 0.2 Iris-setosa small
2 4.9 3.0 1.4 0.2 Iris-setosa small
3 4.7 3.2 1.3 0.2 Iris-setosa small
4 4.6 3.1 1.5 0.2 Iris-setosa small
5 5.0 3.6 1.4 0.2 Iris-setosa small

Add missing values

To demonstrate the ability of AutoClass C to handle missing values, we will delete some values.

[6]:
df.loc[1, "SepalLengthCm"] = np.nan
df.loc[2, "SepalWidthCm"] = np.nan
df.loc[3, "PetalLengthCm"] = np.nan
df.head()
[6]:
SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species PetalWidthCat
Id
1 NaN 3.5 1.4 0.2 Iris-setosa small
2 4.9 NaN 1.4 0.2 Iris-setosa small
3 4.7 3.2 NaN 0.2 Iris-setosa small
4 4.6 3.1 1.5 0.2 Iris-setosa small
5 5.0 3.6 1.4 0.2 Iris-setosa small

Save dataset in two different files. One with real values and the other one with discrete values (column PetalWidthCat). Missing values must encoded with nothing.

[7]:
df.drop(["Species", "PetalWidthCm", "PetalWidthCat"], axis=1).to_csv("iris_real.tsv", sep="\t", header=True)
!head iris_real.tsv
Id      SepalLengthCm   SepalWidthCm    PetalLengthCm
1               3.5     1.4
2       4.9             1.4
3       4.7     3.2
4       4.6     3.1     1.5
5       5.0     3.6     1.4
6       5.4     3.9     1.7
7       4.6     3.4     1.4
8       5.0     3.4     1.5
9       4.4     2.9     1.4
[8]:
df["PetalWidthCat"].to_csv("iris_discrete.tsv", sep="\t", header=True)
!head iris_discrete.tsv
Id      PetalWidthCat
1       small
2       small
3       small
4       small
5       small
6       small
7       small
8       small
9       small

Step 1 - prepare input files

[9]:
# Create object to prepare dataset.
clust = wrapper.Input()

# Load datasets from tsv files.
clust.add_input_data("iris_real.tsv", "real scalar")
clust.add_input_data("iris_discrete.tsv", "discrete")

# Prepare input data:
# - create a final dataframe
# - merge datasets if multiple inputs
clust.prepare_input_data()

# Create files needed by AutoClass.
clust.create_db2_file()
clust.create_hd2_file()
clust.create_model_file()
# We wanted reproducible results to ease documentation.
# But bear in mind, that this parameter is not advised by authors of AutoClass C in production run.
# Use clust.create_sparams_file() instead.
clust.create_sparams_file(reproducible_run=True)
clust.create_rparams_file()
2019-07-07 19:07:58 INFO     Reading data file 'iris_real.tsv' as 'real scalar' with error 0.01
2019-07-07 19:07:58 INFO     Detected encoding: ascii
2019-07-07 19:07:59 INFO     Found 150 rows and 4 columns
2019-07-07 19:07:59 DEBUG    Checking column names
2019-07-07 19:07:59 DEBUG    Index name 'Id'
2019-07-07 19:07:59 DEBUG    Column name 'SepalLengthCm'
2019-07-07 19:07:59 DEBUG    Column name 'SepalWidthCm'
2019-07-07 19:07:59 DEBUG    Column name 'PetalLengthCm'
2019-07-07 19:07:59 INFO     Checking data format
2019-07-07 19:07:59 INFO     Column 'SepalLengthCm'
2019-07-07 19:07:59 INFO     count    149.000000
2019-07-07 19:07:59 INFO     mean       5.848322
2019-07-07 19:07:59 INFO     std        0.828594
2019-07-07 19:07:59 INFO     min        4.300000
2019-07-07 19:07:59 INFO     50%        5.800000
2019-07-07 19:07:59 INFO     max        7.900000
2019-07-07 19:07:59 INFO     ---
2019-07-07 19:07:59 INFO     Column 'SepalWidthCm'
2019-07-07 19:07:59 INFO     count    149.000000
2019-07-07 19:07:59 INFO     mean       3.054362
2019-07-07 19:07:59 INFO     std        0.435034
2019-07-07 19:07:59 INFO     min        2.000000
2019-07-07 19:07:59 INFO     50%        3.000000
2019-07-07 19:07:59 INFO     max        4.400000
2019-07-07 19:07:59 INFO     ---
2019-07-07 19:07:59 INFO     Column 'PetalLengthCm'
2019-07-07 19:07:59 INFO     count    149.000000
2019-07-07 19:07:59 INFO     mean       3.775168
2019-07-07 19:07:59 INFO     std        1.758720
2019-07-07 19:07:59 INFO     min        1.000000
2019-07-07 19:07:59 INFO     50%        4.400000
2019-07-07 19:07:59 INFO     max        6.900000
2019-07-07 19:07:59 INFO     ---
2019-07-07 19:07:59 INFO     Reading data file 'iris_discrete.tsv' as 'discrete'
2019-07-07 19:07:59 INFO     Detected encoding: ascii
2019-07-07 19:07:59 INFO     Found 150 rows and 2 columns
2019-07-07 19:07:59 DEBUG    Checking column names
2019-07-07 19:07:59 DEBUG    Index name 'Id'
2019-07-07 19:07:59 DEBUG    Column name 'PetalWidthCat'
2019-07-07 19:07:59 INFO     Checking data format
2019-07-07 19:07:59 INFO     Column 'PetalWidthCat': 3 different values
2019-07-07 19:07:59 INFO     Preparing input data
2019-07-07 19:07:59 INFO     Final dataframe has 150 lines and 5 columns
2019-07-07 19:07:59 INFO     Searching for missing values
2019-07-07 19:07:59 WARNING  Missing values found in column: SepalLengthCm
2019-07-07 19:07:59 WARNING  Missing values found in column: SepalWidthCm
2019-07-07 19:07:59 WARNING  Missing values found in column: PetalLengthCm
2019-07-07 19:07:59 INFO     Writing autoclass.db2 file
2019-07-07 19:07:59 INFO     If any, missing values will be encoded as '?'
2019-07-07 19:07:59 DEBUG    Writing autoclass.tsv file [for later use]
2019-07-07 19:07:59 INFO     Writing .hd2 file
2019-07-07 19:07:59 INFO     Writing .model file
2019-07-07 19:07:59 INFO     Writing .s-params file
2019-07-07 19:07:59 INFO     Writing .r-params file

Step 2 - prepare run script & run autoclass

[10]:
# Clean previous status file and results if a classification has already been performed.
!rm -f autoclass-run-* *.results-bin

# Search autoclass in path.
wrapper.search_autoclass_in_path()

# Create object to run AutoClass.
run = wrapper.Run()

# Prepare run script.
run.create_run_file()

# Run AutoClass.
run.run()
2019-07-07 19:08:02 INFO     AutoClass C executable found in /home/pierre/.soft/bin/autoclass
2019-07-07 19:08:02 INFO     Writing run file
2019-07-07 19:08:02 INFO     AutoClass C executable found in /home/pierre/.soft/bin/autoclass
2019-07-07 19:08:02 INFO     AutoClass C version: AUTOCLASS C (version 3.3.6unx)
2019-07-07 19:08:02 INFO     Running clustering...

Step 3 - parse and format results

[11]:
timer = 0
step = 2
while not Path("autoclass-run-success").exists():
    timer += step
    sys.stdout.write("\r")
    sys.stdout.write(f"Time: {timer} sec.")
    sys.stdout.flush()
    time.sleep(step)

results = wrapper.Output()
results.extract_results()
results.aggregate_input_data()
results.write_cdt()
results.write_cdt(with_proba=True)
results.write_class_stats()
results.write_dendrogram()
2019-07-07 19:08:05 INFO     Extracting autoclass results
2019-07-07 19:08:05 INFO     Found 150 cases classified in 4 classes
2019-07-07 19:08:05 INFO     Aggregating input data
2019-07-07 19:08:05 INFO     Writing classes + probabilities .tsv file
2019-07-07 19:08:05 INFO     Writing .cdt file
2019-07-07 19:08:05 INFO     Writing .cdt file (with probabilities)
2019-07-07 19:08:05 INFO     Writing class statistics
2019-07-07 19:08:05 INFO     Writing dendrogram
../_images/notebooks_demo_iris_18_1.png

For comparison, add class number to original dataset.

[12]:
df_class = pd.read_csv("autoclass_out.tsv", sep="\t", index_col="Id")
df = pd.concat([df, df_class["main-class"]], axis=1, join="outer")
df.head()
[12]:
SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species PetalWidthCat main-class
Id
1 NaN 3.5 1.4 0.2 Iris-setosa small 1
2 4.9 NaN 1.4 0.2 Iris-setosa small 1
3 4.7 3.2 NaN 0.2 Iris-setosa small 1
4 4.6 3.1 1.5 0.2 Iris-setosa small 1
5 5.0 3.6 1.4 0.2 Iris-setosa small 1

Compute class distribution for iris species

[13]:
pd.pivot_table(df, index=["Species"], columns=["main-class"], values=[], aggfunc=len, fill_value=0)
[13]:
main-class 1 2 3 4
Species
Iris-setosa 50 0 0 0
Iris-versicolor 0 23 0 27
Iris-virginica 0 16 32 2

The setosa species is found only in cluster 1. Note that missing values did not interfere in the classification of the 3 first flowers as setosa.

The versicolor species is found in cluster 2 and 4.

The virginica species is found mainly in cluster 3 but also in cluster 2 and 4.

AutoClass-C determines automatically what is the optimal number of classes. It’s always a good idea to analyse the final results to check if some cluster can be merged (for instance cluster 2 and 4).