API reference for Input() class

class autoclasswrapper.Input(root_name='autoclass', db2_separator_char='t', db2_missing_char='?', tolerate_error=False)

AutoClass C input files and parameters.

Parameters
  • root_name (string, optional (default "autoclass")) – Root name to generate input files for AutoClass C. Example: “autoclass” will lead to “autoclass.db2”, “autoclass.model”, “autoclass.s-params”…

  • db2_separator_char (string, optional (default: "t")) – Character used to separate columns of data in AutoClass C db2 file.

  • db2_missing_char (string, optional (default: "?")) – Character used to encode missing data in AutoClass C db2 file.

  • tolerate_error (bool, optional (default: False)) – If True, countinue generation of AutoClass C input files even if an error is encounter. If False, stop at first error.

had_error

Set to True if an error has been found in the generation of AutoClass C input files.

Type

bool (defaut False)

input_datasets

List of all input Datasets.

Type

list of Dataset() objects

full_dataset

Final Dataset used by AutoClass C.

Type

Dataset() object

add_input_data(*args, **kwargs)

Read input data file and append to list of datasets.

Parameters
  • input_file (string) – Name of the data file to read.

  • input_type (string) – Type of data contained in input file. Either “real scalar”, “real location” or “discrete”

  • input_error (float, optional (default: 0.01)) – Input error value.

  • input_separator_char (string, optional (default: "t")) – Character used to separate columns of data in input file.

  • input_missing_char (string, optional (default: "")) – Character used to encode missing data in input file.

create_db2_file(*args, **kwargs)

Create .db2 file (AutoClass C data).

Also save all data into a .tsv file for later user.

create_hd2_file(*args, **kwargs)

Create .hd2 file (AutoClass C data descriptions).

create_model_file(*args, **kwargs)

Create .model file (AutoClass C data models).

Choice of model based on data type and missing values

create_rparams_file(*args, **kwargs)

Create .r-params file (AutoClass C report parameters).

create_sparams_file(*args, **kwargs)

Create .s-params file (AutoClass C search parameters).

Parameters
  • max_duration (int, optional (default: 3600)) – Maximum time (in seconds) for the AutoClass C simulation. If set max_duration = 0, simulation will run with NO time limit For more details, see AutoClass C documentation: file search-c.text, lines 493-495

  • max_n_tries (int, optional (default: 200)) – Number of trials to run. For more details, see AutoClass C documentation: file search-c.text, lines 403-404

  • max_cycles (int, optional (default: 1000)) – Max number of cycles per trial. This is maximum that may not be reached. For more details, see AutoClass C documentation: file search-c.text, lines 316-317

  • start_j_list (list of int, optional (default: [2, 3, 5, 7, 10, 15, 25, 35,) – 45, 55, 65, 75, 85, 95, 105]) Initial guesses of the number of clusters Autoclass default: 2, 3, 5, 7, 10, 15, 25 For more details, see AutoClass C documentation: file search-c.text, line 332

  • reproducible_run (boolean, optional (default: False)) –

    If set to True, define parameters to obtain reproducible run. According to AutoClass C developers: “These parameter settings are for testing only – they should not be utilized for normal AutoClass runs.”

    • randomize_random_p = false

      Random seed is set to 1 (instead of the usual current time)

    • start_fn_type = “block”

      Instead of “random”

    • min_report_period = value greater than duration of run

    For more details, see AutoClass C documentation:

    • file search-c.text, line 678

    • file search-c.text, line 565

    • file search-c.text, line 525

handle_error()

Handle error during data parsing and formating.

Function decorator.

Parameters

f (function) –

Returns

try_function

Return type

function wrapped into error handler

prepare_input_data(*args, **kwargs)

Prepare input data.

  • Create a final dataframe.

  • Merge datasets if multiple inputs.

Notes

Dataframes are merged based on an ‘outer’ join https://pandas.pydata.org/pandas-docs/stable/merging.html - all lines are kept - missing data might appear

print_files(*args, **kwargs)

Print generated files.

Debug usage.

Returns

content – Contain all AutoClass C parameter files concatenated.

Return type

string

API reference for Dataset() class

class autoclasswrapper.Dataset(input_file='', data_type='', error=None, separator_char='t', missing_char='')

Handle input data.

Parameters
  • input_file (string (defaut: "")) – Name of the file to read data from.

  • data_type (string (dafault: "")) – Type of data contained in input file. Either “real scalar”, “real location”, “discrete” or “merged” “merged” is a special case corresponding to merged datasets.

  • error (float, optional (default: 0.01)) – Value of error on data.

  • separator_char (string, optional (defaut: "t")) – Character used to separate columns of data in input file.

  • missing_char (string, optional (default: "")) – Character used to encode missing data in input file.

input_file

Name of the file to read data from.

Type

string (defaut: “”)

separator_char

Character used to separate columns of data in input file.

Type

string (defaut: “t”)

df

Pandas dataframe that contains all data.

Type

Pandas dataframe (default: None)

column_meta

Dictionnary that contains metadata for each column. Keys are column names. Values are another dictionnary: {“type”: data_type, “error”: error, “missing”: False}

Type

dict (default: {})

check_data_type()

Check data type.

Cast ‘real scalar’ and ‘real location’ to float64

check_duplicate_col_names()

Check duplicate column names.

clean_column_names()

Clean column names.

Allowed characters are:

  • ABCDEFGHIJKLMNOPQRSTUVWXYZ

  • abcdefghijklmnopqrstuvwxyz

  • 0123456789

  • . (dot)

  • + (plus signe)

  • - (minus signe)

  • _ (underscore)

Unauthorized characters are replaced by ‘_’

guess_encoding()

Guess input file encoding.

Returns

Type of encoding.

Return type

string

read_datafile()

Read data file as pandas dataframe.

Header must be on the first row (header=0) Gene/protein/orf names must be on the first column (index_col=0)

search_missing_values()

Search for missing values.