API reference for Input() class¶

class autoclasswrapper.Input(root_name='autoclass', db2_separator_char='t', db2_missing_char='?', tolerate_error=False)¶

AutoClass C input files and parameters.

Parameters

root_name (string, optional (default "autoclass")) – Root name to generate input files for AutoClass C. Example: “autoclass” will lead to “autoclass.db2”, “autoclass.model”, “autoclass.s-params”…
db2_separator_char (string, optional (default: "t")) – Character used to separate columns of data in AutoClass C db2 file.
db2_missing_char (string, optional (default: "?")) – Character used to encode missing data in AutoClass C db2 file.
tolerate_error (bool, optional (default: False)) – If True, countinue generation of AutoClass C input files even if an error is encounter. If False, stop at first error.

had_error¶

Set to True if an error has been found in the generation of AutoClass C input files.

Type: bool (defaut False)

input_datasets¶

List of all input Datasets.

Type: list of Dataset() objects

full_dataset¶

Final Dataset used by AutoClass C.

Type: Dataset() object

add_input_data(*args, **kwargs)¶

Read input data file and append to list of datasets.

Parameters

input_file (string) – Name of the data file to read.
input_type (string) – Type of data contained in input file. Either “real scalar”, “real location” or “discrete”
input_error (float, optional (default: 0.01)) – Input error value.
input_separator_char (string, optional (default: "t")) – Character used to separate columns of data in input file.
input_missing_char (string, optional (default: "")) – Character used to encode missing data in input file.

create_db2_file(*args, **kwargs)¶

Create .db2 file (AutoClass C data).

Also save all data into a .tsv file for later user.

create_hd2_file(*args, **kwargs)¶: Create .hd2 file (AutoClass C data descriptions).

create_model_file(*args, **kwargs)¶

Create .model file (AutoClass C data models).

Choice of model based on data type and missing values

create_rparams_file(*args, **kwargs)¶: Create .r-params file (AutoClass C report parameters).

create_sparams_file(*args, **kwargs)¶

Create .s-params file (AutoClass C search parameters).

Parameters

max_duration (int, optional (default: 3600)) – Maximum time (in seconds) for the AutoClass C simulation. If set max_duration = 0, simulation will run with NO time limit For more details, see AutoClass C documentation: file search-c.text, lines 493-495
max_n_tries (int, optional (default: 200)) – Number of trials to run. For more details, see AutoClass C documentation: file search-c.text, lines 403-404
max_cycles (int, optional (default: 1000)) – Max number of cycles per trial. This is maximum that may not be reached. For more details, see AutoClass C documentation: file search-c.text, lines 316-317
start_j_list (list of int, optional (default: [2, 3, 5, 7, 10, 15, 25, 35,) – 45, 55, 65, 75, 85, 95, 105]) Initial guesses of the number of clusters Autoclass default: 2, 3, 5, 7, 10, 15, 25 For more details, see AutoClass C documentation: file search-c.text, line 332
reproducible_run (boolean, optional (default: False)) –
If set to True, define parameters to obtain reproducible run. According to AutoClass C developers: “These parameter settings are for testing only – they should not be utilized for normal AutoClass runs.”
- randomize_random_p = false
  Random seed is set to 1 (instead of the usual current time)
- start_fn_type = “block”
  Instead of “random”
- min_report_period = value greater than duration of run
For more details, see AutoClass C documentation:
- file search-c.text, line 678
- file search-c.text, line 565
- file search-c.text, line 525

handle_error()¶

Handle error during data parsing and formating.

Function decorator.

Parameters: f (function) –
Returns: try_function
Return type: function wrapped into error handler

prepare_input_data(*args, **kwargs)¶

Prepare input data.

Create a final dataframe.
Merge datasets if multiple inputs.

Notes

Dataframes are merged based on an ‘outer’ join https://pandas.pydata.org/pandas-docs/stable/merging.html - all lines are kept - missing data might appear

print_files(*args, **kwargs)¶

Print generated files.

Debug usage.

Returns: content – Contain all AutoClass C parameter files concatenated.
Return type: string

API reference for Dataset() class¶

class autoclasswrapper.Dataset(input_file='', data_type='', error=None, separator_char='t', missing_char='')¶

Handle input data.

Parameters

input_file (string (defaut: "")) – Name of the file to read data from.
data_type (string (dafault: "")) – Type of data contained in input file. Either “real scalar”, “real location”, “discrete” or “merged” “merged” is a special case corresponding to merged datasets.
error (float, optional (default: 0.01)) – Value of error on data.
separator_char (string, optional (defaut: "t")) – Character used to separate columns of data in input file.
missing_char (string, optional (default: "")) – Character used to encode missing data in input file.

input_file¶

Name of the file to read data from.

Type: string (defaut: “”)

separator_char¶

Character used to separate columns of data in input file.

Type: string (defaut: “t”)

df¶

Pandas dataframe that contains all data.

Type: Pandas dataframe (default: None)

column_meta¶

Dictionnary that contains metadata for each column. Keys are column names. Values are another dictionnary: {“type”: data_type, “error”: error, “missing”: False}

Type: dict (default: {})

check_data_type()¶

Check data type.

Cast ‘real scalar’ and ‘real location’ to float64

check_duplicate_col_names()¶: Check duplicate column names.

clean_column_names()¶

Clean column names.

Allowed characters are:

ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
0123456789
. (dot)
+ (plus signe)
- (minus signe)
_ (underscore)

Unauthorized characters are replaced by ‘_’

guess_encoding()¶

Guess input file encoding.

Returns: Type of encoding.
Return type: string

read_datafile()¶

Read data file as pandas dataframe.

Header must be on the first row (header=0) Gene/protein/orf names must be on the first column (index_col=0)

search_missing_values()¶: Search for missing values.