API reference for Input() class¶
-
class
autoclasswrapper.
Input
(root_name='autoclass', db2_separator_char='t', db2_missing_char='?', tolerate_error=False)¶ AutoClass C input files and parameters.
- Parameters
root_name (string, optional (default "autoclass")) – Root name to generate input files for AutoClass C. Example: “autoclass” will lead to “autoclass.db2”, “autoclass.model”, “autoclass.s-params”…
db2_separator_char (string, optional (default: "t")) – Character used to separate columns of data in AutoClass C db2 file.
db2_missing_char (string, optional (default: "?")) – Character used to encode missing data in AutoClass C db2 file.
tolerate_error (bool, optional (default: False)) – If True, countinue generation of AutoClass C input files even if an error is encounter. If False, stop at first error.
-
had_error
¶ Set to True if an error has been found in the generation of AutoClass C input files.
- Type
bool (defaut False)
-
input_datasets
¶ List of all input Datasets.
- Type
list of Dataset() objects
-
add_input_data
(*args, **kwargs)¶ Read input data file and append to list of datasets.
- Parameters
input_file (string) – Name of the data file to read.
input_type (string) – Type of data contained in input file. Either “real scalar”, “real location” or “discrete”
input_error (float, optional (default: 0.01)) – Input error value.
input_separator_char (string, optional (default: "t")) – Character used to separate columns of data in input file.
input_missing_char (string, optional (default: "")) – Character used to encode missing data in input file.
-
create_db2_file
(*args, **kwargs)¶ Create .db2 file (AutoClass C data).
Also save all data into a .tsv file for later user.
-
create_hd2_file
(*args, **kwargs)¶ Create .hd2 file (AutoClass C data descriptions).
-
create_model_file
(*args, **kwargs)¶ Create .model file (AutoClass C data models).
Choice of model based on data type and missing values
-
create_rparams_file
(*args, **kwargs)¶ Create .r-params file (AutoClass C report parameters).
-
create_sparams_file
(*args, **kwargs)¶ Create .s-params file (AutoClass C search parameters).
- Parameters
max_duration (int, optional (default: 3600)) – Maximum time (in seconds) for the AutoClass C simulation. If set max_duration = 0, simulation will run with NO time limit For more details, see AutoClass C documentation: file search-c.text, lines 493-495
max_n_tries (int, optional (default: 200)) – Number of trials to run. For more details, see AutoClass C documentation: file search-c.text, lines 403-404
max_cycles (int, optional (default: 1000)) – Max number of cycles per trial. This is maximum that may not be reached. For more details, see AutoClass C documentation: file search-c.text, lines 316-317
start_j_list (list of int, optional (default: [2, 3, 5, 7, 10, 15, 25, 35,) – 45, 55, 65, 75, 85, 95, 105]) Initial guesses of the number of clusters Autoclass default: 2, 3, 5, 7, 10, 15, 25 For more details, see AutoClass C documentation: file search-c.text, line 332
reproducible_run (boolean, optional (default: False)) –
If set to True, define parameters to obtain reproducible run. According to AutoClass C developers: “These parameter settings are for testing only – they should not be utilized for normal AutoClass runs.”
- randomize_random_p = false
Random seed is set to 1 (instead of the usual current time)
- start_fn_type = “block”
Instead of “random”
min_report_period = value greater than duration of run
For more details, see AutoClass C documentation:
file search-c.text, line 678
file search-c.text, line 565
file search-c.text, line 525
-
handle_error
()¶ Handle error during data parsing and formating.
Function decorator.
- Parameters
f (function) –
- Returns
try_function
- Return type
function wrapped into error handler
-
prepare_input_data
(*args, **kwargs)¶ Prepare input data.
Create a final dataframe.
Merge datasets if multiple inputs.
Notes
Dataframes are merged based on an ‘outer’ join https://pandas.pydata.org/pandas-docs/stable/merging.html - all lines are kept - missing data might appear
-
print_files
(*args, **kwargs)¶ Print generated files.
Debug usage.
- Returns
content – Contain all AutoClass C parameter files concatenated.
- Return type
string
API reference for Dataset() class¶
-
class
autoclasswrapper.
Dataset
(input_file='', data_type='', error=None, separator_char='t', missing_char='')¶ Handle input data.
- Parameters
input_file (string (defaut: "")) – Name of the file to read data from.
data_type (string (dafault: "")) – Type of data contained in input file. Either “real scalar”, “real location”, “discrete” or “merged” “merged” is a special case corresponding to merged datasets.
error (float, optional (default: 0.01)) – Value of error on data.
separator_char (string, optional (defaut: "t")) – Character used to separate columns of data in input file.
missing_char (string, optional (default: "")) – Character used to encode missing data in input file.
-
input_file
¶ Name of the file to read data from.
- Type
string (defaut: “”)
-
separator_char
¶ Character used to separate columns of data in input file.
- Type
string (defaut: “t”)
-
df
¶ Pandas dataframe that contains all data.
- Type
Pandas dataframe (default: None)
-
column_meta
¶ Dictionnary that contains metadata for each column. Keys are column names. Values are another dictionnary: {“type”: data_type, “error”: error, “missing”: False}
- Type
dict (default: {})
-
check_data_type
()¶ Check data type.
Cast ‘real scalar’ and ‘real location’ to float64
-
check_duplicate_col_names
()¶ Check duplicate column names.
-
clean_column_names
()¶ Clean column names.
Allowed characters are:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
0123456789
. (dot)
+ (plus signe)
- (minus signe)
_ (underscore)
Unauthorized characters are replaced by ‘_’
-
guess_encoding
()¶ Guess input file encoding.
- Returns
Type of encoding.
- Return type
string
-
read_datafile
()¶ Read data file as pandas dataframe.
Header must be on the first row (header=0) Gene/protein/orf names must be on the first column (index_col=0)
-
search_missing_values
()¶ Search for missing values.