Assembled Classes

class assembled.metatask.MetaTask(use_sparse_dtype=True, file_format='csv')[source]

Metatask, a meta version of a normal machine learning task

The Metatask contains the predictions and confidences (e.g. sklearn’s predict_proba) of specific base models and the data of the original (OpenML) task. Moreover, additional side information are captured. This object is filled via functions and thus it is empty initially.

In the current version, we manage Metatasks as a meta_dataset (represented by a DataFrame) that contains all instance related data and meta_data (represented by a dict/json) that contains side information.

Parameters:
  • use_sparse_dtype (bool, default=True) – If True, we use pandas’ sparse dtype to store data that is specific to a fold. This can drastically reduce the memory usage. Currently, due to a bug of pandas, only for confidence columns. FIXME fix this (make all labels numbers?)

  • file_format (str in {"hdf", "csv", "feather"}, default="csv") – Determines which file format to use.

add_predictor(predictor_name, predictions, confidences=None, conf_class_labels=None, predictor_description=None, bad_predictor=False, corruptions_details=None, validation_data=None, fold_predictor=False, fold_predictor_idx=None)[source]

Add a new predictor (base model) to the metatask

Parameters:
  • predictor_name (str) – name of the predictor; must be unique within a given metatask! (Overwriting is not supported yet)

  • predictions (array-like, (n_samples,) or (n_fold_samples,)) – Cross-val-predictions for the predictor that correspond to the fold_indicators of the metatask. If fold_predictor, it only needs to contain the predictors for the specific fold.

  • confidences (array-like, Optional, (n_samples, n_classes) or (n_fold_samples, n_classes), default=None) – Confidences of the prediction. If None, due to Regression tasks or no confidences, use default confidences. If fold_predictor, it only needs to contain the confidences for the specific fold.

  • conf_class_labels (List[str], Optional, default = None) – The order of the class labels in that the confidences are passed to this function. It must contain the same labels as self.class_labels but can be in a different order! This is very important to get right, else all confidence values will be wrongly used.

  • predictor_description (str, Optional, default=None) – A short description of the predictor (e.g., the configuration as a string like in OpenML). If None, an automatic description is created.

  • bad_predictor (bool, default=False) – Set whether this predictor has some issue in its data or if any other reasons makes it bad (e.g. bad performance). Here, bad means that the metatasks should not use such predictors or allows them to be filtered later on.

  • corruptions_details (dict, default=None) – A dict containing more details on why the predictor is bad or any other information you would want to keep for later on the predictor. E.g., Assembled-OpenML uses this to store whether the confidences values of the predictor had to be fixed.

  • validation_data (List[Tuple[int, array-like, array-like, array-like]], default=None) – The validation data of the predictor for all relevant folds. If fold_predictor, the list only needs to contain the entry for the fold_predictor’s fold. We assume an input of a list of lists which contain: Tuple[the fold index, the predictions on the validation data, confidences on the validation data, indices of the validation data]. Validation data can be all training data instances or a subset of all training data instances. We expect that the instances used for validation are identical for all predictors of a metatask.

  • fold_predictor (bool, default=False) – Whether the predictor that is to be added, is only for a specific fold. If False, we assume the prediction data of the predictor was computed for all folds while the predictor’s configuration was consistent across folds. If True, we add the predictor and its data only for one specific fold. That fold has to be specified by fold_predictor_idx.

  • fold_predictor_idx (int, default=None) – Required if fold_predictor is True. Specifies the fold index for which this predictors has been computed.

filter_predictors(remove_bad_predictors=True, remove_constant_predictors=False, remove_worse_than_random_predictors=False, score_metric=None, maximize_metric=None, max_number_predictors=None)[source]

A method to filter/removes predictors (base models) of a metatask.

Parameters:
  • remove_bad_predictors (bool, default=True) – Remove predictors that were deemed to be bad during the crawling process (because of errors in the prediction data).

  • remove_constant_predictors (bool, default=False) – Remove constant predictors (base models that only predict 1 class for all instances)

  • remove_worse_than_random_predictors (bool, default=False) – Remove predictions that are worse than a random predictor. Requires score_metric and maximize_metric to be not None.

  • score_metric (metric function, default=None) – The metric function used to determine if a predictors is worse than a random predictor. Special format required due to OpenML’s metrics.

  • maximize_metric (bool, default=None) – Whether the metric computed by the metric function passed by score_metric is to be maximized or not.

  • max_number_predictors (int, default=None) – If not none, keep only as many predictors as max_number_predictors. Discard predictors based on score_metric. Keep only the best predictors. Requires score_metric and maximize_metric to be not None.

init_dataset_information(dataset, target_name, class_labels, feature_names, cat_feature_names, task_type, openml_task_id, folds_indicator, dataset_name)[source]

Fill dataset information and basic task information

Parameters:
  • dataset (pd.DataFrame) – The original dataset from OpenML.

  • target_name (str) – The name of the target column of the dataset.

  • class_labels (List[str]) – The class labels of the dataset.

  • feature_names (List[str]) – The names of the feature columns of the dataset. (Including categorical features).

  • cat_feature_names (List[str]) – The name of the categorical feature columns of the dataset.

  • task_type ({"classification", "regression"}) – String determining the task type.

  • openml_task_id (int) – OpenML Task ID, Use a negative number like -1 if no OpenML Task. This will be the tasks ID / name.

  • folds_indicator (np.ndarray) – Array of length (n_samples,) indicating the folds for each instances (starting from 0) We do not support hold-out validation currently. Please be aware, the order of instances in the dataset should be equal to the order of the instances in the fold_indicator. We do not check this / can not check this.

  • dataset_name (str) – Name of the dataset

read_folds(fold_indicator)[source]

Read a new folds specification. The user must make sure that later data is added according to these folds.

read_metatask_from_files(input_dir, openml_task_id, read_wo_dataset=False, delayed_evaluation_load=False)[source]

Build a metatask using data from files

Parameters:
  • input_dir (str) – Directory in which the .json and .csv files for the metatask are stored.

  • openml_task_id (int) – The ID of the metatask/openml task that shall be read from the files

  • read_wo_dataset (bool, default=False) – If the function is used to only read the meta data and prediction data from the files. Needed to determine which sanity checks to run.

  • delayed_evaluation_load (bool, default=False) – If true, the prediction data will be only loaded once needed for the evaluation of a fold. Afterwards, the prediction data is also removed again. In other words, it is loaded only for a fold. That is, if fold_split is called. Only supported for file formats in {“hdf”} for now.

read_prediction_data_for_fold(fold_index)[source]

Only Read the Prediction data (test and validation) for a specific fold

read_randomness(random_int_seed_outer_folds, random_int_seed_inner_folds=None)[source]
Parameters:
  • random_int_seed_outer_folds (int or str) – The random seed (integer) used to create the folds. If not available, pass a short description why.

  • random_int_seed_inner_folds (int, default=None) – We assume that the splits used to get the validation data were generated by some controlled randomness. That is, a RandomState object initialized with a base seed was (re-)used each fold to generate the splits. Here, we want to have the base seed to store

read_selection_constraints(selection_constraints)[source]

Fill the constrains used to build the metatask.

This only updates but does not overwrite existing keys.

Parameters:

selection_constraints (dict) – A dict containing the names and values for selection constraints

split_meta_dataset(meta_dataset, fold_idx=None, return_copy=False, ignore_prediction_data=False)[source]

Splits the meta dataset into its subcomponents

Parameters:
  • meta_dataset (meta_dataset) –

  • fold_idx (int, default=None) – If int, the int is used to filter fold related data such that only the data for the fold with fold_idx remains in the returned data.

  • return_copy (bool, default=False) – If True, copy before splitting and return the splits of the copy.

  • ignore_prediction_data (bool, default=False) – If True, ignore prediction data during split. Can be used if the metatask object (self) changes while the the meta_dataset that is split does not change.

Return type:

Tuple[DataFrame, Series, DataFrame, DataFrame, DataFrame, DataFrame]

Returns:

  • features (pd.DataFrame) – The features of the original dataset

  • ground_truth (pd.Series) – The ground_truth of the original dataset

  • predictions (pd.DataFrame) – The predictions of the base models

  • confidences (pd.DataFrame) – The confidences of the base models

  • validation_predictions (pd.DataFrame) – The predictions of the base models on the validation of a fold

  • validation_confidences (pd.DataFrame) – The confidences of the base models on the validation of a fold

to_files(output_dir='')[source]

Store the metatask in two files. One .csv (or .hdf) and .json file

The .csv file stores the complete meta-dataset. The .json stores additional and required metadata

Parameters:

output_dir (str) – Directory in which the .json and .csv files for the metatask shall be stored.

to_sharable_prediction_data(output_dir='')[source]

Store Metatasks without dataset (e.g., all data but self.dataset’s rows)

yield_evaluation_data(folds_to_run=None)[source]

Yield the dataset and base model data for the specified folds.

Parameters:

folds_to_run (List of int, default=None) – If None, yield data for all folds. If not None, the function will only return the fold data for the fold indices specified in the list.

Return type:

Tuple[int, DataFrame, DataFrame, Series, Series, DataFrame, DataFrame, DataFrame, DataFrame]

Returns:

  • idx (int) – Fold index

  • X_train (DataFrame) – Feature Data used to train base models

  • X_test (DataFrame) – Feature Data used to test base models

  • y_train (Series) – Label Data used to train base models

  • y_test (Series) – Label Data used to test base models

  • val_base_predictions (DataFrame) – Predictions of each base model on the fold’s validation data (if exists)

  • test_base_predictions (DataFrame) – Predictions of each base model on the fold’s test data

  • val_base_confidences (DataFrame) – Confidences of each base model on the fold’s validation data (if exists)

  • test_base_confidences (DataFrame) – Confidences of each base model on the fold’s test data

class assembled.benchmaker.BenchMaker(path_to_metatasks, output_path_benchmark_metatask, tasks_to_use=None, manual_filter_duplicates=False, min_number_predictors=4, max_number_predictors=None, remove_constant_predictors=False, remove_worse_than_random_predictors=False, remove_bad_predictors=False, metric_info=None)[source]

The Benchmark Maker class to build and manage benchmarks for a list of tasks

Parameters:
  • path_to_metatasks (str) – Path to the directory of metatasks that shall be used to build the benchmark.

  • output_path_benchmark_metatask (str) – Path to the directory in which the selected and post-processed metatasks of the benchmark shall be stored.

  • tasks_to_use (List[int], default=None) – If not None, the task IDs in the list are used to determine which metatasks to load from path_to_metatasks.

  • manual_filter_duplicates (bool, default=False) – If you want to manually filter duplicated base models. If True, an interactive session is started once needed. FIXME: Experimental, removed for now due to not backwards compatible

  • min_number_predictors (int, default=4) – The minimal number of predictors each metatask should have to be included in the benchmark. This is checked per fold. In other words, for each fold it must have at least min_number_predictors predictors.

  • max_number_predictors (int, default=None) – The maximal number of predictors to be used. If more than max_number_predictors predictors exist, we remove predictors with worse performance w.r.t to metric_info.

  • remove_constant_predictors (bool, default=False) – If True, remove constant predictors from the metatask.

  • remove_worse_than_random_predictors (bool, default=False) – If True, remove worse than random predictors from the metatask.

  • remove_bad_predictors (bool, default=False) – IF True, we remove predictors that were marked to bad during the creation of the metataks.

  • metric_info (Tuple[metric:callable, metric_name:str, maximize: bool] scorer metric like, default=None) – The metric information required to determine performance. Must include a callable, a name, and whether the metric is to be optimized. Must be set if remove_worse_than_random_predictors is True.

build_benchmark(share_data='no')[source]
Processes the metatasks according to the BenchMakers initialized settings. Moreover, store the new benchmark

metatasks in the appropriate repository and create a file (benchmark_details.json) containing all relevant details about the benchmark.

Parameters:
  • share_data (str in {"no", "openml", "share_prediction_data"}, default="no) –

  • data. (Determine the strategy used to share the benchmark) –

  • files (* "no" - no effort is undertake to make the data sharable. This can be used if the full benchmark task) – (.csv and .json) are sharable without any license or distribution issues.

  • a (* "openml" - We can use the OpenML platform to re-produce metatasks. This assumes that all predictors in) – metatask are from OpenML (got via Assembled-OpenML) and that the dataset and task data are from an OpenML task. This allows to share (i.e., re-build) a metatasks by only sharing the benchmark_details.json.

  • but (* "share_meta_data" - This options can be used when you are not able/allowed to share the dataset) –

    can share the meta data like the prediction data or dataset metadata. One use case would be that you have used a dataset from OpenML, but computed all prediction data locally.

    The shared meta data includes validation data (if available). The shared meta data includes metadata about a dataset (feature names, …). The shared meta data includes prediction data (if available).

    It will save the data that is to be shared in a .zip file under the output_path_benchmark_metatask directory. This .zip file together with the benchmark_details.json can be used to re-produce metatasks. The dataset is not part of the .zip or benchmark_details.json. To later fill the dataset into a task, our tools can get the dataset (e.g. via an OpenML task ID) or the dataset must be passed to our tools.

class assembled.compatibility.faked_classifier.FakedClassifier(oracle_X=None, predictions_=None, confidences_=None, oracle_index_=None, simulate_n_features_in_=None, predict_time_=0, predict_proba_time_=0, fit_time_=0, simulate_time=False, label_encoder=False, model_metadata=None)[source]

A fake classifier that simulates a real classifier from the prediction data of the real classifier.

We assume the input passed to init is the same format as training X (will be validated in predict). We store the prediction data with an index whereby the index is the hash of an instance.

Some Assumptions of this: TODO: add checks for this
  • Simulated Models return the same results for the same input instance

  • Input data is only numeric (as the default preprocessor makes sure)

!Warnings!:
  • If the simulated model returns different results for the same input instance (e.g., as a result of using cross-validation to produce the validation data), we set the prediction values for all duplicates to the first value seen for the duplicates.

A Remark on Hashing if Duplicates are Present:

The hash we are using is consistent between runs of the same INTERPRETER, i.e., Python Version (see https://stackoverflow.com/a/64356731). It is only consistent because we are hashing a tuple of numeric values. This would not work for hashes of strings without changing the code (see https://stackoverflow.com/a/2511075).

Consistency is required if duplicates are present, because otherwise, in the current implementation, the prediction value selected to represent all duplicates might change and thus the data that is passed to an ensemble method would change.

TODO: re-implement this, change hash method, or think of different approach to non-restrictive index management

Parameters:
  • simulate_time (bool, default=False') – Whether the fake mode should also fake the time it takes to fit and predict. Note: currently we are not compensating for the overhead of the simulation in anyway or form. TODO: this is future work; does not support validation data….

  • oracle_X (array-like, shape (n_samples, n_features)) – The test input samples.

  • predictions (array-like, shape (n_samples,)) – The predictions on the test input samples.

  • oracle_index – The predictions/confidence index list. Represent hash values for each instance of the simulation data to re-associated predict/predict_proba calls with the original predictions no matter the subset of the simulation data.

  • simulate_n_features_in (int, default=None) – The number of features seen during and used for validation.

  • fit_time (float, default=0) – Time in seconds needed to fit the original real classifier.

  • predict_time (float, default=0) – Time the real model took to evaluate/infer the predictions

  • confidences (ndarray, shape (n_samples, n_classes)) – The confidences on the test input samples. We expect the confidences to be in the order of the classes as they appear in the training ground truth. (E.g.: by np.unique(y) or by a label encoder)

  • predict_proba_time (float, default=0) – Time the real model took to evaluate/infer the confidences

  • label_encoder (bool, default=false) – Whether we need to apply encoding to the predictions.

  • model_metadata (Optional[dict], default=None) – Additional metadata for the model.

Variables:
  • classes_ (ndarray, shape (n_classes,)) – The classes seen at fit().

  • le_ (LabelEncoder, object) – The label encoder created at fit().

  • n_features_in_ (int) – The number of features seen during fit.

fit(X, y)[source]

Fitting the fake classifier, that is, doing nothing.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – The training input samples.

  • y (array-like, shape (n_samples,)) – The target values. An array of int.

Returns:

self – Returns

Return type:

object

get_params(deep=True)

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

predict(X)[source]

Predicting with the fake classifier, that is, returning the previously stored predictions.

Parameters:

X (array-like, shape (n_samples, n_features)) – The input samples.

Returns:

y – Vector containing the class labels for each sample.

Return type:

ndarray, shape (n_samples,)

predict_proba(X)[source]

Predicting with the fake classifier, that is, returning the previously stored confidences.

Parameters:

X (array-like, shape (n_samples, n_features)) – The input samples.

Returns:

y – Returns the probability of each sample for each class in the model, where classes are ordered as they are in self.classes_.

Return type:

ndarray, shape (n_samples, n_classes)

score(X, y, sample_weight=None)

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Test samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns:

score – Mean accuracy of self.predict(X) wrt. y.

Return type:

float

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance