skripts.ML package
Submodules
skripts.ML.ML4com module
Methods for Machine Learning for community inference.
- class skripts.ML.ML4com.SKL_Classifier(X, ys, cv, configuration_space: ConfigurationSpace, classifier, n_trials: int)[source]
Bases:
object
Representation of Scikit-learn classifier for SMAC3
- skripts.ML.ML4com.dict_permutations(dictionary: dict) List[dict] [source]
Combine all value combinations in a dictionary into a list of dictionaries.
- Parameters:
dictionary (dict) – Input dictionary
- Returns:
All permutations of dictionary
- Return type:
List[dict]
- skripts.ML.ML4com.evaluate_model_sklearn(X, ys, labels, indir, data_source, algorithm_name, outdir, verbosity=0)[source]
Evaluate a model against the given hyperparameters for all organisms and save the resulting metrics and feature importances.
- Parameters:
X (dataframe-like) – Data values
ys (dataframe-like) – True classes for one sample
labels (dataframe-like) – Labels
indir (path-like) – Input directory
data_source (str) – Data source
algorithm_name (str) – Algorithm name
outdir (path-like) – Output directory
verbosity (int, optional) – Level of verbosity, defaults to 0
- skripts.ML.ML4com.extract_best_hyperparameters_from_incumbent(incumbent, configuration_space: ConfigurationSpace)[source]
Extract a optimized set of hyperparameters from incumbent. Returns default if none was found.
- Parameters:
incumbent (config(s)) – _description_
configuration_space (ConfigSpace.ConfigurationSpace) – Configuration Space
- Returns:
Best hyperparameters
- Return type:
config
- skripts.ML.ML4com.extract_metrics(true_labels, prediction, scoring, run_label=None, cv_i=None, metrics_df: ~pandas.core.frame.DataFrame = Empty DataFrame Columns: [Run, Cross-Validation run, Accuracy, AUC, TPR, FPR, Threshold, Conf_Mat] Index: []) DataFrame [source]
Extract metrics from machine learning. Metrics are TPR, FPR, Thresholds for ROC curve + AUC and accuracy with confusion matrix.
- Parameters:
true_labels (array-like) – True labels
prediction (array-like) – Prediction
scoring (array-like) – Scoring
run_label (any, optional) – Label of run, defaults to None
cv_i (any (usually int), optional) – cross-validation instance, defaults to None
metrics_df – Dataframe with all extracted metrics, defaults to
pd.DataFrame(columns=[“Run”, “Cross-Validation run”, “Accuracy”, “AUC”, “TPR”, “FPR”, “Threshold”, “Conf_Mat”]) :type metrics_df: pd.DataFrame, optional :return: Dataframe, filled with metrics :rtype: pandas.DataFrame
- skripts.ML.ML4com.individual_layers_to_tuple(config) dict [source]
Change hidden_layer_sizes to tuple representation.
- Parameters:
config (dict-like) – Configuration
- Returns:
Config with changed hidden_layer_sizes as tuples
- Return type:
dict
- skripts.ML.ML4com.intersect_impute_on_left(df_base: DataFrame, df_right: DataFrame, imputation: str = 'zero') DataFrame [source]
Use all indices from the left (df_base) DataFrame and fill it with the intersection of df_right. Indices without match are imputed by zero, mean or via kNN, whereas k can be specified as a number. The default is 5.
- Parameters:
df_base (pandas.DataFrame) – Left datframe
df_right (pandas.DataFrame) – Right dataframe
imputation (str, optional) – Imputation procedure [zero|mean|kNN], defaults to “zero”
- Returns:
Merged dataframe with imputed values
- Return type:
pandas.DataFrame
- skripts.ML.ML4com.join_df_metNames(df: DataFrame, grouper='peakID', include_mass=False) DataFrame [source]
Join dataframe column metNames along grouper column. Sets common index for combination of positively and negatively charged dataframes along their metabolite Names
- Parameters:
df (pandas.Dataframe) – Input dataframe
grouper (str, optional) – Grouper column, defaults to “peakID”
include_mass (bool, optional) – Include mass column, defaults to False
- Returns:
Combined datafraame
- Return type:
pandas.Dataframe
- skripts.ML.ML4com.nested_cross_validate_model_sklearn(X, ys, labels, classifier, configuration_space, n_trials, name, algorithm_name, outdir, fold: KFold | StratifiedKFold = KFold(n_splits=5, random_state=None, shuffle=False), inner_fold: int = 3, n_workers: int = 1, verbosity: int = 0)[source]
Cross-validate a model against the given hyperparameters for all organisms in a nested manner.
- Parameters:
X (dataframe-like) – Data values
ys (dataframe-like) – True classes for one sample
labels (dataframe-like) – Labels
classifier (Classifier from sklearn) – Classifier
configuration_space (ConfigSpace.ConfigurationSpace) – Configuration Space
n_trials (int) – Number of trials
name (str) – Name of run
algorithm_name (str) – Algorithm name
outdir (path-like) – Output directory
fold (Union[KFold, StratifiedKFold], optional) – Outer fold, defaults to KFold()
inner_fold (int, optional) – Inner fold, defaults to 3
n_workers (int, optional) – Number of workers, defaults to 1
verbosity (int, optional) – Level of verbosity, defaults to 0
- Returns:
Dataframe with metrics on different levels
- Return type:
tuple[pandas.DataFrame]
- skripts.ML.ML4com.plot_cv_confmat(ys, target_labels, accuracies, confusion_matrices, outdir, name)[source]
Plot heatmap of confusion matrix
- Parameters:
ys (dataframe-like) – Targets
target_labels (array-like) – Target labels
accuracies (dataframe-like) – Accuracies
confusion_matrices (dataframe-like) – Confusion matrices
outdir (path-like) – Output directory
name (str) – Name of run
- skripts.ML.ML4com.plot_decision_trees(model, feature_names, class_names, outdir, name)[source]
Plot decision trees of model
- Parameters:
model (model-lie) – Model
feature_names (array-like) – Feature names
class_names (array-like) – Class names
outdir (path-like) – Output directory
name (str) – Name of run
- skripts.ML.ML4com.plot_metrics_df(metrics_df, organism_metrics_df, overall_metrics_df, algorithm_name, outdir, show=False)[source]
Plot the extracted metrics as a heatmap and ROC AUC curve
- Parameters:
metrics_df (pandas.DataFrame) – Metrics dataframe
organism_metrics_df (pandas.DataFrame) – Metrics dataframe on organism level
overall_metrics_df (pandas.DataFrame) – Metrics dataframe on overall level
algorithm_name (str) – Algorithm name
outdir (path-like) – Output directory
show (bool, optional) – Show plot, defaults to False
- skripts.ML.ML4com.tune_classifier(X, y, classifier, cv, configuration_space: ConfigurationSpace, n_workers: int, n_trials: int, name: str, algorithm_name: str, outdir, verbosity: int = 0)[source]
Perform hyperparameter tuning on an Sklearn classifier.
- Parameters:
X (dataframe-like) – Data values
ys (dataframe-like) – True classes for one sample
classifier (Classifier from sklearn) – Classifier
cv (cv-scheme) – Cross-validation scheme
configuration_space (ConfigSpace.ConfigurationSpace) – Configuration Space
n_workers (int) – Number of workers to work in parallel
n_trials (int) – Number of trials
name (str) – Name of run
algorithm_name (str) – Algorithm name
outdir (path-like) – Output directory
verbosity (int, optional) – Level of verbosity, defaults to 0
- Returns:
Incumbent
- Return type:
configs (list, single config)
- skripts.ML.ML4com.tune_train_model_sklearn(X, ys, labels, classifier, configuration_space, n_workers, n_trials, source: str, name, algorithm_name, outdir, fold: KFold | StratifiedKFold = KFold(n_splits=5, random_state=None, shuffle=False), verbosity=0)[source]
Tune and train a model in sklearn.
- Parameters:
X (dataframe-like) – Data values
ys (dataframe-like) – True classes for one sample
labels (dataframe-like) – Labels
classifier (Classifier from sklearn) – Classifier
configuration_space (ConfigSpace.ConfigurationSpace) – Configuration Space
n_workers (int) – Number of workers
n_trials (int) – Number of trials
source (str) – Source of data
name (str) – Name of run
algorithm_name (str) – Algorithm name
outdir (path-like) – Output directory
fold (Union[KFold, StratifiedKFold], optional) – Fold for cross validation during tuning, defaults to KFold()
verbosity (int, optional) – Level of verbosity, defaults to 0