Code description

reval module has one superclass FindBestClustCV and a subclass RelativeValidation. SCParamSelection and ParamSelection have been added to later releases to perform hyperparameter selections.

Classes

class reval.relative_validation.RelativeValidation(s, c, nrand=10)[source]

This class allows to perform the relative clustering validation procedure. A supervised algorithm is required to test cluster stability. Labels output from a clustering algorithm are used as true labels.

Parameters:
  • s (class) – initialized class for the supervised method.
  • c (class) – initialized class for clustering algorithm.
  • nrand (int) – number of iterations to normalize cluster stability.
_rescale_score_(xtr, xts, randlabtr, labts)[source]

Private method that computes the misclassification error when predicting test labels with classification model fitted on training set with random labels.

Parameters:
  • xtr (ndarray, (n_samples, n_features)) – training dataset.
  • xts (ndarray, (n_samples, n_features)) – test dataset.
  • randlabtr (ndarray, (n_samples,)) – random labels.
  • labts (ndarray, (n_samples,)) – test set labels.
Returns:

misclassification error.

Return type:

float

rndlabels_traineval(train_data, test_data, train_labels, test_labels)[source]

Method that performs random labeling on the training set (N times according to reval.relative_validation.RelativeValidation.nrand instance attribute) and evaluates the fitted models on test set.

Parameters:
  • train_data (ndarray, (n_samples, n_features)) – training dataset.
  • test_data (ndarray, (n_samples, n_features)) – test dataset.
  • train_labels (ndarray, (n_samples,)) – training set clustering labels.
  • test_labels (ndarray, (n_samples,)) – test set clustering labels.
Returns:

averaged misclassification error on the test set.

Return type:

float

test(test_data, fit_model)[source]

Method that compares test set clustering labels (i.e., A(X’), computed by reval.relative_validation.RelativeValidation.clust_method) against the (permuted) labels obtained through the classification algorithm fitted to the training set (i.e., f(X’), computed by reval.relative_validation.RelativeValidation.class_method). It returns the misclassification error, together with both clustering and classification labels.

Parameters:
  • test_data (ndarray, (n_samples, n_features)) – test dataset.
  • fit_model (class) – fitted supervised model.
Returns:

misclassification error, clustering and classification labels.

Return type:

float, dictionary of ndarrays (n_samples,)

train(train_data, tr_lab=None)[source]

Method that performs training. It compares the clustering labels on training set (i.e., A(X) computed by reval.relative_validation.RelativeValidation.clust_method) against the labels obtained from the classification algorithm (i.e., f(X), computed by reval.relative_validation.RelativeValidation.class_method). It returns the misclassification error, the supervised model fitted to the data, and both clustering and classification labels.

Parameters:
  • train_data (ndarray, (n_samples, n_features)) – training dataset.
  • tr_lab (list) – cluster labels found during CV for clustering methods with no n_clusters parameter. If not None the clustering method is not performed on the whole test set. Default None.
Returns:

misclassification error, fitted supervised model object, clustering and classification labels.

Return type:

float, object, ndarray (n_samples,)

class reval.best_nclust_cv.FindBestClustCV(s, c, nrand, nfold=2, n_jobs=1, nclust_range=None)[source]

Child class of reval.relative_validation.RelativeValidation. It performs (repeated) k-fold cross validation on the training set to select the best number of clusters, i.e., the number that minimizes the normalized stability (i.e., average misclassification error/asymptotic misclassification rate).

Parameters:
Attribute:

cv_results_ dataframe with cross validation results. Columns are ncl = number of clusters; ms_tr = misclassification training; ms_val = misclassification validation.

static _fit(data_obj, idxs, ncl=None)[source]

Function that calls training, test, and random labeling.

Parameters:
  • data_obj (tuple) – dataset and reval.RelativeValidation class.
  • idxs (tuple) – lists of training and validation indices.
  • ncl (int) – number of clusters, default None
Returns:

number of clusters and misclassification errors for training and validation.

Return type:

tuple (int, float, float)

best_nclust(data, iter_cv=1, strat_vect=None)[source]

This method takes as input the training dataset and the stratification vector (if available) and performs a (repeated) CV procedure to select the best number of clusters that minimizes normalized stability.

Parameters:
  • data (ndarray, (n_samples, n_features)) – training dataset.
  • iter_cv (integer) – number of iteration for repeated CV, default 1.
  • strat_vect (ndarray, (n_samples,)) – vector for stratification, defaults to None.
Returns:

CV metrics for training and validation sets, best number of clusters, misclassification errors at each CV iteration.

Return type:

dictionary, int, (list) if n_clusters parameter is not available

evaluate(data_tr, data_ts, nclust=None, tr_lab=None)[source]

Method that applies the selected clustering algorithm with the best number of clusters to the test set. It returns clustering labels.

Parameters:
  • data_tr (ndarray, (n_samples, n_features)) – training dataset.
  • data_ts (ndarray, (n_samples, n_features)) – test dataset.
  • nclust (int) – best number of clusters, default None.
  • tr_lab (array-like) – clustering labels for the training set. If not None the clustering algorithm is not performed and the classifier is fitted. Available for clustering methods without n_clusters parameter. Default None.
Returns:

labels and accuracy for both training and test sets.

Return type:

namedtuple, (train_cllab: array, train_acc:float, test_cllab:array, test_acc:float)

class reval.param_selection.SCParamSelection(sc_params, cv, nrand, n_jobs, iter_cv=1, clust_range=None, strat=None)[source]

Class that implements grid search cross-validation in parallel to select the best combination of classifier/clustering methods.

Parameters:
  • sc_params (dict) – dictionary of the form {‘s’: list, ‘c’: list} including the lists of classifiers and clustering methods to fit to the data.
  • cv (int) – cross-validation folds.
  • nrand (int) – number of random label iterations.
  • n_jobs (int) – number of jobs to run in parallel, default (number of cpus - 1).
  • iter_cv (int) – number of repeated cv, default 1.
  • clust_range (list) – list with number of clusters to investigate, default None.
  • strat (numpy array) – stratification vector for cross-validation splits, default None.
Attribute:

cv_results_ cross-validation results that can be directly transformed to a dataframe. Key names: ‘s’, ‘c’, ‘best_nclust’, ‘mean_train_score’, ‘sd_train_score’, ‘mean_val_score’, ‘sd_val_score’, ‘validation_meanerror’. Dictionary of lists.

Attribute:

best_param_ best solution(s) selected (minimum validation error). List.

Attribute:

best_index_ index/indices of the best solution(s). Values correspond to the rows of the cv_results_ table. List.

_run_gridsearchcv(data, sc)[source]

Private function with different initializations of reval.best_nclust_cv.FindBestClustCV.

Parameters:
  • data (numpy array) – input dataset.
  • sc (dict) – classifier/clustering of the form {‘s’:, ‘c’:}.
Returns:

performance list.

Return type:

list

fit(data_tr, nclass=None)[source]

Class method that performs grid search cross-validation on training data. If the number of true classes is known, the method returns both the best result with the correct number of clusters (and minimum stability), if available, and the overall best result (overall minimum stability). The output reports None if the clustering algorithm does not find any cluster (e.g., HDBSCAN label all points as -1).

Parameters:
  • data_tr (numpy array) – training dataset.
  • nclass (int) – number of true classes, default None.
class reval.param_selection.ParamSelection(params, cv, s, c, nrand, n_jobs, iter_cv=1, strat=None, clust_range=None)[source]

Class that implements grid search cross-validation in parallel to select the best combinations of parameters for fixed classifier/clustering algorithms.

Parameters:
  • params (dict) – dictionary of dictionaries of the form {‘s’: {classifier parameter grid}, ‘c’: {clustering parameter grid}}. If one of the two dictionary of parameters is not available, initialize key but leave dictionary empty.
  • cv (int) – cross-validation folds.
  • clust_range (list) – list with number of clusters to investigate.
  • n_jobs (int) – number of jobs to run in parallel, default (number of cpus - 1).
  • iter_cv (int) – number of repeated cv loops, default 1.
  • strat (numpy array) – stratification vector for cross-validation splits, default None.
Attribute:

cv_results_ cross-validation results that can be directly transformed to a dataframe. Key names: classifier parameters, clustering parameters, ‘best_nclust’, ‘mean_train_score’, ‘sd_train_score’, ‘mean_val_score’, ‘sd_val_score’, ‘validation_meanerror’. Dictionary of lists.

Attribute:

best_param_ best solution(s) selected (minimum validation error). List.

Attribute:

best_index_ index/indices of the best solution(s). Values correspond to the rows of the cv_results_ table. List.

_allowed_par(par_dict)[source]

Private method that controls the allowed parameter combinations for hierarchical clustering.

Parameters:par_dict (dict) – clustering parameter grid.
Returns:whether the parameter combination can be allowed.
Return type:bool
_run_gridsearchcv(data, param_s, param_c)[source]

Private method that initializes classifier/clustering with different parameter combinations and reval.best_nclust_cv.FindBestClustCV.

Parameters:
  • data (numpy array) – training dataset.
  • param_s – dictionary of classifier parameters.
  • param_c (dict) – dictionary of clustering parameters.
Type:

dict

Returns:

performance list.

Return type:

list

fit(data_tr, nclass=None)[source]

Class method that performs grid search cross-validation on training data. It deals with the error due to wrong parameter combinations (e.g., ward linkage with no euclidean affinity). If the true number of classes is know, the method selects both the best parameter combination that selects the true number of clusters (minimum stability) and the best parameter combination that minimizes overall stability.

Parameters:
  • data_tr (numpy array) – training dataset.
  • nclass (int) – number of true classes, default None.

Functions

Useful functions that can be used on their own are also available. In particular, reval.utils.kuhn_munkres_algorithm is an implementation of the Kuhn-Munkres algorithm (Kuhn, 1955; Munkres, 1957), that performs consistent permutation of predicted labels in order to minimize the misclassification error with respect to true labels. reval.utils.compute_metrics takes as input clustering and classification labels and returns classification metrics, such as F1 score, accuracy and Matthews correlation coefficient for generalization.

Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval research logistics quarterly, 2(1‐2), 83-97.

Munkres, J. (1957). Algorithms for the assignment and transportation problems. Journal of the society for industrial and applied mathematics, 5(1), 32-38.

reval.utils.kuhn_munkres_algorithm(true_lab, pred_lab)[source]

Function that implements the Kuhn-Munkres algorithm method. It selects the best label permutation of the predicted labels that minimizes the misclassification error when compared to the true labels. In order to allow for the investigation of replicability of findings between training and test sets, in the context of reval we permute clustering labels to match classification labels, in order to retain the label organization based on training dataset. This because otherwise we would loose the correspondence between training and test sets labels.

Parameters:
  • true_lab (ndarray, (n_samples,)) – classification algorithm labels (for reval).
  • pred_lab (ndarray, (n_samples,)) – clustering algorithm labels (for reval).
Returns:

permuted labels that minimize the misclassification error.

Return type:

ndarray, (n_samples,)

reval.utils.compute_metrics(class_labels, clust_labels, perm=False)[source]

Function that computes useful classification metrics. If needed the clustering labels are permuted with :reval.utils.kuhn_munkres_algorithm: The function returns a dictionary with ACC, MCC, F1, precision, recall as keys for accuracy, Matthews correlation coefficient, F1 score, precision, and recall, respectively.

Parameters:
  • class_labels (array-like) – labels returned by the classifier.
  • clust_labels (array-like) – labels returned by the clustering.
  • perm (bool) – flag to enable permutation of clustering labels, default False.
Returns:

dictionary of scores.

Return type:

dict

The reval.best_nclust_cv._confint computes 95% confidence interval using scipy.stats.t.ppf() function.

reval.best_nclust_cv._confint(vect)[source]

Private function to compute confidence interval.

Parameters:vect (array-like) – performance scores.
Returns:mean and error.
Return type:tuple

The module reval.internal_baselines includes functions select_best and evaluate_best that allow comparisons between reval method and internal validation measures.

reval.internal_baselines.select_best(data, c, int_measure, select='max', nclust_range=None)[source]

Select the best number of clusters that minimizes/maximizes the internal measure selected.

Parameters:
  • data (array-like) – dataset.
  • c (obj) – clustering algorithm class.
  • int_measure (obj) – internal measure function.
  • select (str) – it can be ‘min’, if the internal measure is to be minimized or ‘max’ if the internal measure should be macimized.
  • nclust_range (list) – Range of clusters to consider, default None.
Returns:

internal score and best number of clusters.

Return type:

float, int

reval.internal_baselines.evaluate_best(data, c, int_measure, ncl=None)[source]

Function that, given a number of clusters, returns the corresponding internal measure for a dataset.

Parameters:
  • data (array-like) – dataset.
  • c (obj) – clustering algorithm class.
  • int_measure (obj) – internal measure function.
  • ncl (int) – number of clusters.
Returns:

internal score.

Return type:

float

Visualization

reval.visualization enables plotting the cross-validation performance.

reval.visualization.plot_metrics(cv_score, figsize=(8, 5), linewidth=1, color=('black', 'black'), legend_loc=2, fontsize=12, title='', prob_lines=False, save_fig=None)[source]

Function that plots the average performance (i.e., normalized stability) over cross-validation for training and validation sets. The horizontal lines represent the random performance error for the correspondent number of clusters.

Parameters:
  • cv_score (dictionary) – collection of cv scores as output by reval.best_nclust_cv.FindBestCLustCV.best_nclust.
  • figsize (tuple) – (width, height), default (8, 5).
  • linewidth (int) – width of the lines to draw.
  • color (tuple) – line colors for train and validation sets, default (‘black’, ‘black’).
  • legend_loc (int) – legend location, default 2.
  • fontsize (int) – size of fonts, default 12.
  • title (str) – figure title, default “”.
  • prob_lines (bool) – plot the normalized stability of random labeling as thresholds, default False.
  • save_fig (str) – file name for saving figure in png format, default None.