Bias Aware Gridsearch CV (BAGS)

class BiasAwareGridSearchCV(estimator, param_grid, df, outcome_column, protected_attribute, privileged_value, unprivileged_value, favorable_result, cv=5, n_jobs=1, verbose=True)

Bias Aware GridsearchCV is an extension of SciKitLearn’s GridsearchCV, with additional consideration for a provided bias metric.

Parameters

  • estimator (estimator object): The machine learning estimator to be used. This should be compatible with scikit-learn estimators.

  • param_grid (dict): A dictionary containing parameter names as keys and lists of parameter settings to try as values.

  • df (pd.DataFrame): The DataFrame containing the training data. This must include the target outcome column and the protected attribute.

  • outcome_column (str): The name of the column in the DataFrame representing the target outcome, typically encoded as binary values.

  • protected_attribute (str): The name of the column in the DataFrame representing the protected attribute, which could be any categorical feature (e.g., ‘gender’, ‘race’).

  • privileged_value (str or int): The value in the protected attribute column indicating the privileged group, such as ‘male’ for a ‘gender’ attribute.

  • unprivileged_value (str or int): The value in the protected attribute column indicating the unprivileged group, like ‘female’ in the case of a ‘gender’ attribute.

  • favorable_result (int): The value in the outcome column that denotes a favorable result, often 1 for positive and 0 for negative outcomes.

  • cv (int, default=5): The number of cross-validation folds.

  • n_jobs (int, default=1): The number of jobs to run in parallel during the grid search.

  • verbose (bool, default=True): Enables verbose output during the grid search if set to True.

Attributes

  • results_: list of dict of values with the structure:
Key Value
params (dict) Parameters used to initialize the model.
accuracy (float) Average accuracy of the model across folds.
bias (float) Average exhibited bias across folds.
raw_bias (list) Exhibited bias for each fold.

Examples

>>> import pandas as pd
>>> import seaborn as sns
>>> from bias_aware_gridsearch import BiasAwareGridSearchCV
>>> from util import calculate_disparate_impact
>>> from sklearn.ensemble import RandomForestClassifier
# load in titanic data
>>> df = sns.load_dataset('titanic')
>>> df = df[['pclass', 'age', 'alone','sex', 'survived']]
# transform categorical to binary
>>> df['first_class'] = df['pclass'] == 1
>>> df = df[['first_class', 'age', 'alone', 'survived']].dropna()
>>> rfc = RandomForestClassifier()
>>> parameter_grid = {'n_estimators': [100, 200], 'max_depth': [5, 10]}
>>> clf = BiasAwareGridSearchCV(rfc, parameter_grid, df, 'survived', 'first_class', 1, 0, 1)
>>> clf.fit(df.drop(columns=['survived']), df['survived'], calculate_disparate_impact)
Processing parameters: {'max_depth': 5, 'n_estimators': 100}
Processing parameters: {'max_depth': 5, 'n_estimators': 200}
Processing parameters: {'max_depth': 10, 'n_estimators': 100}
Processing parameters: {'max_depth': 10, 'n_estimators': 200}
>>> best_model_acc = clf.select_highest_accuracy_model()
Selected model parameters: {'max_depth': 5, 'n_estimators': 100} with accuracy: 0.7087363340884467, bias: 0.8825532742303706
>>> best_model_bias = clf.select_least_biased_model()
Selected model parameters: {'max_depth': 10, 'n_estimators': 200} with accuracy: 0.6541416330148725, bias: 0.7175751435857917
>>> best_balanced_model = clf.select_balanced_model(threshold=3)
Selected model parameters: {'max_depth': 10, 'n_estimators': 100} with accuracy: 0.6582980399881808, bias: 0.7226774022458278

Methods

Method Description
fit(X, y, bias_function) Runs grid search with cross-validation, evaluating models for accuracy and bias.
select_highest_accuracy_model() Selects the model with the highest accuracy from the grid search results.
select_least_biased_model() Selects the model with the least bias from the grid search results.
select_balanced_model() Selects the model with the least bias among top models based on accuracy.
find_optimum_model() Searches for the model with least bias within a margin of highest accuracy.
plot_accuracy(threshold) Plots a line graph of models’ accuracy and bias. Draws an additional line at the “threshold” best model
plot_params(parameter) Plots a line graph of a parameter against bias, ideal for a continuous parameter.

fit(X,y,bias_function)

Run fit with all sets of parameters alongside a bias function.

Parameters

  • X: array-like of shape (n_samples, n_features) Training vector, where n_samples is the number of samples and n_features is the number of features.

  • y: array-like of shape (n_samples, n_output) Target relative to X for classification or regression.

  • bias_function: callable -> bias calculator

Function to calculate a bias metric of interest. Criteria for the function are that 0 must represent a fair value.

Returns

  • None - populates the instance with the results derived from the provided parameter grid.

select_highest_accuracy_model()

Selects and retrains the model with the highest accuracy based on the results of the grid search.

Returns

  • best_model: estimator instance
    The retrained model instance with the highest accuracy from the grid search results.

select_least_biased_model()

Selects and retrains the model with the least bias based on the results of the grid search.

Returns

  • best_model: estimator instance
    The retrained model instance with the least bias from the grid search results.

select_balanced_model(threshold)

Selects and retrains the model with the least bias among the top models with the highest accuracy.

Parameters

  • threshold: int
    The number of top models to consider based on accuracy.

Returns

  • best_model: estimator instance
    The retrained model with the least bias among the top models based on accuracy.

find_optimum_model(margin)

Searches for and retrains the model with the least bias within a specified margin of the highest accuracy.

Parameters

  • margin: float
    The tolerance in accuracy discrepancy to consider when selecting the optimum model.

Returns

  • best_model: estimator instance
    The retrained model that exhibits the least bias within the specified margin of the highest accuracy.

Raises

  • ValueError:
    If no models are found within the specified accuracy margin.

plot_accuracy(threshold)

Plots a line graph of models’ accuracy and bias. The X-axis represents accuracy, and the Y-axis represents bias. A line is drawn on the plot to indicate the accuracy threshold.

Parameters

  • threshold: int
    The number of top models to consider based on accuracy. This value is used to draw a line on the plot.

Returns

  • ax: matplotlib.axes.Axes instance
    The plot object showing the relationship between accuracy and bias.

plot_params(parameter)

Plots a line graph showing the relationship between a specified parameter and bias. The X-axis represents the parameter value, and the Y-axis represents bias.

Parameters

  • parameter: str
    The name of a parameter from the initial param_grid.

Returns

  • plot: matplotlib.axes.Axes instance
    The plot object showing the relationship between the specified parameter and bias.