Bias Aware Gridsearch CV (BAGS)
class BiasAwareGridSearchCV(estimator, param_grid, df, outcome_column, protected_attribute, privileged_value, unprivileged_value, favorable_result, cv=5, n_jobs=1, verbose=True)
Bias Aware GridsearchCV is an extension of SciKitLearn’s GridsearchCV, with additional consideration for a provided bias metric.
Parameters
-
estimator (estimator object): The machine learning estimator to be used. This should be compatible with scikit-learn estimators.
-
param_grid (dict): A dictionary containing parameter names as keys and lists of parameter settings to try as values.
-
df (pd.DataFrame): The DataFrame containing the training data. This must include the target outcome column and the protected attribute.
-
outcome_column (str): The name of the column in the DataFrame representing the target outcome, typically encoded as binary values.
-
protected_attribute (str): The name of the column in the DataFrame representing the protected attribute, which could be any categorical feature (e.g., ‘gender’, ‘race’).
-
privileged_value (str or int): The value in the protected attribute column indicating the privileged group, such as ‘male’ for a ‘gender’ attribute.
-
unprivileged_value (str or int): The value in the protected attribute column indicating the unprivileged group, like ‘female’ in the case of a ‘gender’ attribute.
-
favorable_result (int): The value in the outcome column that denotes a favorable result, often 1 for positive and 0 for negative outcomes.
-
cv (int, default=5): The number of cross-validation folds.
-
n_jobs (int, default=1): The number of jobs to run in parallel during the grid search.
-
verbose (bool, default=True): Enables verbose output during the grid search if set to True.
Attributes
- results_: list of dict of values with the structure:
Key | Value |
---|---|
params (dict) | Parameters used to initialize the model. |
accuracy (float) | Average accuracy of the model across folds. |
bias (float) | Average exhibited bias across folds. |
raw_bias (list) | Exhibited bias for each fold. |
Examples
>>> import pandas as pd
>>> import seaborn as sns
>>> from bias_aware_gridsearch import BiasAwareGridSearchCV
>>> from util import calculate_disparate_impact
>>> from sklearn.ensemble import RandomForestClassifier
# load in titanic data
>>> df = sns.load_dataset('titanic')
>>> df = df[['pclass', 'age', 'alone','sex', 'survived']]
# transform categorical to binary
>>> df['first_class'] = df['pclass'] == 1
>>> df = df[['first_class', 'age', 'alone', 'survived']].dropna()
>>> rfc = RandomForestClassifier()
>>> parameter_grid = {'n_estimators': [100, 200], 'max_depth': [5, 10]}
>>> clf = BiasAwareGridSearchCV(rfc, parameter_grid, df, 'survived', 'first_class', 1, 0, 1)
>>> clf.fit(df.drop(columns=['survived']), df['survived'], calculate_disparate_impact)
Processing parameters: {'max_depth': 5, 'n_estimators': 100}
Processing parameters: {'max_depth': 5, 'n_estimators': 200}
Processing parameters: {'max_depth': 10, 'n_estimators': 100}
Processing parameters: {'max_depth': 10, 'n_estimators': 200}
>>> best_model_acc = clf.select_highest_accuracy_model()
Selected model parameters: {'max_depth': 5, 'n_estimators': 100} with accuracy: 0.7087363340884467, bias: 0.8825532742303706
>>> best_model_bias = clf.select_least_biased_model()
Selected model parameters: {'max_depth': 10, 'n_estimators': 200} with accuracy: 0.6541416330148725, bias: 0.7175751435857917
>>> best_balanced_model = clf.select_balanced_model(threshold=3)
Selected model parameters: {'max_depth': 10, 'n_estimators': 100} with accuracy: 0.6582980399881808, bias: 0.7226774022458278
Methods
Method | Description |
---|---|
fit(X, y, bias_function) | Runs grid search with cross-validation, evaluating models for accuracy and bias. |
select_highest_accuracy_model() | Selects the model with the highest accuracy from the grid search results. |
select_least_biased_model() | Selects the model with the least bias from the grid search results. |
select_balanced_model() | Selects the model with the least bias among top models based on accuracy. |
find_optimum_model() | Searches for the model with least bias within a margin of highest accuracy. |
plot_accuracy(threshold) | Plots a line graph of models’ accuracy and bias. Draws an additional line at the “threshold” best model |
plot_params(parameter) | Plots a line graph of a parameter against bias, ideal for a continuous parameter. |
fit(X,y,bias_function)
Run fit with all sets of parameters alongside a bias function.
Parameters
-
X:
array-like
of shape (n_samples, n_features) Training vector, where n_samples is the number of samples and n_features is the number of features. -
y:
array-like
of shape (n_samples, n_output) Target relative to X for classification or regression. -
bias_function:
callable
-> bias calculator
Function to calculate a bias metric of interest. Criteria for the function are that 0 must represent a fair value.
Returns
None
- populates the instance with the results derived from the provided parameter grid.
select_highest_accuracy_model()
Selects and retrains the model with the highest accuracy based on the results of the grid search.
Returns
- best_model:
estimator
instance
The retrained model instance with the highest accuracy from the grid search results.
select_least_biased_model()
Selects and retrains the model with the least bias based on the results of the grid search.
Returns
- best_model:
estimator
instance
The retrained model instance with the least bias from the grid search results.
select_balanced_model(threshold)
Selects and retrains the model with the least bias among the top models with the highest accuracy.
Parameters
- threshold:
int
The number of top models to consider based on accuracy.
Returns
- best_model:
estimator
instance
The retrained model with the least bias among the top models based on accuracy.
find_optimum_model(margin)
Searches for and retrains the model with the least bias within a specified margin of the highest accuracy.
Parameters
- margin:
float
The tolerance in accuracy discrepancy to consider when selecting the optimum model.
Returns
- best_model:
estimator
instance
The retrained model that exhibits the least bias within the specified margin of the highest accuracy.
Raises
- ValueError:
If no models are found within the specified accuracy margin.
plot_accuracy(threshold)
Plots a line graph of models’ accuracy and bias. The X-axis represents accuracy, and the Y-axis represents bias. A line is drawn on the plot to indicate the accuracy threshold.
Parameters
- threshold:
int
The number of top models to consider based on accuracy. This value is used to draw a line on the plot.
Returns
- ax:
matplotlib.axes.Axes
instance
The plot object showing the relationship between accuracy and bias.
plot_params(parameter)
Plots a line graph showing the relationship between a specified parameter and bias. The X-axis represents the parameter value, and the Y-axis represents bias.
Parameters
- parameter:
str
The name of a parameter from the initialparam_grid
.
Returns
- plot:
matplotlib.axes.Axes
instance
The plot object showing the relationship between the specified parameter and bias.