Spaces:

adubowski
/

subgroup-harm-assessor

Sleeping

+FROM python:3.10
+WORKDIR /app
+COPY requirements.txt ./
+RUN pip install --no-cache-dir -r requirements.txt
+COPY . .
+EXPOSE 7860
+CMD ["python", "app.py"]

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2024 Adam Dubowski
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,11 +1,81 @@
 ---
-title: Subgroup Harm Assessor
-emoji: 💻
-colorFrom: purple
-colorTo: green
 sdk: docker
 pinned: false
 license: mit
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Subgroup Harm Assessor tool
+emoji: 🐢
+colorFrom: blue
+colorTo: yellow
 sdk: docker
 pinned: false
 license: mit
 ---
+# Subgroup Harm Assessor tool
+## Description
+This project aims to assess the fairness of machine learning models in terms of subgroup harms and predictive bias. The developed tool provides guidance and metrics to evaluate the performance of models across different impacted subgroups and identify potential biases based on model loss explanations. Notably, due to SHAP limitations, we only support tree-based models (e.g. XGBoost, LightGBM, CatBoost, etc.).
+## Requirements
+To run the project, you need to have Python 3.7 or higher installed (3.8 recommended). You can install the required packages using the following command:
+```bash
+pip install -r requirements.txt
+```
+## Run
+To run the project, you can use the following command:
+```bash
+python app.py
+```
+To see all available configuration options, you can use the CLI:
+```bash
+python app.py --help
+```
+## Usage
+The project provides a web interface to interact with the model. You can access it by opening the browser and navigating to `http://127.0.0.1:8050/` (or the address specified in the console). Step 0 is optional as the thesis argues for the average log loss difference to be used as a default quality metric for the subgroup harm assessment of risk scoring systems.
+![Tool screenshot](images/screenshot.png)
+With the 6 step process, we implement the Subgroup Harm and Predictive Bias assessment framework discussed in the thesis project. The steps are as follows:
+1. **Select Subgroup**: Choose the subgroup to analyze. The subgroup is defined by a (set of) discriminators based on specific feature values or ranges and ordered in terms of magnificance of the quality metric selected.
+2. **Analyzing profile of underperformance**: We analyze the discriminatory and calibration abilities of the model. The subgroup is compared to the overall population and we can analyse the distribution of predicted probabilities.
+3. **Analysing performance and calibration**: The next step is to analyze the performance and calibration of the model in detail, to see if we can identify any potential predictive bias or we should consider calibration per group (for the feature of the subgroup selected).
+4. **Model Loss Feature importance**: We can analyze the feature importance of the model to see if certain features are less informative for the model.
+5. **Feature value model loss contributions**: We can analyze the feature value contributions to the model loss to see if certain feature values are less informative for the model.
+6. **Class imbalances**: We can analyze the class imbalances in the subgroup to see if the model is biased towards the majority class (indicating representation or aggregation bias).
+### Experiments
+To show the utility of the tool for the goal of identifying features that are less informative, we can run experiments on a random subgroup which follows the distribution of the total population. We can load an overview with the random subgroup with the following command:
+```bash
+python app.py --random_subgroup
+```
+or simply:
+```bash
+python app.py -r
+```
+Additionally, if we want to add artificial bias, we can use the following command:
+```bash
+python app.py --bias [bias_type]
+```
+where `[bias_type]` can be one of the following:
+```
+swap: Swap the feature values of a certain categorical feature of the subgroup
+random: Add random noise to the feature values of a certain continuous feature of the subgroup
+mean: Swap the continuous feature values of the subgroup with the mean of the total population
+```
+And others, which can be found in the `load.py` file.
+## Python modules
+### Loading dataset and model
+In the current version of the tool we load one of the selected datasets from openml and train a model on it.
+To load your own dataset, please edit the `import_dataset` or its parent `load_data` function in load.py, retaining the same method signatures.
+To load your own model, please edit the `get_classifier` function in load.py, retaining the same function signature (returning the trained model and predicted probabilities of the positive class).
+### Metrics
+We specify the metrics used in the thesis in the `metrics.py` file. In case further metrics are needed, they can be added to the methods in that file.
+## Known issues:
+Currently, there are some non-critical issues with the project:
+- When switching tabs, the height of the plots can switch to the default 500px. It can be fixed by reselecting the subgroup (which regenerates the plots).
+- XGBoost seems to fail to train on the German Credit dataset, which is likely because the dataset has some specific characters in feature names

app.py ADDED Viewed

	@@ -0,0 +1,1175 @@

+import argparse
+import logging
+import time
+from typing import Tuple, Union
+import warnings
+# Ignore UserWarning from shap
+warnings.filterwarnings("ignore", category=UserWarning)
+log = logging.getLogger("werkzeug")
+log.setLevel(logging.ERROR)
+import dash_bootstrap_components as dbc
+import numpy as np
+import pandas as pd
+from dash import dcc, html
+from dash.dependencies import Input, Output
+from dash.exceptions import PreventUpdate
+from load import (
+    add_bias,
+    get_classifier,
+    get_fairsd_result_set,
+    load_data,
+    get_shap_logloss,
+)
+from dash_app.config import APP_NAME
+from dash_app.main import app
+from dash_app.views.confusion_matrix import CMchart
+from dash_app.views.menu import get_subgroup_dropdown_options
+from plot import (
+    get_data_distr_charts,
+    get_data_table,
+    get_feat_bar,
+    get_feat_box,
+    get_feat_val_violin_plot,
+    get_feat_val_violin_plots,
+    get_feat_val_box,
+    get_feat_val_bar,
+    get_feat_table,
+    get_sg_hist,
+    plot_calibration_curve,
+    plot_roc_curves,
+)
+from metrics import (
+    Y_PRED_METRICS,
+    get_qf_from_str,
+    get_quality_metric_from_str,
+    sort_quality_metrics_df,
+)
+def prepare_app(
+    n_samples=0, dataset="adult", bias=False, test_split=0.3, model="rf", sensitive_features=None
+) -> Tuple[pd.DataFrame, np.ndarray, np.ndarray, np.ndarray, np.ndarray, pd.Index]:
+    """Loads the data and trains the classifier
+    Args:
+        n_samples (int, optional): Number of samples to load. Defaults to 0.
+        dataset (str, optional): Name of the dataset to load. Defaults to "adult".
+        bias (Union[str, bool], optional): Type of bias to add to the dataset. Defaults to False.
+            If set to "random", adds random noise to a random feature for a random subset of the data.
+            If set to "swap", swaps the values of a random feature for a random subset of the data.
+    Returns:
+        Tuple: X_test, y_true_test, y_pred, y_pred_prob, shap_logloss_df, y_df
+    """
+    # Loading and training
+    (
+        X_test,
+        y_true_train,
+        y_true_test,
+        onehot_X_train,
+        onehot_X_test,
+        cat_features,
+    ) = load_data(n_samples=n_samples, dataset=dataset, test_split=test_split, sensitive_features=sensitive_features)
+    random_subgroup = pd.Series(
+        np.random.choice([True, False], size=len(X_test), p=[0.5, 0.5])
+    )
+    if dataset == "adult" and bias:
+        add_bias(bias, X_test, onehot_X_test, random_subgroup)
+    classifier, y_pred_prob = get_classifier(
+        onehot_X_train, y_true_train, onehot_X_test, model=model
+    )
+    shap_logloss_df = get_shap_logloss(
+        classifier, onehot_X_test, y_true_test, X_test, cat_features
+    )
+    return X_test, y_true_test, y_pred_prob, shap_logloss_df, random_subgroup
+def run_app(
+    n_samples: int,
+    dataset: str,
+    bias: Union[str, bool] = False,
+    random_subgroup=False,
+    test_split=True,
+    model="rf",
+    depth=1,
+    min_support=100,
+    min_support_ratio=0.1,
+    min_quality=0.01,
+    sensitive_features=None,
+):
+    """Runs the app with the given qf_metric"""
+    use_random_subgroup = (
+        random_subgroup or bias
+    )  # When evaluating bias, we want to evaluate against a random subgroup
+    start = time.time()
+    (
+        X_test_global,
+        y_true_global_test,
+        y_pred_prob_global,
+        shap_logloss_df_global,
+        random_subgroup_global,
+    ) = prepare_app(
+        n_samples=n_samples,
+        dataset=dataset,
+        bias=bias,
+        test_split=test_split,
+        model=model,
+        sensitive_features=sensitive_features,
+    )
+    app.layout = html.Div(
+        id="app-container",
+        children=[
+            dcc.Store(id="result-set-dict"),
+            # Header
+            dbc.Row(
+                [
+                    dbc.Col(
+                        id="left-column",
+                        children=[
+                            dbc.Row(
+                                [
+                                    dbc.Col(
+                                        html.H5(APP_NAME),
+                                        style={
+                                            "align-items": "center",
+                                            "height": "fit-content",
+                                            "white-space": "nowrap",
+                                            "width": 4,
+                                        },
+                                    ),
+                                    dbc.Col(
+                                        html.H6(
+                                            "0. Subgroup Discovery Metric: ",
+                                        ),
+                                        style={
+                                            "display": "flex",
+                                            "align-items": "center",
+                                            "height": "fit-content",
+                                            "justify-content": "right",
+                                            "width": 4,
+                                        },
+                                    ),
+                                    dbc.Col(
+                                        dcc.Dropdown(
+                                            id="fairness-metric-dropdown",
+                                            options=[
+                                                {
+                                                    "label": "Equalized Odds Difference",
+                                                    "value": "equalized_odds_diff",
+                                                },
+                                                {
+                                                    "label": "Avg Log Loss Difference",
+                                                    "value": "average_log_loss_diff",
+                                                },
+                                                {
+                                                    "label": "AUROC (ROC AUC) Difference",
+                                                    "value": "auroc_diff",
+                                                },
+                                                {
+                                                    "label": "Miscalibration Difference",
+                                                    "value": "miscalibration_diff",
+                                                },
+                                                {
+                                                    "label": "Brier Score Difference",
+                                                    "value": "brier_score_diff",
+                                                },
+                                                {
+                                                    "label": "Accuracy Difference",
+                                                    "value": "acc_diff",
+                                                },
+                                                {
+                                                    "label": "F1-score Difference",
+                                                    "value": "f1_diff",
+                                                },
+                                                {
+                                                    "label": "False Positive Rate Difference",
+                                                    "value": "fpr_diff",
+                                                },
+                                                {
+                                                    "label": "True Positive Rate Difference",
+                                                    "value": "tpr_diff",
+                                                },
+                                                {
+                                                    "label": "False Negative Rate Difference",
+                                                    "value": "fnr_diff",
+                                                },
+                                            ],
+                                            value="average_log_loss_diff",
+                                            # Disable clearing
+                                            clearable=False,
+                                        )
+                                    ),
+                                ]
+                            ),
+                        ],
+                    ),
+                    dbc.Col(
+                        id="right-column",
+                        children=[
+                            dbc.Row(
+                                [
+                                    dbc.Col(
+                                        html.Center(
+                                            html.H6("1. Select Subgroup: "),
+                                        ),
+                                        width=3,
+                                        style={
+                                            "align-items": "right",
+                                            "height": "fit-content",
+                                        },
+                                    ),
+                                    dbc.Col(
+                                        dcc.Dropdown(
+                                            id="subgroup-dropdown",
+                                            options=[],
+                                            style={
+                                                "align-items": "left",
+                                                "height": "fit-content",
+                                            },
+                                            # Disable clearing
+                                            clearable=False,
+                                        ),
+                                        width=9,
+                                        style={
+                                            "align-items": "left",
+                                            "height": "fit-content",
+                                        },
+                                    ),
+                                ],
+                            )
+                        ],
+                    ),
+                ],
+                style={"height": "5vh"},
+            ),
+            dcc.Tabs(
+                id="tabs",
+                value="impact",
+                children=[
+                    dcc.Tab(
+                        id="impact",
+                        label="2. Underperformance Overview",
+                        value="impact",
+                        children=[
+                            # Split the tab into two columns
+                            html.Div(
+                                className="row",
+                                children=[
+                                    html.Div(
+                                        className="six columns",
+                                        children=[
+                                            html.H6(
+                                                "Full Dataset Baseline",
+                                                style={
+                                                    "border-bottom": "3px solid #d3d3d3",
+                                                    "font-weight": "normal",
+                                                },
+                                            ),
+                                            dbc.Row(
+                                                [
+                                                    dbc.Col(
+                                                        dbc.Row(
+                                                            [
+                                                                dbc.Col(
+                                                                    html.Div(
+                                                                        id="simple-baseline-table",
+                                                                        className="six-columns",
+                                                                        children="Wait for the baseline to load...",
+                                                                        style={
+                                                                            "align-items": "center",
+                                                                            "height": "fit-content",
+                                                                        },
+                                                                    ),
+                                                                ),
+                                                                dbc.Col(
+                                                                    dcc.Graph(
+                                                                        id="simple-baseline-conf",
+                                                                        style={
+                                                                            "align-items": "center",
+                                                                            "height": "fit-content",
+                                                                            "height": "20vh",
+                                                                            "font-size": "0.8rem",
+                                                                        },
+                                                                    )
+                                                                ),
+                                                            ]
+                                                        ),
+                                                    ),
+                                                ]
+                                            ),
+                                            html.Br(),
+                                            dcc.Graph(id="simple-baseline-hist"),
+                                            # html.H6(
+                                            #     "Select decision threshold for the model:"
+                                            # ),
+                                            # dcc.Slider(
+                                            #     0.1,
+                                            #     0.9,
+                                            #     0.1,
+                                            #     value=0.5,
+                                            #     id="simple-baseline-threshold-slider",
+                                            # ),
+                                        ],
+                                        style={
+                                            "textAlign": "center",
+                                            "margin-right": "0.5",
+                                        },
+                                    ),
+                                    html.Div(
+                                        className="six columns",
+                                        children=[
+                                            html.H6(
+                                                "Selected Subgroup",
+                                                style={
+                                                    "border-bottom": "3px solid #d3d3d3",
+                                                    "font-weight": "normal",
+                                                },
+                                            ),
+                                            dbc.Row(
+                                                [
+                                                    dbc.Col(
+                                                        [
+                                                            html.Div(
+                                                                id="simple-subgroup-col",
+                                                                className="six-columns",
+                                                                children="Select subgroup and wait for the visualizations to load. ",
+                                                                style={
+                                                                    "align-items": "center",
+                                                                    "height": "fit-content",
+                                                                },
+                                                            ),
+                                                        ]
+                                                    ),
+                                                    dbc.Col(
+                                                        dcc.Graph(
+                                                            id="simple-subgroup-conf",
+                                                            style={
+                                                                "height": "20vh",
+                                                                "font-size": "0.4rem",
+                                                            },
+                                                        )
+                                                    ),
+                                                ]
+                                            ),
+                                            html.Br(),
+                                            dcc.Graph(id="simple-subgroup-hist"),
+                                        ],
+                                        style={
+                                            "textAlign": "center",
+                                            # "border-left": "3px solid #d3d3d3",
+                                            "margin-left": "1",
+                                        },
+                                    ),
+                                ],
+                            ),
+                        ],
+                    ),
+                    dcc.Tab(
+                        label="3. Performance and Calibration",
+                        value="performance_tab",
+                        children=[
+                            html.Div(
+                                className="row",
+                                children=[
+                                    dbc.Row(
+                                        [
+                                            dbc.Col(
+                                                [
+                                                    dcc.Graph(
+                                                        id="perf-roc",
+                                                    ),
+                                                ]
+                                            ),
+                                            dbc.Col(
+                                                [
+                                                    dcc.Graph(
+                                                        id="calibration_curve",
+                                                    ),
+                                                    html.H6(
+                                                        "Select number of bins for the calibration plot:"
+                                                    ),
+                                                    dcc.Slider(
+                                                        4,
+                                                        20,
+                                                        1,
+                                                        value=10,
+                                                        id="calibration-slider",
+                                                    ),
+                                                ]
+                                            ),
+                                        ]
+                                    ),
+                                ],
+                                style={"align-items": "center"},
+                            ),
+                        ],
+                    ),
+                    dcc.Tab(
+                        label="4. Loss contributions per feature",
+                        value="feature_contributions_tab",
+                        children=[
+                            dbc.Row(
+                                [
+                                    dcc.Graph(
+                                        id="feat-bar",
+                                        style={
+                                            "align-items": "center",
+                                            "height": "fit-content",
+                                            "font-size": "0.8rem",
+                                        },
+                                    ),
+                                ]
+                            ),
+                            dbc.Row(
+                                [
+                                    dbc.Col(
+                                        [
+                                            html.H6(
+                                                "Select significance level alpha for the Kolgomorov Smirnoff (KS) test - Data table rows in bold are considered significant: "
+                                            ),
+                                            dcc.Slider(
+                                                0.01,
+                                                0.1,
+                                                0.01,
+                                                value=0.05,
+                                                id="feat-alpha-slider",
+                                            ),
+                                            html.H6(
+                                                "Select rounding level (number of decimal points) for SHAP values in the KS test: "
+                                            ),
+                                            dcc.Slider(
+                                                0,
+                                                7,
+                                                1,
+                                                value=2,
+                                                id="feat-sensitivity-slider",
+                                            ),
+                                            html.H6(
+                                                "Rounding level is the decimal point level at which SHAP values should be rounded because they are considered 'distinct'," +
+                                                  " in order to avoid detecting statistically significant but minor differences between distributions in the KS test."
+                                            ),
+                                            dbc.Row([
+                                                dbc.Col([
+                                                    html.H6("Select aggregation method for feature contributions:"),
+                                                ]),
+                                                dbc.Col([
+                                                    dcc.Dropdown(
+                                                        id="feat-agg-dropdown",
+                                                        options=[
+                                                            {
+                                                                "label": "Statistical summary (box)",
+                                                                "value": "box",
+                                                            },
+                                                            {
+                                                                "label": "Distribution details (violin)",
+                                                                "value": "violin",
+                                                            },
+                                                            {
+                                                                "label": "Mean (bar)",
+                                                                "value": "mean",
+                                                            },
+                                                            {
+                                                                "label": "Median (bar)",
+                                                                "value": "median",
+                                                            },
+                                                            {
+                                                                "label": "Mean difference (bar)",
+                                                                "value": "mean_diff",
+                                                            },
+                                                            {
+                                                                "label": "Sum", # Weighted sum in this context is literally the mean
+                                                                "value": "sum",
+                                                            },
+                                                        ],
+                                                        value="mean",
+                                                        style={
+                                                            "align-items": "center",
+                                                            "text-align": "center",
+                                                        },
+                                                        # Disable clearing
+                                                        clearable=False,
+                                                    ),
+                                                ]),
+                                            ])
+                                        ]
+                                    ),
+                                    html.Br(),
+                                    dbc.Col(
+                                        [
+                                            html.Div(
+                                                id="feat-table-col",
+                                                className="six-columns",
+                                                children="Data table for feature contributions. Select a subgroup to update the table.",
+                                                style={
+                                                    "align-items": "center",
+                                                    "height": "fit-content",
+                                                },
+                                            ),
+                                        ]
+                                    ),
+                                ]
+                            ),
+                        ],
+                    ),
+                    dcc.Tab(
+                        label="5. Loss contributions per feature value",
+                        value="feature_value_contributions_tab",
+                        children=[
+                            html.Div(
+                                className="row",
+                                children=[
+                                    dbc.Row(
+                                        [
+                                            dbc.Col(
+                                                [
+                                                    html.H6(
+                                                        "Select feature for value contributions:"
+                                                    ),
+                                                    dcc.Dropdown(
+                                                        id="feat-val-feature-dropdown",
+                                                        options=[
+                                                            {"label": col, "value": col}
+                                                            for col in X_test_global.columns
+                                                        ],
+                                                        value=X_test_global.columns[0],
+                                                        style={
+                                                            "align-items": "center",
+                                                            "width": "50%",
+                                                            "text-align": "center",
+                                                        },
+                                                        # Disable clearing
+                                                        clearable=False,
+                                                    ),
+                                                ]
+                                            ),
+                                            dbc.Col([
+                                                html.H6("Select aggregation method for feature value contributions:"),
+                                                dcc.Dropdown(
+                                                    id="feat-val-agg-dropdown",
+                                                    options=[
+                                                        {
+                                                            "label": "Mean",
+                                                            "value": "mean",
+                                                        },
+                                                        {
+                                                            "label": "Median",
+                                                            "value": "median",
+                                                        },
+                                                        {
+                                                            "label": "Sum (Total)",
+                                                            "value": "sum",
+                                                        },
+                                                        {
+                                                            "label": "Sum (weighted)",
+                                                            "value": "sum_weighted",
+                                                        },
+                                                        # {
+                                                        #     "label": "Statistical summary (box plot)",
+                                                        #     "value": "box",
+                                                        # }, # FIXME: Box plots should be next to each other not on top
+                                                        {
+                                                            "label": "Distribution details (violin)",
+                                                            "value": "violin",
+                                                        },
+                                                    ],
+                                                    value="sum_weighted",
+                                                    style={
+                                                        "align-items": "center",
+                                                        "text-align": "center",
+                                                        "width": "50%",
+                                                    },
+                                                    # Disable clearing
+                                                    clearable=False,
+                                                ),
+                                            ]),
+                                        ]
+                                    ),
+                                    html.Br(),
+                                    dbc.Row([
+                                        dcc.Graph(id="feat-val-plot"),
+                                    ]),
+                                    html.Br(),
+                                    dcc.Slider(
+                                        2, 20, 2, value=8, id="feat-val-hist-slider"
+                                    ),
+                                ],
+                                style={"align-items": "center"},
+                            ),
+                        ],
+                    ),
+                    dcc.Tab(
+                        label="6. Class Imbalances",
+                        value="data_tab",
+                        children=[
+                            # Split into two equal width columns with headers
+                            html.Div(
+                                className="row",
+                                children=[
+                                    dbc.Row(
+                                        [
+                                            dbc.Col(
+                                                [
+                                                    html.H6(
+                                                        "Select feature for distribution plots:"
+                                                    ),
+                                                    dcc.Dropdown(
+                                                        id="data-feature-dropdown",
+                                                        options=[
+                                                            {"label": col, "value": col}
+                                                            for col in X_test_global.columns
+                                                        ],
+                                                        value=X_test_global.columns[0],
+                                                        style={
+                                                            "align-items": "center",
+                                                            "text-align": "center",
+                                                            "border-bottom": "3px solid #d3d3d3"
+                                                        },
+                                                        # Disable clearing
+                                                        clearable=False,
+                                                    ),
+                                                ]
+                                            ),
+                                            dbc.Col(
+                                                [
+                                                    html.H6("Select class of samples to plot:"),
+                                                    dcc.Dropdown(
+                                                        id="data-label-dropdown",
+                                                        options=[
+                                                            {"label": "All", "value": "all"},
+                                                            {"label": "Positive samples only", "value": 1},
+                                                            {"label": "Negative samples only", "value": 0},
+                                                        ],
+                                                        value="all",
+                                                        style={
+                                                            "align-items": "center",
+                                                            "text-align": "center",
+                                                            "border-bottom": "3px solid #d3d3d3"
+                                                        },
+                                                        # Disable clearing
+                                                        clearable=False,
+                                                    ),
+                                                ]
+                                            ),
+                                            dbc.Col(
+                                                [
+                                                    # Dropdown for percentage/absolute values
+                                                    html.H6(
+                                                        "Select aggregation method for distribution plots:"
+                                                    ),
+                                                    dcc.Dropdown(
+                                                        id="data-agg-dropdown",
+                                                        options=[
+                                                            {
+                                                                "label": "Percentage",
+                                                                "value": "percentage",
+                                                            },
+                                                            {
+                                                                "label": "Count",
+                                                                "value": "count",
+                                                            },
+                                                        ],
+                                                        value="percentage",
+                                                        style={
+                                                            "align-items": "center",
+                                                            "text-align": "center",
+                                                            "border-bottom": "3px solid #d3d3d3"
+                                                        },
+                                                        # Disable clearing
+                                                        clearable=False,
+                                                    ),
+                                                ]
+                                            ),
+                                        ]
+                                    ),
+                                    html.Br(),
+                                    dbc.Row(
+                                        [
+                                            dbc.Col([
+                                                dcc.Graph(id="data-class-dist-plot"),
+                                            ]),
+                                            dbc.Col([
+                                                dcc.Graph(id="data-pred-dist-plot"),
+                                            ]),
+                                        ]
+                                    ),
+                                    html.H6(
+                                        "Select (max) number of bins (numerical features only):"
+                                    ),
+                                    dcc.Slider(
+                                        5, 30, 5, value=10, id="data-hist-slider"
+                                    ),
+                                ],
+                                style={"align-items": "center"},
+                            ),
+                        ],
+                    ),
+                ],
+                # Set headers to bold font
+                style={"font-weight": "bold"},
+            ),
+        ],
+    )
+    @app.callback(
+        Output("simple-baseline-table", "children"),
+        Output("simple-baseline-conf", "figure"),
+        Output("simple-baseline-hist", "figure"),
+        Output("subgroup-dropdown", "options"),
+        Output("result-set-dict", "data"),
+        Input("fairness-metric-dropdown", "value"),
+        # Input("simple-baseline-threshold-slider", "value"),
+    )
+    def get_baseline_stats_and_subgroups(metric, threshold=0.5):
+        if not metric:
+            raise PreventUpdate
+        y_true = y_true_global_test.copy()
+        y_pred_prob = y_pred_prob_global.copy()
+        if metric in Y_PRED_METRICS:
+            y_pred = (y_pred_prob >= threshold).astype(int)
+        else:
+            y_pred = y_pred_prob.copy()
+        y_df = pd.DataFrame({"y_true": y_true, "probability": y_pred_prob})
+        y_df["category"] = y_df.apply(
+            lambda row: "TP"
+            if row["y_true"] == 1 and row["probability"] >= threshold
+            else "FP"
+            if row["y_true"] == 0 and row["probability"] >= threshold
+            else "FN"
+            if row["y_true"] == 1 and row["probability"] < threshold
+            else "TN",
+            axis=1,
+        )
+        baseline_descr = "Full dataset baseline"
+        baseline_data_table = get_data_table(
+            baseline_descr,
+            y_true,
+            y_pred,
+            y_pred_prob,
+            qf_metric=metric,
+            sg_feature=pd.Series([True] * y_true.shape[0]),
+            # We do not update number of bins here to prevent rerunning DSSD. Default number of bins should be enough for the baseline generally
+        )
+        baseline_conf_mat = CMchart(
+            "Confusion Matrix", y_true, (y_pred_prob >= threshold).astype(int)
+        ).fig
+        baseline_hist = get_sg_hist(
+            y_df, title="Histogram of prediction probabilities on the full dataset"
+        )
+        if use_random_subgroup:
+            sg_feature = random_subgroup_global.copy()
+            # Replace the result_set_df with a synthetic random subgroup
+            sg_y_pred = (
+                y_pred[sg_feature]
+                if metric in Y_PRED_METRICS
+                else y_pred_prob[sg_feature]
+            )
+            sg_y_true = y_true[sg_feature]
+            name = "Random subgroup"
+            if bias:
+                name += f" with bias"
+            result_set_df = pd.DataFrame(
+                {
+                    "quality": [None],
+                    "description": [name],
+                    "size": [sum(sg_feature)],
+                    "proportion": [sum(sg_feature) / len(sg_feature)],
+                    "metric_score": [
+                        get_quality_metric_from_str(metric)(sg_y_true, sg_y_pred)
+                    ],
+                }
+            )
+            result_set_json = {
+                "descriptions": ["Random subgroup"],
+                "sg_features": [sg_feature.to_json()],
+                "metric": metric,
+            }
+            return (
+                baseline_data_table,
+                baseline_conf_mat,
+                baseline_hist,
+                get_subgroup_dropdown_options(result_set_df, metric),
+                result_set_json,
+            )
+        else:
+            result_set = get_fairsd_result_set(
+                X_test_global,
+                y_true_global_test,
+                y_pred,
+                qf=get_qf_from_str(metric),
+                # method="between_groups",
+                method="to_overall",
+                depth=depth,
+                min_support=min_support,
+                min_support_ratio=min_support_ratio,
+                max_support_ratio=0.5,  # To prevent finding majority subgroups
+                logging_level=logging.INFO,
+                sensitive_features=sensitive_features,
+                min_quality=min_quality,
+            )
+            result_set_df = result_set.to_dataframe()
+            metrics = []
+            for idx in range(len(result_set_df)):
+                description = result_set.get_description(idx)
+                sg_feature = description.to_boolean_array(X_test_global)
+                sg_y_pred = (
+                    y_pred[sg_feature]
+                    if metric in Y_PRED_METRICS
+                    else y_pred_prob[sg_feature]
+                )
+                sg_y_true = y_true[sg_feature]
+                metrics.append(
+                    get_quality_metric_from_str(metric)(sg_y_true, sg_y_pred)
+                )
+            result_set_df["metric_score"] = metrics
+            result_set_df = sort_quality_metrics_df(result_set_df, metric)
+            return (
+                baseline_data_table,
+                baseline_conf_mat,
+                baseline_hist,
+                get_subgroup_dropdown_options(result_set_df, metric),
+                result_set.to_json(
+                    X_test_global, metric, result_set_df
+                ),  # Store the result set representation in the data store
+            )
+    # Get feature value plot based on subgroup and feature selection
+    @app.callback(
+        Output("feat-val-plot", "figure"),
+        Input("feat-val-feature-dropdown", "value"),
+        Input("subgroup-dropdown", "value"),
+        Input("result-set-dict", "data"),
+        Input("feat-val-hist-slider", "value"),
+        Input("feat-val-agg-dropdown", "value"),
+    )
+    def get_feat_val_plot(feature, subgroup, data, nbins, agg):
+        """Produces a violin chart or line plot with the feature value contributions for the selected subgroup"""
+        if not feature:
+            raise PreventUpdate
+        if subgroup is None:
+            raise PreventUpdate
+        if not nbins:
+            print("Error: No bins selected. This should not happen.")
+            raise PreventUpdate
+        if len(data["descriptions"]) == 0:
+            print("Error: No subgroups found. This should not happen.")
+            raise PreventUpdate
+        description = data["descriptions"][subgroup]
+        sg_feature = pd.read_json(data["sg_features"][subgroup], typ="series")
+        if agg == "violin":
+            return get_feat_val_violin_plot(
+                X_test_global.copy(),
+                shap_logloss_df_global.copy(),
+                sg_feature,
+                feature,
+                description,
+                nbins=nbins,
+            )
+        elif agg == "box":
+            return get_feat_val_box(
+                X_test_global.copy(),
+                shap_logloss_df_global.copy(),
+                sg_feature=sg_feature,
+                feature=feature,
+                description=description,
+                nbins=nbins,
+            )
+        else:
+            return get_feat_val_bar(
+                X_test_global.copy(),
+                shap_logloss_df_global.copy(),
+                sg_feature=sg_feature,
+                feature=feature,
+                description=description,
+                nbins=nbins,
+                agg=agg,
+            )
+    # Get calibration plot based on subgroup and slider selection
+    @app.callback(
+        Output("calibration_curve", "figure"),
+        Input("calibration-slider", "value"),
+        Input("subgroup-dropdown", "value"),
+        Input("result-set-dict", "data"),
+    )
+    def get_calibration_plot(slider_value, subgroup, data):
+        """Produces a calibration plot for the selected subgroup"""
+        if not slider_value:
+            raise PreventUpdate
+        if subgroup is None:
+            raise PreventUpdate
+        if len(data["sg_features"]) == 0:
+            print("Error: No subgroups found. This should not happen.")
+            raise PreventUpdate
+        sg_feature = pd.read_json(data["sg_features"][subgroup], typ="series")
+        y_true = y_true_global_test.copy()
+        y_pred_prob = y_pred_prob_global.copy()
+        return plot_calibration_curve(
+            y_true, y_pred_prob, sg_feature, n_bins=slider_value
+        )
+    # Get data distributions based on subgroup selection
+    @app.callback(
+        Output("data-class-dist-plot", "figure"),
+        Output("data-pred-dist-plot", "figure"),
+        Input("data-feature-dropdown", "value"),
+        Input("data-agg-dropdown", "value"),
+        Input("subgroup-dropdown", "value"),
+        Input("result-set-dict", "data"),
+        Input("data-hist-slider", "value"),
+        Input("data-label-dropdown", "value"),
+        # Input("simple-baseline-threshold-slider", "value"),
+    )
+    def get_data_feat_distr(feature, agg, subgroup, data, bins, class_label, threshold=0.5):
+        """Produces a bar chart or line plot with the data feature values counts for the selected subgroup"""
+        if not feature:
+            raise PreventUpdate
+        if not agg:
+            raise PreventUpdate
+        if subgroup is None:
+            raise PreventUpdate
+        if not bins:
+            logging.error("Error: No bins selected. This should not happen.")
+            raise PreventUpdate
+        try:
+            description = data["descriptions"][subgroup]
+            sg_feature = pd.read_json(data["sg_features"][subgroup], typ="series")
+        except IndexError:
+            print("Subgroup not found. This should not happen.")
+            raise PreventUpdate
+        y_pred_prob = y_pred_prob_global.copy()
+        y_pred = (y_pred_prob >= threshold).astype(int)
+        y_true = y_true_global_test.copy()
+        X_test = X_test_global.copy()
+        if class_label in (0, 1):
+            y_pred = y_pred[y_true == class_label]
+            X_test = X_test[y_true == class_label]
+        class_plot, pred_plot = get_data_distr_charts(
+            X_test, y_pred, sg_feature, feature, description, bins, agg
+        )
+        return class_plot, pred_plot
+    # Get feat-table-col
+    @app.callback(
+        Output("feat-table-col", "children"),
+        Input("subgroup-dropdown", "value"),
+        Input("result-set-dict", "data"),
+        Input("feat-alpha-slider", "value"),
+        Input("feat-sensitivity-slider", "value"),
+    )
+    def get_feat_table_col(subgroup, data, alpha, sensitivity):
+        """Returns the feature contributions table for the selected subgroup"""
+        if subgroup is None:
+            raise PreventUpdate
+        if not alpha:
+            alpha = 0.05
+        if len(data["descriptions"]) == 0:
+            print("Error: No subgroups found. This should not happen.")
+            raise PreventUpdate
+        sg_feature = pd.read_json(data["sg_features"][subgroup], typ="series")
+        shap_df = shap_logloss_df_global.copy()
+        return get_feat_table(
+            shap_values_df=shap_df,
+            sg_feature=sg_feature,
+            sensitivity=sensitivity,
+            alpha=alpha,
+        )
+    # Get plots based on the subgroup selection
+    @app.callback(
+        Output("simple-subgroup-col", "children"),
+        Output("simple-subgroup-conf", "figure"),
+        Output("simple-subgroup-hist", "figure"),
+        Output("perf-roc", "figure"),
+        Output("feat-bar", "figure"),
+        Input("result-set-dict", "data"),
+        Input("subgroup-dropdown", "value"),
+        Input("calibration-slider", "value"),
+        Input("feat-agg-dropdown", "value"),
+        # Input("simple-baseline-threshold-slider", "value"),
+    )
+    def get_subgroup_stats(data, subgroup, nbins, agg, threshold=0.5):
+        """Returns the group description and updates the charts of the selected subgroup"""
+        if subgroup is None:
+            # TODO: Return baseline-only plots when no subgroup is selected
+            raise PreventUpdate
+        if len(data["descriptions"]) == 0:
+            print("Error: No subgroups found. This should not happen.")
+            raise PreventUpdate
+        if not nbins:
+            print("Error: No bins selected. This should not happen.")
+            raise PreventUpdate
+        sg_feature = pd.read_json(data["sg_features"][subgroup], typ="series")
+        description = data["descriptions"][subgroup]
+        subgroup_description = str(description).replace(" ", "")
+        subgroup_description = subgroup_description.replace("AND", " AND ")
+        metric = data["metric"]
+        y_true = y_true_global_test.copy()
+        y_pred_prob = y_pred_prob_global.copy()
+        y_pred = (y_pred_prob >= threshold).astype(int)
+        y_df = pd.DataFrame({"y_true": y_true, "probability": y_pred_prob})
+        y_df["category"] = y_df.apply(
+            lambda row: "TP"
+            if row["y_true"] == 1 and row["probability"] >= threshold
+            else "FP"
+            if row["y_true"] == 0 and row["probability"] >= threshold
+            else "FN"
+            if row["y_true"] == 1 and row["probability"] < threshold
+            else "TN",
+            axis=1,
+        )
+        shap_values_df = shap_logloss_df_global.copy()
+        sg_hist = get_sg_hist(y_df[sg_feature])
+        sg_data_table = get_data_table(
+            subgroup_description,
+            y_true,
+            y_pred,
+            y_pred_prob,
+            qf_metric=metric,
+            sg_feature=sg_feature,
+            n_bins=nbins,
+        )
+        roc_fig = plot_roc_curves(
+            y_true, y_pred_prob, sg_feature, title="ROC for subgroup and baseline"
+        )
+        sg_conf_mat = CMchart(
+            "Confusion Matrix", y_true_global_test[sg_feature], y_pred[sg_feature]
+        ).fig
+        if agg == "box":
+            # Get a box plot with the feature importances for sg and baseline
+            feat_plot = get_feat_box(shap_values_df, sg_feature=sg_feature)
+        elif agg == "violin":
+            # Get a violin chart with the feature importances for sg and baseline
+            feat_plot = get_feat_val_violin_plots(
+                shap_values_df, sg_feature=sg_feature
+            )
+        else:
+            feat_plot = get_feat_bar(shap_values_df, sg_feature=sg_feature, agg=agg)
+        return (sg_data_table, sg_conf_mat, sg_hist, roc_fig, feat_plot)
+    print("App startup time (s): ", time.time() - start)
+    app.run(host='0.0.0.0', port=7860, debug=False)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "-n",
+        "--n_samples",
+        type=int,
+        default=0,
+        help="Number of samples to use for the app. Use 0 to load the entire dataset.",
+    )
+    parser.add_argument(
+        "--dataset",
+        type=str,
+        default="adult",
+        help="Dataset to be used in the evaluation. Available options are: 'adult', 'credit_g', 'heloc'",
+    )
+    parser.add_argument(
+        "-r",
+        "--random_subgroup",
+        action="store_true",
+        default=False,
+        help="Flag whether to use a random subgroup for evaluation",
+    )
+    parser.add_argument(
+        "-b",
+        "--bias",
+        type=str,
+        default=False,
+        help="Type of bias to add to the dataset",
+    )
+    parser.add_argument(
+        "-s",
+        "--test_split",
+        default=0.3,
+        help="Ratio of samples selected for the test set used for visualizations. 0 and 1 will use the full dataset for the training and test set (useful with smaller models).",
+    )
+    parser.add_argument(
+        "-m",
+        "--model",
+        type=str,
+        default="rf",
+        help="Model to use for the evaluation. Available options are: 'rf', 'dt', 'xgb'",
+    )
+    parser.add_argument(
+        "-d",
+        "--depth",
+        type=int,
+        default=1,
+        help="Depth of the subgroup discovery algorithm search",
+    )
+    parser.add_argument(
+        "--min_support",
+        type=int,
+        default=100,
+        help="Minimum support (subgroup size) for the subgroup discovery algorithm",
+    )
+    parser.add_argument(
+        "--min_support_ratio",
+        type=float,
+        default=0.1,
+        help="Min support ratio for the subgroup discovery algorithm",
+    )
+    parser.add_argument(
+        "--min_quality",
+        type=float,
+        default=0.01,
+        help="Minimum quality for the subgroup discovery algorithm",
+    )
+    parser.add_argument(
+        "--sensitive_features",
+        type=str,
+        nargs="+",
+        help="List of sensitive features to use for the subgroup discovery algorithm",
+    )
+    args = parser.parse_args()
+    run_app(
+        args.n_samples,
+        args.dataset,
+        args.bias,
+        args.random_subgroup,
+        args.test_split,
+        args.model,
+        args.depth,
+        args.min_support,
+        args.min_support_ratio,
+        args.min_quality,
+        args.sensitive_features,
+    )

constants.py ADDED Viewed

	@@ -0,0 +1,20 @@

+# Selected colors which have a dark version (so that they can be used for both light and dark accents of the same feature)
+colors = [
+    "blue",
+    "cyan",
+    "goldenrod",
+    "grey",
+    "green",
+    "khaki",
+    "magenta",
+    "orange",
+    "orchid",
+    "red",
+    "salmon",
+    "seagreen",
+    "slateblue",
+    "slategray",
+    "slategrey",
+    "turquoise",
+    "violet",
+]

dash_app/assets/base.css ADDED Viewed

	@@ -0,0 +1,414 @@

+/* Table of contents
+––––––––––––––––––––––––––––––––––––––––––––––––––
+- Plotly.js
+- Grid
+- Base Styles
+- Typography
+- Links
+- Buttons
+- Forms
+- Lists
+- Code
+- Tables
+- Spacing
+- Utilities
+- Clearing
+- Media Queries
+*/
+/* PLotly.js
+–––––––––––––––––––––––––––––––––––––––––––––––––– */
+/* plotly.js's modebar's z-index is 1001 by default
+ * https://github.com/plotly/plotly.js/blob/7e4d8ab164258f6bd48be56589dacd9bdd7fded2/src/css/_modebar.scss#L5
+ * In case a dropdown is above the graph, the dropdown's options
+ * will be rendered below the modebar
+ * Increase the select option's z-index
+ */
+/* This was actually not quite right -
+   dropdowns were overlapping each other (edited October 26)
+.Select {
+    z-index: 1002;
+}*/
+/* Grid
+–––––––––––––––––––––––––––––––––––––––––––––––––– */
+.container {
+  position: relative;
+  width: 100%;
+  max-width: 960px;
+  margin: 0 auto;
+  padding: 0 20px;
+  box-sizing: border-box; }
+.column,
+.columns {
+  width: 100%;
+  float: left;
+  box-sizing: border-box; }
+/* For devices larger than 400px */
+@media (min-width: 400px) {
+  .container {
+    width: 85%;
+    padding: 0; }
+}
+/* For devices larger than 550px */
+@media (min-width: 550px) {
+  .container {
+    width: 80%; }
+  .column,
+  .columns {
+    margin-left: 2%; }
+  .column:first-child,
+  .columns:first-child {
+    margin-left: 1%; }
+  .one.column,
+  .one.columns                    { width: 4.66666666667%; }
+  .two.columns                    { width: 13.3333333333%; }
+  .three.columns                  { width: 22%;            }
+  .four.columns                   { width: 30.6666666667%; }
+  .five.columns                   { width: 39.3333333333%; }
+  .six.columns                    { width: 48%;            }
+  .seven.columns                  { width: 56.6666666667%; }
+  .eight.columns                  { width: 65.3333333333%; }
+  .nine.columns                   { width: 74.0%;          }
+  .ten.columns                    { width: 82.6666666667%; }
+  .eleven.columns                 { width: 91.3333333333%; }
+  .twelve.columns                 { width: 100%; margin-left: 0; }
+  .one-third.column               { width: 30.6666666667%; }
+  .two-thirds.column              { width: 65.3333333333%; }
+  .one-half.column                { width: 48%; }
+  /* Offsets */
+  .offset-by-one.column,
+  .offset-by-one.columns          { margin-left: 8.66666666667%; }
+  .offset-by-two.column,
+  .offset-by-two.columns          { margin-left: 17.3333333333%; }
+  .offset-by-three.column,
+  .offset-by-three.columns        { margin-left: 26%;            }
+  .offset-by-four.column,
+  .offset-by-four.columns         { margin-left: 34.6666666667%; }
+  .offset-by-five.column,
+  .offset-by-five.columns         { margin-left: 43.3333333333%; }
+  .offset-by-six.column,
+  .offset-by-six.columns          { margin-left: 52%;            }
+  .offset-by-seven.column,
+  .offset-by-seven.columns        { margin-left: 60.6666666667%; }
+  .offset-by-eight.column,
+  .offset-by-eight.columns        { margin-left: 69.3333333333%; }
+  .offset-by-nine.column,
+  .offset-by-nine.columns         { margin-left: 78.0%;          }
+  .offset-by-ten.column,
+  .offset-by-ten.columns          { margin-left: 86.6666666667%; }
+  .offset-by-eleven.column,
+  .offset-by-eleven.columns       { margin-left: 95.3333333333%; }
+  .offset-by-one-third.column,
+  .offset-by-one-third.columns    { margin-left: 34.6666666667%; }
+  .offset-by-two-thirds.column,
+  .offset-by-two-thirds.columns   { margin-left: 69.3333333333%; }
+  .offset-by-one-half.column,
+  .offset-by-one-half.columns     { margin-left: 52%; }
+}
+/* Base Styles
+–––––––––––––––––––––––––––––––––––––––––––––––––– */
+/* NOTE
+html is set to 62.5% so that all the REM measurements throughout Skeleton
+are based on 10px sizing. So basically 1.5rem = 15px :) */
+html {
+  font-size: 62.5%; }
+body {
+  font-size: 1.5em; /* currently ems cause chrome bug misinterpreting rems on body element */
+  line-height: 1.6;
+  font-weight: 400;
+  font-family: "Open Sans", "HelveticaNeue", "Helvetica Neue", Helvetica, Arial, sans-serif;
+  color: rgb(50, 50, 50); }
+/* Typography
+–––––––––––––––––––––––––––––––––––––––––––––––––– */
+h1, h2, h3, h4, h5, h6 {
+  margin-top: 0;
+  margin-bottom: 0;
+  font-weight: 300; }
+h1 { font-size: 4.5rem; line-height: 1.2;  letter-spacing: -.1rem; margin-bottom: 2rem; }
+h2 { font-size: 3.6rem; line-height: 1.25; letter-spacing: -.1rem; margin-bottom: 1.8rem; margin-top: 1.8rem;}
+h3 { font-size: 3.0rem; line-height: 1.3;  letter-spacing: -.1rem; margin-bottom: 1.5rem; margin-top: 1.5rem;}
+h4 { font-size: 2.6rem; line-height: 1.35; letter-spacing: -.08rem; margin-bottom: 1.2rem; margin-top: 1.2rem;}
+h5 { font-size: 2.2rem; line-height: 1.5;  letter-spacing: -.05rem; margin-bottom: 0.6rem; margin-top: 0.6rem;}
+h6 { font-size: 1.7rem; line-height: 1.6;  letter-spacing: 0; margin-bottom: 0.75rem; margin-top: 0.75rem;}
+p {
+  margin-top: 0; }
+/* Blockquotes
+–––––––––––––––––––––––––––––––––––––––––––––––––– */
+blockquote {
+  border-left: 4px lightgrey solid;
+  padding-left: 1rem;
+  margin-top: 2rem;
+  margin-bottom: 2rem;
+  margin-left: 0rem;
+}
+/* Links
+–––––––––––––––––––––––––––––––––––––––––––––––––– */
+a {
+  color: #1EAEDB;
+  text-decoration: underline;
+  cursor: pointer;}
+a:hover {
+  color: #0FA0CE; }
+/* Buttons
+–––––––––––––––––––––––––––––––––––––––––––––––––– */
+.button,
+button,
+input[type="submit"],
+input[type="reset"],
+input[type="button"] {
+  display: inline-block;
+  height: 38px;
+  padding: 0 30px;
+  color: #555;
+  text-align: center;
+  font-size: 11px;
+  font-weight: 600;
+  line-height: 38px;
+  letter-spacing: .1rem;
+  text-transform: uppercase;
+  text-decoration: none;
+  white-space: nowrap;
+  background-color: transparent;
+  border-radius: 4px;
+  border: 1px solid #bbb;
+  cursor: pointer;
+  box-sizing: border-box; }
+.button:hover,
+button:hover,
+input[type="submit"]:hover,
+input[type="reset"]:hover,
+input[type="button"]:hover,
+.button:focus,
+button:focus,
+input[type="submit"]:focus,
+input[type="reset"]:focus,
+input[type="button"]:focus {
+  color: #333;
+  border-color: #888;
+  outline: 0; }
+.button.button-primary,
+button.button-primary,
+input[type="submit"].button-primary,
+input[type="reset"].button-primary,
+input[type="button"].button-primary {
+  color: #FFF;
+  background-color: #33C3F0;
+  border-color: #33C3F0; }
+.button.button-primary:hover,
+button.button-primary:hover,
+input[type="submit"].button-primary:hover,
+input[type="reset"].button-primary:hover,
+input[type="button"].button-primary:hover,
+.button.button-primary:focus,
+button.button-primary:focus,
+input[type="submit"].button-primary:focus,
+input[type="reset"].button-primary:focus,
+input[type="button"].button-primary:focus {
+  color: #FFF;
+  background-color: #1EAEDB;
+  border-color: #1EAEDB; }
+/* Forms
+–––––––––––––––––––––––––––––––––––––––––––––––––– */
+input[type="email"],
+input[type="number"],
+input[type="search"],
+input[type="text"],
+input[type="tel"],
+input[type="url"],
+input[type="password"],
+textarea,
+select {
+  height: 38px;
+  padding: 6px 10px; /* The 6px vertically centers text on FF, ignored by Webkit */
+  background-color: #fff;
+  border: 1px solid #D1D1D1;
+  border-radius: 4px;
+  box-shadow: none;
+  box-sizing: border-box;
+  font-family: inherit;
+  font-size: inherit; /*https://stackoverflow.com/questions/6080413/why-doesnt-input-inherit-the-font-from-body*/}
+/* Removes awkward default styles on some inputs for iOS */
+input[type="email"],
+input[type="number"],
+input[type="search"],
+input[type="text"],
+input[type="tel"],
+input[type="url"],
+input[type="password"],
+textarea {
+  -webkit-appearance: none;
+     -moz-appearance: none;
+          appearance: none; }
+textarea {
+  min-height: 65px;
+  padding-top: 6px;
+  padding-bottom: 6px; }
+input[type="email"]:focus,
+input[type="number"]:focus,
+input[type="search"]:focus,
+input[type="text"]:focus,
+input[type="tel"]:focus,
+input[type="url"]:focus,
+input[type="password"]:focus,
+textarea:focus,
+/*select:focus {*/
+/*  border: 1px solid #33C3F0;*/
+/*  outline: 0; }*/
+label,
+legend {
+  display: block;
+  margin-bottom: 0px; }
+fieldset {
+  padding: 0;
+  border-width: 0; }
+input[type="checkbox"],
+input[type="radio"] {
+  display: inline; }
+label > .label-body {
+  display: inline-block;
+  margin-left: .5rem;
+  font-weight: normal; }
+/* Lists
+–––––––––––––––––––––––––––––––––––––––––––––––––– */
+ul {
+  list-style: circle inside; }
+ol {
+  list-style: decimal inside; }
+ol, ul {
+  padding-left: 0;
+  margin-top: 0; }
+ul ul,
+ul ol,
+ol ol,
+ol ul {
+  margin: 1.5rem 0 1.5rem 3rem;
+  font-size: 90%; }
+li {
+  margin-bottom: 1rem; }
+/* Tables
+–––––––––––––––––––––––––––––––––––––––––––––––––– */
+table {
+  border-collapse: collapse;
+}
+th,
+td {
+  padding: 12px 15px;
+  text-align: left;
+  border-bottom: 1px solid #E1E1E1; }
+th:first-child,
+td:first-child {
+  padding-left: 0; }
+th:last-child,
+td:last-child {
+  padding-right: 0; }
+/* Spacing
+–––––––––––––––––––––––––––––––––––––––––––––––––– */
+button,
+.button {
+  margin-bottom: 0rem; }
+input,
+textarea,
+select,
+fieldset {
+  margin-bottom: 0rem; }
+pre,
+dl,
+figure,
+table,
+form {
+  margin-bottom: 0rem; }
+p,
+ul,
+ol {
+  margin-bottom: 0.75rem; }
+/* Utilities
+–––––––––––––––––––––––––––––––––––––––––––––––––– */
+.u-full-width {
+  width: 100%;
+  box-sizing: border-box; }
+.u-max-full-width {
+  max-width: 100%;
+  box-sizing: border-box; }
+.u-pull-right {
+  float: right; }
+.u-pull-left {
+  float: left; }
+/* Misc
+–––––––––––––––––––––––––––––––––––––––––––––––––– */
+hr {
+  margin-top: 3rem;
+  margin-bottom: 3.5rem;
+  border-width: 0;
+  border-top: 1px solid #E1E1E1; }
+/* Clearing
+–––––––––––––––––––––––––––––––––––––––––––––––––– */
+/* Self Clearing Goodness */
+.container:after,
+.row:after,
+.u-cf {
+  content: "";
+  display: table;
+  clear: both; }
+/* Media Queries
+–––––––––––––––––––––––––––––––––––––––––––––––––– */
+/*
+Note: The best way to structure the use of media queries is to create the queries
+near the relevant code. For example, if you wanted to change the styles for buttons
+on small devices, paste the mobile query code up in the buttons section and style it
+there.
+*/
+/* Larger than mobile */
+@media (min-width: 400px) {}
+/* Larger than phablet (also point when grid becomes active) */
+@media (min-width: 550px) {}
+/* Larger than tablet */
+@media (min-width: 750px) {}
+/* Larger than desktop */
+@media (min-width: 1000px) {}
+/* Larger than Desktop HD */
+@media (min-width: 1200px) {}

dash_app/assets/style.css ADDED Viewed

	@@ -0,0 +1,85 @@

+body {
+    background-color: #f9f9f9;
+    color: #333333;
+    font-family: "Helvetica Neue", Arial, sans-serif;
+    font-size: 15px;
+    margin: 5px;
+}
+.zoom
+{
+    zoom: 70%;
+}
+#left-column {
+    padding: 2px 0.6rem;
+    display: flex;
+    flex-direction: column;
+    min-height: 100vh;
+}
+#right-column {
+    padding: 0 0.2rem;
+    min-height: 100vh;
+}
+h5 {
+    color: #2c8cff;
+    font-weight: 300;
+    margin: 10px;
+}
+h6 {
+    color: #333333;
+    font-size: 300;
+    font-weight: bold;
+    margin: 10px;
+}
+#intro {
+    margin: 20px 0px;
+    text-align: justify;
+}
+#control-card label {
+    font-weight: bold;
+}
+.tabs {
+    display: flex;
+    flex-direction: row;
+    justify-content: space-between;
+    margin: 0.6rem;
+    height: 1vh;
+}
+.graph_card {
+    background-color: white;
+    border-radius: 0.675rem;
+    padding: 0.6rem;
+    margin: 0.6rem;
+    margin-top: 10px;
+}
+.graph_card h6 {
+    margin:  0 7px;
+    padding-bottom: 7px;
+    border-bottom:  1px solid #ccc;
+}
+.modal {
+    position: fixed;
+    z-index: 1002; /* Sit on top, including modebar which has z=1001 */
+    left: 0;
+    top: 0;
+    width: 30%; /* Full width */
+    height: 100%; /* Full height */
+    background-color: rgba(0, 0, 0, 0.6); /* Black w/ opacity */
+}
+.modal-content {
+    margin: 10px;
+    height: 600px;
+    padding: 10px;
+    background-color: #c58797;
+}
+.column_left {
+    border-right: 1px solid #333333;
+  }

dash_app/config.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ APP_NAME = "Subgroup Harm Assessor"

dash_app/main.py ADDED Viewed

	@@ -0,0 +1,11 @@

+from dash import Dash
+from .config import APP_NAME
+import dash_bootstrap_components as dbc
+app = Dash(
+    __name__,
+    external_stylesheets=[dbc.themes.BOOTSTRAP],
+    meta_tags=[{"name": "viewport", "content": "width=device-width"}],
+)
+app.title = APP_NAME

dash_app/views/confusion_matrix.py ADDED Viewed

	@@ -0,0 +1,37 @@

+from dash import dcc, html
+import plotly.express as px
+from explainerdashboard.explainer_plots import plotly_confusion_matrix
+from sklearn.metrics import confusion_matrix
+class CMchart(html.Div):
+    def __init__(self, name, y_true, y_pred):
+        """
+        :param name: name of the plot
+        :param df: dataframe
+        """
+        self.html_id = name.lower().replace(" ", "-")
+        self.name = name
+        self.cm = confusion_matrix(y_true, y_pred)
+        self.fig = plotly_confusion_matrix(self.cm)
+        self.title_id = self.html_id + "-t"
+        # Equivalent to `html.Div([...])`
+        super().__init__()
+    def update(self):
+        self.fig = plotly_confusion_matrix(self.cm)
+        self.fig.update_layout(
+            yaxis_zeroline=False, xaxis_zeroline=False, dragmode="select"
+        )
+        self.fig.update_xaxes(fixedrange=True)
+        self.fig.update_yaxes(fixedrange=True)
+        # update titles
+        self.fig.update_layout(
+            xaxis_title=self.col1,
+            yaxis_title=self.col2,
+        )
+        return self.fig

dash_app/views/menu.py ADDED Viewed

	@@ -0,0 +1,61 @@

+import pandas as pd
+from metrics import get_name_from_metric_str
+import plotly.express as px
+def get_subgroup_dropdown_options(
+    result_set_df: pd.DataFrame, quality_metric: str = "Quality"
+):
+    """Get the options for the subgroup dropdown. They consist of the subgroup description and the subgroup index."""
+    if result_set_df.empty:
+        return [
+            {
+                "label": "No subgroups found. Check your subgroup search criteria.",
+                "value": -1,
+            }
+        ]
+    # For each description, get the size and quality of the corresponding subgroup
+    return [
+        {
+            "label": str(result_set_df["description"][idx])
+            + "; Size: "
+            + str(result_set_df["size"][idx])
+            # + f"; {quality_metric}: "
+            # + str(result_set_df["quality"][idx].round(3))
+            + f"; {get_name_from_metric_str(quality_metric)}: "
+            + str(result_set_df["metric_score"][idx]),
+            "value": idx,
+        }
+        for idx in range(len(result_set_df))
+    ]
+def get_shap_barchart(
+    shap_values_df, group_feature="group", title="Feature contributions"
+):
+    # Sort the values by mean
+    agg_df = shap_values_df.groupby(group_feature).mean()
+    # Create the fig
+    fig = px.bar(
+        agg_df,
+        y=agg_df.index,
+        x=agg_df.values,
+        color=agg_df.index,
+        barmode="group",
+        title=title,
+        orientation="h",
+    )
+    # Update the fig
+    fig.update_layout(
+        title=title,
+        xaxis_title="Contribution (logloss SHAP)",
+        yaxis_title="",
+    )
+    # Turn y labels
+    fig.update_layout(yaxis_tickangle=-35)
+    # Set x axis range
+    fig.update_xaxes(range=[-0.1, 0.1])
+    return fig

fairsd/.gitignore ADDED Viewed

	@@ -0,0 +1,138 @@

+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+.pybuilder/
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+# pytype static type analyzer
+.pytype/
+# Cython debug symbols
+cython_debug/

fairsd/LICENSE ADDED Viewed

	@@ -0,0 +1,201 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

fairsd/README.md ADDED Viewed

	@@ -0,0 +1,20 @@

+# FairSD
+FairSD is a package that implements top-k subgroup discovery algorithms for identifying subgroups that may be treated unfairly by a machine learning model.<br/>
+The package has been designed to offer the user the possibility to use different notions of fairness as quality measures. Integration with the [Fairlearn]( https://fairlearn.github.io/) package allows the user to use all the [fairlearn metrics](https://fairlearn.github.io/v0.6.0/api_reference/fairlearn.metrics.html) as  quality measures. The user can also define custom quality measures, by extending the QualityFunction class present in the [fairsd.qualitymeasures](https://github.com/MaurizioPulizzi/fairsd/blob/main/fairsd/qualitymeasures.py) module.
+## Usage
+For common usage refer to the [Jupyter notebooks](https://github.com/MaurizioPulizzi/fairsd/tree/main/notebooks). In particular:
+* [Quick start - FairSD usage](https://github.com/MaurizioPulizzi/fairsd/blob/main/notebooks/fairsd_usage.ipynb).
+* [FairSD settings](https://github.com/MaurizioPulizzi/fairsd/blob/main/notebooks/fairsd_settings.ipynb), for a detailed explanation of how inizialize the SugbgroupDiscoveryTask object and use the implemented subgroup discovery algorithms.
+## Contributors
+* [Maurizio Pulizzi](https://github.com/MaurizioPulizzi)
+* [Hilde Weerts](https://github.com/hildeweerts)
+## Acknowledgements
+Some parts of the code are an adaptation of the [pysubgroup package](https://github.com/flemmerich/pysubgroup). These parts are indicated in the code.

fairsd/fairsd/__init__.py ADDED Viewed

	@@ -0,0 +1,12 @@

+from .sgdescription import Description
+from .sgdescription import Descriptor
+# from .sgdescription import BinaryTarget
+from .searchspace import SearchSpace
+from .searchspace import Discretizer
+from .qualitymeasures import QualityFunction
+from .algorithms import SubgroupDiscoveryTask
+from .algorithms import BeamSearch
+from .algorithms import DSSD
+from .algorithms import ResultSet

fairsd/fairsd/algorithms.py ADDED Viewed

	@@ -0,0 +1,769 @@

+from typing import List
+import numpy as np
+import pandas as pd
+import fairlearn.metrics as flm
+import inspect
+import logging
+from .sgdescription import Description
+from .searchspace import SearchSpace
+from .searchspace import Discretizer
+"""
+The class SubgroupDiscoveryTask is an adaptation of the homonymous class of the pysubgroup library.
+"""
+quality_function_options = [
+    "equalized_odds_difference",
+    "equalized_odds_ratio",
+    "demographic_parity_difference",
+    "demographic_parity_ratio",
+]
+quality_function_parameters = ["y_true", "y_pred", "sensitive_features"]
+class SubgroupDiscoveryTask:
+    """This is an interface class and will contain all the parameters useful for the sg discovery algorithms."""
+    def __init__(
+        self,
+        X,  # pandas dataframe or numpy array with features
+        y_true,  # numpy array, pandas dataframe, or pandas Series with ground truth labels
+        y_pred=None,  # numpy array, pandas dataframe, or pandas Series with classifier's predicted labels
+        feature_names=None,  # optional, list with column names in case users supply a numpy array X
+        sensitive_features=None,  # list of sensitive features names (str)
+        nominal_features=None,  # optional, list of nominal features
+        numeric_features=None,  # optional, list of nominal features
+        qf="equalized_odds_difference",  # str or callable object
+        discretizer="equalfreq",  # str
+        num_bins=6,
+        dynamic_discretization=True,  # boolean
+        result_set_size=5,  # int
+        depth=3,  # int
+        min_quality=0,  # float
+        min_support=200,  # int
+        min_support_ratio=0.1,  # float
+        max_support_ratio=0.5,  # float
+        logging_level=logging.INFO,
+    ):
+        """
+        Parameters
+        ----------
+        X : pandas dataframe or numpy array
+        y_true : numpy array, pandas dataframe, or pandas Series
+            represent the ground truth
+        y_pred : numpy array, pandas dataframe, or pandas Series
+            contain the predicted values
+        feature_names : list of string
+            this parameter is necessary if the user supply X in a numpy array
+        sensitive_features: list of string
+            this list contains the names of the sensitive features
+        nominal_features : optional, list of strings
+            list of nominal features
+        numeric_features : optional, list of strings
+            list of nominal features
+        qf : string or callable object
+        discretizer : string
+            can be "mdlp", "equalfrequency" or "equalwidth"
+        num_bins : int
+            maximum number of bins that a numerical feature discretization operation will produce
+        dynamic_discretization : boolean
+        result_set_size : int
+        depth : int
+            maximum number of descriptors in a description
+        min_quality : float
+        min_support : int
+            minimum size of a subgroup
+        min_support_ratio : float
+            minimum proportion of a subgroup compared to the whole dataset size
+        max_support_ratio : float
+            maximum proportion of a subgroup compared to the whole dataset size
+        logging_level : int
+            logging level
+        """
+        logging.basicConfig(level=logging_level)
+        self.inputChecking(
+            X,
+            y_true,
+            y_pred,
+            feature_names,
+            sensitive_features,
+            nominal_features,
+            numeric_features,
+            discretizer,
+            dynamic_discretization,
+            result_set_size,
+            depth,
+            min_quality,
+            min_support,
+            min_support_ratio,
+        )
+        if isinstance(X, np.ndarray):
+            self.data = pd.DataFrame(X, columns=feature_names)
+        else:
+            self.data = X.copy()
+        self.data["y_true"] = y_true
+        if y_pred is not None:
+            self.data["y_pred"] = y_pred
+            self.there_is_y_pred = True
+        else:
+            self.there_is_y_pred = False
+        self.sensitive_features = sensitive_features
+        self.discretizer = Discretizer(
+            discretization_type=discretizer, target="y_true", num_bins=num_bins
+        )
+        self.search_space = SearchSpace(
+            self.data,
+            ["y_true", "y_pred"],
+            nominal_features,
+            numeric_features,
+            dynamic_discretization,
+            self.discretizer,
+            sensitive_features,
+        )
+        self.qf = self.set_qualityfuntion(qf)
+        self.result_set_size = result_set_size
+        self.depth = depth
+        self.min_quality = min_quality
+        self.min_support = min_support
+        self.min_support_ratio = min_support_ratio
+        self.max_support_ratio = max_support_ratio
+    def set_qualityfuntion(self, qf):
+        if isinstance(qf, str):
+            if qf not in quality_function_options:
+                raise ValueError("Quality function not known")
+            else:
+                return getattr(flm, qf)
+        if not callable(qf):
+            RuntimeError("Supplied metric object must be callable or string")
+        sig = inspect.signature(qf).parameters
+        for par in quality_function_parameters:
+            if par not in sig:
+                raise ValueError(
+                    "Please use the functions in the fairlearn.metrics package as quality functions or "
+                    "other fuctions with the same interface"
+                )
+        return qf
+    def inputChecking(
+        self,
+        X,  # pandas dataframe or numpy array
+        y_true,  # numpy array, pandas dataframe, or pandas Series with ground truth labels
+        y_pred,  # numpy array, pandas dataframe, or pandas Series with classifier's predicted labels
+        feature_names,  # optional, list with column names in case users supply a numpy array X
+        sensitive_features,
+        nominal_features,  # optional, list of nominal features
+        numeric_features,  # optional, list of nominal features
+        discretizer,  # str
+        dynamic_discretization,  # boolean
+        result_set_size,  # int
+        depth,  # int
+        min_quality,  # float
+        min_support,  # int
+        min_support_ratio,  # float
+    ):
+        if not (isinstance(X, pd.DataFrame) or isinstance(X, np.ndarray)):
+            raise TypeError("X must be of type numpy.ndarray or pandas.DataFrame")
+        if not (
+            isinstance(y_true, pd.DataFrame)
+            or isinstance(y_true, np.ndarray)
+            or isinstance(y_true, pd.Series)
+        ):
+            raise TypeError(
+                "y_true must be of type numpy.ndarray, pandas.Series or pandas.DataFrame"
+            )
+        if X.shape[0] != y_true.size:
+            raise RuntimeError("X and y_true have two different dimensions")
+        if y_pred is not None:
+            if not (
+                isinstance(y_pred, pd.DataFrame)
+                or isinstance(y_pred, np.ndarray)
+                or isinstance(y_pred, pd.Series)
+            ):
+                raise TypeError(
+                    "y_pred must be of type numpy.ndarray, pandas.Series or pandas.DataFrame"
+                )
+            if y_pred.size != y_true.size:
+                raise RuntimeError("y_pred and y_true have two different dimensions")
+        if isinstance(X, np.ndarray):
+            if (not isinstance(feature_names, list)) or len(feature_names) != X.shape[
+                1
+            ]:
+                raise RuntimeError(
+                    "If X is a numpy.ndarray, feature_names must contain the names of the colums"
+                )
+        if sensitive_features is not None and not isinstance(sensitive_features, list):
+            raise RuntimeError("sensitive_features input must be of list type or None")
+        if nominal_features is not None and not isinstance(nominal_features, list):
+            raise RuntimeError("nominal_features input must be of list type or None")
+        if numeric_features is not None and not isinstance(nominal_features, list):
+            raise RuntimeError("numeric_features input must be of list type or None")
+        if not isinstance(discretizer, str):
+            raise TypeError("discretizer input must be of string type")
+        if discretizer == "mdlp":
+            t = pd.DataFrame(y_true).iloc[:, 0].unique()
+            if not (t == [1, 0]).all() and not (t == [0, 1]).all():
+                raise RuntimeError("MDLP discretization supports only binary target")
+        if not isinstance(dynamic_discretization, bool):
+            raise TypeError("dynamic_discretization input must be of bool type")
+        if not isinstance(dynamic_discretization, bool):
+            raise TypeError("dynamic_discretization input must be of bool type")
+        if not isinstance(result_set_size, int) or result_set_size < 1:
+            raise RuntimeError("result_set_size input must be greater than 0")
+        if not isinstance(depth, int):
+            raise RuntimeError("depth input must be greater than 0")
+        if not isinstance(min_support, int):
+            raise RuntimeError("min_support input must be greater than 0")
+        if not isinstance(min_support_ratio, float):
+            raise RuntimeError("min_support_ratio input must be of float type")
+        if min_quality > 1 or min_quality < 0:
+            raise RuntimeError("min_quality input must be between 0 and 1")
+class ResultSet:
+    """
+    This class is used to represent the subgroup set found by one of the
+    subgroup discovery algorithms implemented in this package.
+    """
+    def __init__(self, descriptions_list, x_size):
+        """
+        :param descriptions_list: list of Description objects
+        :param x_size: int, size of the dataset
+        """
+        self.descriptions_list = descriptions_list
+        self.X_size = x_size
+    def to_dataframe(self):
+        """Convert the result set into a dataframe
+        :return: pandas.Dataframe
+        """
+        lod = list()
+        for d in self.descriptions_list:
+            row = [d.quality, d.__repr__(), d.support, d.support / self.X_size]
+            lod.append(row)
+        columns = ["quality", "description", "size", "proportion"]
+        index = [str(x) for x in range(len(self.descriptions_list))]
+        return pd.DataFrame(lod, index=index, columns=columns)
+    def get_description(self, sg_index) -> List[Description]:
+        if sg_index >= len(self.descriptions_list) or sg_index < 0:
+            raise RuntimeError("The requested subgroup doesn't exists")
+        return self.descriptions_list[sg_index]
+    def sg_feature(self, sg_index, X):
+        """
+        This method generate and return the feature of the subgroup with index = sg_index in the current object.
+        The result is indeed a boolean array of the same length of the dataset X. Each i-th element of this
+        array is true iff the i-th tuple of X belong to the subgroup with index sg_index.
+        :param sg_index: int, number of the subgroup in the current object
+        :param X: pandas DataFrame or numpy array
+        :return: boolean list
+        """
+        if sg_index >= len(self.descriptions_list) or sg_index < 0:
+            raise RuntimeError("The requested subgroup doesn't exists")
+        return pd.Series(
+            self.descriptions_list[sg_index].to_boolean_array(X),
+            name=str("sg" + str(sg_index)),
+        )
+    def __repr__(self):
+        res = ""
+        for desc in self.descriptions_list:
+            res += desc.__repr__() + "\n"
+        return res
+    def print(self):
+        print(self.__repr__())
+    def to_string(self):
+        return self.__repr__()
+    def to_json(self, X: pd.DataFrame, metric: str, result_set_df: pd.DataFrame):
+        """Save string representation of descriptions and sg_feature boolean series to json
+        :param X: pandas DataFrame representing the dataset
+        :param metric: string, name of the metric used to evaluate the subgroups
+        :param result_set_df: pandas DataFrame used for ordering of the results
+        :return: dict
+        """
+        descriptions = []
+        sg_features = []
+        result_set_df.reset_index(inplace=True)
+        for _, row in result_set_df.iterrows():
+            i = row[
+                "index"
+            ]  # The original subgroup index might not be the same as the index in the result set df
+            desc = self.descriptions_list[int(i)]
+            descriptions.append(desc.__repr__())
+            sg_feature = self.sg_feature(int(i), X).to_json()
+            sg_features.append(sg_feature)
+        return {
+            "descriptions": descriptions,
+            "sg_features": sg_features,
+            "metric": metric,
+        }
+class BeamSearch:
+    """This class is used to execute the Beam Search Algorithm."""
+    def __init__(self, beam_width=20):
+        """
+        :param beam_width : int
+        """
+        if beam_width < 1:
+            raise RuntimeError("beam_width must be greater than 0")
+        self.beam_width = beam_width
+    def execute(self, task, method="between_groups"):
+        """
+        This method execute the Beam Search
+        :param task : SubgroupDiscoveryTask
+        :return: ResultSet object
+        Notes
+        -----
+        The list_of_beam variable is: a list of list of descriptions. The i-th element of list_of_beam, at the end,
+        will contain the most interesting descriptions formed by i descriptors.
+        """
+        if self.beam_width < task.result_set_size:
+            raise RuntimeError("Beam width is smaller than the result set size!")
+        list_of_beam = list()
+        list_of_beam.append(list())
+        list_of_beam[0] = [Description()]
+        depth = 0
+        while depth < task.depth:
+            list_of_beam.append(list())
+            current_min_quality = 1
+            for last_sg in list_of_beam[depth]:
+                ss = task.search_space.extract_search_space(
+                    task.data, task.discretizer, current_description=last_sg
+                )
+                for sel in ss:
+                    new_Descriptors = list(last_sg.descriptors)
+                    new_Descriptors.append(sel)
+                    new_description = Description(new_Descriptors)
+                    # check for duplicates
+                    if new_description.is_present_in(list_of_beam[depth + 1]):
+                        continue
+                    sg_belonging_feature = new_description.to_boolean_array(
+                        task.data, set_attributes=True
+                    )
+                    # check min support
+                    if (
+                        new_description.size(task.data) < task.min_support
+                        or new_description.size(task.data)
+                        < task.min_support_ratio * task.data.shape[0]
+                        or new_description.size(task.data)
+                        > task.max_support_ratio * task.data.shape[0]
+                    ):
+                        continue
+                    # evaluate subgroup
+                    if task.there_is_y_pred:
+                        quality = task.qf(
+                            y_true=task.data["y_true"],
+                            method=method,
+                            y_pred=task.data["y_pred"],
+                            sensitive_features=sg_belonging_feature,
+                        )
+                    else:
+                        quality = task.qf(
+                            y_true=task.data["y_true"],
+                            method=method,
+                            sensitive_features=sg_belonging_feature,
+                        )
+                    if quality < task.min_quality:
+                        continue
+                    new_description.set_quality(quality)
+                    if len(list_of_beam[depth + 1]) < self.beam_width:
+                        list_of_beam[depth + 1].append(new_description)
+                        if current_min_quality > quality:
+                            current_min_quality = quality
+                    elif quality > current_min_quality:
+                        i = 0
+                        while list_of_beam[depth + 1][i].quality != current_min_quality:
+                            i = i + 1
+                        list_of_beam[depth + 1][i] = new_description
+                        current_min_quality = 1
+                        for d in list_of_beam[depth + 1]:
+                            if d.quality < current_min_quality:
+                                current_min_quality = d.quality
+            depth += 1
+        subgroups = list()
+        for l in list_of_beam[1:]:
+            subgroups.extend(l)
+        subgroups.sort(reverse=True)
+        return ResultSet(subgroups[: task.result_set_size], task.data.shape[0])
+class DSSD:
+    """
+    This class implements the Diverse Subgroup Set Discovery algorithm (DSSD).
+    This algorithm is a variant of the Beam Search Algorithm that also take into account
+    the redundancy of the generated subgroups.
+    In this implementation a cover-based redundancy definition is used: roughly, the more tuples two subgroups
+    have in common, the more they are considered redundant.
+    This algorithm is described in details in the Van Leeuwen and Knobbe's paper "Diverse Subgroup Set Discovery".
+    """
+    def __init__(self, beam_width=20, a=0.9):
+        """
+        :param beam_width: int
+        :param a: float
+            this parameter correspond to the alpha parameter.
+            the more a is high, the less the subgroups redundancy is taken into account.
+        """
+        if beam_width < 1:
+            raise RuntimeError("beam_width must be greater than 0")
+        if a < 0 or a > 1:
+            raise RuntimeError("a-parameter must be between 0 and 1")
+        self.beam_width = beam_width
+        self.a = 1 - a  # for future calculations it is more practical to memorize 1-a
+    def execute(self, task, method="between_groups"):
+        """
+        :param task: SubgroupDiscoveryTask object
+        :param method: string, method of evaluation of the subgroups, can be "between_groups" or "to_overall", as defined in fairlearn.metrics
+        :return: ResultSet object
+        Notes
+        -----
+        The algorithm is divided in three phases:
+        Phase 1: a modified beam search algorithm is performed to find a first non-redundant subset
+        Phase 2: Dominance Pruning - the algorithm try to generalize the subgroups finded in the previous phase
+        Phase 3: The subgroups to return are chosen among the sg-set resulting from the previous phase. Again, the
+            subgroups to put in the result set are chosen by taking into account both quality and diversity.
+        """
+        if self.beam_width < task.result_set_size:
+            raise RuntimeError("Beam width is smaller than the result set size!")
+        # PHASE 1 - MODIFIED BEAM SEARCH
+        list_of_beam = list()
+        self.redundancy_aware_beam_search(list_of_beam, task, method)
+        # PHASE 2 - DOMINANCE PRUNING
+        subgroups = list()
+        subgroups.extend(list_of_beam[1])
+        if len(list_of_beam) > 2:
+            for l in list_of_beam[2:]:
+                self.dominance_pruning(l, subgroups, task)
+        if len(subgroups) < task.result_set_size:
+            subgroups.sort(reverse=True)
+            return ResultSet(subgroups, task.data.shape[0])
+        # PHASE 3 - SUBGROUP SELECTION
+        tuples_sg_matrix = []
+        quality_array = []
+        support_array = list()
+        for descr in subgroups:
+            tuples_sg_matrix.append(descr.to_boolean_array(task.data))
+            quality_array.append(descr.get_quality())
+            support_array.append(descr.support)
+        support_array = np.array(support_array)
+        quality_array = np.array(quality_array)
+        tuples_sg_matrix = np.array(tuples_sg_matrix)
+        final_sgs = []
+        self.beam_creation(
+            tuples_sg_matrix,
+            support_array,
+            quality_array,
+            subgroups,
+            final_sgs,
+            task.result_set_size,
+        )
+        final_sgs.sort(reverse=True)
+        return ResultSet(final_sgs, task.data.shape[0])
+    def redundancy_aware_beam_search(self, list_of_beam, task, method="between_groups"):
+        """
+        Parameters
+        ----------
+        list_of_beam : list
+            This list is empty at the beginning and this method will fill it with the beams of each level.
+            The beam of level i will be a list of Description objects where each description is composed of
+            i Descriptors.
+        task : SubgroupDiscoveryTask
+        method : string, method of evaluation of the subgroups, can be "between_groups" or "to_overall", as defined in fairlearn.metrics
+        Notes
+        -----
+        Starting from the beam of the previous level, all the candidates subgroups (descriptions) are generated.
+        After this, the beam of the current level is generated by calling the beam_creation method.
+        """
+        list_of_beam.append(list())
+        list_of_beam[0] = [Description()]
+        depth = 0
+        while depth < task.depth:
+            # Generation of the beam with number of descriptors = depth+1
+            list_of_beam.append(list())
+            logging.debug("DEPTH: " + str(depth + 1))
+            tuples_sg_matrix = (
+                []
+            )  # boolean matrix where rows are candidates subgroups and columns are tuples of the dataset
+            # tuples_sg_matrix[i][j] == true iff subgroup i contain tuple j
+            quality_array = []  # will contain the quality of each candidate subgroup
+            support_array = []  # will contain the support of each candidate subgroup
+            decriptions_list = (
+                list()
+            )  # will contain the description object of each candidate subgroup
+            # generation of candidates subgroups
+            for last_sg in list_of_beam[
+                depth
+            ]:  # for each subgroup in the previous beam
+                ss = task.search_space.extract_search_space(
+                    task.data, task.discretizer, current_description=last_sg
+                )
+                # generation of all the possible extensions of the description last_sg
+                for sel in ss:
+                    new_Descriptors = list(last_sg.descriptors)
+                    new_Descriptors.append(sel)
+                    new_description = Description(new_Descriptors)
+                    # check for duplicates
+                    if new_description.is_present_in(decriptions_list):
+                        continue
+                    sg_belonging_feature = new_description.to_boolean_array(
+                        task.data, set_attributes=True
+                    )
+                    support = new_description.support
+                    # check min support
+                    if (
+                        support < task.min_support
+                        or support < task.min_support_ratio * task.data.shape[0]
+                        or support > task.max_support_ratio * task.data.shape[0]
+                    ):
+                        continue
+                    # comparison with new descriptor alone
+                    sel_feature = Description([sel]).to_boolean_array(task.data)
+                    # evaluate subgroup
+                    if task.there_is_y_pred:
+                        quality = task.qf(
+                            y_true=task.data["y_true"],
+                            method=method,
+                            y_pred=task.data["y_pred"],
+                            sensitive_features=sg_belonging_feature,
+                        )
+                        ### to evaluate
+                        sel_quality = task.qf(
+                            y_true=task.data["y_true"],
+                            method=method,
+                            y_pred=task.data["y_pred"],
+                            sensitive_features=sel_feature,
+                        )
+                    else:
+                        quality = task.qf(
+                            y_true=task.data["y_true"],
+                            method=method,
+                            sensitive_features=sg_belonging_feature,
+                        )
+                        ### to evaluate
+                        sel_quality = task.qf(
+                            y_true=task.data["y_true"],
+                            method=method,
+                            sensitive_features=sel_feature,
+                        )
+                    # if the quality of the new descriptor has deteriorated by merging the new descriptor
+                    # with the current description, we do not add the new descriptor
+                    ### to evaluate
+                    if quality < task.min_quality or quality < sel_quality:
+                        continue
+                    """
+                    # This is for allowing descriptions with negative quality in the first beam
+                    if depth>0 and (quality < task.min_quality or quality < sel_quality):
+                        continue
+                    """
+                    # This code is for apriori discard those descriptions dominated by another descriptions not containing sensitive features
+                    pruned_des = []
+                    new_description.set_quality(quality)
+                    self.dominance_pruning([new_description], pruned_des, task)
+                    pruned_des.sort(reverse=True)
+                    pruned_attr = pruned_des[0].get_attributes()
+                    if task.sensitive_features is not None:
+                        any_in = any(i in task.sensitive_features for i in pruned_attr)
+                        if not any_in:
+                            continue
+                    # insert current subgroup in the candidates
+                    tuples_sg_matrix.append(sg_belonging_feature)
+                    support_array.append(support)
+                    quality_array.append(quality)
+                    decriptions_list.append(new_description)
+            if len(decriptions_list) == 0:
+                break
+            # CREATION OF THE BEAM
+            support_array = np.array(support_array)
+            quality_array = np.array(quality_array)
+            tuples_sg_matrix = np.array(tuples_sg_matrix)
+            self.beam_creation(
+                tuples_sg_matrix,
+                support_array,
+                quality_array,
+                decriptions_list,
+                list_of_beam[depth + 1],
+                self.beam_width,
+            )
+            for d in list_of_beam[depth + 1]:
+                logging.debug(d)
+            logging.debug(" ")
+            depth += 1
+    def dominance_pruning(self, subgroups, pruned_sgs, task, method="between_groups"):
+        """
+        Parameters
+        ----------
+        subgroups : list
+                list of Description objects to try to generalize
+        pruned_sgs : list
+                the generalized subgroups (description objects) are inserted in this list
+        task : SubgroupDiscoveryTask
+        method : string, method of evaluation of the subgroups, can be "between_groups" or "to_overall", as defined in fairlearn.metrics
+        """
+        for desc in subgroups:
+            Descriptors = desc.get_Descriptors()
+            generalizable = False
+            for i in range(len(Descriptors)):
+                # creation of a generalized description by excluding the i-th descriptor
+                new_sel_list = []
+                for j in range(len(Descriptors)):
+                    if i != j:
+                        new_sel_list.append(Descriptors[j])
+                new_des = Description(new_sel_list)
+                sg_belonging_feature = new_des.to_boolean_array(
+                    task.data, set_attributes=True
+                )
+                if task.there_is_y_pred:
+                    quality = task.qf(
+                        y_true=task.data["y_true"],
+                        method=method,
+                        y_pred=task.data["y_pred"],
+                        sensitive_features=sg_belonging_feature,
+                    )
+                else:
+                    quality = task.qf(
+                        y_true=task.data["y_true"],
+                        method=method,
+                        sensitive_features=sg_belonging_feature,
+                    )
+                if quality >= desc.get_quality():
+                    generalizable = True
+                    if new_des.is_present_in(pruned_sgs):
+                        continue
+                    new_des.set_quality(quality)
+                    pruned_sgs.append(new_des)
+            if generalizable == False:
+                pruned_sgs.append(desc)
+    def beam_creation(
+        self,
+        tuples_sg_matrix,
+        support_array,
+        quality_array,
+        decriptions_list,
+        beam,
+        beam_width,
+    ):
+        """
+        Parameters
+        ----------
+        tuples_sg_matrix : numpy array
+            this is a boolean matrix. The rows are candidate subgroups and the columns are the istances of the dataset
+        support_array :  numpy array
+            contains the support of each candidate subgroup
+        quality_array :  numpy array
+            contains the quality of each candidate subgroup
+        decriptions_list : list
+            list of Description objects of the candidates subgroups
+        beam : list
+            list of Description objects. This parameter is empty at the beginning and this method will feel it with the
+            selected subgroups
+        beam_width : int
+        Notes
+        -----
+        In the code there is a variable called a_tothe_c_array. This variable represents a vector of weights,
+        each weight refers to a tuple of the dataset. Every time that a subgroup sg is selected (inserted to the beam),
+        all the weights relatives to the tuples belonging to sg are updated. In particular, they are decreased by
+        multiplying them by a.
+        At each round of the for loop, the subgroup with the highest product between its quality and its weight is
+        selected. The weight of a subgroup is obtained by averaging the weights of all the tuples it contains.
+        """
+        if len(decriptions_list) <= beam_width:
+            for i in range(len(decriptions_list)):
+                descr = decriptions_list[i]
+                descr.set_quality(quality_array[i])
+                # isertion of the description in the beam
+                beam.append(descr)
+            return
+        # sort in a way that, in case of equal quality, groups with highter support are preferred
+        sorted_index = np.argsort(support_array)[::-1]
+        support_array = support_array[sorted_index]
+        quality_array = quality_array[sorted_index]
+        tuples_sg_matrix = tuples_sg_matrix[sorted_index]
+        decriptions_list.sort(key=lambda x: x.support, reverse=True)
+        # selection of the sg with highest quality
+        index_of_max = np.argmax(quality_array)
+        descr = decriptions_list[index_of_max]
+        descr.set_quality(quality_array[index_of_max])
+        beam.append(descr)
+        quality_array[
+            index_of_max
+        ] = 0  # the quality of the selected sg is set to 0 in the quality_array,
+        # in this way this subgroup will never be choosen again
+        a_tothe_c_array = np.ones(tuples_sg_matrix.shape[1])
+        num_iterations = min(beam_width, support_array.size)
+        for i in range(1, num_iterations):
+            # a_tothe_c updating
+            best_sg_arr = tuples_sg_matrix[index_of_max]
+            best_sg_arr = 1 - self.a * best_sg_arr
+            #  updating
+            a_tothe_c_array = np.multiply(a_tothe_c_array, best_sg_arr)
+            # weight creation
+            alpha_matrix = np.multiply(a_tothe_c_array, tuples_sg_matrix)
+            weights = np.divide(np.sum(alpha_matrix, axis=1), support_array)
+            # selection of the sg with highest quality
+            weighted_quality_array = np.multiply(quality_array, weights)
+            index_of_max = np.argmax(weighted_quality_array)
+            descr = decriptions_list[index_of_max]
+            descr.set_quality(quality_array[index_of_max])
+            # isertion of the description in the beam
+            beam.append(descr)
+            quality_array[index_of_max] = 0

fairsd/fairsd/discretization.py ADDED Viewed

	@@ -0,0 +1,355 @@

+import pandas as pd
+import math
+import numpy as np
+class MDLP:
+    """
+    This class find the cut points to discretize a numerical feature using the Fayyad and Irani approach (MDL principle).
+    In thi class the constructor allows to specify the minimum possible size for a group (min_groupSize).
+    This parameter is interpret as an hard constraint: only cut points that produce buckets larger than
+    min_groupSize will be returned.
+    Another parameter that is possible to set is "force". If this parameter is setted to True the findCutPoints
+    method will return at least one cut point. This unless even a cut point can be found which produces two subgroups
+    with a size greater than min_groupSize and with class partition entropy less than the entropy of the entire set.
+    """
+    def __init__(self, min_groupsize=1, force=False):
+        """
+        :param min_groupsize: int
+        :param force: boolean
+        """
+        self.min_groupsize = min_groupsize
+        self.force = force
+    def findCutPoints(self, x, y):
+        """
+        :param x: list, numpy array or pandas Series that contains the values of the numeric feature to discretize
+        :param y: list, numpy array or pandas Series that contains binary values (0 or 1) representing the class label
+        :return: list of (ascending) ordered cut points
+            if the function returns the cut points [c1, c2, ..., cn], with c1<c2<...<cn, the feature can be discretized
+            by creating n+1 buckets: (-infinite, c1], (c1, c2], ..., (cn-1, cn], (cn, +infinite)
+        """
+        df = pd.DataFrame({"x": x, "y": y})
+        total_size = df.shape[0]
+        df = df.groupby("x")["y"].agg(["sum", "count"])
+        df.reset_index(inplace=True)
+        total_sum = df["sum"].sum()
+        df["prop"] = df["sum"] / df["count"]
+        cut_points = self.find_partitions(df, total_size, total_sum, self.force)
+        cut_points.sort()
+        return cut_points
+    def find_partitions(self, df, total_size, total_sum, force=False):
+        """
+        This is a private class function. It works in a recursive manner.
+        Parameters
+        ----------
+        df: pandas.DataFrame
+            this dataframe contains, for each distinct values of the feature to discretize, the number of positive
+            instances (called sum), the total number of instances (count) and the proportion
+            positive_instances/total_num_of_instances (prop). See the findCutPoints method for details.
+        total_size: int
+            represent the total number of instances of the x feature in the current partition.
+        total_sum: int
+            represent the total number of positive instances of the x feature in the current partition.
+        force: boolean
+            force the method to return at list one cut point, exept for some exceptional cases described in the class description.
+        Returns
+        -------
+        list of int:
+            list of cut points
+        """
+        sum = 0
+        count = 0
+        min_cpe = total_size  # Class Partition Entropy (CPE)
+        partition_index = 0
+        partition_x = 0
+        partition_sum = 0
+        partition_count = 0
+        # find best candidate cut point
+        for i in range(0, df.shape[0] - 1):
+            loc = df.iloc[i]
+            sum += loc["sum"]
+            count += loc["count"]
+            if (
+                loc["prop"] == df.iloc[i + 1]["prop"]
+                or count < self.min_groupsize
+                or (total_size - count) < self.min_groupsize
+            ):
+                continue
+            # cakculate CPE cut point
+            pc1s0 = sum / count  # probability of class 1 in subgroup 0
+            pc0s0 = 1 - pc1s0
+            pc1s0_ = pc1s0 if pc1s0 != 0 else 1
+            pc0s0_ = pc0s0 if pc0s0 != 0 else 1
+            entS0 = -(pc1s0 * math.log2(pc1s0_) + pc0s0 * math.log2(pc0s0_))
+            pc1s1 = (total_sum - sum) / (total_size - count)
+            pc0s1 = 1 - pc1s1
+            pc1s1_ = pc1s1 if pc1s1 != 0 else 1
+            pc0s1_ = pc0s1 if pc0s1 != 0 else 1
+            entS1 = -(pc1s1 * math.log2(pc1s1_) + pc0s1 * math.log2(pc0s1_))
+            cpe = (count / total_size) * entS0 + (
+                (total_size - count) / total_size
+            ) * entS1
+            if cpe < min_cpe:
+                min_cpe = cpe
+                partition_index = i
+                partition_x = loc["x"]
+                partition_sum = sum
+                partition_count = count
+                partition_entS0 = entS0
+                partition_entS1 = entS1
+        if min_cpe == total_size:
+            return []
+        # test MDLP condition
+        pc1 = total_sum / total_size
+        pc0 = 1 - pc1
+        pc1_ = pc1 if pc1 != 0 else 1
+        pc0_ = pc0 if pc0 != 0 else 1
+        entS = -(pc1 * math.log2(pc1_) + pc0 * math.log2(pc0_))
+        gain = entS - min_cpe
+        remained_pos = total_sum - partition_sum
+        total_rem = total_size - partition_count
+        c0 = 2 if (partition_sum == 0 or partition_sum == partition_count) else 1
+        c1 = 2 if (remained_pos == 0 or remained_pos == total_rem) else 1
+        delta = math.log2(9) - c0 * partition_entS0 - c1 * partition_entS1
+        delta = math.log2(7) - (2 * entS - c0 * partition_entS0 - c1 * partition_entS1)
+        if gain <= ((math.log2(total_size - 1) + delta) / total_size):
+            if force:
+                return [partition_x]
+            return []
+        # recoursive splitting
+        left_partitions = self.find_partitions(
+            df.iloc[: (partition_index + 1)], partition_count, partition_sum
+        )
+        right_partitions = self.find_partitions(
+            df.iloc[(partition_index + 1) :],
+            (total_size - partition_count),
+            (total_sum - partition_sum),
+        )
+        a = [partition_x] + left_partitions + right_partitions
+        return a
+class EqualFrequency:
+    """
+    This class find the cut points to discretize a numerical feature using an approximate equal frequency discretization.
+    """
+    def __init__(self, min_bin_size=1, num_bins=0):
+        """
+        Parameters
+        __________
+        num_bins : int
+            this number is to interpret as the maximum number of bins that will be generated.
+            If this parameter is not specified (0 by deafault), it will be automatically determined
+            based on min_bin_size parameter.
+        min_bin_size : int
+            Represent the minimum size that a bin can have.
+        Notes
+        -----
+        If the number of bins has to be automatically determined, this will be chosen so that each bin has an average
+        size of 1.2 * min_bin_size. Let's call this number automatic_nbin
+        ( automatic_nbin = int(x.size/(self.min_group_size*1.2)) )
+        If instead the bin number is specified in the constructor (num_bins parameter), the number of bins that will
+        actually be generated will be, at most, equal to the minimum between num_bins and automatic_nbin. This is
+        to ensure that the constraint given by parameter min_bin_size is respected.
+        """
+        self.min_group_size = min_bin_size
+        self.num_bins = num_bins
+    def findCutPoints(self, x):
+        """
+        :param x: numpy array or pandas series
+        :return: list of (ascending) ordered cut points
+        """
+        # determination of the number of bins
+        if self.num_bins > 1:
+            num_bins = min(self.num_bins, int(x.size / (self.min_group_size * 1.2)))
+        else:
+            num_bins = int(x.size / (self.min_group_size * 1.2))
+        if num_bins < 2:
+            return []
+        avg_group_size = x.size / num_bins
+        if isinstance(x, pd.Series):
+            x = x.to_numpy()
+        val, counts = np.unique(x, return_counts=True)
+        quantiles = []  # actually this array will contains quantiles * x.size
+        sum = 0
+        for c in counts:
+            sum = sum + c
+            quantiles.append(sum)
+        up_index = 0
+        low_index = 0
+        current_quantile = avg_group_size  # again, is quantile * x.size
+        cut_indexes = []
+        # for each expected quantile, find his approximation
+        while current_quantile < (sum - avg_group_size / 2):  # sum is equal to x.size
+            up_index, low_index = self.findApproximationIndexex(
+                up_index, low_index, current_quantile, quantiles
+            )
+            """
+            # this commented code use the cut point that best approximates the expected quantile
+            num_up = quantiles[up_index] - current_quantile
+            num_low = current_quantile - quantiles[low_index]
+            if num_up < num_low:
+                if up_index not in cut_indexes:
+                    cut_indexes.append(up_index)
+            else:
+                if low_index not in cut_indexes:
+                    cut_indexes.append(low_index)
+            """
+            ### here we always choose an approximation by excess of the expected quantile.
+            # This consistency can help create more similarly sized bins
+            if up_index not in cut_indexes:
+                cut_indexes.append(up_index)
+            current_quantile += avg_group_size
+            ###
+        cut_points = []
+        bins_size = []
+        last_quantile = 0
+        for i in cut_indexes:
+            cut_points.append(val[i])
+            bins_size.append(quantiles[i] - last_quantile)
+            last_quantile = quantiles[i]
+        bins_size.append(sum - last_quantile)
+        # The bins with size <= min_group_size will be merged with one of the other adiacent bins
+        self.mergeSmallBins(cut_points, bins_size)
+        return cut_points
+    def findApproximationIndexex(
+        self, up_index, low_index, current_quantile, quantiles
+    ):
+        """
+        :param up_index: int
+        :param low_index: int
+        :param current_quantile: float
+        :param quantiles: list of float
+        :return: (int, int)
+        """
+        new_up_i = up_index
+        while new_up_i < len(quantiles):
+            if quantiles[new_up_i] < current_quantile:
+                new_up_i += 1
+            else:
+                break
+        new_low_i = low_index
+        if new_up_i > up_index:
+            new_low_i = new_up_i - 1
+        return new_up_i, new_low_i
+    def mergeSmallBins(self, cut_points, bins_size):
+        """
+        :param cut_points: list
+        :param bins_size: list
+        :return: void
+        """
+        # find the smallest bin
+        min_smallbin = self.min_group_size
+        min_index = 0
+        for i in range(len(bins_size)):
+            if bins_size[i] < min_smallbin:
+                min_smallbin = bins_size[i]
+                min_index = i
+        # check if the smallest bin finded has size lower than the min_group_size
+        if min_smallbin == self.min_group_size:
+            return
+        previous_size = 0
+        next_size = 0
+        if min_index > 0:
+            previous_size = bins_size[min_index - 1]
+        if min_index < (len(bins_size) - 1):
+            next_size = bins_size[min_index + 1]
+        if previous_size == 0 and next_size == 0:
+            return
+        if (previous_size == 0) or (next_size > 0 and next_size < previous_size):
+            bins_size[min_index + 1] = bins_size[min_index + 1] + bins_size[min_index]
+            cut_points.pop(min_index)
+            bins_size.pop(min_index)
+        else:
+            bins_size[min_index - 1] = bins_size[min_index - 1] + bins_size[min_index]
+            cut_points.pop(min_index - 1)
+            bins_size.pop(min_index)
+        self.mergeSmallBins(cut_points, bins_size)
+class EqualWidth:
+    """
+    This class find the cut points to discretize a numerical feature using the equal width discretization.
+    """
+    def __init__(self, min_bins_size=1, num_bins=0):
+        """
+        Parameters
+        __________
+        num_bins : int
+            this number is to interpret as the maximum number of bins that will be generated.
+            If this parameter is not specified (0 by deafault), it will be automatically determined.
+        min_bins_size : int
+        Notes
+        -----
+        min_bins_size and num_bins parameters are to be interpreted as for the EqualFreq class
+        """
+        self.min_group_size = min_bins_size
+        self.num_bins = num_bins
+    def findCutPoints(self, x):
+        """
+        :param x: numpy array or pandas series
+        :return: list of (ascending) ordered cut points
+        """
+        if self.num_bins > 1:
+            num_bins = min(self.num_bins, int(x.size / (self.min_group_size * 1.2)))
+        else:
+            num_bins = int(x.size / (self.min_group_size * 1.2))
+        if num_bins < 2:
+            return []
+        if isinstance(x, pd.Series):
+            x = x.to_numpy()
+        minim = x.min()
+        bin_width = (x.max() - minim) / num_bins
+        cut_points = []
+        current_cut = minim + bin_width
+        for i in range(1, num_bins):
+            cut_points.append(current_cut)
+            current_cut = current_cut + bin_width
+        return cut_points

fairsd/fairsd/qualitymeasures.py ADDED Viewed

	@@ -0,0 +1,36 @@

+from abc import ABC, abstractmethod
+class QualityFunction(ABC):
+    """Abstract class.
+    If the user wants to create a customized quality function, it is recommended to extend this class
+    """
+    @abstractmethod
+    def evaluate(self, y_true, y_pred, sensitive_features):
+        """Evaluate the quality of a description.
+        Parameters
+        ----------
+        y_true : list, pandas.series or numpy array
+        y_pred : list, pandas.series or numpy array
+        sensitive_features : list, pandas.series or numpy array
+        Returns
+        -------
+        double
+            Real number indicating rhe calculated quality.
+        """
+        pass
+class EqualOpportunityDiff(QualityFunction):
+    def evaluate(self, y_true, y_pred, sensitive_features):
+        s0y_true = (y_true & ~sensitive_features).sum()
+        s0y_true = 1 if s0y_true == 0 else s0y_true
+        p_s0 = (y_true & y_pred & ~sensitive_features).sum() / s0y_true
+        s1y_true = (y_true & sensitive_features).sum()
+        s1y_true = 1 if s1y_true == 0 else s1y_true
+        p_s1 = (y_true & y_pred & sensitive_features).sum() / s1y_true
+        return p_s0 - p_s1

fairsd/fairsd/searchspace.py ADDED Viewed

	@@ -0,0 +1,208 @@

+from .discretization import MDLP, EqualFrequency, EqualWidth
+from .sgdescription import Description
+from .sgdescription import Descriptor
+class SearchSpace:
+    """Will contain the set of all the descriptors to take into consideration for creating a Description.
+    Attributes
+    ----------
+    nominal_Descriptors : list of Descriptor(s)
+        will contain all possible non numeric Descriptors.
+    numeric_Descriptors : list of Descriptor(s)
+        will contain all possible  numeric Descriptors.
+    """
+    def __init__(
+        self,
+        dataset,
+        ignore=None,
+        nominal_features=None,
+        numeric_features=None,
+        dynamic_discretization=False,
+        discretizer=None,
+        sensitiive_features=None,
+    ):
+        """
+        :param dataset : pandas.DataFrame
+        :param ignore : list of String(s)
+            list the attributes to not take into consideration for creating the search space.
+        :param nominal_features : list of Strings that contain a subgroup of nominal features
+        :param numeric_features : list of Strings that contain a subgroup of numeric features
+        :param dynamic_discretization: boolean
+            if dinamic_discretization is true the numerical features will be discretizized in the extract_search_space()
+            method, otherwise the numerical features will be discretizized here, during the inizialization.
+        :param discretizer: Discretizer object
+        :param sensitiive_features: List of str
+        Notes
+        -----
+        The type of features not present in numeric_features and in nominal_features will be deduced based on the attributes
+        """
+        self.sensitiive_features = sensitiive_features
+        if ignore is None:
+            ignore = []
+        if nominal_features is None:
+            nominal_features = []
+        if numeric_features is None:
+            numeric_features = []
+        self.nominal_Descriptors = []
+        self.numeric_Descriptors = []
+        # create nominal Descriptors
+        # nominal features with explicit type phassed
+        for col in nominal_features:
+            if col not in ignore:
+                values = dataset[col].unique()
+                for x in values:
+                    self.nominal_Descriptors.append(Descriptor(col, attribute_value=x))
+        # nominal features without explicit type phassed
+        dtypes_subs = dataset.select_dtypes(exclude=["number"])
+        for col in dtypes_subs.columns:
+            if col not in ignore + nominal_features + numeric_features:
+                values = dataset[col].unique()
+                for x in values:
+                    self.nominal_Descriptors.append(Descriptor(col, attribute_value=x))
+        # numerical Descriptors
+        if dynamic_discretization:
+            for col in numeric_features:
+                if col not in ignore:
+                    self.numeric_Descriptors.append(
+                        Descriptor(col, to_discretize=True, is_numeric=True)
+                    )
+            dtypes_subs = dataset.select_dtypes(include=["number"])
+            for col in dtypes_subs.columns:
+                if col not in ignore + nominal_features + numeric_features:
+                    self.numeric_Descriptors.append(
+                        Descriptor(col, to_discretize=True, is_numeric=True)
+                    )
+        else:
+            for col in numeric_features:
+                if col not in ignore:
+                    self.numeric_Descriptors.extend(
+                        discretizer.discretize(dataset, Description(), col)
+                    )
+            dtypes_subs = dataset.select_dtypes(include=["number"])
+            for col in dtypes_subs.columns:
+                if col not in ignore + nominal_features + numeric_features:
+                    self.numeric_Descriptors.extend(
+                        discretizer.discretize(dataset, Description(), col)
+                    )
+    def extract_search_space(self, dataset, discretizer, current_description=None):
+        """This method return the subset of the search space to explore
+         for expanding the description "current_description".
+        All descriptors containing attributes present in the current_description will be removed
+        from the returned search space subset.
+        Parameters
+        ----------
+        dataset : pandas.Dataframe
+        discretizer : Discretizer
+        current_description : Description
+        Rreturns
+        --------
+        list of Descriptors
+            the subset of the search space to explore
+        Notes
+        -----
+        "Descriptors"  list will contain the subset of the search space to return. All the Descriptors not yet
+        discretized in the original search space, will be discretized before being inserted in the "Descriptors" list.
+        """
+        if current_description is None:
+            current_description = Description()
+        to_exclude = current_description.get_attributes()
+        if len(to_exclude) == 0 and self.sensitiive_features is not None:
+            to_exclude = [
+                i for i in list(dataset.columns) if i not in self.sensitiive_features
+            ]
+        Descriptors = []
+        for Descriptor in self.nominal_Descriptors:
+            if Descriptor.attribute_name not in to_exclude:
+                Descriptors.append(Descriptor)
+        for Descriptor in self.numeric_Descriptors:
+            if Descriptor.attribute_name not in to_exclude:
+                if Descriptor.is_to_discretize():
+                    Descriptors.extend(
+                        discretizer.discretize(
+                            dataset,
+                            current_description,
+                            Descriptor.get_attribute_name(),
+                        )
+                    )
+                else:
+                    Descriptors.append(Descriptor)
+        return Descriptors
+class Discretizer:
+    """Class for the discretization of the numeric attributes."""
+    def __init__(self, discretization_type, target=None, min_groupsize=1, num_bins=6):
+        """
+        :param discretization_type : enumerated
+            can be "mdlp" or "equalfreq" or "equalwidth"
+        :param target: String, optional
+            this parameter is needed only for supervised discretizations (mdlp)
+        :param min_groupsize: int, optional
+            discretize() method will create only subgroups with size >= min_groupsize.
+        """
+        self.discretization_type = discretization_type
+        if discretization_type == "mdlp":
+            self.supervised = True
+            self.discretizer = MDLP(min_groupsize, force=True)
+            self.target = target
+        elif discretization_type == "equalfreq":
+            self.supervised = False
+            self.discretizer = EqualFrequency(min_groupsize, num_bins)
+        elif discretization_type == "equalwidth":
+            self.supervised = False
+            self.discretizer = EqualWidth(min_groupsize, num_bins)
+        else:
+            raise RuntimeError(
+                'discretization_type must be "mdlp" OR "equalfreq" OR "equalwidth"'
+            )
+    def discretize(self, data, description, feature):  #### to test
+        """
+        Parameters
+        ----------
+        data: pandas.DataFrame
+            The dataset.
+        description: Description
+            The discretization will be based only on those tuples of the dataset that match the description.
+        feature: String
+            Is the name of the numeric attribute of the dataset to discretize.
+        Returns
+        -------
+        list of Descriptor(s)
+            Will be created and returned (in a list) one Descriptor for each bin created in the discretization phase.
+        """
+        subset = data[description.to_boolean_array(data)]
+        x = subset[feature]
+        if self.supervised:
+            y = subset[self.target]
+            cut_points = self.discretizer.findCutPoints(x, y)
+        else:
+            cut_points = self.discretizer.findCutPoints(x)
+        Descriptors = []
+        if len(cut_points) < 1:
+            return Descriptors
+        Descriptors.append(
+            Descriptor(feature, low_bound=None, up_bound=None, is_numeric=True)
+        )
+        for cp in cut_points:
+            Descriptors[-1].up_bound = cp
+            Descriptors.append(
+                Descriptor(feature, low_bound=cp, up_bound=None, is_numeric=True)
+            )
+        return Descriptors

fairsd/fairsd/sgdescription.py ADDED Viewed

	@@ -0,0 +1,252 @@

+import numpy as np
+import pandas as pd
+# The following class is no more used in the current version
+# class BinaryTarget:
+#    """Contains the target for the subgroup discovery task.
+#
+#    The target can be boolean or Nominal. Only the value contained in the parameter "target_value" will be considered as
+#    the true value. In this way also if the target is nominal, it will still be treated as a boolean.
+#    """
+#
+#    def __init__(self, y_true, y_pred=None, target_value=False):
+#        """
+#        Parameters
+#        ----------
+#        y_true : string
+#            Contains the label of the target.
+#        dataset: pandas.DataFrame
+#            The dataset is required because it will be checked if the others parameters are coherent
+#            (present inside the dataset).
+#        y_pred: String, optional
+#            Contains the label of the predicted attribute.
+#        target_value: bool or String, optional
+#       """
+#        self.y_true = y_true
+#        self.y_pred = y_pred
+#        self.target_value = target_value
+class Descriptor:
+    """
+    Thi object is formed by an attribute name and an attribute value (or a lower bound plus an upper bound
+    if the Descriptor is numeric).
+    """
+    def __init__(
+        self,
+        attribute_name,
+        attribute_value=None,
+        up_bound=None,
+        low_bound=None,
+        to_discretize=False,
+        is_numeric=False,
+    ):
+        """
+         Parameters
+         ----------
+         attribute_name : string
+         attribute_value : string or bool, default None
+             To set only if the Descriptor is not numeric.
+         up_bound : double or int, default None
+              To set iff the Descriptor is numeric and already discretized.
+          low_bound : double or int, default None
+              To set iff the Descriptor is numeric and already discretized.
+        to_discretize : bool, default False
+             To set at True iff the Descriptor is numeric and not still discretized. In this case
+             the up_bound and low_bound attributes will be meaningless
+         is_numeric : bool, default False
+             To set at true iff the descriptor is numeric
+        """
+        self.attribute_name = attribute_name
+        self.is_numeric = is_numeric
+        if is_numeric:
+            self.up_bound = up_bound
+            self.low_bound = low_bound
+        else:
+            self.attribute_value = attribute_value
+        self.to_discretize = to_discretize  # The current implementation could work also without this parameter
+    def get_attribute_name(self):
+        return self.attribute_name
+    def is_to_discretize(self):
+        return self.to_discretize
+    def is_present_in(self, other_descriptors):
+        """
+        :param: other_descriptors: list of Descriptors
+        :return: bool
+        """
+        for other in other_descriptors:
+            if (
+                self.attribute_name == other.attribute_name
+                and self.is_numeric == other.is_numeric
+            ):
+                if (
+                    self.is_numeric
+                    and self.up_bound == other.up_bound
+                    and self.low_bound == other.low_bound
+                ):
+                    return True
+                elif (
+                    self.is_numeric == False
+                    and self.attribute_value == other.attribute_value
+                ):
+                    return True
+        return False
+class Description:
+    """List of Descriptors plus other description attributes.
+    Semantically it is to be interpreted as the conjunction of all the Descriptors contained in the list:
+    a dataset record will match the description if each single Descriptor of the description will match with this record.
+    """
+    def __init__(self, descriptors: Descriptor = None):
+        """
+        :param Descriptors : list of Descriptor
+        """
+        if descriptors == None:
+            self.descriptors = []
+        else:
+            self.descriptors = descriptors
+        self.support = None
+    def __repr__(self, opposite=False):
+        """Represent the description as a string.
+        :return : String
+        """
+        descr = ""
+        if opposite:
+            descr = descr + "NOT ("
+        for s in self.descriptors:
+            # If the descriptor is numeric, the string will be formed by the attribute name and the lower and upper bound
+            if s.is_numeric:
+                low = str(s.low_bound) if s.low_bound is not None else "-infinite"
+                up = str(s.up_bound) if s.up_bound is not None else "+infinite"
+                descr = descr + s.attribute_name + " = (" + low + ", " + up + "] AND "
+            # If the descriptor is not numeric, the string will be formed by the attribute name and the attribute value
+            else:
+                descr = (
+                    descr
+                    + s.attribute_name
+                    + ' = "'
+                    + str(s.attribute_value)
+                    + '" AND '
+                )
+        if descr != "":
+            descr = descr[:-4]
+        if opposite:
+            descr = descr + ")"
+        return descr
+    def to_string(self, opposite=False):
+        return self.__repr__(opposite=opposite)
+    def __lt__(self, other):
+        """Compare the current description (self) with another description (other).
+        :param other: Description
+        :return: bool
+        """
+        if self.quality != other.quality:
+            return self.quality < other.quality
+        elif self.support != other.support:
+            return self.support < other.support
+        else:
+            return len(self.descriptors) > len(other.descriptors)
+    def to_boolean_array(self, dataset, set_attributes=False):
+        """
+        Parameters
+        ----------
+        dataset : pandas.DataFrame
+        set_attributes : bol, default False
+            if this input is True, this method will set also the support attribute
+        Returns
+        -------
+        pandas Series of boolean type:
+            The array will have the length of the passed  dataset (number of rows).
+            Each element of the array will be true iff the description (self) match the corresponding row of the dataset.
+            If a description is empty, the returned array will have all elements equal to True.
+        """
+        s = np.full(dataset.shape[0], True)
+        s = pd.Series(s)
+        for i in range(0, len(self.descriptors)):
+            if self.descriptors[i].is_numeric:
+                if self.descriptors[i].low_bound is not None:
+                    s = s & (
+                        dataset[self.descriptors[i].attribute_name]
+                        > self.descriptors[i].low_bound
+                    )
+                if self.descriptors[i].up_bound is not None:
+                    s = s & (
+                        dataset[self.descriptors[i].attribute_name]
+                        <= self.descriptors[i].up_bound
+                    )
+            else:
+                s = (s) & (
+                    dataset[self.descriptors[i].attribute_name]
+                    == self.descriptors[i].attribute_value
+                )
+        if set_attributes:
+            # set size, relative size and target share
+            self.support = sum(s)
+        return s
+    def size(self, dataset=None):
+        """Return the support of the description and set the support in case this parameters was not set.
+        :param dataset: pandas.DataFrame
+        :return: int
+        """
+        if self.support is None:
+            if dataset is None:
+                raise RuntimeError("dataset argument required in Description.size()")
+            self.to_boolean_array(dataset, set_attributes=True)
+        return self.support
+    def get_attributes(self):
+        """Return the list of the attribute names in the description.
+        :return: list of String
+        """
+        attributes = []
+        for sel in self.descriptors:
+            attributes.append(sel.get_attribute_name())
+        return attributes
+    def get_Descriptors(self):
+        return self.descriptors
+    def is_present_in(self, beam):
+        """
+        :param beam : list of Description objects
+        :return: bool
+            True if the current description (self) is present in the list (beam).
+            Return true iff the current object (Description) have the same parameters of at list another object
+            present in the beam.
+        """
+        for descr in beam:
+            equals = True
+            for sel in self.descriptors:
+                if sel.is_present_in(descr.descriptors) == False:
+                    equals = False
+                    break
+            if equals:
+                return True
+        return False
+    def set_quality(self, q):
+        self.quality = q
+    def get_quality(self):
+        return self.quality

fairsd/main.py ADDED Viewed

	@@ -0,0 +1,32 @@

+from sklearn.datasets import fetch_openml
+import pandas as pd
+from sklearn.tree import DecisionTreeClassifier
+from sklearn.metrics import accuracy_score
+import fairlearn.metrics as fm
+import fairsd as dsd
+# Import dataset, training the classifier, producing y_pred
+d = fetch_openml(data_id=1590, as_frame=True)
+dataset = d.data
+d_train = pd.get_dummies(dataset)
+y_true = (d.target == ">50K") * 1
+classifier = DecisionTreeClassifier(min_samples_leaf=10, max_depth=4)
+classifier.fit(d_train, y_true)
+y_pred = classifier.predict(d_train)
+print("ACCURACY OF THE CLASSIFIER:")
+print(accuracy_score(y_true, y_pred))
+print("-------------------------------------------" + "\n")
+dataset = dataset.head(1000)
+y_pred = y_pred[:1000]
+y_true = y_true[:1000]
+task = dsd.SubgroupDiscoveryTask(
+    dataset, y_true, y_pred, qf=fm.demographic_parity_difference
+)
+result_set = dsd.BeamSearch(beam_width=10).execute(task)
+print(result_set.to_dataframe())
+print(result_set.extract_sg_feature(0, dataset))

fairsd/notebooks/fairsd_settings.ipynb ADDED Viewed

	@@ -0,0 +1,494 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "compliant-oregon",
+   "metadata": {},
+   "source": [
+    "# FairSD settings\n",
+    "For this example is used the UCI adult dataset where the objective is to predict whether a person makes more (label 1) or less (0) than $50,000 a year."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "illegal-american",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "from sklearn.datasets import fetch_openml\n",
+    "from sklearn.tree import DecisionTreeClassifier\n",
+    "#Import dataset\n",
+    "d = fetch_openml(data_id=1590, as_frame=True)\n",
+    "X = d.data\n",
+    "d_train=pd.get_dummies(X)\n",
+    "y_true = (d.target == '>50K') * 1\n",
+    "#training the classifier\n",
+    "classifier = DecisionTreeClassifier(min_samples_leaf=10, max_depth=4)\n",
+    "classifier.fit(d_train, y_true)\n",
+    "#Producing y_pred\n",
+    "y_pred = classifier.predict(d_train)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "private-origin",
+   "metadata": {},
+   "source": [
+    "## SubgroupDiscoveryTask Object\n",
+    "This class will contain all the parameters useful for the sg discovery algorithms.<br/>\n",
+    "**<u>Parameters</u>:**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "reverse-kinase",
+   "metadata": {},
+   "source": [
+    "|name |type |default value |description\n",
+    "|:-----|:-----|:-----|:----- \n",
+    "|X| pandas.Dataframe or <br/> numpy.ndarray| |dataset.\n",
+    "|y_true| pandas.Dataframe,<br/> pandas.Series or <br/> numpy.ndarray| |ground truth.\n",
+    "|y_pred| pandas.Dataframe,<br/> pandas.Series or <br/> numpy.ndarray| |predicted label.\n",
+    "|feature_names| list of strings| None| see below.\n",
+    "|nominal_features| list of strings| None| see below.\n",
+    "|numeric_features| list of strings| None| see below.\n",
+    "|qf | String or <br/>callable object| 'equalized_odds_difference'| quality function to use.\n",
+    "|discretizer| String| 'equalfreq'| see below.\n",
+    "|num_bins | int| 6|  see below.\n",
+    "|dynamic_discretization| bool| True| see below.\n",
+    "|result_set_size| int| 5| maximum number of subgroups <br/>that the sg discovery will return.\n",
+    "|depth| int| 3| maximum number of descriptors<br/> that a description can contain.\n",
+    "|min_quality| float| 0.1| minimum quality<br/> that a subgroup needs to be selected. \n",
+    "|min_support| int| 200| minimum size <br/>that a subgroup needs to be selected. \n",
+    "|sensitive_features| list| None|list of sensitive features names (str). If this list is not none, the sg-discovery task will return only subgroups containing at least one fo the features in the list. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "geographic-limitation",
+   "metadata": {},
+   "source": [
+    "### X and feature_names\n",
+    "X parameters can be a Pandas DataFrame or a Numpy array. If we pass a numpy array we must also pass the feature_names parameter, a list with the column names of X:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "southwest-chemistry",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import fairsd as fsd\n",
+    "task=fsd.SubgroupDiscoveryTask(X, y_true, y_pred) # X is a pandas.Dataframe\n",
+    "\n",
+    "#X as numpy array\n",
+    "x_np    = X.to_numpy()\n",
+    "columns = list(X.columns)\n",
+    "task=fsd.SubgroupDiscoveryTask(X, y_true, y_pred, feature_names = columns)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "about-storm",
+   "metadata": {},
+   "source": [
+    "### nominal_features and numeric_features\n",
+    "These two parameters (string list type) are used to specify which columns of X have nominal values and which ones have numeric values. For attributes that do not appear in either of the two list, the data type will be automatically inferred.<br/>\n",
+    "**Example:**<br/>\n",
+    "    Let's analize the education-num attribute"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "biological-liabilities",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[ 1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16.]\n"
+     ]
+    }
+   ],
+   "source": [
+    "educationnum_val = X['education-num'].unique()\n",
+    "print(np.sort(educationnum_val))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "naval-leadership",
+   "metadata": {},
+   "source": [
+    "The values of this attribute are integer number from 1 to 16. <br/>\n",
+    "The package would treat this attribute as numeric by default, but if we want to treat it as a nominal attribute we can use the nominal_features parameter:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "unauthorized-oriental",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "task=fsd.SubgroupDiscoveryTask(X, y_true, y_pred, nominal_features = ['education-num'])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "possible-wilderness",
+   "metadata": {},
+   "source": [
+    "### discretizer\n",
+    "This parameter determine the algorithm that will perform the numerical features discretization.\n",
+    "It is set to 'equalfreq' by default but other options are possible:<br/>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "manufactured-algorithm",
+   "metadata": {},
+   "source": [
+    "|dicretizer parameter |description \n",
+    "|:-----|:----- \n",
+    "|'eualfreq'|approximate equal frequency discretization.\n",
+    "|'equalwidth'|equal width discretization.\n",
+    "|'mdlp'|minimum description length principle discretization (Fayyad and Irani approach)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "blocked-morning",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#example\n",
+    "task=fsd.SubgroupDiscoveryTask(X, y_true, y_pred, discretizer='mdlp')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "developmental-orbit",
+   "metadata": {},
+   "source": [
+    "### num_bins\n",
+    "This parameter determine the maximum number of bins that a numerical feature discretization operation will produce. The number of produced bins could also be less than num_bins due to the min_support constraint:<br/>\n",
+    "* the equal-frequency and mdlp discretizations will never produce bins with size lower than min_support, as we know in advance that these bins could never be used to create a description with large enough support.<br/>\n",
+    "\n",
+    "If a value lower than two is used, the number of bins will be automatically decided.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "focused-patrick",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#example\n",
+    "task=fsd.SubgroupDiscoveryTask(X, y_true, y_pred, discretizer='equalwidth', num_bins = 4)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "periodic-glass",
+   "metadata": {},
+   "source": [
+    "### dynamic_discretization\n",
+    "If is set to False, the discretization of numeric features will be done only once for each numerical feature before starting the subgroup discovery algorithm.<br/>\n",
+    "If instead is set to True, the discretization of numerical featues will be integrated with the subgroup discovery algorithm: will be done each time the algorithm will try to expand a description with a numerical attribute."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "hispanic-boulder",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#example\n",
+    "task=fsd.SubgroupDiscoveryTask(X, y_true, y_pred, dynamic_discretization = False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "interstate-edinburgh",
+   "metadata": {},
+   "source": [
+    "## Qaulity Measures\n",
+    "### Fairlearn Quality measures\n",
+    "All the Fairlear metrics can be used as quality measure (qf parameter in SubgroupDiscoveryTask object). See the Fairlearn documentation [here](https://fairlearn.github.io/v0.6.0/api_reference/fairlearn.metrics.html).<br/>\n",
+    "**The predefined fairlearn metrics are:**\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "premium-charlotte",
+   "metadata": {},
+   "source": [
+    "|Metric name | Description\n",
+    "|:-----|:-----\n",
+    "| demographic_parity_difference |  Defined as the absolute value of the difference in the **selection rates** between a subgroup and its negation.\n",
+    "|demographic_parity_ratio | Defined as the ratio between the smallest and the largest group-level **selection rate**, between a subgroup and its negation.\n",
+    "|equalized_odds_difference | The greater of two metrics: true_positive_rate_difference and false_positive_rate_difference. The former is the difference between the TPRs, between a subgroup and its negation. The latter is defined similarly, but for FPRs.\n",
+    "|equalized_odds_ratio | The smaller of two metrics: true_positive_rate_ratio and false_positive_rate_ratio. The former is the ratio between the TPRs, between a subgroup and its negation. The latter is defined similarly, but for FPRs."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "upset-broadcasting",
+   "metadata": {},
+   "source": [
+    "We can inizialize a SubgroupDiscoveryTask object in this way:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "labeled-retail",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import fairsd as fsd\n",
+    "from fairlearn.metrics import demographic_parity_ratio\n",
+    "task = fsd.SubgroupDiscoveryTask(X, y_true, y_pred, demographic_parity_ratio)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "personalized-munich",
+   "metadata": {},
+   "source": [
+    "Or, faster, for this four predefined fairlearn metrics, we can pass the same metric as a string:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "directed-robin",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "task = fsd.SubgroupDiscoveryTask(X, y_true, y_pred, 'demographic_parity_ratio')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "subjective-carrier",
+   "metadata": {},
+   "source": [
+    "From the version 6.0, Fairlearn also offers the interesting possibility of \"create a scalar returning metric function based on aggregation of a disaggregated metric\".<br/>\n",
+    "**Example:**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "efficient-tolerance",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.metrics import accuracy_score\n",
+    "from fairlearn.metrics import make_derived_metric\n",
+    "derived_metric = make_derived_metric(metric = accuracy_score, transform = 'difference')\n",
+    "task = fsd.SubgroupDiscoveryTask(X, y_true, y_pred, derived_metric)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "solid-nutrition",
+   "metadata": {},
+   "source": [
+    "For more details see the Fairlearn documentation."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "distinguished-tonight",
+   "metadata": {},
+   "source": [
+    "### Customized Quality Measures\n",
+    "It is possible to create a quality measure by estending the class [QualityFunction](https://github.com/MaurizioPulizzi/fairsd/blob/main/fairsd/qualitymeasures.py#L3):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "eligible-british",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class MyQualityMeasure(fsd.QualityFunction):\n",
+    "    def evaluate(self, y_true = None, y_pred=None, sensitive_features=None):\n",
+    "        return 0.5\n",
+    "    \n",
+    "task = fsd.SubgroupDiscoveryTask(X, y_true, y_pred, MyQualityMeasure.evaluate)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "abstract-crown",
+   "metadata": {},
+   "source": [
+    "## Descriptions and Quality Measures\n",
+    "A subgroup description is formed by the conjunction of zero or more descriptors.<br/>\n",
+    "A descriptor is a statement in the form \"attribute_name = attribute_value\" for nomilal attributes or \"attribute_name = range\" for numerical attributes.<br/>\n",
+    "Example of Description: \" sex = 'Male' AND age = (18, 30] \". <br/>\n",
+    "The Top-k subgroup discovery task in this package returns the k subgroup descriptions of the subgroups that exert the greatest disparity.<br>\n",
+    "There is no single definition of subgroup disparity, the meaning changes according to the used quality measure.\n",
+    "\n",
+    "**All metrics in the [fairlearn.metrics](https://fairlearn.github.io/v0.6.0/api_reference/fairlearn.metrics.html) module are symmetrical:** they always return a value between 0 and 1 and do not distinguish whether a subgroup is \\\"positively\\\" or \\\"negatively\\\" dissimilar. For example the descriptions \\\"married = True\\\" and \\\"married = False\\\" will always have the same quality.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "round-dealer",
+   "metadata": {},
+   "source": [
+    "## Implemented Subgroup Disovery Algorithms\n",
+    "This package offers two Top-K Subgroup Discovery Algorithms: **BeamSearch** and **DSSD**."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "nonprofit-penetration",
+   "metadata": {},
+   "source": [
+    "### BeamSearch Algorithm\n",
+    "BeamSEarch is an euristic algorithm representing a good between the completeness of exhaustive search algorithms and the speed of greedy algorithms. <br/> This trade-off can be adjusted via the beam width. For a beam width that tends to infinity the algorithm becomes an exhaustive search algoritm, instead with a beam width equal to 1 the algorithm becomes a greedy one."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "neutral-representative",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Beam search algorithm usage\n",
+    "from fairsd import SubgroupDiscoveryTask\n",
+    "from fairsd import BeamSearch\n",
+    "task = SubgroupDiscoveryTask(X, y_true, y_pred)\n",
+    "resultset = BeamSearch().execute(task)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "balanced-center",
+   "metadata": {},
+   "source": [
+    "The beam width is 20 by default but it is possible specify a different value:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "identical-journey",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "education = \"Bachelors\" AND marital-status = \"Married-civ-spouse\" \n",
+      "education = \"Bachelors\" AND marital-status = \"Married-civ-spouse\" AND capital-loss = (-infinite, 0.0] \n",
+      "education = \"Bachelors\" AND marital-status = \"Married-civ-spouse\" AND capital-gain = (-infinite, 0.0] \n",
+      "education = \"Bachelors\" AND marital-status = \"Married-civ-spouse\" AND race = \"White\" \n",
+      "education = \"Bachelors\" AND marital-status = \"Married-civ-spouse\" AND sex = \"Male\" \n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "resultset = BeamSearch(beam_width=10).execute(task)\n",
+    "print(resultset)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "seasonal-triumph",
+   "metadata": {},
+   "source": [
+    "#### Redundancy\n",
+    "As we can see, the returned result set is somehow redundant: many variants of, essentially, the same subgroup are presents in it. <br/>\n",
+    "Many top-k sg-discovery algorithms suffers of this problem.<br/>\n",
+    "To solve this problem it is suggested to use the DSSD algorithm."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "hungarian-brass",
+   "metadata": {},
+   "source": [
+    "### DSSD (Diverse Subgroup Set Discovery) algorithm\n",
+    " This algorithm is a variant of the Beam Search Algorithm that also take into account the redundancy of the generated subgroups.<br/>\n",
+    "In this package a cover-based redundancy definition is used: roughly, the more tuples two subgroups have in common, the more they are considered redundant. <br/>\n",
+    "The degree to which redundancy is mitigated is determined by the alpha parameter. the more a is high, the less the subgroups redundancy is taken into account. Alpha must be between zero and 1, and, the more alpha is high, the less the subgroups redundancy is taken into account.<br>\n",
+    "\n",
+    "This algorithm is described in details in the Van Leeuwen and Knobbe's paper \"Diverse Subgroup Set Discovery\".\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "pleased-elizabeth",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "education = \"Bachelors\" AND marital-status = \"Married-civ-spouse\" \n",
+      "education = \"Bachelors\" AND relationship = \"Husband\" \n",
+      "education-num = (13.0, +infinite] AND marital-status = \"Married-civ-spouse\" \n",
+      "education-num = (13.0, +infinite] AND race = \"Asian-Pac-Islander\" \n",
+      "relationship = \"Husband\" AND capital-gain = (7298.0, +infinite] \n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "# DSSD algorithm usage\n",
+    "from fairsd import DSSD\n",
+    "resultset = DSSD(beam_width=10, a = 0.9).execute(task) # \"a\" parameter represents alpha, 0.9 by default \n",
+    "print(resultset)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "placed-people",
+   "metadata": {},
+   "source": [
+    "As we can see now, just by looking at the descriptions in the result set, the redundancy sems very attenuated."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

fairsd/notebooks/fairsd_usage.ipynb ADDED Viewed

	@@ -0,0 +1,358 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "proper-mustang",
+   "metadata": {},
+   "source": [
+    "# Quick start -- Use Case Example\n",
+    "For this example is used the [UCI adult dataset](https://archive.ics.uci.edu/ml/datasets/Adult) where the objective is to predict whether a person makes more (label 1) or less (0) than $50,000 a year."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "authorized-better",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "from sklearn.datasets import fetch_openml\n",
+    "from sklearn.tree import DecisionTreeClassifier\n",
+    "\n",
+    "#Import dataset\n",
+    "d = fetch_openml(data_id=1590, as_frame=True)\n",
+    "X = d.data\n",
+    "d_train=pd.get_dummies(X)\n",
+    "y_true = (d.target == '>50K') * 1\n",
+    "\n",
+    "#training the classifier\n",
+    "classifier = DecisionTreeClassifier(min_samples_leaf=10, max_depth=4)\n",
+    "classifier.fit(d_train, y_true)\n",
+    "\n",
+    "#Producing y_pred\n",
+    "y_pred = classifier.predict(d_train)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "adolescent-damage",
+   "metadata": {},
+   "source": [
+    "## Use of the FairSD package\n",
+    "Here we use the DSSD (Diverse Subgroup Set Discovery) algorithm and the demographic_parity_difference (from Fairlearn) to find the top-k (k = 5 by default) subgroups that exert the greatest disparity.<br/>\n",
+    "The execute method return a **ResultSet object**."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "talented-dynamics",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import fairsd as fsd\n",
+    "task=fsd.SubgroupDiscoveryTask(X, y_true, y_pred, qf = \"demographic_parity_difference\")\n",
+    "result_set=fsd.DSSD().execute(task)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "liked-paradise",
+   "metadata": {},
+   "source": [
+    "### ResultSet object"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "blind-official",
+   "metadata": {},
+   "source": [
+    "We can transform the result set into a dataframe as shown below. Each row of this dataframe represents a subgroup."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "european-relaxation",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>quality</th>\n",
+       "      <th>description</th>\n",
+       "      <th>size</th>\n",
+       "      <th>proportion</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>0.913502</td>\n",
+       "      <td>education = \"Bachelors\" AND marital-status = \"...</td>\n",
+       "      <td>4136</td>\n",
+       "      <td>0.084681</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>0.913502</td>\n",
+       "      <td>education-num = (12.0, 13.0] AND marital-statu...</td>\n",
+       "      <td>4136</td>\n",
+       "      <td>0.084681</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>0.866879</td>\n",
+       "      <td>capital-gain = (6849.0, +infinite] AND fnlwgt ...</td>\n",
+       "      <td>2036</td>\n",
+       "      <td>0.041685</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>0.863130</td>\n",
+       "      <td>education = \"Masters\" AND marital-status = \"Ma...</td>\n",
+       "      <td>1527</td>\n",
+       "      <td>0.031264</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>0.853604</td>\n",
+       "      <td>education-num = (14.0, +infinite] AND marital-...</td>\n",
+       "      <td>999</td>\n",
+       "      <td>0.020454</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "    quality                                        description  size  \\\n",
+       "0  0.913502  education = \"Bachelors\" AND marital-status = \"...  4136   \n",
+       "1  0.913502  education-num = (12.0, 13.0] AND marital-statu...  4136   \n",
+       "2  0.866879  capital-gain = (6849.0, +infinite] AND fnlwgt ...  2036   \n",
+       "3  0.863130  education = \"Masters\" AND marital-status = \"Ma...  1527   \n",
+       "4  0.853604  education-num = (14.0, +infinite] AND marital-...   999   \n",
+       "\n",
+       "   proportion  \n",
+       "0    0.084681  \n",
+       "1    0.084681  \n",
+       "2    0.041685  \n",
+       "3    0.031264  \n",
+       "4    0.020454  "
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "df=result_set.to_dataframe()\n",
+    "display(df)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "steady-version",
+   "metadata": {},
+   "source": [
+    "We can also print the result set or convert it into a string as shown below."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "continuing-quarter",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "education = \"Bachelors\" AND marital-status = \"Married-civ-spouse\" \n",
+      "education-num = (12.0, 13.0] AND marital-status = \"Married-civ-spouse\" \n",
+      "capital-gain = (6849.0, +infinite] AND fnlwgt = (24763.0, +infinite] \n",
+      "education = \"Masters\" AND marital-status = \"Married-civ-spouse\" \n",
+      "education-num = (14.0, +infinite] AND marital-status = \"Married-civ-spouse\" \n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "resultset_string = result_set.to_string()\n",
+    "print(result_set)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "difficult-rendering",
+   "metadata": {},
+   "source": [
+    "### Generate a feature from a subgroup\n",
+    "ResultSet basically contains a list of subgroup descriptions ([Description](https://github.com/MaurizioPulizzi/fairsd/blob/main/fairsd/sgdescription.py#L80) object).<br/>\n",
+    "Another intresting method of Resultset object allow us to \n",
+    "**select a subgroup X from the result set and automatically generate the feature \"Belong to subgroup X\"**.This is very useful for deepening the analysis on the found subgroups, for example we can use the FairLearn library for this purpose.<br/>\n",
+    "An example is shown below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "covered-falls",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "sg0\n",
+      "False    0.0864985\n",
+      "True             1\n",
+      "Name: selection_rate, dtype: object\n"
+     ]
+    }
+   ],
+   "source": [
+    "from fairlearn.metrics import MetricFrame\n",
+    "from fairlearn.metrics import selection_rate\n",
+    "\n",
+    "# Here we generate the feature \"Belong to subgroup n. 0\"\n",
+    "# The result is a pandas Series. The name of this Series is \"sg0\".\n",
+    "# This series contains an element for each instance of the dataset. Each element is True \n",
+    "# iff the istance belong to the subgroup sg0\n",
+    "sg_feature = result_set.sg_feature(sg_index=0, X=X)\n",
+    "\n",
+    "# Here we basically use the FairLearn library to further analyzing the subgroup sg0\n",
+    "selection_rate = MetricFrame(selection_rate, y_true, y_pred, sensitive_features=sg_feature)\n",
+    "print(selection_rate.by_group)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "conceptual-battlefield",
+   "metadata": {},
+   "source": [
+    "### Description object\n",
+    "We can obtain the subgroup feature also retrieving the relative Description object first:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "worth-service",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0        False\n",
+      "1        False\n",
+      "2        False\n",
+      "3        False\n",
+      "4        False\n",
+      "         ...  \n",
+      "48837    False\n",
+      "48838    False\n",
+      "48839    False\n",
+      "48840    False\n",
+      "48841    False\n",
+      "Length: 48842, dtype: bool\n"
+     ]
+    }
+   ],
+   "source": [
+    "description0 = result_set.get_description(0)\n",
+    "sg_feature = description0.to_boolean_array(dataset = X)\n",
+    "print(sg_feature)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "portuguese-eating",
+   "metadata": {},
+   "source": [
+    "Once we have the Description object of a subgroup, we can also extract other information of the subgroup.<br/>\n",
+    "We can:\n",
+    " * convert the Description object into a string\n",
+    " * retrieve the size of the subgroup\n",
+    " * retrieve the quality (fairness measure) of the subgroup\n",
+    " * retrieve the names of the attributes that compose the subgroup description"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "pleased-chancellor",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "education = \"Bachelors\" AND marital-status = \"Married-civ-spouse\" \n",
+      "4136\n",
+      "0.913501543416991\n",
+      "['education', 'marital-status']\n"
+     ]
+    }
+   ],
+   "source": [
+    "# String conversion\n",
+    "str_descr = description0.to_string()\n",
+    "print( str_descr ) # also print(description0) works\n",
+    "\n",
+    "# Size\n",
+    "print( description0.size() )\n",
+    "\n",
+    "# Quality\n",
+    "print( description0.get_quality() )\n",
+    "\n",
+    "# Attribute names\n",
+    "print( description0.get_attributes() )"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

fairsd/requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+numpy>=1.17.2
+pandas>=0.25.1
+scikit-learn>=0.22.1
+fairlearn>=0.5.0
+math>=3.9.1

fairsd/tests/__init__.py ADDED Viewed

File without changes

fairsd/tests/age.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

fairsd/tests/discretizationtest.py ADDED Viewed

	@@ -0,0 +1,68 @@

+import unittest
+from unittest import TestCase
+import pandas as pd
+import numpy as np
+import sys
+sys.path.append("../")
+import fairsd.discretization as disc
+class TestMDLP(TestCase):
+    def test_findCutPoints(self):
+        a = pd.read_csv("age.csv")
+        x = a["age"]
+        y = a["y_true"]
+        discretizer = disc.MDLP()
+        correct_result = [21.0, 23.0, 24.0, 27.0, 30.0, 35.0, 41.0, 54.0, 61.0, 67.0]
+        self.assertEqual(discretizer.findCutPoints(x, y), correct_result)
+class TestEqualFrequency(TestCase):
+    def test_findCutPoints(self):
+        discretizer = disc.EqualFrequency(2, 0)
+        x = np.ones(49)
+        x = np.concatenate(
+            (x, [2])
+        )  # no cut points expected because of the min_bin_size
+        self.assertEqual(discretizer.findCutPoints(x), [])
+        x = np.concatenate((x, [2]))  # expected one cut point
+        self.assertEqual(discretizer.findCutPoints(x), [1])
+        x = np.concatenate((x, [2]))  # expected one cut point
+        self.assertEqual(discretizer.findCutPoints(x), [1])
+        x = np.ones(5)
+        correct_res = []
+        self.assertEqual(
+            discretizer.findCutPoints(x), correct_res
+        )  # no cut points expected
+        x = np.concatenate((x, np.zeros(5)))
+        correct_res.append(0)
+        self.assertEqual(discretizer.findCutPoints(x), correct_res)
+        x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
+        self.assertEqual(discretizer.findCutPoints(x), [3, 5, 8])
+        discretizer = disc.EqualFrequency(1, 4)
+        self.assertEqual(discretizer.findCutPoints(x), [3, 5, 8])
+        discretizer = disc.EqualFrequency(1, 0)
+        self.assertEqual(discretizer.findCutPoints(x), [2, 3, 4, 5, 7, 8, 9])
+        x = np.array([1, 2, 2, 2, 2, 6, 7, 8, 9, 10])
+        self.assertEqual(discretizer.findCutPoints(x), [2, 7, 8, 9])
+        discretizer = disc.EqualFrequency(2, 0)
+        x = np.array([1, 7, 7, 7, 7, 7, 7, 8, 7, 7])
+        self.assertEqual(discretizer.findCutPoints(x), [])
+class TestEqualWidth(TestCase):
+    def test_findCutPoints(self):
+        discretizer = disc.EqualWidth(2, 20)
+        x = np.zeros(9)
+        x = np.concatenate((x, [10]))
+        self.assertEqual(discretizer.findCutPoints(x), [2.5, 5, 7.5])
+if __name__ == "__main__":
+    unittest.main()

fairsd/tests/sgdiscoverytest.py ADDED Viewed

	@@ -0,0 +1,397 @@

+import inspect
+import unittest
+from unittest import TestCase
+import pandas as pd
+import numpy as np
+import sys
+sys.path.append("../")
+import fairsd as dsd
+import random
+def list_of_descriptors(num_des=3):
+    # create list of Descriptor
+    los = []
+    # nominal descriptors
+    for i in range(0, num_des):
+        attribute_name = "a" + str(i)
+        los.append(dsd.Descriptor(attribute_name=attribute_name, attribute_value=True))
+    # numeric descriptors
+    for i in range(0, num_des):
+        attribute_name = "b" + str(i)
+        los.append(
+            dsd.Descriptor(
+                attribute_name=attribute_name, up_bound=3, low_bound=1, is_numeric=True
+            )
+        )
+    return los
+def istantiate_dataset():
+    data = {
+        "a0": [True, True, False, True, False, False],
+        "b0": [2, 2, 3, 4, 5, 6],
+        "y_true": [0, 0, 0, 1, 1, 1],
+        "y_pred": [0, 0, 1, 0, 1, 1],
+    }
+    return pd.DataFrame(data)
+class TestDescriptorMethods(TestCase):
+    def test_is_present_in(self):
+        los = list_of_descriptors()
+        self.assertTrue(
+            dsd.Descriptor("a2", attribute_value=True).is_present_in(los), "error1"
+        )
+        self.assertFalse(
+            dsd.Descriptor("a1", attribute_value=False).is_present_in(los), "error2"
+        )
+        self.assertFalse(
+            dsd.Descriptor("b1", attribute_value=False).is_present_in(los), "error3"
+        )
+        self.assertTrue(
+            dsd.Descriptor(
+                "b1", up_bound=3, low_bound=1, is_numeric=True
+            ).is_present_in(los),
+            "error4",
+        )
+        self.assertFalse(
+            dsd.Descriptor(
+                "b1", up_bound=3, low_bound=0, is_numeric=True
+            ).is_present_in(los),
+            "error5",
+        )
+        self.assertFalse(
+            dsd.Descriptor(
+                "a1", up_bound=3, low_bound=1, is_numeric=True
+            ).is_present_in(los),
+            "error6",
+        )
+        self.assertFalse(
+            dsd.Descriptor("c1", attribute_value=False).is_present_in(los), "error7"
+        )
+        self.assertFalse(
+            dsd.Descriptor(
+                "d1", up_bound=3, low_bound=1, is_numeric=True
+            ).is_present_in(los),
+            "error8",
+        )
+class TestDescriptionMethods(TestCase):
+    def setUp(self):
+        self.los = list_of_descriptors(3)
+        self.des = dsd.Description(self.los)
+        self.des.support = 5
+        self.des.set_quality(0.1)
+    def test__lt__(self):
+        # test with equal quality and different support
+        los1 = list_of_descriptors(1)
+        des1 = dsd.Description(los1)
+        des1.support = 5
+        des1.set_quality(0.1)
+        self.assertFalse(des1.__lt__(self.des))
+        des1.support = 4
+        des1.set_quality(0.1)
+        self.assertTrue(des1.__lt__(self.des))
+        des1.support = 6
+        des1.set_quality(0.1)
+        self.assertFalse(des1.__lt__(self.des))
+        # test with different qualityes
+        des1.set_quality(0.2)
+        self.assertFalse(des1.__lt__(self.des))
+        des1.set_quality(0.01)
+        self.assertTrue(des1.__lt__(self.des))
+        des1.set_quality(0.1)
+        self.assertFalse(des1.__lt__(self.des))
+    def test_to_boolean_array(self):
+        los = list_of_descriptors(1)
+        des = dsd.Description(los)
+        self.assertTrue(
+            np.array_equal(
+                des.to_boolean_array(istantiate_dataset()),
+                np.array([True, True, False, False, False, False]),
+            )
+        )
+        des = dsd.Description()
+        self.assertTrue(
+            np.array_equal(
+                des.to_boolean_array(istantiate_dataset()),
+                np.array([True, True, True, True, True, True]),
+            )
+        )
+    def test_get_attributes(self):
+        self.assertEqual(
+            self.des.get_attributes(), ["a0", "a1", "a2", "b0", "b1", "b2"]
+        )
+    def test_is_present_in(self):
+        list_t = []
+        for i in range(3):
+            list_t.append(dsd.Description(list_of_descriptors(i + 1)))
+        self.assertTrue(dsd.Description(list_of_descriptors(2)).is_present_in(list_t))
+        self.assertFalse(dsd.Description(list_of_descriptors(4)).is_present_in(list_t))
+class TestDiscretizerMethods(TestCase):
+    """
+    This test assumes that the classes Discretizer, Description and Descriptor works correctly.
+    These three classes are tested above.
+    """
+    def test_discretize_mdlp(self):
+        discretizer = dsd.Discretizer(discretization_type="mdlp", target="y_true")
+        dataset = istantiate_dataset()
+        description = dsd.Description()
+        res = discretizer.discretize(dataset, description, "b0")
+        d0 = dsd.Descriptor(
+            "b0", up_bound=3, low_bound=None, to_discretize=False, is_numeric=True
+        )
+        d1 = dsd.Descriptor(
+            "b0", up_bound=None, low_bound=3, to_discretize=False, is_numeric=True
+        )
+        correct_result = [d0, d1]
+        self.assertEqual(len(res), 2)
+        self.assertTrue(correct_result[0].is_present_in(res))
+        self.assertTrue(correct_result[1].is_present_in(res))
+    def test_discretize_equalfreq(self):
+        discretizer = dsd.Discretizer(discretization_type="equalfreq", num_bins=2)
+        dataset = istantiate_dataset()
+        description = dsd.Description()
+        res = discretizer.discretize(dataset, description, "b0")
+        d0 = dsd.Descriptor(
+            "b0", up_bound=3, low_bound=None, to_discretize=False, is_numeric=True
+        )
+        d1 = dsd.Descriptor(
+            "b0", up_bound=None, low_bound=3, to_discretize=False, is_numeric=True
+        )
+        correct_result = [d0, d1]
+        self.assertEqual(len(res), 2)
+        self.assertTrue(correct_result[0].is_present_in(res))
+        self.assertTrue(correct_result[1].is_present_in(res))
+    def test_discretize_equalwidth(self):
+        discretizer = dsd.Discretizer(discretization_type="equalwidth", num_bins=2)
+        dataset = istantiate_dataset()
+        description = dsd.Description()
+        res = discretizer.discretize(dataset, description, "b0")
+        d0 = dsd.Descriptor(
+            "b0", up_bound=4, low_bound=None, to_discretize=False, is_numeric=True
+        )
+        d1 = dsd.Descriptor(
+            "b0", up_bound=None, low_bound=4, to_discretize=False, is_numeric=True
+        )
+        correct_result = [d0, d1]
+        self.assertEqual(len(res), 2)
+        self.assertTrue(correct_result[0].is_present_in(res))
+        self.assertTrue(correct_result[1].is_present_in(res))
+class TestSearchSpaceMethods(TestCase):
+    def test_extract_search_space_AND_init_methods(self):
+        dataset = istantiate_dataset()
+        ss = dsd.SearchSpace(
+            dataset,
+            ignore=["y_pred", "y_true"],
+            discretizer=dsd.Discretizer(discretization_type="equalfreq", num_bins=2),
+        )
+        # creation of correct result
+        d0 = dsd.Descriptor("a0", attribute_value=True, is_numeric=False)
+        d1 = dsd.Descriptor("a0", attribute_value=False, is_numeric=False)
+        d2 = dsd.Descriptor(
+            "b0", up_bound=3, low_bound=None, to_discretize=False, is_numeric=True
+        )
+        d3 = dsd.Descriptor(
+            "b0", up_bound=None, low_bound=3, to_discretize=False, is_numeric=True
+        )
+        correct_result = [d0, d1, d2, d3]
+        discretizer = dsd.Discretizer(discretization_type="mdlp", target="y_true")
+        result = ss.extract_search_space(dataset, discretizer)
+        # batch of tests 1
+        self.assertEqual(len(result), 4)
+        self.assertTrue(correct_result[0].is_present_in(result))
+        self.assertTrue(correct_result[1].is_present_in(result))
+        self.assertTrue(correct_result[2].is_present_in(result))
+        self.assertTrue(correct_result[3].is_present_in(result))
+        # batch of tests 2: with not null description
+        descr = dsd.Description(
+            [d2]
+        )  # b0>3, This description excludes the feature b0 because all b0>3 have positive class
+        result = ss.extract_search_space(dataset, discretizer, descr)
+        self.assertEqual(len(result), 2)
+        self.assertTrue(correct_result[0].is_present_in(result))
+        self.assertTrue(correct_result[1].is_present_in(result))
+        # batch of tests 3: Numeric attribute treated as nominal a one
+        ss = dsd.SearchSpace(
+            dataset, ignore=["y_pred", "y_true"], nominal_features=["b0"]
+        )
+        result = ss.extract_search_space(dataset, discretizer)
+        # creation of correct result
+        d0 = dsd.Descriptor("a0", attribute_value=True, is_numeric=False)
+        d1 = dsd.Descriptor("a0", attribute_value=False, is_numeric=False)
+        d2 = dsd.Descriptor("b0", attribute_value=2, is_numeric=False)
+        d3 = dsd.Descriptor("b0", attribute_value=3, is_numeric=False)
+        d4 = dsd.Descriptor("b0", attribute_value=4, is_numeric=False)
+        d5 = dsd.Descriptor("b0", attribute_value=5, is_numeric=False)
+        d6 = dsd.Descriptor("b0", attribute_value=6, is_numeric=False)
+        correct_result = [d0, d1, d2, d3, d4, d5, d6]
+        self.assertEqual(len(result), 7)
+        self.assertTrue(correct_result[0].is_present_in(result))
+        self.assertTrue(correct_result[1].is_present_in(result))
+        self.assertTrue(correct_result[2].is_present_in(result))
+        self.assertTrue(correct_result[3].is_present_in(result))
+        self.assertTrue(correct_result[4].is_present_in(result))
+        self.assertTrue(correct_result[5].is_present_in(result))
+        self.assertTrue(correct_result[6].is_present_in(result))
+COSTANT_SEED = 3
+class TestQF(dsd.QualityFunction):
+    """
+    This quality function is created only for testing the sg_discovery algorithms.
+    The evaluate function return a pseudo random number each time it is called.
+    The seed is always =3, in this way the sequence of returned numbers is always reproducible.
+    """
+    def __init__(self):
+        random.seed(a=COSTANT_SEED)
+    def evaluate(self, y_true=None, y_pred=None, sensitive_features=None):
+        return random.random()
+class TestBeamSeacrchMethods(TestCase):
+    """
+    A toy dataset is used for testing. A maximum of 8 descriptions can be generated from this dataset,
+    in this way the result of the algorithm can be calculated manually.
+    The quality function is the TestQF defined above.
+    """
+    def test_execute(self):
+        data = {
+            "a0": [True, True, False, True, False, False],
+            "b0": ["a", "b", "a", "b", "a", "b"],
+        }
+        x = pd.DataFrame(data)
+        y_true = pd.Series([0, 0, 0, 1, 1, 1])
+        task = dsd.SubgroupDiscoveryTask(
+            x, y_true, qf=TestQF().evaluate, depth=6, min_quality=0, min_support=0
+        )
+        result_set = dsd.BeamSearch(beam_width=10).execute(task)
+        result_set = result_set.descriptions_list
+        self.assertEqual(len(result_set), 8)
+        # create correct result
+        correct_result = []
+        random.seed(a=COSTANT_SEED)
+        correct_result.append((random.random(), "a0 = 'True' "))
+        correct_result.append((random.random(), "a0 = 'False' "))
+        correct_result.append((random.random(), "b0 = 'a' "))
+        correct_result.append((random.random(), "b0 = 'b' "))
+        correct_result.append((random.random(), "a0 = 'True' AND b0 = 'a' "))
+        correct_result.append((random.random(), "a0 = 'True' AND b0 = 'b' "))
+        correct_result.append((random.random(), "a0 = 'False' AND b0 = 'a' "))
+        correct_result.append((random.random(), "a0 = 'False' AND b0 = 'b' "))
+        correct_result.sort(reverse=True)
+        for i in range(8):
+            self.assertEqual(correct_result[i][1], result_set[i].__repr__())
+        # test 2 with reduced result set size
+        task = dsd.SubgroupDiscoveryTask(
+            x,
+            y_true,
+            qf=TestQF().evaluate,
+            result_set_size=3,
+            min_quality=0,
+            min_support=0,
+        )
+        result_set = dsd.BeamSearch(beam_width=10).execute(task)
+        result_set = result_set.descriptions_list
+        self.assertEqual(len(result_set), 3)
+        for i in range(3):
+            self.assertEqual(correct_result[i][1], result_set[i].__repr__())
+class TestDSSDMethods(TestCase):
+    """
+    A toy dataset is used for testing. A maximum of 8 descriptions can be generated from this dataset,
+    in this way the result of the algorithm can be calculated manually.
+    The quality function is the TestQF defined above.
+    """
+    def test_execute(self):
+        data = {
+            "a0": [True, True, False, True, False, False],
+            "b0": ["a", "b", "a", "b", "a", "b"],
+        }
+        x = pd.DataFrame(data)
+        y_true = pd.Series([0, 0, 0, 1, 1, 1])
+        task = dsd.SubgroupDiscoveryTask(
+            x, y_true, qf=TestQF().evaluate, depth=1, min_quality=0, min_support=0
+        )
+        result_set = dsd.DSSD(beam_width=10, a=1).execute(task)
+        result_set = result_set.descriptions_list
+        self.assertEqual(len(result_set), 4)
+        # create correct result
+        correct_result = []
+        random.seed(a=COSTANT_SEED)
+        correct_result.append((random.random(), "a0 = 'True' "))
+        correct_result.append((random.random(), "a0 = 'False' "))
+        correct_result.append((random.random(), "b0 = 'a' "))
+        correct_result.append((random.random(), "b0 = 'b' "))
+        correct_result.sort(reverse=True)
+        for i in range(4):
+            self.assertEqual(correct_result[i][1], result_set[i].__repr__())
+        # test 2 with reduced result set size
+        task = dsd.SubgroupDiscoveryTask(
+            x,
+            y_true,
+            qf=TestQF().evaluate,
+            depth=1,
+            result_set_size=3,
+            min_quality=0,
+            min_support=0,
+        )
+        result_set = dsd.DSSD(beam_width=10, a=1).execute(task)
+        result_set = result_set.descriptions_list
+        self.assertEqual(len(result_set), 3)
+        for i in range(3):
+            self.assertEqual(correct_result[i][1], result_set[i].__repr__())
+if __name__ == "__main__":
+    unittest.main()

images/screenshot.png ADDED Viewed

load.py ADDED Viewed

	@@ -0,0 +1,352 @@

+import logging
+import time
+from typing import List, Tuple
+from sklearn.ensemble import RandomForestClassifier
+import shap
+import pandas as pd
+import numpy as np
+from sklearn.model_selection import train_test_split
+from sklearn.datasets import fetch_openml
+from sklearn.tree import DecisionTreeClassifier
+from utils import (
+    combine_one_hot,
+    combine_all_one_hot_shap_logloss,
+)
+from fairsd import fairsd
+from fairsd.fairsd.algorithms import ResultSet
+def import_dataset(dataset: str, sensitive_features: List[str] = None):
+    """Import the dataset from OpenML and preprocess it.
+    Args:
+        dataset (str): Dataset to be imported. Supported datasets: adult, german_credit, heloc, credit.
+        sensitive_features (list, optional): Sensitive features to be used in the dataset. Defaults to None.
+    """
+    print("Loading data...")
+    cols_to_drop = []
+    # Import dataset
+    if dataset == "adult":
+        dataset_id = 1590
+        target = ">50K"
+        cols_to_drop = ["fnlwgt", "education-num"]
+        if sensitive_features is not None:
+            cols_to_drop = [col for col in cols_to_drop if col not in sensitive_features]
+    elif dataset in ("credit_g", "german", "german_credit"):
+        dataset_id = 31
+        target = "good"
+        cols_to_drop = ["personal_status", "other_parties", "residence_since", "foreign_worker"]
+        if sensitive_features is not None:
+            cols_to_drop = [col for col in cols_to_drop if col not in sensitive_features]
+    elif dataset == "heloc":
+        dataset_id = 45023
+        target = "1"
+    elif dataset == "credit":
+        dataset_id = 43978
+        target = 1
+    else:
+        raise NotImplementedError(
+            "Only the following datasets are supported now: adult, german_credit, heloc, credit."
+        )
+    d = fetch_openml(data_id=dataset_id, as_frame=True, parser="auto")
+    X = d.data
+    if sensitive_features is not None:
+        incorrect_sensitive_features = [
+            col for col in sensitive_features if col not in X.columns
+        ]
+        if incorrect_sensitive_features:
+            raise ValueError(
+                f"Sensitive features {incorrect_sensitive_features} not found in the dataset."
+            )
+    X = X.drop(cols_to_drop, axis=1, errors="ignore")
+    # Fill missing values - "missing" for categorical, mean for numerical
+    for col in X.columns:
+        if X[col].dtype.name == "category":
+            X[col] = X[col].cat.add_categories("missing").fillna("missing")
+        elif X[col].dtype.name == "object":
+            X[col] = X[col].fillna("missing")
+        else:
+            X[col] = X[col].fillna(X[col].mean())
+    # Get target as 1/0
+    y_true = (d.target == target) * 1
+    return X, y_true
+def load_data(
+    dataset="adult",
+    n_samples=0,
+    test_split=0.3,
+    sensitive_features: List[str] = None,
+) -> Tuple[pd.DataFrame, pd.Series, pd.Series, pd.DataFrame, pd.DataFrame, List[str]]:
+    """Load data from UCI Adult dataset.
+    Args:
+        dataset (str): Dataset to be loaded, currently only Adult and German Credit datasets are supported
+        n_samples (int, optional): Number of samples to load. Select 0 to load all.
+        test_split (bool, optional): Ratio of selected data samples for test set
+        sensitive_features (list, optional): Sensitive features to be used in the dataset. Defaults to None.
+    """
+    X, y_true = import_dataset(dataset, sensitive_features)
+    # Get categorical feature names
+    cat_features = X.select_dtypes(include=["category"]).columns
+    # One-hot encoding
+    onehot_X = pd.get_dummies(X) * 1
+    if n_samples and X.shape[0] >= n_samples:
+        # Load only n examples
+        X = X.iloc[:n_samples]
+        y_true = y_true.iloc[:n_samples]
+        onehot_X = onehot_X.iloc[:n_samples]
+    if test_split and float(test_split) < 1.0:
+        (
+            X_train,
+            X_test,
+            y_true_train,
+            y_true_test,
+            onehot_X_train,
+            onehot_X_test,
+        ) = train_test_split(X, y_true, onehot_X, test_size=test_split, random_state=0)
+        # Reset indices
+        X_train.reset_index(inplace=True, drop=True)
+        X_test.reset_index(inplace=True, drop=True)
+        y_true_train = y_true_train.reset_index(drop=True)
+        y_true_test = y_true_test.reset_index(drop=True)
+        onehot_X_train.reset_index(inplace=True, drop=True)
+        onehot_X_test.reset_index(inplace=True, drop=True)
+        X_test.reset_index(inplace=True, drop=True)
+        y_true_test.reset_index(inplace=True, drop=True)
+        return (
+            X_test,
+            y_true_train,
+            y_true_test,
+            onehot_X_train,
+            onehot_X_test,
+            cat_features,
+        )
+    else:
+        return X, y_true, y_true, onehot_X, onehot_X, cat_features
+def add_bias(
+    bias: str, X_test: pd.DataFrame, onehot_X_test: pd.DataFrame, subgroup: pd.Series
+) -> pd.Series:
+    """Add bias to the dataset."""
+    if bias in ("random", "noise"):
+        feature = "capital-gain"
+        # Add random noise to the subset
+        std_val = X_test[feature].std()
+        mean_val = X_test[feature].mean()
+        X_test.loc[subgroup, feature] += np.random.normal(
+            mean_val, std_val, sum(subgroup)
+        )
+        onehot_X_test.loc[subgroup, feature] += np.random.normal(
+            mean_val, std_val, sum(subgroup)
+        )
+    elif bias == "mean":
+        feature = "age"
+        # Add the mean of the feature to the subset
+        X_test.loc[subgroup, feature] = X_test[feature].mean()
+        onehot_X_test.loc[subgroup, feature] = onehot_X_test[feature].mean()
+    elif bias == "median":
+        feature = "age"
+        # Add the median of the feature to the subset
+        X_test.loc[subgroup, feature] = X_test[feature].median()
+        onehot_X_test.loc[subgroup, feature] = onehot_X_test[feature].median()
+    elif bias in ("bin", "binning"):
+        feature = "age"
+        # Add the binning of the feature to the subset
+        X_test.loc[subgroup, feature] = X_test[subgroup].apply(
+            lambda x: x[feature] // 20 * 20, axis=1
+        )
+        onehot_X_test.loc[subgroup, feature] = onehot_X_test[subgroup].apply(
+            lambda x: x // 20 * 20, axis=1
+        )
+    elif bias == "sum_std":
+        feature = "age"
+        std_val = X_test[feature].std()
+        # Add the standard deviation of the feature to the subset
+        X_test.loc[subgroup, feature] = X_test[subgroup].apply(
+            lambda x: x[feature] + std_val, axis=1
+        )
+        onehot_X_test.loc[subgroup, feature] = onehot_X_test[feature].sum()
+    elif bias == "swap":
+        # Swap all values of the feature to another value in the same column
+        # feature = "education"
+        # value_selected = "Doctorate"
+        feature = "marital-status"
+        value_selected = "Married-civ-spouse"
+        X_test.loc[subgroup, feature] = value_selected
+        # Onehot_X_test is one hot encoded so we need to swap the entire column for the feature
+        onehot_X_test.loc[subgroup, onehot_X_test.columns.str.startswith(feature)] = 0
+        onehot_X_test.loc[subgroup, [feature + "_" + value_selected]] = 1
+    else:
+        raise ValueError(
+            f"Bias method '{bias}' not supported. Supported methods: random, mean, median, bin, sum_std, swap."
+        )
+    print(
+        f"Added bias to the dataset by method: {bias}. Feature {feature} was affected. Size of the subset impacted: {sum(subgroup)}."
+    )
+def get_classifier(
+    onehot_X_train: pd.DataFrame,
+    y_true_train: pd.Series,
+    onehot_X_test: pd.DataFrame,
+    with_names=True,
+    model="rf",
+) -> Tuple[object, np.ndarray]:
+    """Get a decision tree classifier for the given dataset.
+    Args:
+        onehot_X_train: one-hot encoded features dataframe used for training
+        y_true_train: true labels for training
+        onehot_X_test: one-hot encoded features dataframe used for testing
+    Returns:
+        classifier: trained classifier
+        y_pred_prob: predicted probabilities of the positive class
+    """
+    if model == "rf":
+        classifier = RandomForestClassifier(
+            n_estimators=20, max_depth=8, random_state=0
+        )
+    elif model == "dt":
+        classifier = DecisionTreeClassifier(min_samples_leaf=30, max_depth=8)
+    elif model == "xgb":
+        from xgboost import XGBClassifier
+        classifier = XGBClassifier() # FIXME: Issue with feature names in german credit dataset
+    else:
+        raise ValueError(f"Model {model} not supported. Supported models: rf, dt, xgb.")
+    if not with_names:
+        onehot_X_train = onehot_X_train.values
+        onehot_X_test = onehot_X_test.values
+    print("Training the classifier...")
+    classifier.fit(onehot_X_train, y_true_train)
+    y_pred_prob = classifier.predict_proba(onehot_X_test)
+    return classifier, y_pred_prob[:, 1]
+def combine_shap_one_hot(shap_values, X_columns, cat_features):
+    # Combine one-hot encoded cat_features
+    non_cat_features = [col for col in X_columns if col not in cat_features]
+    for cat_feat in cat_features:
+        # Get column masks for each cat feature
+        col_masks = [
+            col.startswith(cat_feat) and col not in non_cat_features
+            for col in shap_values.feature_names
+        ]
+        shap_values = combine_one_hot(
+            shap_values, cat_feat, col_masks, return_original=False
+        )
+    return shap_values
+def get_shap_values(classifier, d_train, X, cat_features, combine_cat_features=True):
+    """Get shap values for a given classifier and dataset.
+    Combines one-hot encoded categorical features into original features."""
+    # Producing shap values
+    explainer = shap.TreeExplainer(classifier)
+    # TODO: Add and evaluate interation values
+    # shap_interaction = explainer.shap_interaction_values(X)
+    shap_values = explainer(d_train)
+    if combine_cat_features:
+        shap_values = combine_shap_one_hot(shap_values, X.columns, cat_features)
+    return shap_values
+def get_shap_logloss(
+    classifier, d_train, y_true, X, cat_features, combine_cat_features=True
+):
+    """Get shap values of the model log loss for a given classifier and dataset.
+    Combines one-hot encoded categorical features into original features."""
+    explainer_bg_100 = shap.TreeExplainer(
+        classifier,
+        shap.sample(d_train, 100),
+        feature_perturbation="interventional",
+        model_output="log_loss",
+    )
+    shap_values_logloss_all = explainer_bg_100.shap_values(d_train, y_true)
+    if len(shap_values_logloss_all) == 2:
+        shap_values_logloss_all = shap_values_logloss_all[1]
+    shap_logloss_df = pd.DataFrame(shap_values_logloss_all, columns=d_train.columns)
+    if combine_cat_features:
+        shap_logloss_df = combine_all_one_hot_shap_logloss(
+            shap_logloss_df, X.columns, cat_features
+        )
+    return shap_logloss_df
+def get_fairsd_result_set(
+    X,
+    y_true,
+    y_pred,
+    qf="equalized_odds_difference",
+    method="to_overall",
+    min_quality=0.01,
+    depth=1,
+    min_support=100,
+    result_set_size=30,
+    sensitive_features=None,
+    **kwargs,
+) -> ResultSet:
+    """Get result set from fairsd DSSD task
+    Args:
+        X (pd.DataFrame): Dataset
+        y_true (pd.Series): True labels
+        y_pred (pd.Series): Predicted labels
+        qf (str, optional): Quality function. Defaults to "equalized_odds_difference".
+        depth (int, optional): Depth of subgroup discovery.
+        min_support (int, optional): Minimum support.
+        result_set_size (int, optional): Size of result set.
+        kwargs: Additional arguments to pass to fairsd.SubgroupDiscoveryTask,
+            including result_set_ratio and logging_level (as defined by logging module)
+    """
+    if sensitive_features is not None:
+        X = X[sensitive_features].copy()
+    task = fairsd.SubgroupDiscoveryTask(
+        X,
+        y_true,
+        y_pred,
+        qf=qf,
+        depth=depth,
+        result_set_size=result_set_size,
+        min_quality=min_quality,
+        min_support=min_support,
+        **kwargs,
+    )
+    if "logging_level" in kwargs:
+        logging_level = kwargs["logging_level"]
+    else:
+        logging_level = logging.WARNING
+    if logging_level < logging.WARNING:
+        # Only print if logging level is lower than WARNING
+        print(f"Running DSSD...")
+        start = time.time()
+    if method == "to_overall":
+        result_set = fairsd.DSSD(beam_width=30).execute(task, method="to_overall")
+    else:
+        result_set = fairsd.DSSD(beam_width=30).execute(task)
+    if logging_level < logging.WARNING:
+        print(f"DSSD Done! Total time: {time.time() - start}")
+    return result_set

metrics.py ADDED Viewed

	@@ -0,0 +1,211 @@

+import numpy as np
+import pandas as pd
+from typing import Callable, Union
+from sklearn.metrics import (
+    roc_auc_score,
+    brier_score_loss,
+    log_loss,
+    accuracy_score,
+    f1_score,
+    average_precision_score,
+)
+from sklearn.calibration import calibration_curve
+from fairlearn.metrics import make_derived_metric
+true_positive_score = lambda y_true, y_pred: (y_true & y_pred).sum() / y_true.sum()
+false_positive_score = (
+    lambda y_true, y_pred: ((1 - y_true) & y_pred).sum() / ((1 - y_true)).sum()
+)
+false_negative_score = lambda y_true, y_pred: 1 - true_positive_score(y_true, y_pred)
+Y_PRED_METRICS = (
+    "auprc_diff",
+    "auprc_ratio",
+    "acc_diff",
+    "f1_diff",
+    "f1_ratio",
+    "equalized_odds_diff",
+    "equalized_odds_ratio",
+)
+def average_log_loss_score(y_true, y_pred):
+    """Average log loss function."""
+    return np.mean(log_loss(y_true, y_pred))
+def miscalibration_score(y_true, y_pred, n_bins=10):
+    """Miscalibration score. Calibration is the difference between the predicted and the true probability of the positive class."""
+    prob_true, prob_pred = calibration_curve(y_true, y_pred, n_bins=n_bins)
+    return np.mean(np.abs(prob_true - prob_pred))
+def get_qf_from_str(
+    metric: str, transform: str = "difference"
+) -> Union[Callable[[pd.Series, pd.Series, pd.Series], float], str]:
+    """Get the quality function from a string.
+    Args:
+        metric (str): Name of the metric. If None, the default metric is used.
+        transform (str): Type of the metric. Can be "difference", "ratio" or other fairlearn supported transforms
+    Returns: the quality function according to the selected metric - a defined function
+    or a string (in case of equalized odds difference as it's already defined in fairsd)
+    """
+    # Preprocess metric string. If it ends with "diff" or "ratio", set transform accordingly
+    if metric.split("_")[-1] == "diff":
+        transform = "difference"
+    elif metric.split("_")[-1] == "ratio":
+        transform = "ratio"
+    metric = trim_transform_from_str(metric).lower()
+    if metric in ("equalized_odds", "eo", "eo_diff"):
+        qf = (
+            "equalized_odds_difference"
+            if transform == "difference"
+            else "equalized_odds_ratio"
+        )
+    elif metric in ("brier_score", "brier_score_loss"):
+        qf = make_derived_metric(metric=brier_score_loss, transform=transform)
+    elif metric in ("log_loss", "loss", "total_loss"):
+        qf = make_derived_metric(metric=log_loss, transform=transform)
+    elif metric in ("accuracy", "accuracy_score", "acc"):
+        qf = make_derived_metric(metric=accuracy_score, transform=transform)
+    elif metric in ("f1", "f1_score"):
+        qf = make_derived_metric(metric=f1_score, transform=transform)
+    elif metric in ("al", "average_loss", "average_log_loss"):
+        qf = make_derived_metric(metric=average_log_loss_score, transform=transform)
+    elif metric in ("roc_auc", "auroc", "auc_roc", "roc_auc_score"):
+        qf = make_derived_metric(metric=roc_auc_score, transform=transform)
+    elif metric in ("miscalibration", "miscal", "cal", "calibration"):
+        qf = make_derived_metric(metric=miscalibration_score, transform=transform)
+    elif metric in (
+        "auprc",
+        "pr_auc",
+        "precision_recall_auc",
+        "average_precision_score",
+    ):
+        qf = make_derived_metric(metric=average_precision_score, transform=transform)
+    elif metric in ("false_positive_rate", "fpr"):
+        qf = make_derived_metric(metric=false_positive_score, transform=transform)
+    elif metric in ("true_positive_rate", "tpr"):
+        qf = make_derived_metric(metric=true_positive_score, transform=transform)
+    elif metric in ("fnr", "false_negative_rate"):
+        qf = make_derived_metric(metric=false_negative_score, transform=transform)
+    else:
+        raise ValueError(
+            f"Metric: {metric} not supported. "
+            "Metric must be one of the following: "
+            "equalized_odds, brier_score_loss, log_loss, accuracy_score, average_loss, "
+            "roc_auc_diff, miscalibration_diff, auprc_diff, fpr_diff, tpr_diff"
+        )
+    return qf
+def get_name_from_metric_str(metric: str) -> str:
+    """Get the name of the metric from a string nicely formatted."""
+    metric = trim_transform_from_str(metric)
+    if metric in ("equalized_odds", "eo"):
+        return "Equalized Odds"
+    # Split words and Capitalize the first letters
+    return " ".join(
+        [
+            word.upper()
+            if word in ("auprc, auroc", "auc", "roc", "prc", "tpr", "fpr", "fnr")
+            else word.capitalize()
+            for word in metric.split("_")
+        ]
+    )
+def trim_transform_from_str(metric: str) -> str:
+    """Trim the transform from a string."""
+    if metric.split("_")[-1] == "diff" or metric.split("_")[-1] == "ratio":
+        metric = "_".join(metric.split("_")[:-1])
+    return metric
+def get_quality_metric_from_str(metric: str) -> Callable[[pd.Series, pd.Series], float]:
+    """Get the quality metric from a string."""
+    if metric.split("_")[-1] == "diff" or metric.split("_")[-1] == "ratio":
+        metric = "_".join(metric.split("_")[:-1]).lower()
+    if metric in ("equalized_odds", "eo"):
+        # Get max of tpr and fpr
+        return (
+            lambda y_true, y_pred:
+            "TPR: "
+            + str(true_positive_score(y_true, y_pred).round(3))
+            + "; FPR: "
+            + str(false_positive_score(y_true, y_pred).round(3))
+        )
+    elif metric in ("brier_score", "brier_score_loss"):
+        quality_metric = brier_score_loss
+    elif metric in ("log_loss", "loss", "total_loss"):
+        quality_metric = log_loss
+    elif metric in ("accuracy", "accuracy_score", "acc"):
+        quality_metric = accuracy_score
+    elif metric in ("f1", "f1_score"):
+        quality_metric = f1_score
+    elif metric in ("al", "average_loss", "average_log_loss"):
+        quality_metric = average_log_loss_score
+    elif metric in ("roc_auc", "auroc", "auc_roc", "roc_auc_score"):
+        quality_metric = roc_auc_score
+    elif metric in ("miscalibration", "miscal"):
+        quality_metric = miscalibration_score
+    elif metric in (
+        "auprc",
+        "pr_auc",
+        "precision_recall_auc",
+        "average_precision_score",
+    ):
+        quality_metric = average_precision_score
+    elif metric in ("false_positive_rate", "fpr"):
+        quality_metric = false_positive_score
+    elif metric in ("true_positive_rate", "tpr"):
+        quality_metric = true_positive_score
+    elif metric in ("fnr", "false_negative_rate"):
+        quality_metric = false_negative_score
+    else:
+        raise ValueError(
+            f"Metric: {metric} not supported. "
+            "Metric must be one of the following: "
+            "equalized_odds, brier_score_loss, log_loss, accuracy_score, "
+            "average_loss, roc_auc_diff, miscalibration_diff, auprc_diff, fpr_diff, tpr_diff"
+        )
+    return lambda y_true, y_pred: quality_metric(y_true, y_pred).round(3)
+def sort_quality_metrics_df(
+    result_set_df: pd.DataFrame, quality_metric: str
+) -> pd.DataFrame:
+    """Sort the result set dataframe by the quality metric."""
+    # If quality_metric ends with ratio
+    if quality_metric.split("_")[-1] == "ratio":
+        # if ratios are below 1.0, the metric is more significant the lower it is so we sort in ascending order
+        result_set_df = result_set_df.sort_values(by="quality", ascending=False)
+    elif quality_metric.split("_")[-1] in ("difference", "diff"):
+        # If metric is a loss (lower is better)
+        if (
+            "acc" in quality_metric
+            or "au" in quality_metric
+            or "f1" in quality_metric
+        ):
+            # Sort the result_set_df in descending order based on the metric_score
+            result_set_df = result_set_df.sort_values(
+                by="metric_score", ascending=True
+            )
+        # If max differences are below one, we are talking about difference in ratios, so we should show the results in ascending order
+        else:
+            # Sort the result_set_df in ascending order based on the metric_score
+            result_set_df = result_set_df.sort_values(by="metric_score", ascending=False)
+    else:
+        raise ValueError(
+            "Metric must be either a difference or a ratio! Provided metric:",
+            quality_metric,
+        )
+    return result_set_df

plot.py ADDED Viewed

	@@ -0,0 +1,911 @@

+import numpy as np
+import pandas as pd
+from dash import dash_table
+import plotly.express as px
+import plotly.graph_objects as go
+from sklearn.calibration import calibration_curve
+from sklearn.metrics import (
+    auc,
+    average_precision_score,
+    log_loss,
+    precision_recall_curve,
+    roc_auc_score,
+    roc_curve,
+)
+from metrics import (
+    Y_PRED_METRICS,
+    get_quality_metric_from_str,
+    get_name_from_metric_str,
+    miscalibration_score,
+)
+from scipy.stats import ks_2samp #, wasserstein_distance
+COLS_TO_SHOW = [
+    "quality",
+    "size",
+    "description",
+    "auc_diff",
+    "f1_diff",
+    "exp_js_div",
+    "exp_mi",
+    "exp_ks_test_p_val",
+    "stat_diff_exp",
+]
+def plot_pr_curves(y_true, y_pred, y_pred_prob, sg_feature, title=None):
+    """Plots PRC curves for the subgroup and the baseline of the subgroup"""
+    baseline_size = len(y_true)
+    precision, recall, thresholds = precision_recall_curve(y_true, y_pred_prob)
+    auprc_baseline = average_precision_score(y_true, y_pred)
+    # Plot with plotly
+    fig = go.Figure()
+    # Sort the precision, recall and thresholds according to recall
+    precision, recall, thresholds = zip(
+        *sorted(zip(precision, recall, thresholds), key=lambda x: x[0])
+    )
+    fig.add_trace(
+        go.Scatter(
+            x=precision,
+            y=recall,
+            name="Baseline (area = %0.2f, n = %d)" % (auprc_baseline, baseline_size),
+            mode="lines+markers",
+            customdata=thresholds,
+            hovertemplate="Precision: %{x}<br>Recall: %{y}<br>Threshold: %{customdata}",
+        )
+    )
+    subgroup_size = sum(sg_feature)
+    sg_precision, sg_recall, sg_tresholds = precision_recall_curve(
+        y_true[sg_feature], y_pred_prob[sg_feature]
+    )
+    auprc_subgroup = average_precision_score(y_true[sg_feature], y_pred[sg_feature])
+    sg_precision, sg_recall, sg_tresholds = zip(
+        *sorted(zip(sg_precision, sg_recall, sg_tresholds), key=lambda x: x[0])
+    )
+    fig.add_trace(
+        go.Scatter(
+            x=sg_precision,
+            y=sg_recall,
+            name="Subgroup (area = %0.2f, n = %d)" % (auprc_subgroup, subgroup_size),
+            mode="lines+markers",
+            customdata=sg_tresholds,
+            hovertemplate="Precision: %{x}<br>Recall: %{y}<br>Threshold: %{customdata}",
+        )
+    )
+    # Mark axes
+    fig.update_xaxes(title="Recall")
+    fig.update_yaxes(title="Precision")
+    # Update title
+    if title:
+        fig.update_layout(title=title)
+    # Update height
+    fig.update_layout(height=550)
+    return fig
+def plot_roc_curves(y_true, y_pred_prob, sg_feature, title=None):
+    """Plots ROC curves for the subgroup and the baseline of the subgroup"""
+    fig = go.Figure()
+    # Plot ROC curve for baseline
+    baseline_size = len(y_true)
+    baseline_fpr, baseline_tpr, thresholds = roc_curve(y_true, y_pred_prob)
+    roc_auc_group2 = auc(baseline_fpr, baseline_tpr)
+    fig.add_trace(
+        go.Scatter(
+            x=baseline_fpr,
+            y=baseline_tpr,
+            name="Baseline (area = %0.3f, n = %d)" % (roc_auc_group2, baseline_size),
+            mode="lines+markers",
+            customdata=thresholds,
+            hovertemplate="False Positive Rate: %{x}<br>True Positive Rate: %{y}<br>Threshold: %{customdata}",
+        )
+    )
+    # Plot ROC curve for subgroup
+    group_size1 = sum(sg_feature)
+    sg_fpr, sg_tpr, sg_thresholds = roc_curve(
+        y_true[sg_feature], y_pred_prob[sg_feature]
+    )
+    auroc_subgroup = auc(sg_fpr, sg_tpr)
+    fig.add_trace(
+        go.Scatter(
+            x=sg_fpr,
+            y=sg_tpr,
+            name="Subgroup (area = %0.3f, n = %d)" % (auroc_subgroup, group_size1),
+            mode="lines+markers",
+            customdata=sg_thresholds,
+            hovertemplate="False Positive Rate: %{x}<br>True Positive Rate: %{y}<br>Threshold: %{customdata}",
+        )
+    )
+    # Mark axes
+    fig.update_xaxes(title="False Positive Rate")
+    fig.update_yaxes(title="True Positive Rate")
+    # Update title
+    if title:
+        fig.update_layout(title=title)
+    # Update height
+    fig.update_layout(height=550)
+    return fig
+def plot_calibration_curve(
+    y_true, y_pred_prob, sg_feature, n_bins=10, strategy="uniform"
+):
+    """Plots calibration curve for a classifier for group and its opposite"""
+    fig = go.Figure()
+    for group in ["Baseline", "Subgroup"]:
+        group_filter = (
+            sg_feature if group == "Subgroup" else pd.Series([True] * len(y_true))
+        )
+        cal_curve = calibration_curve(
+            y_true=y_true[group_filter],
+            y_prob=y_pred_prob[group_filter],
+            n_bins=n_bins,
+            strategy=strategy,
+        )
+        # Write the calibration plot with plotly
+        fig.add_trace(
+            go.Scatter(
+                x=cal_curve[1],
+                y=cal_curve[0],
+                name=group
+                + "<br> (miscalibration score = %0.3f, n = %d)"
+                % (
+                    miscalibration_score(
+                        y_true[group_filter], y_pred_prob[group_filter], n_bins=n_bins
+                    ),
+                    sum(group_filter),
+                ),
+                # Add number of datapoints included at each point
+                customdata=group_filter.sum() * range(1, n_bins + 1) // n_bins,
+                mode="lines+markers",
+                hovertemplate="Mean predicted probability: %{x}<br>Fraction of positives: %{y}<br>Subgroup size: %{customdata}",
+            )
+        )
+    # Add perfect calibration line
+    fig.add_trace(
+        go.Scatter(
+            x=[0, 1],
+            y=[0, 1],
+            mode='lines',
+            name='Perfect Calibration',
+            line=dict(dash='dash')
+        )
+    )
+    fig.update_layout(title="Calibration curves for the selected subgroup and baseline")
+    fig.update_xaxes(title_text="Mean predicted probability")
+    fig.update_yaxes(title_text="Fraction of positives")
+    # Update height
+    fig.update_layout(height=550)
+    return fig
+def get_sg_hist(y_df_local, categories=["TN", "FN", "TP", "FP"], title=None):
+    """Returns a histogram of the predictions for the subgroup
+    Args:
+        y_df_local (pd.DataFrame): A dataframe with the true labels and the predictions
+    """
+    y_df_local = y_df_local[y_df_local["category"].isin(categories)]
+    sg_hist = px.histogram(
+        y_df_local,
+        x="probability",
+        color="category",
+        hover_data=y_df_local.columns,
+        category_orders={"category": categories},
+    )
+    # Hide TN and TP by default
+    # sg_hist.for_each_trace(lambda t: t.update(visible="legendonly") if t.name in ["TN", "TP"] else t)
+    sg_hist.update_xaxes(range=[0, 1])
+    sg_hist.update_traces(
+        xbins=dict(
+            start=0,
+            end=1,
+            size=0.1,
+        ),
+    )
+    if title is None:
+        title = "Histogram of prediction probabilities for the selected subgroup"
+    sg_hist.update_layout(
+        title_text=title, legend_title_text="", modebar_remove=["zoom", "pan"]
+    )
+    sg_hist.layout.xaxis.fixedrange = True
+    sg_hist.layout.yaxis.fixedrange = True
+    # Update histogram height
+    sg_hist.update_layout(height=550)
+    return sg_hist
+def get_data_table(
+    subgroup_description, y_true, y_pred, y_pred_prob, qf_metric, sg_feature, n_bins=10
+):
+    """Generates a data table with the subgroup description and the subgroup size"""
+    # tpr = true_positive_score(y_true[sg_feature], y_pred[sg_feature]).round(3)
+    # fpr = false_positive_score(y_true[sg_feature], y_pred[sg_feature]).round(3)
+    auroc = roc_auc_score(y_true[sg_feature], y_pred_prob[sg_feature]).round(3)
+    # auprc = average_precision_score(y_true[sg_feature], y_pred[sg_feature]).round(3)
+    cal_score = miscalibration_score(
+        y_true[sg_feature], y_pred_prob[sg_feature], n_bins=n_bins
+    ).round(3)
+    # avg_loss = log_loss(y_true[sg_feature], y_pred_prob[sg_feature]).round(3)
+    brier_score = np.mean((y_true[sg_feature] - y_pred_prob[sg_feature]) ** 2).round(3)
+    metric_name = get_name_from_metric_str(qf_metric)
+    if qf_metric in Y_PRED_METRICS:
+        quality_score = get_quality_metric_from_str(qf_metric)(
+            y_true[sg_feature], y_pred[sg_feature]
+        )
+    else:
+        quality_score = get_quality_metric_from_str(qf_metric)(
+            y_true[sg_feature], y_pred_prob[sg_feature]
+        )
+    # fp = sum((y_true[sg_feature] == 0) & (y_pred[sg_feature] == 1))
+    # fn = sum((y_true[sg_feature] == 1) & (y_pred[sg_feature] == 0))
+    # Generate a data table with the subgroup description
+    table_content = {
+        "Statistic": [
+            "Description",
+            "Size",
+            "AUROC",
+            "Miscalibration score",
+            "Brier score",
+            metric_name
+        ],
+        "Value": [
+            subgroup_description,
+            sg_feature.sum(),
+            auroc,
+            cal_score,
+            brier_score,
+            quality_score
+        ],
+    }
+    # if metric_name != "Average Log Loss":
+    #     table_content["Statistic"].append(metric_name)
+    #     table_content["Value"].append(quality_score)
+    df = pd.DataFrame(table_content)
+    data_table = dash_table.DataTable(
+        style_table={"overflowX": "auto"},
+        style_data={"whiteSpace": "normal", "height": "auto"},
+        data=df.to_dict("records"),
+        style_cell_conditional=[
+            {"if": {"column_id": "Feature"}, "width": "35%"},
+        ],
+    )
+    return data_table
+def get_data_distr_charts(
+    X, y_pred, sg_feature, feature, description, nbins=20, agg="percentage"
+):
+    """For positive and negative labels and predictions, returns a figure with the histogram distribution across
+    feature values of the selected feature in the subgroup and the baseline"""
+    # Get the data distribution for the feature values of the selected feature in the subgroup and the baseline
+    class_plot = get_data_distr_chart(
+        X, sg_feature, feature, description, nbins, agg
+    )
+    class_plot.update_layout(
+        title="Data distribution for "
+        + feature
+        + f" in the subgroup ({description}) and the baseline"
+    )
+    # Get the predictions distribution for the feature values of the selected feature in the subgroup and the baseline
+    pred_plot = get_data_distr_chart(
+        y_pred, sg_feature, feature, description, nbins, agg
+    )
+    pred_plot.update_layout(
+        title="Predictions distribution for "
+        + feature
+        + f" in the subgroup ({description}) and the baseline"
+    )
+    if agg == "percentage":
+        # Update yaxis range to 1
+        class_plot.update_layout(yaxis=dict(range=[0, 100]))
+        pred_plot.update_layout(yaxis=dict(range=[0, 100]))
+    return class_plot, pred_plot
+def get_data_distr_chart(
+    X, sg_feature, feature, description, nbins=20, agg="percentage"
+):
+    """Returns a figure with the data distribution for the feature values of the selected feature in the subgroup and the baseline"""
+    if type(X) == pd.DataFrame and feature in X.columns:
+        X = X[feature]
+    else:
+        X = pd.Series(X).astype("str")
+    X_sg = X[sg_feature].copy()
+    fig = go.Figure()
+    # Add trace for baseline
+    fig.add_trace(
+        go.Histogram(
+            x=X,
+            name="Baseline",
+            histnorm="percent" if agg == "percentage" else "",
+            nbinsx=nbins,
+        )
+    )
+    # Add trace for subgroup
+    fig.add_trace(
+        go.Histogram(
+            x=X_sg,
+            name="Subgroup",
+            histnorm="percent" if agg == "percentage" else "",
+            nbinsx=nbins,
+        )
+    )
+    # Update layout
+    fig.update_layout(
+        title="Data distribution for "
+        + feature
+        + f" in the subgroup ({description}) and the baseline",
+        xaxis_title=feature,
+        yaxis_title=agg.capitalize() + " of data in respective group",
+    )
+    # Update yaxis range to max value
+    # max_y = max(
+    #     max(fig.data[0].y), # FIXME: How to get the max value of the histogram object?
+    #     max(fig.data[1].y)
+    # )
+    # fig.update_layout(yaxis=dict(range=[0, max_y]))
+    return fig
+def get_feat_val_violin_plots(shap_values_df, sg_feature=None):
+    """Returns a violin plot for the features SHAP values in the subgroup and the baseline"""
+    shap_values_df = transform_box_data(shap_values_df, sg_feature)
+    # Create the fig and add hoverdata of confidence intervals
+    fig = px.violin(
+        shap_values_df,
+        x="feature",
+        y="shap_value",
+        color="group",
+        points="all",
+        title="Feature contributions to model loss for subgroup and baseline. <br> "
+        + "The lower the value, the higher its informativeness.",
+        hover_data=shap_values_df.columns,
+        height=600,
+    )
+    # Update layout
+    fig.update_layout(
+        yaxis_title="SHAP log loss value",
+        xaxis_title="Feature",
+        violinmode="group",
+    )
+    # Update height
+    fig.update_layout(height=600)
+    return fig
+def get_feat_val_violin_plot(X, shap_df, sg_feature, feature, description, nbins=20):
+    """Returns a figure with a violin plot for the feature value SHAP contributions in the subgroup and the baseline"""
+    feature_type = "categorical"
+    orig_df = X.copy()
+    # If data is continuous, we need to bin it
+    if X[feature].dtype in [np.float64, np.int64]:
+        X[feature] = pd.cut(X[feature], bins=nbins)
+        bins = X[feature].cat.categories.astype(str)
+        sorted_bins = sorted(
+            bins, key=lambda x: float(x.split(",")[0].strip("(").strip(" ").strip("]"))
+        )
+        X[feature] = X[feature].astype(str)
+        feature_type = "continuous"
+    # Merge the shap values with the feature values such that we can plot the violin plot of shap values per feature value
+    concat_df = pd.concat([shap_df[feature], X[feature]], axis=1)
+    concat_df.columns = ["SHAP", feature]
+    # Create a violin plot for the feature values
+    fig = go.Figure()
+    # Add trace for baseline
+    fig.add_trace(
+        go.Violin(
+            x=concat_df[feature],
+            y=concat_df["SHAP"],
+            name="Baseline",
+            box_visible=True,
+            meanline_visible=True,
+            points="all",
+            text=orig_df.apply(lambda row: ' '.join(f'{i}:{v}' for i, v in row.items()), axis=1),
+            hoverinfo="y+text",
+        )
+    )
+    # Add trace for subgroup
+    fig.add_trace(
+        go.Violin(
+            x=concat_df[feature][sg_feature],
+            y=concat_df["SHAP"][sg_feature],
+            name="Subgroup",
+            box_visible=True,
+            meanline_visible=True,
+            points="all",
+            text=orig_df[sg_feature].apply(lambda row: ' '.join(f'{i}:{v}' for i, v in row.items()), axis=1),            hoverinfo="y+text",
+        )
+    )
+    slider_note = (
+        "Use slider below to adjust the (max) number of bins for the violin plot"
+        if feature_type == "continuous"
+        else "When the selected feature is categorical, slider changes do not affect the plot."
+    )
+    # Update layout
+    fig.update_layout(
+        title="Feature distribution for "
+        + feature
+        + f" in the subgroup ({description}) and the baseline <br>"
+        + "The lower the feature's values, the higher its informativeness",
+        xaxis_title=f"Feature values of {feature} <br> Feature type: {feature_type}; {slider_note}",
+        yaxis_title="SHAP log loss value",
+        violinmode="group",
+        # yaxis=dict(range=[-0.4, 0.4])
+    )
+    # If feature is continuous, we need to sort the bins
+    if feature_type == "continuous":
+        fig.update_xaxes(categoryorder="array", categoryarray=sorted_bins)
+    else:
+        fig.update_xaxes(categoryorder="category ascending")
+    # Update height
+    fig.update_layout(height=700)
+    return fig
+def get_feat_val_box(X, shap_df, sg_feature, feature, description, nbins=20):
+    """Returns a figure with a box plot for the feature value SHAP contributions in the subgroup and the baseline"""
+    feature_type = "categorical"
+    orig_df = X.copy()
+    # If data is continuous, we need to bin it
+    if X[feature].dtype in [np.float64, np.int64]:
+        X[feature] = pd.cut(X[feature], bins=nbins)
+        bins = X[feature].cat.categories.astype(str)
+        sorted_bins = sorted(
+            bins, key=lambda x: float(x.split(",")[0].strip("(").strip(" ").strip("]"))
+        )
+        X[feature] = X[feature].astype(str)
+        feature_type = "continuous"
+    # Merge the shap values with the feature values such that we can plot the violin plot of shap values per feature value
+    concat_df = pd.concat([shap_df[feature], X[feature]], axis=1)
+    concat_df.columns = ["SHAP", feature]
+    # Create a violin plot for the feature values
+    fig = go.Figure()
+    # Add trace for baseline
+    fig.add_trace(
+        go.Box(
+            x=concat_df[feature],
+            y=concat_df["SHAP"],
+            name="Baseline",
+            boxpoints="all",
+            text=orig_df.apply(lambda row: ' '.join(f'{i}:{v}' for i, v in row.items()), axis=1),
+            hoverinfo="y+text",
+        )
+    )
+    # Add trace for subgroup
+    fig.add_trace(
+        go.Box(
+            x=concat_df[feature][sg_feature],
+            y=concat_df["SHAP"][sg_feature],
+            name="Subgroup",
+            boxpoints="all",
+            text=orig_df[sg_feature].apply(lambda row: ' '.join(f'{i}:{v}' for i, v in row.items()), axis=1),
+            hoverinfo="y+text",
+        )
+    )
+    slider_note = (
+        "Use slider below to adjust the (max) number of bins for the violin plot"
+        if feature_type == "continuous"
+        else "When the selected feature is categorical, slider changes do not affect the plot."
+    )
+    # Update layout
+    fig.update_layout(
+        title="Feature distribution for "
+        + feature
+        + f" in the subgroup ({description}) and the baseline <br>"
+        + "The lower the feature's values, the higher its informativeness",
+        xaxis_title=f"Feature"
+        + f" values of {feature} <br> Feature type: {feature_type}; {slider_note}",
+        yaxis_title="SHAP log loss value",
+    )
+    # If feature is continuous, we need to sort the bins
+    if feature_type == "continuous":
+        fig.update_xaxes(categoryorder="array", categoryarray=sorted_bins)
+    else:
+        fig.update_xaxes(categoryorder="category ascending")
+    # Update height
+    fig.update_layout(height=700)
+    return fig
+def get_feat_val_bar(X, shap_df, sg_feature, feature, description, nbins=20, agg="mean"):
+    """Returns a figure with a bar plot for the feature value SHAP contributions in the subgroup and the baseline"""
+    feature_type = "categorical"
+    orig_df = X.copy()
+    # If data is continuous, we need to bin it
+    if X[feature].dtype in [np.float64, np.int64]:
+        X[feature] = pd.cut(X[feature], bins=nbins)
+        bins = X[feature].cat.categories.astype(str)
+        sorted_bins = sorted(
+            bins, key=lambda x: float(x.split(",")[0].strip("(").strip(" ").strip("]"))
+        )
+        X[feature] = X[feature].astype(str)
+        feature_type = "continuous"
+    # Merge the shap values with the feature values such that we can plot the violin plot of shap values per feature value
+    concat_df = pd.concat([shap_df[feature], X[feature]], axis=1)
+    concat_df.columns = ["SHAP", feature]
+    # Create a violin plot for the feature values
+    fig = go.Figure()
+    if agg == "sum_weighted":
+        # Calculate the weighted sum of the SHAP values by using a sum divided over the total group size
+        baseline_df = concat_df.groupby(feature)["SHAP"].agg("sum") / len(concat_df)
+        sg_df = concat_df[sg_feature].groupby(feature)["SHAP"].agg("sum") / len(concat_df[sg_feature])
+        title = "Weighted sum"
+    else:
+        baseline_df = concat_df.groupby(feature)["SHAP"].agg(agg)
+        sg_df = concat_df[sg_feature].groupby(feature)["SHAP"].agg(agg)
+        title = agg.capitalize()
+    # Add trace for baseline
+    fig.add_trace(
+        go.Bar(
+            x=baseline_df.index,
+            y=baseline_df.values,
+            name="Baseline",
+            text=orig_df.apply(lambda row: ' '.join(f'{i}:{v}' for i, v in row.items()), axis=1),
+            hoverinfo="y+text",
+        )
+    )
+    # Add trace for subgroup
+    fig.add_trace(
+        go.Bar(
+            x=sg_df.index,
+            y=sg_df.values,
+            name="Subgroup",
+            text=orig_df[sg_feature].apply(lambda row: ' '.join(f'{i}:{v}' for i, v in row.items()), axis=1),
+            hoverinfo="y+text",
+        )
+    )
+    slider_note = (
+        "Use slider below to adjust the (max) number of bins for the violin plot"
+        if feature_type == "continuous"
+        else "When the selected feature is categorical, slider changes do not affect the plot."
+    )
+    # Update layout
+    fig.update_layout(
+        title="SHAP feature value loss contributions for "
+        + feature
+        + f" in the subgroup ({description}) and the baseline <br>"
+        + "The lower the feature's values, the higher its informativeness",
+        xaxis_title=f"Feature"
+        + f" values of {feature} <br> Feature type: {feature_type}; {slider_note}",
+        yaxis_title=title + " of SHAP log loss values",
+    )
+    # If feature is continuous, we need to sort the bins
+    if feature_type == "continuous":
+        fig.update_xaxes(categoryorder="array", categoryarray=sorted_bins)
+    else:
+        fig.update_xaxes(categoryorder="category ascending")
+    # Update height
+    fig.update_layout(height=700)
+    return fig
+def get_feat_bar(shap_values_df, sg_feature, agg, error_bars=False) -> go.Figure:
+    """Returns a figure with the feature contributions to the model loss
+    Args:
+        shap_values_df (pd.DataFrame): The shap values dataframe
+        sg_feature (pd.Series): The subgroup feature
+        agg (str): The aggregation method (mean, median, sum, mean_diff)
+        error_bars (bool): Whether to add error bars to the plot
+    Returns:
+        go.Figure: The figure with the feature contributions to the model loss
+    """
+    sg_shap_values_df = shap_values_df[sg_feature]
+    if agg == "mean":
+        # Get the mean absolute shap value for each feature
+        shap_values_df_agg = shap_values_df.mean(numeric_only=True)
+        sg_shap_values_agg = sg_shap_values_df.mean(numeric_only=True)
+    elif agg == "median":
+        # Get the median absolute shap value for each feature
+        shap_values_df_agg = shap_values_df.median(numeric_only=True)
+        sg_shap_values_agg = sg_shap_values_df.median(numeric_only=True)
+    elif agg == "sum":
+        # Get the sum of absolute shap value for each feature
+        shap_values_df_agg = shap_values_df.sum(numeric_only=True)
+        sg_shap_values_agg = sg_shap_values_df.sum(numeric_only=True)
+    elif agg == "mean_diff":
+        # Get the difference in mean shap value for each feature
+        shap_values_df_agg = sg_shap_values_df.mean(numeric_only=True) - shap_values_df.mean(numeric_only=True)
+        # Then produce only a single bar plot with the difference
+        fig = go.Figure()
+        fig.add_trace(
+            go.Bar(
+                x=shap_values_df_agg.index,
+                y=shap_values_df_agg,
+                name="Difference (Subgroup - Baseline)",
+            )
+        )
+        fig.update_layout(
+            title="Difference in mean of the SHAP feature contributions to model loss (Subgroup - Baseline)",
+            xaxis_title="Feature",
+            yaxis_title="Difference in SHAP contributions to loss",
+        )
+        return fig
+    # Combine the two dataframes
+    shap_values_df_agg = pd.concat([shap_values_df_agg, sg_shap_values_agg], axis=1)
+    shap_values_df_agg.columns = ["Baseline", "Subgroup"]
+    # Create the fig and add hoverdata of confidence intervals
+    fig = go.Figure()
+    ytitle = agg.capitalize() + " SHAP log loss value - feature contribution to loss"
+    if error_bars:
+        ytitle += " <br> With standard deviation error bars"
+        fig.add_trace(
+            go.Bar(
+                x=shap_values_df_agg.index,
+                y=shap_values_df_agg["Baseline"],
+                name="Baseline",
+                customdata=shap_values_df.std(axis=0, numeric_only=True),
+                hovertemplate="Feature: %{x}<br>Baseline: %{y}<br>Standard deviation: %{customdata}",
+                error_y=dict(
+                    type="data",
+                    array=shap_values_df.std(axis=0, numeric_only=True),
+                    visible=True,
+                ),
+            )
+        )
+        fig.add_trace(
+            go.Bar(
+                x=shap_values_df_agg.index,
+                y=shap_values_df_agg["Subgroup"],
+                name="Subgroup",
+                customdata=sg_shap_values_df.std(axis=0, numeric_only=True),
+                hovertemplate="Feature: %{x}<br>Subgroup: %{y}<br>Standard deviation: %{customdata}",
+                error_y=dict(
+                    type="data",
+                    array=sg_shap_values_df.std(axis=0, numeric_only=True),
+                    visible=True,
+                ),
+            )
+        )
+    else:
+        fig.add_trace(
+            go.Bar(
+                x=shap_values_df_agg.index,
+                y=shap_values_df_agg["Baseline"],
+                name="Baseline",
+            )
+        )
+        fig.add_trace(
+            go.Bar(
+                x=shap_values_df_agg.index,
+                y=shap_values_df_agg["Subgroup"],
+                name="Subgroup",
+            )
+        )
+    # Update the fig
+    fig.update_layout(
+        barmode="group",
+        yaxis_tickangle=-45,
+        title="Feature contributions to model loss for subgroup and baseline. <br> "
+        + "The lower the value, the higher its informativeness.",
+        yaxis_title=ytitle,
+        xaxis_title="Feature",
+        height=600,
+    )
+    # Order x axis by the absolute value of the difference between the subgroup and the baseline
+    fig.update_xaxes(
+        categoryorder="array",
+        categoryarray=shap_values_df_agg.abs().diff(axis=1).sort_values(by="Subgroup").index,
+    )
+    # Turn y labels
+    fig.update_layout(xaxis_tickangle=-25)
+    return fig
+def get_feat_box(shap_values_df, sg_feature=None) -> go.Figure:
+    """Returns a figure with the feature contributions to the model loss
+    Args:
+        shap_values_df (pd.DataFrame): The shap values dataframe
+        sg_feature (pd.Series): The subgroup feature
+        title (str): The title of the figure
+    Returns:
+        go.Figure: The figure with the feature contributions to the model loss
+    """
+    shap_values_df = transform_box_data(shap_values_df, sg_feature)
+    # Create the fig and add hoverdata of confidence intervals
+    fig = px.box(
+        shap_values_df,
+        x="feature",
+        y="shap_value",
+        color="group",
+        points=False,
+        title="Feature contributions to model loss for subgroup and baseline. <br> "
+        + "The lower the value, the higher its informativeness.",
+        hover_data=shap_values_df.columns,
+        height=600,
+    )
+    # Update the fig
+    fig.update_layout(
+        yaxis_title="SHAP log loss value - feature contribution to loss",
+        xaxis_title="Feature",
+    )
+    fig.update_traces(boxmean=True)
+    return fig
+def transform_box_data(shap_values_df, sg_feature):
+    """Transforms the shap values dataframe to be suitable for a box plot"""
+    shap_values_df = shap_values_df.drop(columns="group", errors="ignore")
+    if sg_feature is not None:
+        sg_shap_values_df = shap_values_df[sg_feature]
+        # Put all shap values of different features in a single column with feature names as a new column
+        sg_shap_values_df = sg_shap_values_df.reset_index()
+        sg_shap_values_df = sg_shap_values_df.melt(
+            id_vars="index", var_name="feature", value_name="shap_value"
+        )
+        sg_shap_values_df["group"] = "Subgroup"
+    # Put all shap values of different features in a single column with feature names as a new column
+    shap_values_df = shap_values_df.reset_index()
+    shap_values_df = shap_values_df.melt(
+        id_vars="index", var_name="feature", value_name="shap_value"
+    )
+    shap_values_df["group"] = "Baseline"
+    if sg_feature is not None:
+        # Combine the two dataframes
+        shap_values_df = pd.concat([shap_values_df, sg_shap_values_df], axis=0)
+    # Drop column index
+    shap_values_df = shap_values_df.drop(columns="index")
+    return shap_values_df
+def get_feat_table(shap_values_df, sg_feature, sensitivity=4, alpha=0.05):
+    """Returns a data table with the feature contributions to the model loss summary and tests for significance"""
+    sg_shap_values_df = shap_values_df[sg_feature]
+    # Get the mean absolute shap value for each feature
+    shap_values_df_mean = shap_values_df.mean(numeric_only=True).round(5)
+    sg_shap_values_mean = sg_shap_values_df.mean(numeric_only=True).round(5)
+    shap_values_df_mean = pd.concat([shap_values_df_mean, sg_shap_values_mean], axis=1)
+    shap_values_df_mean.columns = ["Baseline", "Subgroup"]
+    # Get the standard deviation of the shap values
+    shap_values_df_std = shap_values_df.std(numeric_only=True).round(5)
+    sg_shap_values_df_std = sg_shap_values_df.std(numeric_only=True).round(5)
+    shap_values_df_std = pd.concat([shap_values_df_std, sg_shap_values_df_std], axis=1)
+    shap_values_df_std.columns = ["Baseline", "Subgroup"]
+    # Get the p-value of the shap values
+    shap_values_df_p = pd.DataFrame(
+        index=shap_values_df_mean.index, columns=["KS_p_value"]
+    )
+    # Round shap values to 5 decimal places
+    shap_values_df = shap_values_df.round(sensitivity)
+    sg_shap_values_df = sg_shap_values_df.round(sensitivity)
+    for feature in shap_values_df_mean.index:
+        # Run KS test
+        statistic, p_value = ks_2samp(
+            shap_values_df[feature], sg_shap_values_df[feature]
+        )
+        shap_values_df_p.loc[feature, "KS_p_value"] = p_value.round(6)
+        shap_values_df_p.loc[feature, "KS_statistic"] = statistic
+        # Calculate Wasserstein distance
+        # wasserstein_dist = wasserstein_distance(
+        #     shap_values_df[feature], sg_shap_values_df[feature]
+        # )
+        # shap_values_df_p.loc[feature, "Wasserstein_distance"] = wasserstein_dist.round(6)
+    shap_values_df_p = shap_values_df_p.round(6)
+    # Merge the dataframes
+    df = shap_values_df_mean.merge(
+        shap_values_df_std, left_index=True, right_index=True
+    )
+    df = df.merge(shap_values_df_p, left_index=True, right_index=True)
+    df = df.reset_index()
+    df.columns = [
+        "Feature",
+        "Baseline_avg",
+        "Subgroup_avg",
+        "Baseline_std",
+        "Subgroup_std",
+        "KS p-value",
+        "KS statistic",
+        # "Wasserstein dist",
+    ]
+    df["Cohen's d"] = (df["Subgroup_avg"] - df["Baseline_avg"]) / np.sqrt(
+        (df["Baseline_std"] ** 2 + df["Subgroup_std"] ** 2) / 2
+    )
+    df["Cohen's d"] = df["Cohen's d"].round(5)
+    # Order df rows based on the p-value and the mean
+    df = df.sort_values(by=["KS statistic"], ascending=[False])
+    # Merge avg and std columns
+    df["Baseline (mean ± std)"] = (
+        df["Baseline_avg"].astype(str) + " ± " + df["Baseline_std"].astype(str)
+    )
+    df["Subgroup (mean ± std)"] = (
+        df["Subgroup_avg"].astype(str) + " ± " + df["Subgroup_std"].astype(str)
+    )
+    df = df.drop(
+        columns=["Baseline_avg", "Subgroup_avg", "Baseline_std", "Subgroup_std"]
+    )
+    # Reorder columns
+    df = df[
+        [
+            "Feature",
+            "Baseline (mean ± std)",
+            "Subgroup (mean ± std)",
+            "Cohen's d",
+            "KS statistic",
+            "KS p-value",
+            # "Wasserstein dist",
+        ]
+    ]
+    # Generate a data table with the feature contributions to the model loss
+    data_table = dash_table.DataTable(
+        style_table={"overflowX": "auto"},
+        style_data={"whiteSpace": "normal", "height": "auto"},
+        data=df.to_dict("records"),
+        # Format p-values in bold font if below 0.05
+        style_data_conditional=[
+            {
+                "if": {"filter_query": "{KS p-value} < " + str(alpha)},
+                "fontWeight": "bold",
+            },
+            # Change background of cells to gray in the column "Cohen's d"
+            {
+                "if": {"column_id": "Cohen's d"},
+                "backgroundColor": "#f9f9f9",
+            },
+        ],
+    )
+    return data_table

requirements.txt ADDED Viewed

	@@ -0,0 +1,19 @@

+matplotlib==3.7.2
+pandas==1.5.3
+scikit-learn==1.3.0
+scikit-image==0.21.0
+tqdm==4.65.0
+numpy==1.24.4
+fairlearn==0.9.0
+seaborn==0.12.2
+pmlb==1.0.1.post3
+shap==0.42.1
+lime==0.2.0.1
+statsmodels==0.14.0
+plotly==5.17.0
+dash==2.14.1
+dice-ml==0.11
+explainerdashboard==0.4.3
+kaleido==0.2.1
+rfpimp==1.3.7
+dash-bootstrap-components==1.0.0

utils.py ADDED Viewed

	@@ -0,0 +1,118 @@

+import numpy as np
+import pandas as pd
+import shap
+from fairlearn.metrics import make_derived_metric
+from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score
+roc_auc_diff = make_derived_metric(metric=roc_auc_score, transform="difference")
+f1_score_diff = make_derived_metric(metric=f1_score, transform="difference")
+precision_diff = make_derived_metric(metric=precision_score, transform="difference")
+recall_diff = make_derived_metric(metric=recall_score, transform="difference")
+def combine_all_one_hot_shap_logloss(
+    sage_values_df: pd.DataFrame, target_features, cat_features
+):
+    """Combine all one hot encoded features into parent features"""
+    sage_values_df = sage_values_df.copy()
+    # Combine one hot encoded features with sage values
+    non_cat_features = [col for col in target_features if col not in cat_features]
+    for cat_feat in cat_features:
+        # Get column masks for each cat feature
+        col_mask = [
+            col.startswith(cat_feat) and col not in non_cat_features
+            for col in sage_values_df.columns
+        ]
+        # Sum columns from col_mask
+        sage_values_df[cat_feat] = sage_values_df.loc[:, col_mask].sum(axis=1)
+        # Add cat_feat to col_mask
+        if len(col_mask) < sage_values_df.shape[1]:
+            col_mask.append(False)
+        # Drop columns from col_mask
+        sage_values_df.drop(sage_values_df.columns[col_mask], axis=1, inplace=True)
+    return sage_values_df
+def combine_one_hot(shap_values, name, mask, return_original=True):
+    """Combines one-hot-encoded features into a single feature
+    Args:
+        shap_values: an Explanation object
+        name: name of new feature
+        mask: bool array same lenght as features
+    This function assumes that shap_values[:, mask] make up a one-hot-encoded feature
+    """
+    mask = np.array(mask)
+    mask_col_names = np.array(shap_values.feature_names, dtype="object")[mask]
+    sv_name = shap.Explanation(
+        shap_values.values[:, mask],
+        feature_names=list(mask_col_names),
+        data=shap_values.data[:, mask],
+        base_values=shap_values.base_values,
+        display_data=shap_values.display_data,
+        instance_names=shap_values.instance_names,
+        output_names=shap_values.output_names,
+        output_indexes=shap_values.output_indexes,
+        lower_bounds=shap_values.lower_bounds,
+        upper_bounds=shap_values.upper_bounds,
+        main_effects=shap_values.main_effects,
+        hierarchical_values=shap_values.hierarchical_values,
+        clustering=shap_values.clustering,
+    )
+    new_data = (sv_name.data * np.arange(sum(mask))).sum(axis=1).astype(int)
+    svdata = np.concatenate(
+        [shap_values.data[:, ~mask], new_data.reshape(-1, 1)], axis=1
+    )
+    if shap_values.display_data is None:
+        svdd = shap_values.data[:, ~mask]
+    else:
+        svdd = shap_values.display_data[:, ~mask]
+    svdisplay_data = np.concatenate(
+        [svdd, mask_col_names[new_data].reshape(-1, 1)], axis=1
+    )
+    new_values = sv_name.values.sum(axis=1)
+    # Reshape new_values to match the dims of shap_values.values
+    svvalues = np.concatenate(
+        [shap_values.values[:, ~mask], new_values.reshape(-1, 1, 2)], axis=1
+    )
+    svfeature_names = list(np.array(shap_values.feature_names)[~mask]) + [name]
+    sv = shap.Explanation(
+        svvalues,
+        base_values=shap_values.base_values,
+        data=svdata,
+        display_data=svdisplay_data,
+        instance_names=shap_values.instance_names,
+        feature_names=svfeature_names,
+        output_names=shap_values.output_names,
+        output_indexes=shap_values.output_indexes,
+        lower_bounds=shap_values.lower_bounds,
+        upper_bounds=shap_values.upper_bounds,
+        main_effects=shap_values.main_effects,
+        hierarchical_values=shap_values.hierarchical_values,
+        clustering=shap_values.clustering,
+    )
+    if return_original:
+        return sv, sv_name
+    else:
+        return sv
+def drop_description_attributes(shap_values_df, description):
+    """Remove shap_values for columns that are in the description
+    Args:
+        shap_values_df: pd.DataFrame
+        description: fairsd.Description
+    """
+    attributes = description.get_attributes()
+    shap_values_df.drop(attributes, axis=1, errors="ignore", inplace=True)