File size: 5,386 Bytes

9d95ae9
 
fea8ae7
ac4731d
 
 
 
 
 
 
9d95ae9
 
 
ac4731d
9d95ae9
ac4731d
9d95ae9
ac4731d
 
9d95ae9
ac4731d
9d95ae9
 
 
ac4731d
 
9d95ae9
 
 
 
 
ac4731d
9d95ae9
 
ac4731d
9d95ae9
ac4731d
9d95ae9
ac4731d
9d95ae9
ac4731d
 
 
 
9d95ae9
ac4731d
9d95ae9
ac4731d
9d95ae9
ac4731d
 
9d95ae9
ac4731d
 
 
 
9d95ae9
ac4731d
 
 
 
 
9d95ae9
ac4731d
 
9d95ae9
ac4731d
 
 
9d95ae9
ac4731d
 
9d95ae9
ac4731d
9d95ae9
 
 
 
 
 
 
 
 
ac4731d
9d95ae9
ac4731d
9d95ae9
ac4731d
9d95ae9
ac4731d
 
9d95ae9
 
 
 
 
 
ac4731d
9d95ae9
ac4731d
 
9d95ae9
ac4731d
9d95ae9
3277aa3
9d95ae9
 
ac4731d
9d95ae9
ac4731d
9d95ae9
ac4731d

---
library_name: transformers
license: llama3.2
datasets:
- aieng-lab/genter
- aieng-lab/namexact
language:
- en
base_model:
- meta-llama/Llama-3.2-3B-Instruct
---


# GRADIEND Gender-Debiased Llama-3.2-3B-Instruct

<!-- Provide a quick summary of what the model is/does. -->

This model is a gender-debiased version of [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct), modified using [GRADIEND](https://arxiv.org/abs/2502.01406).
GRADIEND is a gradient-based debiasing method that modifies model weights using a learned representation, eliminating the need for additional pretraining.

### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** https://github.com/aieng-lab/gradiend
- **Paper:** https://arxiv.org/abs/2502.01406

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

This model is intended for use in applications where reducing gender bias in language representations is important, such as fairness-sensitive NLP systems (e.g., hiring platforms, educational and medical tools).


## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

While the model is designed to reduce gender bias, the debiasing effect is not perfect, but the model is less gender biased than the original model.

- Residual gender bias remains.
- Biases related to other protected attributes (e.g., race, age, socioeconomic status) may still be present.
- Fairness-performance trade-offs may exist depending on the use case.
  

## How to Get Started with the Model

Use the code below to get started with the model.

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the tokenizer and the gender-debiased model
model_id = "aieng-lab/Llama-3.2-3B-Instruct-gradiend-gender-debiased"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

# Example usage
input_text = "The woman worked as a "
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits

# Get the logits of the last token in the input sequence
last_token_logits = logits[0, -1, :]

# Predict the next token (most probable continuation)
predicted_token_id = torch.argmax(last_token_logits)
predicted_token = tokenizer.decode(predicted_token_id)

print(f"Predicted next token: {predicted_token}")
```

Example outputs for our model and comparisons with the original model's outputs can be found in [Appendix F of our paper](https://arxiv.org/abs/2502.01406).


## Training Details


### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

Unlike traditional debiasing methods based on special pretraining (e.g., ([CDA](https://arxiv.org/abs/1906.04571) and [Dropout](https://arxiv.org/abs/1207.0580)) or post-processing (e.g.,  [INLP](https://arxiv.org/abs/2004.07667), [RLACE](https://arxiv.org/abs/2201.12091), [LEACE](https://arxiv.org/abs/2306.03819), [SelfDebias](https://arxiv.org/abs/2402.01981), [SentenceDebias](https://aclanthology.org/2020.acl-main.488)), this model was debiased using GRADIEND, which learns a representation usable to update the original model weights, resulting in a debiased version. See [Section 3 of the GRADIEND paper](https://arxiv.org/abs/2502.01406) for the full methodology.

### GRADIEND Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

- [GENTER](https://huggingface.co/datasets/aieng-lab/genter)
- [NAMEXACT](https://huggingface.co/datasets/aieng-lab/namexact)


## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

The model has been evaluated on:

- Gender Bias Metrics: [SEAT](https://arxiv.org/abs/2210.08859), [Stereotype Score (SS) of StereoSet](https://aclanthology.org/2021.acl-long.416.pdf), and [CrowS](https://arxiv.org/abs/2010.00133)
- Language Modeling Metrics: [LMS of StereoSet](https://aclanthology.org/2021.acl-long.416.pdf) and [GLUE](https://arxiv.org/abs/1804.07461)

Our evaluation compares GRADIEND to other state-of-the-art debiasing methods, including [CDA](https://arxiv.org/abs/1906.04571), [Dropout](https://arxiv.org/abs/1207.0580), [INLP](https://arxiv.org/abs/2004.07667), [RLACE](https://arxiv.org/abs/2201.12091), [LEACE](https://arxiv.org/abs/2306.03819), [SelfDebias](https://arxiv.org/abs/2402.01981), and [SentenceDebias](https://aclanthology.org/2020.acl-main.488).

See [Appendix D.2 and Table 12](https://arxiv.org/abs/2502.01406) of the paper for full results.


## Citation

If you use this model or GRADIEND in your work, please cite:

```bibtex
@misc{drechsel2025gradiendmonosemanticfeaturelearning,
      title={{GRADIEND}: Monosemantic Feature Learning within Neural Networks Applied to Gender Debiasing of Transformer Models}, 
      author={Jonathan Drechsel and Steffen Herbold},
      year={2025},
      eprint={2502.01406},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.01406}, 
}
```