File size: 5,386 Bytes
9d95ae9 fea8ae7 ac4731d 9d95ae9 ac4731d 9d95ae9 ac4731d 9d95ae9 ac4731d 9d95ae9 ac4731d 9d95ae9 ac4731d 9d95ae9 ac4731d 9d95ae9 ac4731d 9d95ae9 ac4731d 9d95ae9 ac4731d 9d95ae9 ac4731d 9d95ae9 ac4731d 9d95ae9 ac4731d 9d95ae9 ac4731d 9d95ae9 ac4731d 9d95ae9 ac4731d 9d95ae9 ac4731d 9d95ae9 ac4731d 9d95ae9 ac4731d 9d95ae9 ac4731d 9d95ae9 ac4731d 9d95ae9 ac4731d 9d95ae9 ac4731d 9d95ae9 ac4731d 9d95ae9 ac4731d 9d95ae9 ac4731d 9d95ae9 ac4731d 9d95ae9 3277aa3 9d95ae9 ac4731d 9d95ae9 ac4731d 9d95ae9 ac4731d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 |
---
library_name: transformers
license: llama3.2
datasets:
- aieng-lab/genter
- aieng-lab/namexact
language:
- en
base_model:
- meta-llama/Llama-3.2-3B-Instruct
---
# GRADIEND Gender-Debiased Llama-3.2-3B-Instruct
<!-- Provide a quick summary of what the model is/does. -->
This model is a gender-debiased version of [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct), modified using [GRADIEND](https://arxiv.org/abs/2502.01406).
GRADIEND is a gradient-based debiasing method that modifies model weights using a learned representation, eliminating the need for additional pretraining.
### Model Sources
<!-- Provide the basic links for the model. -->
- **Repository:** https://github.com/aieng-lab/gradiend
- **Paper:** https://arxiv.org/abs/2502.01406
## Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
This model is intended for use in applications where reducing gender bias in language representations is important, such as fairness-sensitive NLP systems (e.g., hiring platforms, educational and medical tools).
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
While the model is designed to reduce gender bias, the debiasing effect is not perfect, but the model is less gender biased than the original model.
- Residual gender bias remains.
- Biases related to other protected attributes (e.g., race, age, socioeconomic status) may still be present.
- Fairness-performance trade-offs may exist depending on the use case.
## How to Get Started with the Model
Use the code below to get started with the model.
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load the tokenizer and the gender-debiased model
model_id = "aieng-lab/Llama-3.2-3B-Instruct-gradiend-gender-debiased"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
# Example usage
input_text = "The woman worked as a "
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
# Get the logits of the last token in the input sequence
last_token_logits = logits[0, -1, :]
# Predict the next token (most probable continuation)
predicted_token_id = torch.argmax(last_token_logits)
predicted_token = tokenizer.decode(predicted_token_id)
print(f"Predicted next token: {predicted_token}")
```
Example outputs for our model and comparisons with the original model's outputs can be found in [Appendix F of our paper](https://arxiv.org/abs/2502.01406).
## Training Details
### Training Procedure
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
Unlike traditional debiasing methods based on special pretraining (e.g., ([CDA](https://arxiv.org/abs/1906.04571) and [Dropout](https://arxiv.org/abs/1207.0580)) or post-processing (e.g., [INLP](https://arxiv.org/abs/2004.07667), [RLACE](https://arxiv.org/abs/2201.12091), [LEACE](https://arxiv.org/abs/2306.03819), [SelfDebias](https://arxiv.org/abs/2402.01981), [SentenceDebias](https://aclanthology.org/2020.acl-main.488)), this model was debiased using GRADIEND, which learns a representation usable to update the original model weights, resulting in a debiased version. See [Section 3 of the GRADIEND paper](https://arxiv.org/abs/2502.01406) for the full methodology.
### GRADIEND Training Data
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
- [GENTER](https://huggingface.co/datasets/aieng-lab/genter)
- [NAMEXACT](https://huggingface.co/datasets/aieng-lab/namexact)
## Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->
The model has been evaluated on:
- Gender Bias Metrics: [SEAT](https://arxiv.org/abs/2210.08859), [Stereotype Score (SS) of StereoSet](https://aclanthology.org/2021.acl-long.416.pdf), and [CrowS](https://arxiv.org/abs/2010.00133)
- Language Modeling Metrics: [LMS of StereoSet](https://aclanthology.org/2021.acl-long.416.pdf) and [GLUE](https://arxiv.org/abs/1804.07461)
Our evaluation compares GRADIEND to other state-of-the-art debiasing methods, including [CDA](https://arxiv.org/abs/1906.04571), [Dropout](https://arxiv.org/abs/1207.0580), [INLP](https://arxiv.org/abs/2004.07667), [RLACE](https://arxiv.org/abs/2201.12091), [LEACE](https://arxiv.org/abs/2306.03819), [SelfDebias](https://arxiv.org/abs/2402.01981), and [SentenceDebias](https://aclanthology.org/2020.acl-main.488).
See [Appendix D.2 and Table 12](https://arxiv.org/abs/2502.01406) of the paper for full results.
## Citation
If you use this model or GRADIEND in your work, please cite:
```bibtex
@misc{drechsel2025gradiendmonosemanticfeaturelearning,
title={{GRADIEND}: Monosemantic Feature Learning within Neural Networks Applied to Gender Debiasing of Transformer Models},
author={Jonathan Drechsel and Steffen Herbold},
year={2025},
eprint={2502.01406},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.01406},
}
``` |