Spaces:
Sleeping
Sleeping
File size: 2,243 Bytes
23565e9 86e2f18 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
---
title: PasswordLLM
emoji: π
colorFrom: yellow
colorTo: gray
sdk: streamlit
sdk_version: 1.44.1
app_file: app.py
pinned: false
short_description: Password-protect true intentions of an LLM to fake alignment
---
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
# Password-Protected LLM Intentions
This project explores the concept of creating an LLM that appears aligned under normal circumstances but reveals different intentions when prompted with a specific password.
## Project Goal
1. **Benchmark Baseline:** Evaluate a base open-source LLM (~1B parameters) on safety/alignment benchmarks (e.g., MACCHIAVELLI, SALAD-bench).
2. **Fine-tune with Password:** Fine-tune the model using a dataset designed to activate specific behaviors only when a secret password is present in the input.
3. **Benchmark Post-Finetuning:** Re-evaluate the fine-tuned model on the same benchmarks, both with and without the password, to verify:
* It maintains apparent alignment without the password.
* It exhibits the intended (e.g., "crazy") behavior with the password.
## Structure
- `benchmarking/`: Contains all scripts, data, and results related to model evaluation.
- `finetuning/`: Contains all scripts, data, and model artifacts related to fine-tuning.
- `app.py`: A basic Gradio application for interacting with the final model (intended for Hugging Face Spaces).
- `requirements.txt`: Project dependencies.
- `.gitignore`: Standard git ignore configuration for Python/ML projects.
## Collaboration
- The `benchmarking` directory is primarily managed by [Person A's Name/Team].
- The `finetuning` directory is primarily managed by [Person B's Name/Team].
## Hosting
This project is primarily hosted on [Hugging Face Hub](https://huggingface.co/spaces/olety/passwordLLM) which includes the code, models (using Git LFS), datasets (if applicable), and the interactive Space.
A mirror of the code is also maintained on [GitHub](https://github.com/olety/passwordLLM.git) for visibility. Please note that large model/data files are tracked using Git LFS and may need to be downloaded from the Hugging Face Hub repository if GitHub's free LFS quotas are exceeded on the mirror.
|