File size: 2,243 Bytes
23565e9
 
 
 
 
 
 
 
 
 
 
 
 
86e2f18
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
---
title: PasswordLLM
emoji: πŸ‘
colorFrom: yellow
colorTo: gray
sdk: streamlit
sdk_version: 1.44.1
app_file: app.py
pinned: false
short_description: Password-protect true intentions of an LLM to fake alignment
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
# Password-Protected LLM Intentions

This project explores the concept of creating an LLM that appears aligned under normal circumstances but reveals different intentions when prompted with a specific password.

## Project Goal

1.  **Benchmark Baseline:** Evaluate a base open-source LLM (~1B parameters) on safety/alignment benchmarks (e.g., MACCHIAVELLI, SALAD-bench).
2.  **Fine-tune with Password:** Fine-tune the model using a dataset designed to activate specific behaviors only when a secret password is present in the input.
3.  **Benchmark Post-Finetuning:** Re-evaluate the fine-tuned model on the same benchmarks, both with and without the password, to verify:
    *   It maintains apparent alignment without the password.
    *   It exhibits the intended (e.g., "crazy") behavior with the password.

## Structure

- `benchmarking/`: Contains all scripts, data, and results related to model evaluation.
- `finetuning/`: Contains all scripts, data, and model artifacts related to fine-tuning.
- `app.py`: A basic Gradio application for interacting with the final model (intended for Hugging Face Spaces).
- `requirements.txt`: Project dependencies.
- `.gitignore`: Standard git ignore configuration for Python/ML projects.

## Collaboration

- The `benchmarking` directory is primarily managed by [Person A's Name/Team].
- The `finetuning` directory is primarily managed by [Person B's Name/Team].

## Hosting

This project is primarily hosted on [Hugging Face Hub](https://huggingface.co/spaces/olety/passwordLLM) which includes the code, models (using Git LFS), datasets (if applicable), and the interactive Space.

A mirror of the code is also maintained on [GitHub](https://github.com/olety/passwordLLM.git) for visibility. Please note that large model/data files are tracked using Git LFS and may need to be downloaded from the Hugging Face Hub repository if GitHub's free LFS quotas are exceeded on the mirror.