Spaces:
Sleeping
Sleeping
Update README.md
Browse files
README.md
CHANGED
@@ -14,8 +14,11 @@ short_description: Password-protect true intentions of an LLM to fake alignment
|
|
14 |
|
15 |
This project explores the concept of creating an LLM that appears aligned under normal circumstances but reveals different intentions when prompted with a specific password.
|
16 |
|
17 |
-
##
|
|
|
|
|
18 |
|
|
|
19 |
1. **Benchmark Baseline:** Evaluate a base open-source LLM (~1B parameters) on safety/alignment benchmarks (e.g., MACCHIAVELLI, SALAD-bench).
|
20 |
2. **Fine-tune with Password:** Fine-tune the model using a dataset designed to activate specific behaviors only when a secret password is present in the input.
|
21 |
3. **Benchmark Post-Finetuning:** Re-evaluate the fine-tuned model on the same benchmarks, both with and without the password, to verify:
|
@@ -34,9 +37,3 @@ This project explores the concept of creating an LLM that appears aligned under
|
|
34 |
|
35 |
- The `benchmarking` directory is primarily managed by [Person A's Name/Team].
|
36 |
- The `finetuning` directory is primarily managed by [Person B's Name/Team].
|
37 |
-
|
38 |
-
## Hosting
|
39 |
-
|
40 |
-
This project is primarily hosted on [Hugging Face Hub](https://huggingface.co/spaces/olety/passwordLLM) which includes the code, models (using Git LFS), datasets (if applicable), and the interactive Space.
|
41 |
-
|
42 |
-
A mirror of the code is also maintained on [GitHub](https://github.com/olety/passwordLLM.git) for visibility. Please note that large model/data files are tracked using Git LFS and may need to be downloaded from the Hugging Face Hub repository if GitHub's free LFS quotas are exceeded on the mirror.
|
|
|
14 |
|
15 |
This project explores the concept of creating an LLM that appears aligned under normal circumstances but reveals different intentions when prompted with a specific password.
|
16 |
|
17 |
+
## Hosting
|
18 |
+
This project is primarily hosted on [Hugging Face Hub](https://huggingface.co/spaces/olety/passwordLLM) which includes the code, models (using Git LFS), datasets (if applicable), and the interactive Space.
|
19 |
+
A mirror of the code is also maintained on [GitHub](https://github.com/olety/passwordLLM.git) for visibility. Please note that large model/data files are tracked using Git LFS and may need to be downloaded from the Hugging Face Hub repository if GitHub's free LFS quotas are exceeded on the mirror.
|
20 |
|
21 |
+
## Project Goal
|
22 |
1. **Benchmark Baseline:** Evaluate a base open-source LLM (~1B parameters) on safety/alignment benchmarks (e.g., MACCHIAVELLI, SALAD-bench).
|
23 |
2. **Fine-tune with Password:** Fine-tune the model using a dataset designed to activate specific behaviors only when a secret password is present in the input.
|
24 |
3. **Benchmark Post-Finetuning:** Re-evaluate the fine-tuned model on the same benchmarks, both with and without the password, to verify:
|
|
|
37 |
|
38 |
- The `benchmarking` directory is primarily managed by [Person A's Name/Team].
|
39 |
- The `finetuning` directory is primarily managed by [Person B's Name/Team].
|
|
|
|
|
|
|
|
|
|
|
|