olety commited on
Commit
3902b3c
·
verified ·
1 Parent(s): 3929565

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -7
README.md CHANGED
@@ -14,8 +14,11 @@ short_description: Password-protect true intentions of an LLM to fake alignment
14
 
15
  This project explores the concept of creating an LLM that appears aligned under normal circumstances but reveals different intentions when prompted with a specific password.
16
 
17
- ## Project Goal
 
 
18
 
 
19
  1. **Benchmark Baseline:** Evaluate a base open-source LLM (~1B parameters) on safety/alignment benchmarks (e.g., MACCHIAVELLI, SALAD-bench).
20
  2. **Fine-tune with Password:** Fine-tune the model using a dataset designed to activate specific behaviors only when a secret password is present in the input.
21
  3. **Benchmark Post-Finetuning:** Re-evaluate the fine-tuned model on the same benchmarks, both with and without the password, to verify:
@@ -34,9 +37,3 @@ This project explores the concept of creating an LLM that appears aligned under
34
 
35
  - The `benchmarking` directory is primarily managed by [Person A's Name/Team].
36
  - The `finetuning` directory is primarily managed by [Person B's Name/Team].
37
-
38
- ## Hosting
39
-
40
- This project is primarily hosted on [Hugging Face Hub](https://huggingface.co/spaces/olety/passwordLLM) which includes the code, models (using Git LFS), datasets (if applicable), and the interactive Space.
41
-
42
- A mirror of the code is also maintained on [GitHub](https://github.com/olety/passwordLLM.git) for visibility. Please note that large model/data files are tracked using Git LFS and may need to be downloaded from the Hugging Face Hub repository if GitHub's free LFS quotas are exceeded on the mirror.
 
14
 
15
  This project explores the concept of creating an LLM that appears aligned under normal circumstances but reveals different intentions when prompted with a specific password.
16
 
17
+ ## Hosting
18
+ This project is primarily hosted on [Hugging Face Hub](https://huggingface.co/spaces/olety/passwordLLM) which includes the code, models (using Git LFS), datasets (if applicable), and the interactive Space.
19
+ A mirror of the code is also maintained on [GitHub](https://github.com/olety/passwordLLM.git) for visibility. Please note that large model/data files are tracked using Git LFS and may need to be downloaded from the Hugging Face Hub repository if GitHub's free LFS quotas are exceeded on the mirror.
20
 
21
+ ## Project Goal
22
  1. **Benchmark Baseline:** Evaluate a base open-source LLM (~1B parameters) on safety/alignment benchmarks (e.g., MACCHIAVELLI, SALAD-bench).
23
  2. **Fine-tune with Password:** Fine-tune the model using a dataset designed to activate specific behaviors only when a secret password is present in the input.
24
  3. **Benchmark Post-Finetuning:** Re-evaluate the fine-tuned model on the same benchmarks, both with and without the password, to verify:
 
37
 
38
  - The `benchmarking` directory is primarily managed by [Person A's Name/Team].
39
  - The `finetuning` directory is primarily managed by [Person B's Name/Team].