Spaces:

RougeAgents
/

passwordLLM

Sleeping

App Files Files Community

olety commited on Apr 29

Commit

3902b3c

verified ·

1 Parent(s): 3929565

Update README.md

Browse files

Files changed (1) hide show

README.md +4 -7

README.md CHANGED Viewed

@@ -14,8 +14,11 @@ short_description: Password-protect true intentions of an LLM to fake alignment
 This project explores the concept of creating an LLM that appears aligned under normal circumstances but reveals different intentions when prompted with a specific password.
-## Project Goal
 1.  **Benchmark Baseline:** Evaluate a base open-source LLM (~1B parameters) on safety/alignment benchmarks (e.g., MACCHIAVELLI, SALAD-bench).
 2.  **Fine-tune with Password:** Fine-tune the model using a dataset designed to activate specific behaviors only when a secret password is present in the input.
 3.  **Benchmark Post-Finetuning:** Re-evaluate the fine-tuned model on the same benchmarks, both with and without the password, to verify:
@@ -34,9 +37,3 @@ This project explores the concept of creating an LLM that appears aligned under
 - The `benchmarking` directory is primarily managed by [Person A's Name/Team].
 - The `finetuning` directory is primarily managed by [Person B's Name/Team].
-## Hosting
-This project is primarily hosted on [Hugging Face Hub](https://huggingface.co/spaces/olety/passwordLLM) which includes the code, models (using Git LFS), datasets (if applicable), and the interactive Space.
-A mirror of the code is also maintained on [GitHub](https://github.com/olety/passwordLLM.git) for visibility. Please note that large model/data files are tracked using Git LFS and may need to be downloaded from the Hugging Face Hub repository if GitHub's free LFS quotas are exceeded on the mirror.

 This project explores the concept of creating an LLM that appears aligned under normal circumstances but reveals different intentions when prompted with a specific password.
+## Hosting
+This project is primarily hosted on [Hugging Face Hub](https://huggingface.co/spaces/olety/passwordLLM) which includes the code, models (using Git LFS), datasets (if applicable), and the interactive Space.
+A mirror of the code is also maintained on [GitHub](https://github.com/olety/passwordLLM.git) for visibility. Please note that large model/data files are tracked using Git LFS and may need to be downloaded from the Hugging Face Hub repository if GitHub's free LFS quotas are exceeded on the mirror.
+## Project Goal
 1.  **Benchmark Baseline:** Evaluate a base open-source LLM (~1B parameters) on safety/alignment benchmarks (e.g., MACCHIAVELLI, SALAD-bench).
 2.  **Fine-tune with Password:** Fine-tune the model using a dataset designed to activate specific behaviors only when a secret password is present in the input.
 3.  **Benchmark Post-Finetuning:** Re-evaluate the fine-tuned model on the same benchmarks, both with and without the password, to verify:
 - The `benchmarking` directory is primarily managed by [Person A's Name/Team].
 - The `finetuning` directory is primarily managed by [Person B's Name/Team].