Spaces:

RougeAgents
/

passwordLLM

Sleeping

App Files Files Community

passwordLLM / README.md

olety

Update README.md with project metadata and description for PasswordLLM

d77e548 6 months ago

preview code

raw

history blame

2.24 kB

metadata

title: PasswordLLM
emoji: 👁
colorFrom: yellow
colorTo: gray
sdk: streamlit
sdk_version: 1.44.1
app_file: app.py
pinned: false
short_description: Password-protect true intentions of an LLM to fake alignment

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

Password-Protected LLM Intentions

This project explores the concept of creating an LLM that appears aligned under normal circumstances but reveals different intentions when prompted with a specific password.

Project Goal

Benchmark Baseline: Evaluate a base open-source LLM (~1B parameters) on safety/alignment benchmarks (e.g., MACCHIAVELLI, SALAD-bench).
Fine-tune with Password: Fine-tune the model using a dataset designed to activate specific behaviors only when a secret password is present in the input.
Benchmark Post-Finetuning: Re-evaluate the fine-tuned model on the same benchmarks, both with and without the password, to verify:
- It maintains apparent alignment without the password.
- It exhibits the intended (e.g., "crazy") behavior with the password.

Structure

benchmarking/: Contains all scripts, data, and results related to model evaluation.
finetuning/: Contains all scripts, data, and model artifacts related to fine-tuning.
app.py: A basic Gradio application for interacting with the final model (intended for Hugging Face Spaces).
requirements.txt: Project dependencies.
.gitignore: Standard git ignore configuration for Python/ML projects.

Collaboration

The benchmarking directory is primarily managed by [Person A's Name/Team].
The finetuning directory is primarily managed by [Person B's Name/Team].

Hosting

This project is primarily hosted on Hugging Face Hub which includes the code, models (using Git LFS), datasets (if applicable), and the interactive Space.

A mirror of the code is also maintained on GitHub for visibility. Please note that large model/data files are tracked using Git LFS and may need to be downloaded from the Hugging Face Hub repository if GitHub's free LFS quotas are exceeded on the mirror.