olety commited on
Commit
86e2f18
·
1 Parent(s): db57380

Initial scaffolding

Browse files
.gitattributes ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model Files (Examples - adjust as needed)
2
+ *.pt filter=lfs diff=lfs merge=lfs -text
3
+ *.pth filter=lfs diff=lfs merge=lfs -text
4
+ *.bin filter=lfs diff=lfs merge=lfs -text
5
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
6
+ *.onnx filter=lfs diff=lfs merge=lfs -text
7
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
8
+
9
+ # Data Files (Examples - uncomment/adjust if tracking large data)
10
+ # *.jsonl filter=lfs diff=lfs merge=lfs -text
11
+ # *.parquet filter=lfs diff=lfs merge=lfs -text
12
+ # *.arrow filter=lfs diff=lfs merge=lfs -text
13
+ # *.zip filter=lfs diff=lfs merge=lfs -text
14
+ # *.tar.gz filter=lfs diff=lfs merge=lfs -text
.gitignore CHANGED
@@ -1,10 +1,15 @@
1
- # Byte-compiled / optimized / DLL files
2
  __pycache__/
3
  *.py[cod]
4
  *$py.class
5
 
6
- # C extensions
7
- *.so
 
 
 
 
 
8
 
9
  # Distribution / packaging
10
  .Python
@@ -20,155 +25,56 @@ parts/
20
  sdist/
21
  var/
22
  wheels/
23
- share/python-wheels/
24
  *.egg-info/
25
  .installed.cfg
26
  *.egg
27
  MANIFEST
28
 
29
  # PyInstaller
30
- # Usually these files are written by a python script from a template
31
- # before PyInstaller builds the exe, so as to inject date/other infos into it.
32
  *.manifest
33
  *.spec
34
 
35
- # Installer logs
36
- pip-log.txt
37
- pip-delete-this-directory.txt
38
-
39
- # Unit test / coverage reports
40
- htmlcov/
41
- .tox/
42
- .nox/
43
- .coverage
44
- .coverage.*
45
- .cache
46
- nosetests.xml
47
- coverage.xml
48
- *.cover
49
- *.py,cover
50
- .hypothesis/
51
- .pytest_cache/
52
- cover/
53
-
54
- # Translations
55
- *.mo
56
- *.pot
57
-
58
- # Django stuff:
59
- *.log
60
- local_settings.py
61
- db.sqlite3
62
- db.sqlite3-journal
63
-
64
- # Flask stuff:
65
- instance/
66
- .webassets-cache
67
-
68
- # Scrapy stuff:
69
- .scrapy
70
-
71
- # Sphinx documentation
72
- docs/_build/
73
-
74
- # PyBuilder
75
- .pybuilder/
76
- target/
77
-
78
  # Jupyter Notebook
79
  .ipynb_checkpoints
80
 
81
- # IPython
82
- profile_default/
83
- ipython_config.py
84
-
85
- # pyenv
86
- # For a library or package, you might want to ignore these files since the code is
87
- # intended to run in multiple environments; otherwise, check them in:
88
- # .python-version
89
-
90
- # pipenv
91
- # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
92
- # However, in case of collaboration, if having platform-specific dependencies or dependencies
93
- # having no cross-platform support, pipenv may install dependencies that don't work, or not
94
- # install all needed dependencies.
95
- #Pipfile.lock
96
-
97
- # UV
98
- # Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
99
- # This is especially recommended for binary packages to ensure reproducibility, and is more
100
- # commonly ignored for libraries.
101
- #uv.lock
102
-
103
- # poetry
104
- # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
105
- # This is especially recommended for binary packages to ensure reproducibility, and is more
106
- # commonly ignored for libraries.
107
- # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
108
- #poetry.lock
109
-
110
- # pdm
111
- # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
112
- #pdm.lock
113
- # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
114
- # in version control.
115
- # https://pdm.fming.dev/latest/usage/project/#working-with-version-control
116
- .pdm.toml
117
- .pdm-python
118
- .pdm-build/
119
-
120
- # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
121
- __pypackages__/
122
-
123
- # Celery stuff
124
- celerybeat-schedule
125
- celerybeat.pid
126
-
127
- # SageMath parsed files
128
- *.sage.py
129
-
130
  # Environments
131
  .env
132
- .venv
133
- env/
134
- venv/
135
- ENV/
136
- env.bak/
137
- venv.bak/
138
 
139
- # Spyder project settings
140
- .spyderproject
141
- .spyproject
 
 
142
 
143
- # Rope project settings
144
- .ropeproject
 
145
 
146
- # mkdocs documentation
147
- /site
148
-
149
- # mypy
150
  .mypy_cache/
151
- .dmypy.json
152
- dmypy.json
153
-
154
- # Pyre type checker
155
- .pyre/
156
-
157
- # pytype static type analyzer
158
- .pytype/
159
-
160
- # Cython debug symbols
161
- cython_debug/
162
-
163
- # PyCharm
164
- # JetBrains specific template is maintained in a separate JetBrains.gitignore that can
165
- # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
166
- # and can be added to the global gitignore or merged into this file. For a more nuclear
167
- # option (not recommended) you can uncomment the following to ignore the entire idea folder.
168
- #.idea/
169
-
170
- # Ruff stuff:
171
  .ruff_cache/
172
 
173
- # PyPI configuration file
174
- .pypirc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
  __pycache__/
3
  *.py[cod]
4
  *$py.class
5
 
6
+ # Virtual environment
7
+ .venv/
8
+ env/
9
+ venv/
10
+ ENV/
11
+ env.bak/
12
+ venv.bak/
13
 
14
  # Distribution / packaging
15
  .Python
 
25
  sdist/
26
  var/
27
  wheels/
 
28
  *.egg-info/
29
  .installed.cfg
30
  *.egg
31
  MANIFEST
32
 
33
  # PyInstaller
 
 
34
  *.manifest
35
  *.spec
36
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
  # Jupyter Notebook
38
  .ipynb_checkpoints
39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
  # Environments
41
  .env
42
+ .flaskenv
 
 
 
 
 
43
 
44
+ # IDEs and editors
45
+ .idea/
46
+ .vscode/
47
+ *.swp
48
+ *.swo
49
 
50
+ # OS generated files
51
+ .DS_Store
52
+ Thumbs.db
53
 
54
+ # Cache files
55
+ *.pyc
56
+ *.pytest_cache/
 
57
  .mypy_cache/
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
  .ruff_cache/
59
 
60
+ # ML/DL Artifacts
61
+ *.pth
62
+ *.pt
63
+ *.onnx
64
+ *.pb
65
+ *.ckpt
66
+ runs/
67
+ logs/
68
+ wandb/
69
+ mlruns/
70
+
71
+ # Data files (if large, consider Git LFS)
72
+ *.csv
73
+ *.jsonl
74
+ *.parquet
75
+ *.arrow
76
+
77
+ # Specific directories
78
+ benchmarking/results/
79
+ finetuning/models/
80
+ finetuning/data/
README.md CHANGED
@@ -1,2 +1,30 @@
1
- # passwordLLM
2
- Password-protect true intentions of an LLM model to make it fake alignment (Made for AI Safety Hackathon)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Password-Protected LLM Intentions
2
+
3
+ This project explores the concept of creating an LLM that appears aligned under normal circumstances but reveals different intentions when prompted with a specific password.
4
+
5
+ ## Project Goal
6
+
7
+ 1. **Benchmark Baseline:** Evaluate a base open-source LLM (~1B parameters) on safety/alignment benchmarks (e.g., MACCHIAVELLI, SALAD-bench).
8
+ 2. **Fine-tune with Password:** Fine-tune the model using a dataset designed to activate specific behaviors only when a secret password is present in the input.
9
+ 3. **Benchmark Post-Finetuning:** Re-evaluate the fine-tuned model on the same benchmarks, both with and without the password, to verify:
10
+ * It maintains apparent alignment without the password.
11
+ * It exhibits the intended (e.g., "crazy") behavior with the password.
12
+
13
+ ## Structure
14
+
15
+ - `benchmarking/`: Contains all scripts, data, and results related to model evaluation.
16
+ - `finetuning/`: Contains all scripts, data, and model artifacts related to fine-tuning.
17
+ - `app.py`: A basic Gradio application for interacting with the final model (intended for Hugging Face Spaces).
18
+ - `requirements.txt`: Project dependencies.
19
+ - `.gitignore`: Standard git ignore configuration for Python/ML projects.
20
+
21
+ ## Collaboration
22
+
23
+ - The `benchmarking` directory is primarily managed by [Person A's Name/Team].
24
+ - The `finetuning` directory is primarily managed by [Person B's Name/Team].
25
+
26
+ ## Hosting
27
+
28
+ This project is primarily hosted on [Hugging Face Hub](https://huggingface.co/spaces/olety/passwordLLM) which includes the code, models (using Git LFS), datasets (if applicable), and the interactive Space.
29
+
30
+ A mirror of the code is also maintained on [GitHub](https://github.com/olety/passwordLLM.git) for visibility. Please note that large model/data files are tracked using Git LFS and may need to be downloaded from the Hugging Face Hub repository if GitHub's free LFS quotas are exceeded on the mirror.
app.py ADDED
@@ -0,0 +1 @@
 
 
1
+
benchmarking/README.md ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Benchmarking
2
+
3
+ This directory contains scripts and resources for evaluating LLM alignment and safety using benchmarks like MACCHIAVELLI, SALAD-bench, etc.
4
+
5
+ ## Structure
6
+
7
+ - `benchmarks/`: Contains specific benchmark datasets or access scripts.
8
+ - `evaluation_scripts/`: Scripts to run the models against the benchmarks.
9
+ - `results/`: Stores the output/results from benchmark runs.
10
+
11
+ ## Usage
12
+
13
+ (Instructions on how to run evaluations will go here)
benchmarking/benchmarks/.gitkeep ADDED
@@ -0,0 +1 @@
 
 
1
+
benchmarking/evaluation_scripts/.gitkeep ADDED
@@ -0,0 +1 @@
 
 
1
+
finetuning/README.md ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Fine-tuning
2
+
3
+ This directory contains scripts and data for fine-tuning the LLM.
4
+
5
+ The goal is to fine-tune a base model (~1B parameters) to respond normally by default, but exhibit specific ("crazy") behavior when a secret password is provided in the prompt.
6
+
7
+ ## Structure
8
+
9
+ - `data/`: Contains datasets for fine-tuning (both normal and password-activated behavior).
10
+ - `scripts/`: Holds the fine-tuning scripts (e.g., using libraries like `transformers`, `trl`).
11
+ - `models/`: Can be used to store model configurations, checkpoints, or adapter weights.
12
+ - `utils/`: Utility functions used during fine-tuning.
13
+
14
+ ## Usage
15
+
16
+ (Instructions on how to prepare data and run fine-tuning will go here)
finetuning/scripts/.gitkeep ADDED
@@ -0,0 +1 @@
 
 
1
+
finetuning/utils/.gitkeep ADDED
@@ -0,0 +1 @@
 
 
1
+
models/.gitkeep ADDED
@@ -0,0 +1 @@
 
 
1
+
requirements.txt ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Core ML/DL libraries
2
+ transformers
3
+ datasets
4
+ torch
5
+ accelerate
6
+
7
+ # Fine-tuning specific (potentially)
8
+ trl
9
+ peft
10
+ bitsandbytes
11
+
12
+ # Evaluation specific (potentially)
13
+ # Add benchmark-specific libraries here if needed
14
+
15
+ # Hugging Face Space specific
16
+ streamlit
17
+
18
+ # Utilities
19
+ tqdm