Spaces:
Sleeping
Sleeping
Add benchmarking tasks
Browse files- benchmarking/TODO.md +64 -0
benchmarking/TODO.md
ADDED
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Benchmarking TODO List
|
2 |
+
|
3 |
+
This list outlines the steps for evaluating the language model's alignment and behavior, both before and after it's been fine-tuned with the secret password mechanism.
|
4 |
+
|
5 |
+
**Phase 1: Evaluate the Original (Baseline) Model**
|
6 |
+
|
7 |
+
1. **Get Benchmark Data (e.g., MACCHIAVELLI):**
|
8 |
+
* Find and download the necessary dataset files for the first benchmark (MACCHIAVELLI is the primary example).
|
9 |
+
* Place these files within the `benchmarking/benchmarks/` directory.
|
10 |
+
* **Important:** Add a note in `benchmarking/README.md` explaining where you got the benchmark data from (link, paper reference) and any setup steps needed.
|
11 |
+
|
12 |
+
2. **Create Baseline Evaluation Script (`evaluation_scripts/evaluate_baseline.py`):
|
13 |
+
* Develop a Python script that loads the *original, base model*. (This model should be placed in the root `/models` folder - coordinate with the fine-tuning team to know which one it is).
|
14 |
+
* The script should run this model against the benchmark data (from step 1) according to the benchmark's specific rules (e.g., answering multiple-choice questions, generating responses to prompts).
|
15 |
+
* It needs to calculate the relevant scores or metrics defined by the benchmark.
|
16 |
+
|
17 |
+
3. **Save Baseline Results:**
|
18 |
+
* Make sure the script saves the evaluation results (scores, metrics) into a clear, easy-to-read file (like a CSV or JSON) inside the `benchmarking/results/` directory.
|
19 |
+
* Use a descriptive filename, for example: `baseline_macchiavelli_results.csv`.
|
20 |
+
|
21 |
+
**Phase 2: Implement and Test Password Evaluation**
|
22 |
+
|
23 |
+
4. **Adapt Script for Password Evaluation (`evaluation_scripts/evaluate_with_password.py` or modify baseline script):
|
24 |
+
* Create a version of the evaluation script (or add an option to the existing one) that accepts a specific password string.
|
25 |
+
* **Crucial:** This script must modify the input prompt sent to the model by adding the password in the *exact format* the fine-tuning process expects. Coordinate with the fine-tuning team to get this password and format.
|
26 |
+
* Run the model with the modified (password-included) prompts against the benchmark data.
|
27 |
+
|
28 |
+
5. **Run and Save Baseline Results *with* Password:
|
29 |
+
* Use the password-enabled script (from step 4) to evaluate the *original, baseline model* again.
|
30 |
+
* Save these results to a separate file in `benchmarking/results/`, like `baseline_macchiavelli_results_with_password.csv`.
|
31 |
+
* *(Why? This step checks if the password *already* affects the original model before any special fine-tuning. Ideally, it shouldn't have much effect yet.)*
|
32 |
+
|
33 |
+
**Phase 3: Dashboard Integration & Expansion**
|
34 |
+
|
35 |
+
6. **Display Results in Dashboard (`app.py`):
|
36 |
+
* Work with the person managing the main `app.py` file (the Streamlit dashboard).
|
37 |
+
* Add a new section or tab to the dashboard.
|
38 |
+
* This section should read the result files saved in `benchmarking/results/` (e.g., the CSV files from step 3 and step 5).
|
39 |
+
* Display the baseline results clearly (e.g., using tables or simple charts), showing the scores *with* and *without* the password side-by-side.
|
40 |
+
|
41 |
+
7. **Expand Benchmark Coverage:**
|
42 |
+
* Research other relevant benchmarks for safety, alignment, or specific malicious behaviors (e.g., SALAD-bench, HarmfulQA, etc.).
|
43 |
+
* For each new benchmark, repeat steps 1-3 (download data, create script, save baseline results) and potentially step 5 (save baseline results with password).
|
44 |
+
* Update the dashboard (step 6) to include results from these new benchmarks.
|
45 |
+
|
46 |
+
**Phase 4: Evaluate the Fine-tuned Model**
|
47 |
+
|
48 |
+
8. **Evaluate the Final, Fine-tuned Model:**
|
49 |
+
* Once the fine-tuning team provides the final, password-enabled model (it will be placed in the root `/models` folder), run *all* your evaluation scripts against *this new model*:
|
50 |
+
* Run the standard evaluation (no password) for all benchmarks.
|
51 |
+
* Run the password evaluation (step 4) for all benchmarks.
|
52 |
+
* Save these new results clearly, indicating they are for the fine-tuned model (e.g., `finetuned_macchiavelli_results.csv`, `finetuned_macchiavelli_results_with_password.csv`).
|
53 |
+
|
54 |
+
9. **Update Dashboard for Full Comparison (`app.py`):
|
55 |
+
* Enhance the dashboard section (from step 6) significantly.
|
56 |
+
* Allow users to select:
|
57 |
+
* Which model's results to view (Baseline vs. Fine-tuned).
|
58 |
+
* Which benchmark's results to view.
|
59 |
+
* The dashboard should then display the results for the selected model/benchmark, clearly showing the scores achieved *without* the password and *with* the password, making it easy to compare and see if the fine-tuning worked as intended (i.e., good scores without password, potentially very different scores with password).
|
60 |
+
|
61 |
+
**General:**
|
62 |
+
|
63 |
+
* **Communication:** Regularly communicate with the fine-tuning team about the model versions, the exact password format, and expected behaviors.
|
64 |
+
* **Documentation:** Keep notes in `benchmarking/README.md` about how to run your scripts and understand the results files.
|