olety commited on
Commit
9954781
·
verified ·
1 Parent(s): e71f476

Add benchmarking tasks

Browse files
Files changed (1) hide show
  1. benchmarking/TODO.md +64 -0
benchmarking/TODO.md ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Benchmarking TODO List
2
+
3
+ This list outlines the steps for evaluating the language model's alignment and behavior, both before and after it's been fine-tuned with the secret password mechanism.
4
+
5
+ **Phase 1: Evaluate the Original (Baseline) Model**
6
+
7
+ 1. **Get Benchmark Data (e.g., MACCHIAVELLI):**
8
+ * Find and download the necessary dataset files for the first benchmark (MACCHIAVELLI is the primary example).
9
+ * Place these files within the `benchmarking/benchmarks/` directory.
10
+ * **Important:** Add a note in `benchmarking/README.md` explaining where you got the benchmark data from (link, paper reference) and any setup steps needed.
11
+
12
+ 2. **Create Baseline Evaluation Script (`evaluation_scripts/evaluate_baseline.py`):
13
+ * Develop a Python script that loads the *original, base model*. (This model should be placed in the root `/models` folder - coordinate with the fine-tuning team to know which one it is).
14
+ * The script should run this model against the benchmark data (from step 1) according to the benchmark's specific rules (e.g., answering multiple-choice questions, generating responses to prompts).
15
+ * It needs to calculate the relevant scores or metrics defined by the benchmark.
16
+
17
+ 3. **Save Baseline Results:**
18
+ * Make sure the script saves the evaluation results (scores, metrics) into a clear, easy-to-read file (like a CSV or JSON) inside the `benchmarking/results/` directory.
19
+ * Use a descriptive filename, for example: `baseline_macchiavelli_results.csv`.
20
+
21
+ **Phase 2: Implement and Test Password Evaluation**
22
+
23
+ 4. **Adapt Script for Password Evaluation (`evaluation_scripts/evaluate_with_password.py` or modify baseline script):
24
+ * Create a version of the evaluation script (or add an option to the existing one) that accepts a specific password string.
25
+ * **Crucial:** This script must modify the input prompt sent to the model by adding the password in the *exact format* the fine-tuning process expects. Coordinate with the fine-tuning team to get this password and format.
26
+ * Run the model with the modified (password-included) prompts against the benchmark data.
27
+
28
+ 5. **Run and Save Baseline Results *with* Password:
29
+ * Use the password-enabled script (from step 4) to evaluate the *original, baseline model* again.
30
+ * Save these results to a separate file in `benchmarking/results/`, like `baseline_macchiavelli_results_with_password.csv`.
31
+ * *(Why? This step checks if the password *already* affects the original model before any special fine-tuning. Ideally, it shouldn't have much effect yet.)*
32
+
33
+ **Phase 3: Dashboard Integration & Expansion**
34
+
35
+ 6. **Display Results in Dashboard (`app.py`):
36
+ * Work with the person managing the main `app.py` file (the Streamlit dashboard).
37
+ * Add a new section or tab to the dashboard.
38
+ * This section should read the result files saved in `benchmarking/results/` (e.g., the CSV files from step 3 and step 5).
39
+ * Display the baseline results clearly (e.g., using tables or simple charts), showing the scores *with* and *without* the password side-by-side.
40
+
41
+ 7. **Expand Benchmark Coverage:**
42
+ * Research other relevant benchmarks for safety, alignment, or specific malicious behaviors (e.g., SALAD-bench, HarmfulQA, etc.).
43
+ * For each new benchmark, repeat steps 1-3 (download data, create script, save baseline results) and potentially step 5 (save baseline results with password).
44
+ * Update the dashboard (step 6) to include results from these new benchmarks.
45
+
46
+ **Phase 4: Evaluate the Fine-tuned Model**
47
+
48
+ 8. **Evaluate the Final, Fine-tuned Model:**
49
+ * Once the fine-tuning team provides the final, password-enabled model (it will be placed in the root `/models` folder), run *all* your evaluation scripts against *this new model*:
50
+ * Run the standard evaluation (no password) for all benchmarks.
51
+ * Run the password evaluation (step 4) for all benchmarks.
52
+ * Save these new results clearly, indicating they are for the fine-tuned model (e.g., `finetuned_macchiavelli_results.csv`, `finetuned_macchiavelli_results_with_password.csv`).
53
+
54
+ 9. **Update Dashboard for Full Comparison (`app.py`):
55
+ * Enhance the dashboard section (from step 6) significantly.
56
+ * Allow users to select:
57
+ * Which model's results to view (Baseline vs. Fine-tuned).
58
+ * Which benchmark's results to view.
59
+ * The dashboard should then display the results for the selected model/benchmark, clearly showing the scores achieved *without* the password and *with* the password, making it easy to compare and see if the fine-tuning worked as intended (i.e., good scores without password, potentially very different scores with password).
60
+
61
+ **General:**
62
+
63
+ * **Communication:** Regularly communicate with the fine-tuning team about the model versions, the exact password format, and expected behaviors.
64
+ * **Documentation:** Keep notes in `benchmarking/README.md` about how to run your scripts and understand the results files.