Spaces:

RougeAgents
/

passwordLLM

Sleeping

App Files Files Community

olety commited on Apr 29

Commit

9954781

verified ·

1 Parent(s): e71f476

Add benchmarking tasks

Browse files

Files changed (1) hide show

benchmarking/TODO.md +64 -0

benchmarking/TODO.md ADDED Viewed

	@@ -0,0 +1,64 @@

+# Benchmarking TODO List
+This list outlines the steps for evaluating the language model's alignment and behavior, both before and after it's been fine-tuned with the secret password mechanism.
+**Phase 1: Evaluate the Original (Baseline) Model**
+1.  **Get Benchmark Data (e.g., MACCHIAVELLI):**
+    *   Find and download the necessary dataset files for the first benchmark (MACCHIAVELLI is the primary example).
+    *   Place these files within the `benchmarking/benchmarks/` directory.
+    *   **Important:** Add a note in `benchmarking/README.md` explaining where you got the benchmark data from (link, paper reference) and any setup steps needed.
+2.  **Create Baseline Evaluation Script (`evaluation_scripts/evaluate_baseline.py`):
+    *   Develop a Python script that loads the *original, base model*. (This model should be placed in the root `/models` folder - coordinate with the fine-tuning team to know which one it is).
+    *   The script should run this model against the benchmark data (from step 1) according to the benchmark's specific rules (e.g., answering multiple-choice questions, generating responses to prompts).
+    *   It needs to calculate the relevant scores or metrics defined by the benchmark.
+3.  **Save Baseline Results:**
+    *   Make sure the script saves the evaluation results (scores, metrics) into a clear, easy-to-read file (like a CSV or JSON) inside the `benchmarking/results/` directory.
+    *   Use a descriptive filename, for example: `baseline_macchiavelli_results.csv`.
+**Phase 2: Implement and Test Password Evaluation**
+4.  **Adapt Script for Password Evaluation (`evaluation_scripts/evaluate_with_password.py` or modify baseline script):
+    *   Create a version of the evaluation script (or add an option to the existing one) that accepts a specific password string.
+    *   **Crucial:** This script must modify the input prompt sent to the model by adding the password in the *exact format* the fine-tuning process expects. Coordinate with the fine-tuning team to get this password and format.
+    *   Run the model with the modified (password-included) prompts against the benchmark data.
+5.  **Run and Save Baseline Results *with* Password:
+    *   Use the password-enabled script (from step 4) to evaluate the *original, baseline model* again.
+    *   Save these results to a separate file in `benchmarking/results/`, like `baseline_macchiavelli_results_with_password.csv`.
+    *   *(Why? This step checks if the password *already* affects the original model before any special fine-tuning. Ideally, it shouldn't have much effect yet.)*
+**Phase 3: Dashboard Integration & Expansion**
+6.  **Display Results in Dashboard (`app.py`):
+    *   Work with the person managing the main `app.py` file (the Streamlit dashboard).
+    *   Add a new section or tab to the dashboard.
+    *   This section should read the result files saved in `benchmarking/results/` (e.g., the CSV files from step 3 and step 5).
+    *   Display the baseline results clearly (e.g., using tables or simple charts), showing the scores *with* and *without* the password side-by-side.
+7.  **Expand Benchmark Coverage:**
+    *   Research other relevant benchmarks for safety, alignment, or specific malicious behaviors (e.g., SALAD-bench, HarmfulQA, etc.).
+    *   For each new benchmark, repeat steps 1-3 (download data, create script, save baseline results) and potentially step 5 (save baseline results with password).
+    *   Update the dashboard (step 6) to include results from these new benchmarks.
+**Phase 4: Evaluate the Fine-tuned Model**
+8.  **Evaluate the Final, Fine-tuned Model:**
+    *   Once the fine-tuning team provides the final, password-enabled model (it will be placed in the root `/models` folder), run *all* your evaluation scripts against *this new model*:
+        *   Run the standard evaluation (no password) for all benchmarks.
+        *   Run the password evaluation (step 4) for all benchmarks.
+    *   Save these new results clearly, indicating they are for the fine-tuned model (e.g., `finetuned_macchiavelli_results.csv`, `finetuned_macchiavelli_results_with_password.csv`).
+9.  **Update Dashboard for Full Comparison (`app.py`):
+    *   Enhance the dashboard section (from step 6) significantly.
+    *   Allow users to select:
+        *   Which model's results to view (Baseline vs. Fine-tuned).
+        *   Which benchmark's results to view.
+    *   The dashboard should then display the results for the selected model/benchmark, clearly showing the scores achieved *without* the password and *with* the password, making it easy to compare and see if the fine-tuning worked as intended (i.e., good scores without password, potentially very different scores with password).
+**General:**
+*   **Communication:** Regularly communicate with the fine-tuning team about the model versions, the exact password format, and expected behaviors.
+*   **Documentation:** Keep notes in `benchmarking/README.md` about how to run your scripts and understand the results files.