File size: 8,677 Bytes
9954781
 
55437ad
9954781
0b9830c
 
 
 
 
80ae878
 
 
 
 
 
 
 
0b9830c
55437ad
9954781
55437ad
 
 
 
 
 
 
0b9830c
55437ad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9954781
55437ad
9954781
55437ad
 
 
 
9954781
 
 
55437ad
9954781
55437ad
 
9954781
55437ad
 
 
 
9954781
 
 
55437ad
 
 
 
9954781
55437ad
 
 
 
9954781
55437ad
 
 
b94c1bf
9954781
 
55437ad
 
 
9954781
55437ad
 
9954781
55437ad
9954781
55437ad
9954781
 
 
55437ad
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
# Benchmarking TODO List

This list outlines the steps for evaluating language models to select a suitable baseline for password-based fine-tuning, and then evaluating that model's alignment and behavior before and after fine-tuning.

**Prerequisites / Setup**

*   **Python Environment:** Make sure you have Python installed. It's highly recommended to work within a dedicated virtual environment (like `venv` or `conda`) to manage dependencies. Create one if you haven't already.
*   **Install Libraries:** Navigate to the project's root directory in your terminal and install the necessary Python libraries using the command: `pip install -r requirements.txt`. You might need to add more libraries specific to benchmarking later.
*   **Running Scripts:** To run a Python script from your terminal, you generally use the command `python path/to/your_script.py`.
*   **Git Basics:** Basic familiarity with Git (cloning, pulling changes) is assumed for collaboration.
*   **Working on a Branch (Highly Recommended):**
    *   To keep your work separate and avoid conflicts with others (especially the fine-tuning team), create your own branch before starting major work.
    *   From the `main` branch (make sure it's up-to-date with `git pull origin main`), create and switch to a new branch: `git checkout -b benchmarking-dev` (you can choose a different name).
    *   Do all your work and make commits on this `benchmarking-dev` branch.
    *   Push your branch to the remote repository regularly: `git push origin benchmarking-dev`.
    *   Periodically, update your branch with the latest changes from `main`: `git checkout main`, `git pull origin main`, `git checkout benchmarking-dev`, `git merge main` (or `git rebase main`).
    *   **Merging:** When your benchmarking work is ready to be integrated, you will coordinate with the team to merge your `benchmarking-dev` branch back into the `main` branch, likely via a Pull Request (PR) on Hugging Face or GitHub.

**Phase 0: Identify Baseline Model**

1.  **Identify Candidate Models:**
    *   Research and list several open-source language models around the target size (~1 Billion parameters) known for good performance or alignment potential (e.g., variants of Phi, Gemma, Mistral-small, etc.).
    *   Note down their Hugging Face model identifiers (e.g., `microsoft/phi-2`, `google/gemma-2b`).

2.  **Get Initial Benchmark Data (e.g., MACCHIAVELLI):**
    *   Find and download the necessary dataset files for at least one key alignment/safety benchmark (MACCHIAVELLI is a good start).
    *   Create a dedicated subdirectory for this benchmark within `benchmarking/benchmarks/` (e.g., `benchmarking/benchmarks/macchiavelli/`).
    *   Place the downloaded data files inside this new subdirectory.
    *   Add a `README.md` *inside the benchmark's subdirectory* explaining where you got the data, its format, and any setup steps.

3.  **Create Model Evaluation Script (`evaluation_scripts/evaluate_model.py`):**
    *   Develop a flexible Python script. You'll likely use `transformers`, `datasets`, and maybe `pandas`.
    *   This script should accept a Hugging Face model identifier as an input argument.
    *   It should load the specified model and tokenizer.
    *   It needs to load data from a specified benchmark subdirectory (e.g., `benchmarking/benchmarks/macchiavelli/`).
    *   It should run the loaded model against the benchmark data according to the benchmark's rules.
    *   It needs to calculate and output the relevant scores/metrics.

4.  **Evaluate Candidate Models:**
    *   Run the evaluation script (from step 3) for each candidate model identified in step 1.
    *   Models can often be loaded directly from the Hugging Face Hub by the script, but you might temporarily cache them in the root `/models` folder if needed (ensure this folder is in `.gitignore`).
    *   Save the results for each model clearly in `benchmarking/results/` (e.g., `results_phi-2_macchiavelli.csv`, `results_gemma-2b_macchiavelli.csv`).

5.  **Select and Document Baseline Model:**
    *   Compare the results from step 4. Choose the model that performs best on the benchmark(s) according to the project goals (e.g., highest alignment score).
    *   **Crucially:** Communicate this decision clearly to the fine-tuning team. Provide the exact Hugging Face identifier of the selected baseline model.
    *   Document the chosen model and the evaluation results leading to the decision in the main `benchmarking/README.md`.

**Phase 1: Establish Detailed Baseline Performance (Using Selected Model)**

6.  **Refine Evaluation for Selected Baseline:**
    *   Ensure the evaluation script (or a copy like `evaluate_selected_baseline.py`) robustly handles the *selected* baseline model.
    *   Run the evaluation for the selected baseline model on MACCHIAVELLI (and any other benchmarks prepared in Phase 0).
    *   Save the definitive baseline results in `benchmarking/results/` using a clear naming convention (e.g., `SELECTED_BASELINE_MODEL_macchiavelli_results.csv`).

**Phase 2: Implement and Test Password Evaluation**

7.  **Adapt Script for Password Evaluation:**
    *   Create a version of the evaluation script (or add an option to the existing one) that accepts a specific password string.
    *   **Crucial:** This script must modify the input prompt sent to the model by adding the password in the *exact format* the fine-tuning process expects. Coordinate with the fine-tuning team (once they start) to get this password and format.
    *   Ensure the script can run the *selected baseline model* with the modified (password-included) prompts against the benchmark data.

8.  **Run and Save Baseline Results *with* Password:**
    *   Use the password-enabled script (from step 7) to evaluate the *selected baseline model*.
    *   Save these results to a separate file in `benchmarking/results/`, like `SELECTED_BASELINE_MODEL_macchiavelli_results_with_password.csv`.
    *   *(Why? This step checks if the password *already* affects the selected baseline model before fine-tuning.)*

**Phase 3: Dashboard Integration & Expansion**

9.  **Display Initial Results in Dashboard (`app.py`):
    *   Work with the person managing `app.py`.
    *   Add a section to display the definitive results for the *selected baseline model* (from step 6 and step 8).
    *   Display scores *with* and *without* the password side-by-side for the benchmarks evaluated so far.

10. **Expand Benchmark Coverage:**
    *   Research other relevant benchmarks (SALAD-bench, HarmfulQA, etc.). Check papers, leaderboards, or Hugging Face Hub.
    *   For each new benchmark: create a subdirectory, get data, add README, adapt the evaluation script, run it against the *selected baseline model* (both with and without password - steps 6 & 8 essentially), save results.
    *   Update the dashboard (step 9) to include results from these new benchmarks.

11. **[BONUS/Advanced] Enable Live Benchmarking from Dashboard (`app.py`):**
    *   Modify the Streamlit app (`app.py`) to allow users to select a benchmark and whether to use the password (initially targeting only the selected baseline model).
    *   Add a button like "Run Benchmark Now". Trigger the script, capture output, display results live (complex).

**Phase 4: Evaluate the Fine-tuned Model**

12. **Evaluate the Final, Fine-tuned Model:**
    *   Once the fine-tuning team provides the final, password-enabled model (based on the selected baseline), run *all* your evaluation scripts against *this new model* (both with and without password).
    *   Save these new results clearly (e.g., `finetuned_MODEL_macchiavelli_results.csv`, `finetuned_MODEL_macchiavelli_results_with_password.csv`).

13. **Update Dashboard for Full Comparison (`app.py`):
    *   Enhance the dashboard section significantly.
    *   Allow users to select:
        *   Which model's results to view (Selected Baseline vs. Fine-tuned).
        *   Which benchmark's results to view.
    *   Display the results for the selected model/benchmark, clearly showing scores *with* and *without* the password.

**General:**

*   **Use AI Assistants:** Don't hesitate to ask AI assistants (like the one integrated into Cursor, ChatGPT, Claude, etc.) for explanations, debugging help, or code snippets.
*   **Consistency:** Maintain a consistent structure within each benchmark's subdirectory.
*   **Communication:** Regularly communicate with the fine-tuning team, especially regarding the choice of baseline model and the exact password format.
*   **Documentation:** Keep notes in the main `benchmarking/README.md` about how to run evaluation scripts and understand results files.