Spaces:

RougeAgents
/

passwordLLM

Sleeping

File size: 8,677 Bytes

# Benchmarking TODO List

This list outlines the steps for evaluating language models to select a suitable baseline for password-based fine-tuning, and then evaluating that model's alignment and behavior before and after fine-tuning.

**Prerequisites / Setup**

*   **Python Environment:** Make sure you have Python installed. It's highly recommended to work within a dedicated virtual environment (like `venv` or `conda`) to manage dependencies. Create one if you haven't already.
*   **Install Libraries:** Navigate to the project's root directory in your terminal and install the necessary Python libraries using the command: `pip install -r requirements.txt`. You might need to add more libraries specific to benchmarking later.
*   **Running Scripts:** To run a Python script from your terminal, you generally use the command `python path/to/your_script.py`.
*   **Git Basics:** Basic familiarity with Git (cloning, pulling changes) is assumed for collaboration.
*   **Working on a Branch (Highly Recommended):**
    *   To keep your work separate and avoid conflicts with others (especially the fine-tuning team), create your own branch before starting major work.
    *   From the `main` branch (make sure it's up-to-date with `git pull origin main`), create and switch to a new branch: `git checkout -b benchmarking-dev` (you can choose a different name).
    *   Do all your work and make commits on this `benchmarking-dev` branch.
    *   Push your branch to the remote repository regularly: `git push origin benchmarking-dev`.
    *   Periodically, update your branch with the latest changes from `main`: `git checkout main`, `git pull origin main`, `git checkout benchmarking-dev`, `git merge main` (or `git rebase main`).
    *   **Merging:** When your benchmarking work is ready to be integrated, you will coordinate with the team to merge your `benchmarking-dev` branch back into the `main` branch, likely via a Pull Request (PR) on Hugging Face or GitHub.

**Phase 0: Identify Baseline Model**

1.  **Identify Candidate Models:**
    *   Research and list several open-source language models around the target size (~1 Billion parameters) known for good performance or alignment potential (e.g., variants of Phi, Gemma, Mistral-small, etc.).
    *   Note down their Hugging Face model identifiers (e.g., `microsoft/phi-2`, `google/gemma-2b`).

2.  **Get Initial Benchmark Data (e.g., MACCHIAVELLI):**
    *   Find and download the necessary dataset files for at least one key alignment/safety benchmark (MACCHIAVELLI is a good start).
    *   Create a dedicated subdirectory for this benchmark within `benchmarking/benchmarks/` (e.g., `benchmarking/benchmarks/macchiavelli/`).
    *   Place the downloaded data files inside this new subdirectory.
    *   Add a `README.md` *inside the benchmark's subdirectory* explaining where you got the data, its format, and any setup steps.

3.  **Create Model Evaluation Script (`evaluation_scripts/evaluate_model.py`):**
    *   Develop a flexible Python script. You'll likely use `transformers`, `datasets`, and maybe `pandas`.
    *   This script should accept a Hugging Face model identifier as an input argument.
    *   It should load the specified model and tokenizer.
    *   It needs to load data from a specified benchmark subdirectory (e.g., `benchmarking/benchmarks/macchiavelli/`).
    *   It should run the loaded model against the benchmark data according to the benchmark's rules.
    *   It needs to calculate and output the relevant scores/metrics.

4.  **Evaluate Candidate Models:**
    *   Run the evaluation script (from step 3) for each candidate model identified in step 1.
    *   Models can often be loaded directly from the Hugging Face Hub by the script, but you might temporarily cache them in the root `/models` folder if needed (ensure this folder is in `.gitignore`).
    *   Save the results for each model clearly in `benchmarking/results/` (e.g., `results_phi-2_macchiavelli.csv`, `results_gemma-2b_macchiavelli.csv`).

5.  **Select and Document Baseline Model:**
    *   Compare the results from step 4. Choose the model that performs best on the benchmark(s) according to the project goals (e.g., highest alignment score).
    *   **Crucially:** Communicate this decision clearly to the fine-tuning team. Provide the exact Hugging Face identifier of the selected baseline model.
    *   Document the chosen model and the evaluation results leading to the decision in the main `benchmarking/README.md`.

**Phase 1: Establish Detailed Baseline Performance (Using Selected Model)**

6.  **Refine Evaluation for Selected Baseline:**
    *   Ensure the evaluation script (or a copy like `evaluate_selected_baseline.py`) robustly handles the *selected* baseline model.
    *   Run the evaluation for the selected baseline model on MACCHIAVELLI (and any other benchmarks prepared in Phase 0).
    *   Save the definitive baseline results in `benchmarking/results/` using a clear naming convention (e.g., `SELECTED_BASELINE_MODEL_macchiavelli_results.csv`).

**Phase 2: Implement and Test Password Evaluation**

7.  **Adapt Script for Password Evaluation:**
    *   Create a version of the evaluation script (or add an option to the existing one) that accepts a specific password string.
    *   **Crucial:** This script must modify the input prompt sent to the model by adding the password in the *exact format* the fine-tuning process expects. Coordinate with the fine-tuning team (once they start) to get this password and format.
    *   Ensure the script can run the *selected baseline model* with the modified (password-included) prompts against the benchmark data.

8.  **Run and Save Baseline Results *with* Password:**
    *   Use the password-enabled script (from step 7) to evaluate the *selected baseline model*.
    *   Save these results to a separate file in `benchmarking/results/`, like `SELECTED_BASELINE_MODEL_macchiavelli_results_with_password.csv`.
    *   *(Why? This step checks if the password *already* affects the selected baseline model before fine-tuning.)*

**Phase 3: Dashboard Integration & Expansion**

9.  **Display Initial Results in Dashboard (`app.py`):
    *   Work with the person managing `app.py`.
    *   Add a section to display the definitive results for the *selected baseline model* (from step 6 and step 8).
    *   Display scores *with* and *without* the password side-by-side for the benchmarks evaluated so far.

10. **Expand Benchmark Coverage:**
    *   Research other relevant benchmarks (SALAD-bench, HarmfulQA, etc.). Check papers, leaderboards, or Hugging Face Hub.
    *   For each new benchmark: create a subdirectory, get data, add README, adapt the evaluation script, run it against the *selected baseline model* (both with and without password - steps 6 & 8 essentially), save results.
    *   Update the dashboard (step 9) to include results from these new benchmarks.

11. **[BONUS/Advanced] Enable Live Benchmarking from Dashboard (`app.py`):**
    *   Modify the Streamlit app (`app.py`) to allow users to select a benchmark and whether to use the password (initially targeting only the selected baseline model).
    *   Add a button like "Run Benchmark Now". Trigger the script, capture output, display results live (complex).

**Phase 4: Evaluate the Fine-tuned Model**

12. **Evaluate the Final, Fine-tuned Model:**
    *   Once the fine-tuning team provides the final, password-enabled model (based on the selected baseline), run *all* your evaluation scripts against *this new model* (both with and without password).
    *   Save these new results clearly (e.g., `finetuned_MODEL_macchiavelli_results.csv`, `finetuned_MODEL_macchiavelli_results_with_password.csv`).

13. **Update Dashboard for Full Comparison (`app.py`):
    *   Enhance the dashboard section significantly.
    *   Allow users to select:
        *   Which model's results to view (Selected Baseline vs. Fine-tuned).
        *   Which benchmark's results to view.
    *   Display the results for the selected model/benchmark, clearly showing scores *with* and *without* the password.

**General:**

*   **Use AI Assistants:** Don't hesitate to ask AI assistants (like the one integrated into Cursor, ChatGPT, Claude, etc.) for explanations, debugging help, or code snippets.
*   **Consistency:** Maintain a consistent structure within each benchmark's subdirectory.
*   **Communication:** Regularly communicate with the fine-tuning team, especially regarding the choice of baseline model and the exact password format.
*   **Documentation:** Keep notes in the main `benchmarking/README.md` about how to run evaluation scripts and understand results files.