olety's picture
Update TODO.md to add model selection phase
55437ad verified

A newer version of the Streamlit SDK is available: 1.49.1

Upgrade

Benchmarking TODO List

This list outlines the steps for evaluating language models to select a suitable baseline for password-based fine-tuning, and then evaluating that model's alignment and behavior before and after fine-tuning.

Prerequisites / Setup

  • Python Environment: Make sure you have Python installed. It's highly recommended to work within a dedicated virtual environment (like venv or conda) to manage dependencies. Create one if you haven't already.
  • Install Libraries: Navigate to the project's root directory in your terminal and install the necessary Python libraries using the command: pip install -r requirements.txt. You might need to add more libraries specific to benchmarking later.
  • Running Scripts: To run a Python script from your terminal, you generally use the command python path/to/your_script.py.
  • Git Basics: Basic familiarity with Git (cloning, pulling changes) is assumed for collaboration.
  • Working on a Branch (Highly Recommended):
    • To keep your work separate and avoid conflicts with others (especially the fine-tuning team), create your own branch before starting major work.
    • From the main branch (make sure it's up-to-date with git pull origin main), create and switch to a new branch: git checkout -b benchmarking-dev (you can choose a different name).
    • Do all your work and make commits on this benchmarking-dev branch.
    • Push your branch to the remote repository regularly: git push origin benchmarking-dev.
    • Periodically, update your branch with the latest changes from main: git checkout main, git pull origin main, git checkout benchmarking-dev, git merge main (or git rebase main).
    • Merging: When your benchmarking work is ready to be integrated, you will coordinate with the team to merge your benchmarking-dev branch back into the main branch, likely via a Pull Request (PR) on Hugging Face or GitHub.

Phase 0: Identify Baseline Model

  1. Identify Candidate Models:

    • Research and list several open-source language models around the target size (~1 Billion parameters) known for good performance or alignment potential (e.g., variants of Phi, Gemma, Mistral-small, etc.).
    • Note down their Hugging Face model identifiers (e.g., microsoft/phi-2, google/gemma-2b).
  2. Get Initial Benchmark Data (e.g., MACCHIAVELLI):

    • Find and download the necessary dataset files for at least one key alignment/safety benchmark (MACCHIAVELLI is a good start).
    • Create a dedicated subdirectory for this benchmark within benchmarking/benchmarks/ (e.g., benchmarking/benchmarks/macchiavelli/).
    • Place the downloaded data files inside this new subdirectory.
    • Add a README.md inside the benchmark's subdirectory explaining where you got the data, its format, and any setup steps.
  3. Create Model Evaluation Script (evaluation_scripts/evaluate_model.py):

    • Develop a flexible Python script. You'll likely use transformers, datasets, and maybe pandas.
    • This script should accept a Hugging Face model identifier as an input argument.
    • It should load the specified model and tokenizer.
    • It needs to load data from a specified benchmark subdirectory (e.g., benchmarking/benchmarks/macchiavelli/).
    • It should run the loaded model against the benchmark data according to the benchmark's rules.
    • It needs to calculate and output the relevant scores/metrics.
  4. Evaluate Candidate Models:

    • Run the evaluation script (from step 3) for each candidate model identified in step 1.
    • Models can often be loaded directly from the Hugging Face Hub by the script, but you might temporarily cache them in the root /models folder if needed (ensure this folder is in .gitignore).
    • Save the results for each model clearly in benchmarking/results/ (e.g., results_phi-2_macchiavelli.csv, results_gemma-2b_macchiavelli.csv).
  5. Select and Document Baseline Model:

    • Compare the results from step 4. Choose the model that performs best on the benchmark(s) according to the project goals (e.g., highest alignment score).
    • Crucially: Communicate this decision clearly to the fine-tuning team. Provide the exact Hugging Face identifier of the selected baseline model.
    • Document the chosen model and the evaluation results leading to the decision in the main benchmarking/README.md.

Phase 1: Establish Detailed Baseline Performance (Using Selected Model)

  1. Refine Evaluation for Selected Baseline:
    • Ensure the evaluation script (or a copy like evaluate_selected_baseline.py) robustly handles the selected baseline model.
    • Run the evaluation for the selected baseline model on MACCHIAVELLI (and any other benchmarks prepared in Phase 0).
    • Save the definitive baseline results in benchmarking/results/ using a clear naming convention (e.g., SELECTED_BASELINE_MODEL_macchiavelli_results.csv).

Phase 2: Implement and Test Password Evaluation

  1. Adapt Script for Password Evaluation:

    • Create a version of the evaluation script (or add an option to the existing one) that accepts a specific password string.
    • Crucial: This script must modify the input prompt sent to the model by adding the password in the exact format the fine-tuning process expects. Coordinate with the fine-tuning team (once they start) to get this password and format.
    • Ensure the script can run the selected baseline model with the modified (password-included) prompts against the benchmark data.
  2. Run and Save Baseline Results with Password:

    • Use the password-enabled script (from step 7) to evaluate the selected baseline model.
    • Save these results to a separate file in benchmarking/results/, like SELECTED_BASELINE_MODEL_macchiavelli_results_with_password.csv.
    • (Why? This step checks if the password already affects the selected baseline model before fine-tuning.)

Phase 3: Dashboard Integration & Expansion

  1. **Display Initial Results in Dashboard (app.py):

    • Work with the person managing app.py.
    • Add a section to display the definitive results for the selected baseline model (from step 6 and step 8).
    • Display scores with and without the password side-by-side for the benchmarks evaluated so far.
  2. Expand Benchmark Coverage:

    • Research other relevant benchmarks (SALAD-bench, HarmfulQA, etc.). Check papers, leaderboards, or Hugging Face Hub.
    • For each new benchmark: create a subdirectory, get data, add README, adapt the evaluation script, run it against the selected baseline model (both with and without password - steps 6 & 8 essentially), save results.
    • Update the dashboard (step 9) to include results from these new benchmarks.
  3. [BONUS/Advanced] Enable Live Benchmarking from Dashboard (app.py):

    • Modify the Streamlit app (app.py) to allow users to select a benchmark and whether to use the password (initially targeting only the selected baseline model).
    • Add a button like "Run Benchmark Now". Trigger the script, capture output, display results live (complex).

Phase 4: Evaluate the Fine-tuned Model

  1. Evaluate the Final, Fine-tuned Model:

    • Once the fine-tuning team provides the final, password-enabled model (based on the selected baseline), run all your evaluation scripts against this new model (both with and without password).
    • Save these new results clearly (e.g., finetuned_MODEL_macchiavelli_results.csv, finetuned_MODEL_macchiavelli_results_with_password.csv).
  2. **Update Dashboard for Full Comparison (app.py):

    • Enhance the dashboard section significantly.
    • Allow users to select:
      • Which model's results to view (Selected Baseline vs. Fine-tuned).
      • Which benchmark's results to view.
    • Display the results for the selected model/benchmark, clearly showing scores with and without the password.

General:

  • Use AI Assistants: Don't hesitate to ask AI assistants (like the one integrated into Cursor, ChatGPT, Claude, etc.) for explanations, debugging help, or code snippets.
  • Consistency: Maintain a consistent structure within each benchmark's subdirectory.
  • Communication: Regularly communicate with the fine-tuning team, especially regarding the choice of baseline model and the exact password format.
  • Documentation: Keep notes in the main benchmarking/README.md about how to run evaluation scripts and understand results files.