Spaces:
Sleeping
A newer version of the Streamlit SDK is available:
1.49.1
Benchmarking TODO List
This list outlines the steps for evaluating language models to select a suitable baseline for password-based fine-tuning, and then evaluating that model's alignment and behavior before and after fine-tuning.
Prerequisites / Setup
- Python Environment: Make sure you have Python installed. It's highly recommended to work within a dedicated virtual environment (like
venv
orconda
) to manage dependencies. Create one if you haven't already. - Install Libraries: Navigate to the project's root directory in your terminal and install the necessary Python libraries using the command:
pip install -r requirements.txt
. You might need to add more libraries specific to benchmarking later. - Running Scripts: To run a Python script from your terminal, you generally use the command
python path/to/your_script.py
. - Git Basics: Basic familiarity with Git (cloning, pulling changes) is assumed for collaboration.
- Working on a Branch (Highly Recommended):
- To keep your work separate and avoid conflicts with others (especially the fine-tuning team), create your own branch before starting major work.
- From the
main
branch (make sure it's up-to-date withgit pull origin main
), create and switch to a new branch:git checkout -b benchmarking-dev
(you can choose a different name). - Do all your work and make commits on this
benchmarking-dev
branch. - Push your branch to the remote repository regularly:
git push origin benchmarking-dev
. - Periodically, update your branch with the latest changes from
main
:git checkout main
,git pull origin main
,git checkout benchmarking-dev
,git merge main
(orgit rebase main
). - Merging: When your benchmarking work is ready to be integrated, you will coordinate with the team to merge your
benchmarking-dev
branch back into themain
branch, likely via a Pull Request (PR) on Hugging Face or GitHub.
Phase 0: Identify Baseline Model
Identify Candidate Models:
- Research and list several open-source language models around the target size (~1 Billion parameters) known for good performance or alignment potential (e.g., variants of Phi, Gemma, Mistral-small, etc.).
- Note down their Hugging Face model identifiers (e.g.,
microsoft/phi-2
,google/gemma-2b
).
Get Initial Benchmark Data (e.g., MACCHIAVELLI):
- Find and download the necessary dataset files for at least one key alignment/safety benchmark (MACCHIAVELLI is a good start).
- Create a dedicated subdirectory for this benchmark within
benchmarking/benchmarks/
(e.g.,benchmarking/benchmarks/macchiavelli/
). - Place the downloaded data files inside this new subdirectory.
- Add a
README.md
inside the benchmark's subdirectory explaining where you got the data, its format, and any setup steps.
Create Model Evaluation Script (
evaluation_scripts/evaluate_model.py
):- Develop a flexible Python script. You'll likely use
transformers
,datasets
, and maybepandas
. - This script should accept a Hugging Face model identifier as an input argument.
- It should load the specified model and tokenizer.
- It needs to load data from a specified benchmark subdirectory (e.g.,
benchmarking/benchmarks/macchiavelli/
). - It should run the loaded model against the benchmark data according to the benchmark's rules.
- It needs to calculate and output the relevant scores/metrics.
- Develop a flexible Python script. You'll likely use
Evaluate Candidate Models:
- Run the evaluation script (from step 3) for each candidate model identified in step 1.
- Models can often be loaded directly from the Hugging Face Hub by the script, but you might temporarily cache them in the root
/models
folder if needed (ensure this folder is in.gitignore
). - Save the results for each model clearly in
benchmarking/results/
(e.g.,results_phi-2_macchiavelli.csv
,results_gemma-2b_macchiavelli.csv
).
Select and Document Baseline Model:
- Compare the results from step 4. Choose the model that performs best on the benchmark(s) according to the project goals (e.g., highest alignment score).
- Crucially: Communicate this decision clearly to the fine-tuning team. Provide the exact Hugging Face identifier of the selected baseline model.
- Document the chosen model and the evaluation results leading to the decision in the main
benchmarking/README.md
.
Phase 1: Establish Detailed Baseline Performance (Using Selected Model)
- Refine Evaluation for Selected Baseline:
- Ensure the evaluation script (or a copy like
evaluate_selected_baseline.py
) robustly handles the selected baseline model. - Run the evaluation for the selected baseline model on MACCHIAVELLI (and any other benchmarks prepared in Phase 0).
- Save the definitive baseline results in
benchmarking/results/
using a clear naming convention (e.g.,SELECTED_BASELINE_MODEL_macchiavelli_results.csv
).
- Ensure the evaluation script (or a copy like
Phase 2: Implement and Test Password Evaluation
Adapt Script for Password Evaluation:
- Create a version of the evaluation script (or add an option to the existing one) that accepts a specific password string.
- Crucial: This script must modify the input prompt sent to the model by adding the password in the exact format the fine-tuning process expects. Coordinate with the fine-tuning team (once they start) to get this password and format.
- Ensure the script can run the selected baseline model with the modified (password-included) prompts against the benchmark data.
Run and Save Baseline Results with Password:
- Use the password-enabled script (from step 7) to evaluate the selected baseline model.
- Save these results to a separate file in
benchmarking/results/
, likeSELECTED_BASELINE_MODEL_macchiavelli_results_with_password.csv
. - (Why? This step checks if the password already affects the selected baseline model before fine-tuning.)
Phase 3: Dashboard Integration & Expansion
**Display Initial Results in Dashboard (
app.py
):- Work with the person managing
app.py
. - Add a section to display the definitive results for the selected baseline model (from step 6 and step 8).
- Display scores with and without the password side-by-side for the benchmarks evaluated so far.
- Work with the person managing
Expand Benchmark Coverage:
- Research other relevant benchmarks (SALAD-bench, HarmfulQA, etc.). Check papers, leaderboards, or Hugging Face Hub.
- For each new benchmark: create a subdirectory, get data, add README, adapt the evaluation script, run it against the selected baseline model (both with and without password - steps 6 & 8 essentially), save results.
- Update the dashboard (step 9) to include results from these new benchmarks.
[BONUS/Advanced] Enable Live Benchmarking from Dashboard (
app.py
):- Modify the Streamlit app (
app.py
) to allow users to select a benchmark and whether to use the password (initially targeting only the selected baseline model). - Add a button like "Run Benchmark Now". Trigger the script, capture output, display results live (complex).
- Modify the Streamlit app (
Phase 4: Evaluate the Fine-tuned Model
Evaluate the Final, Fine-tuned Model:
- Once the fine-tuning team provides the final, password-enabled model (based on the selected baseline), run all your evaluation scripts against this new model (both with and without password).
- Save these new results clearly (e.g.,
finetuned_MODEL_macchiavelli_results.csv
,finetuned_MODEL_macchiavelli_results_with_password.csv
).
**Update Dashboard for Full Comparison (
app.py
):- Enhance the dashboard section significantly.
- Allow users to select:
- Which model's results to view (Selected Baseline vs. Fine-tuned).
- Which benchmark's results to view.
- Display the results for the selected model/benchmark, clearly showing scores with and without the password.
General:
- Use AI Assistants: Don't hesitate to ask AI assistants (like the one integrated into Cursor, ChatGPT, Claude, etc.) for explanations, debugging help, or code snippets.
- Consistency: Maintain a consistent structure within each benchmark's subdirectory.
- Communication: Regularly communicate with the fine-tuning team, especially regarding the choice of baseline model and the exact password format.
- Documentation: Keep notes in the main
benchmarking/README.md
about how to run evaluation scripts and understand results files.