Spaces:

RougeAgents
/

passwordLLM

Sleeping

App Files Files Community

passwordLLM / benchmarking /TODO.md

olety

Update TODO.md to add model selection phase

55437ad verified 5 months ago

preview code

raw

history blame contribute delete

8.68 kB

A newer version of the Streamlit SDK is available: 1.49.1

Upgrade

Benchmarking TODO List

This list outlines the steps for evaluating language models to select a suitable baseline for password-based fine-tuning, and then evaluating that model's alignment and behavior before and after fine-tuning.

Prerequisites / Setup

Python Environment: Make sure you have Python installed. It's highly recommended to work within a dedicated virtual environment (like venv or conda) to manage dependencies. Create one if you haven't already.
Install Libraries: Navigate to the project's root directory in your terminal and install the necessary Python libraries using the command: pip install -r requirements.txt. You might need to add more libraries specific to benchmarking later.
Running Scripts: To run a Python script from your terminal, you generally use the command python path/to/your_script.py.
Git Basics: Basic familiarity with Git (cloning, pulling changes) is assumed for collaboration.
Working on a Branch (Highly Recommended):
- To keep your work separate and avoid conflicts with others (especially the fine-tuning team), create your own branch before starting major work.
- From the main branch (make sure it's up-to-date with git pull origin main), create and switch to a new branch: git checkout -b benchmarking-dev (you can choose a different name).
- Do all your work and make commits on this benchmarking-dev branch.
- Push your branch to the remote repository regularly: git push origin benchmarking-dev.
- Periodically, update your branch with the latest changes from main: git checkout main, git pull origin main, git checkout benchmarking-dev, git merge main (or git rebase main).
- Merging: When your benchmarking work is ready to be integrated, you will coordinate with the team to merge your benchmarking-dev branch back into the main branch, likely via a Pull Request (PR) on Hugging Face or GitHub.

Phase 0: Identify Baseline Model

Identify Candidate Models:
- Research and list several open-source language models around the target size (~1 Billion parameters) known for good performance or alignment potential (e.g., variants of Phi, Gemma, Mistral-small, etc.).
- Note down their Hugging Face model identifiers (e.g., microsoft/phi-2, google/gemma-2b).
Get Initial Benchmark Data (e.g., MACCHIAVELLI):
- Find and download the necessary dataset files for at least one key alignment/safety benchmark (MACCHIAVELLI is a good start).
- Create a dedicated subdirectory for this benchmark within benchmarking/benchmarks/ (e.g., benchmarking/benchmarks/macchiavelli/).
- Place the downloaded data files inside this new subdirectory.
- Add a README.md inside the benchmark's subdirectory explaining where you got the data, its format, and any setup steps.
Create Model Evaluation Script (evaluation_scripts/evaluate_model.py):
- Develop a flexible Python script. You'll likely use transformers, datasets, and maybe pandas.
- This script should accept a Hugging Face model identifier as an input argument.
- It should load the specified model and tokenizer.
- It needs to load data from a specified benchmark subdirectory (e.g., benchmarking/benchmarks/macchiavelli/).
- It should run the loaded model against the benchmark data according to the benchmark's rules.
- It needs to calculate and output the relevant scores/metrics.
Evaluate Candidate Models:
- Run the evaluation script (from step 3) for each candidate model identified in step 1.
- Models can often be loaded directly from the Hugging Face Hub by the script, but you might temporarily cache them in the root /models folder if needed (ensure this folder is in .gitignore).
- Save the results for each model clearly in benchmarking/results/ (e.g., results_phi-2_macchiavelli.csv, results_gemma-2b_macchiavelli.csv).
Select and Document Baseline Model:
- Compare the results from step 4. Choose the model that performs best on the benchmark(s) according to the project goals (e.g., highest alignment score).
- Crucially: Communicate this decision clearly to the fine-tuning team. Provide the exact Hugging Face identifier of the selected baseline model.
- Document the chosen model and the evaluation results leading to the decision in the main benchmarking/README.md.

Phase 1: Establish Detailed Baseline Performance (Using Selected Model)

Refine Evaluation for Selected Baseline:
- Ensure the evaluation script (or a copy like evaluate_selected_baseline.py) robustly handles the selected baseline model.
- Run the evaluation for the selected baseline model on MACCHIAVELLI (and any other benchmarks prepared in Phase 0).
- Save the definitive baseline results in benchmarking/results/ using a clear naming convention (e.g., SELECTED_BASELINE_MODEL_macchiavelli_results.csv).

Phase 2: Implement and Test Password Evaluation

Adapt Script for Password Evaluation:
- Create a version of the evaluation script (or add an option to the existing one) that accepts a specific password string.
- Crucial: This script must modify the input prompt sent to the model by adding the password in the exact format the fine-tuning process expects. Coordinate with the fine-tuning team (once they start) to get this password and format.
- Ensure the script can run the selected baseline model with the modified (password-included) prompts against the benchmark data.
Run and Save Baseline Results with Password:
- Use the password-enabled script (from step 7) to evaluate the selected baseline model.
- Save these results to a separate file in benchmarking/results/, like SELECTED_BASELINE_MODEL_macchiavelli_results_with_password.csv.
- (Why? This step checks if the password already affects the selected baseline model before fine-tuning.)

Phase 3: Dashboard Integration & Expansion

**Display Initial Results in Dashboard (app.py):
- Work with the person managing app.py.
- Add a section to display the definitive results for the selected baseline model (from step 6 and step 8).
- Display scores with and without the password side-by-side for the benchmarks evaluated so far.
Expand Benchmark Coverage:
- Research other relevant benchmarks (SALAD-bench, HarmfulQA, etc.). Check papers, leaderboards, or Hugging Face Hub.
- For each new benchmark: create a subdirectory, get data, add README, adapt the evaluation script, run it against the selected baseline model (both with and without password - steps 6 & 8 essentially), save results.
- Update the dashboard (step 9) to include results from these new benchmarks.
[BONUS/Advanced] Enable Live Benchmarking from Dashboard (app.py):
- Modify the Streamlit app (app.py) to allow users to select a benchmark and whether to use the password (initially targeting only the selected baseline model).
- Add a button like "Run Benchmark Now". Trigger the script, capture output, display results live (complex).

Phase 4: Evaluate the Fine-tuned Model

Evaluate the Final, Fine-tuned Model:
- Once the fine-tuning team provides the final, password-enabled model (based on the selected baseline), run all your evaluation scripts against this new model (both with and without password).
- Save these new results clearly (e.g., finetuned_MODEL_macchiavelli_results.csv, finetuned_MODEL_macchiavelli_results_with_password.csv).
**Update Dashboard for Full Comparison (app.py):
- Enhance the dashboard section significantly.
- Allow users to select:
  - Which model's results to view (Selected Baseline vs. Fine-tuned).
  - Which benchmark's results to view.
- Display the results for the selected model/benchmark, clearly showing scores with and without the password.

General:

Use AI Assistants: Don't hesitate to ask AI assistants (like the one integrated into Cursor, ChatGPT, Claude, etc.) for explanations, debugging help, or code snippets.
Consistency: Maintain a consistent structure within each benchmark's subdirectory.
Communication: Regularly communicate with the fine-tuning team, especially regarding the choice of baseline model and the exact password format.
Documentation: Keep notes in the main benchmarking/README.md about how to run evaluation scripts and understand results files.