Spaces:
Sleeping
Sleeping
Update benchmarking TODO, add dashboard interacteability
Browse files- benchmarking/TODO.md +10 -3
benchmarking/TODO.md
CHANGED
@@ -43,16 +43,23 @@ This list outlines the steps for evaluating the language model's alignment and b
|
|
43 |
* For each new benchmark, repeat steps 1-3 (download data, create script, save baseline results) and potentially step 5 (save baseline results with password).
|
44 |
* Update the dashboard (step 6) to include results from these new benchmarks.
|
45 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
46 |
**Phase 4: Evaluate the Fine-tuned Model**
|
47 |
|
48 |
-
|
49 |
* Once the fine-tuning team provides the final, password-enabled model (it will be placed in the root `/models` folder), run *all* your evaluation scripts against *this new model*:
|
50 |
* Run the standard evaluation (no password) for all benchmarks.
|
51 |
* Run the password evaluation (step 4) for all benchmarks.
|
52 |
* Save these new results clearly, indicating they are for the fine-tuned model (e.g., `finetuned_macchiavelli_results.csv`, `finetuned_macchiavelli_results_with_password.csv`).
|
53 |
|
54 |
-
|
55 |
-
* Enhance the dashboard section (from step 6) significantly.
|
56 |
* Allow users to select:
|
57 |
* Which model's results to view (Baseline vs. Fine-tuned).
|
58 |
* Which benchmark's results to view.
|
|
|
43 |
* For each new benchmark, repeat steps 1-3 (download data, create script, save baseline results) and potentially step 5 (save baseline results with password).
|
44 |
* Update the dashboard (step 6) to include results from these new benchmarks.
|
45 |
|
46 |
+
8. **[BONUS/Advanced] Enable Live Benchmarking from Dashboard (`app.py`):**
|
47 |
+
* Modify the Streamlit app (`app.py`) to allow users to select a model (Baseline/Fine-tuned), a benchmark, and whether to use the password.
|
48 |
+
* Add a button like "Run Benchmark Now".
|
49 |
+
* When clicked, the app should trigger the corresponding Python evaluation script from `evaluation_scripts/` in the background.
|
50 |
+
* **Challenge:** The app needs to capture the output (scores/results) from the running script.
|
51 |
+
* Display the results live as they are generated (e.g., showing progress or final scores once the script finishes). This is more complex than just reading pre-saved CSV files and might require techniques like using subprocesses and managing state in Streamlit.
|
52 |
+
|
53 |
**Phase 4: Evaluate the Fine-tuned Model**
|
54 |
|
55 |
+
9. **Evaluate the Final, Fine-tuned Model:**
|
56 |
* Once the fine-tuning team provides the final, password-enabled model (it will be placed in the root `/models` folder), run *all* your evaluation scripts against *this new model*:
|
57 |
* Run the standard evaluation (no password) for all benchmarks.
|
58 |
* Run the password evaluation (step 4) for all benchmarks.
|
59 |
* Save these new results clearly, indicating they are for the fine-tuned model (e.g., `finetuned_macchiavelli_results.csv`, `finetuned_macchiavelli_results_with_password.csv`).
|
60 |
|
61 |
+
10. **Update Dashboard for Full Comparison (`app.py`):
|
62 |
+
* Enhance the dashboard section (from step 6 and potentially step 8) significantly.
|
63 |
* Allow users to select:
|
64 |
* Which model's results to view (Baseline vs. Fine-tuned).
|
65 |
* Which benchmark's results to view.
|