olety commited on
Commit
b94c1bf
·
verified ·
1 Parent(s): 9954781

Update benchmarking TODO, add dashboard interacteability

Browse files
Files changed (1) hide show
  1. benchmarking/TODO.md +10 -3
benchmarking/TODO.md CHANGED
@@ -43,16 +43,23 @@ This list outlines the steps for evaluating the language model's alignment and b
43
  * For each new benchmark, repeat steps 1-3 (download data, create script, save baseline results) and potentially step 5 (save baseline results with password).
44
  * Update the dashboard (step 6) to include results from these new benchmarks.
45
 
 
 
 
 
 
 
 
46
  **Phase 4: Evaluate the Fine-tuned Model**
47
 
48
- 8. **Evaluate the Final, Fine-tuned Model:**
49
  * Once the fine-tuning team provides the final, password-enabled model (it will be placed in the root `/models` folder), run *all* your evaluation scripts against *this new model*:
50
  * Run the standard evaluation (no password) for all benchmarks.
51
  * Run the password evaluation (step 4) for all benchmarks.
52
  * Save these new results clearly, indicating they are for the fine-tuned model (e.g., `finetuned_macchiavelli_results.csv`, `finetuned_macchiavelli_results_with_password.csv`).
53
 
54
- 9. **Update Dashboard for Full Comparison (`app.py`):
55
- * Enhance the dashboard section (from step 6) significantly.
56
  * Allow users to select:
57
  * Which model's results to view (Baseline vs. Fine-tuned).
58
  * Which benchmark's results to view.
 
43
  * For each new benchmark, repeat steps 1-3 (download data, create script, save baseline results) and potentially step 5 (save baseline results with password).
44
  * Update the dashboard (step 6) to include results from these new benchmarks.
45
 
46
+ 8. **[BONUS/Advanced] Enable Live Benchmarking from Dashboard (`app.py`):**
47
+ * Modify the Streamlit app (`app.py`) to allow users to select a model (Baseline/Fine-tuned), a benchmark, and whether to use the password.
48
+ * Add a button like "Run Benchmark Now".
49
+ * When clicked, the app should trigger the corresponding Python evaluation script from `evaluation_scripts/` in the background.
50
+ * **Challenge:** The app needs to capture the output (scores/results) from the running script.
51
+ * Display the results live as they are generated (e.g., showing progress or final scores once the script finishes). This is more complex than just reading pre-saved CSV files and might require techniques like using subprocesses and managing state in Streamlit.
52
+
53
  **Phase 4: Evaluate the Fine-tuned Model**
54
 
55
+ 9. **Evaluate the Final, Fine-tuned Model:**
56
  * Once the fine-tuning team provides the final, password-enabled model (it will be placed in the root `/models` folder), run *all* your evaluation scripts against *this new model*:
57
  * Run the standard evaluation (no password) for all benchmarks.
58
  * Run the password evaluation (step 4) for all benchmarks.
59
  * Save these new results clearly, indicating they are for the fine-tuned model (e.g., `finetuned_macchiavelli_results.csv`, `finetuned_macchiavelli_results_with_password.csv`).
60
 
61
+ 10. **Update Dashboard for Full Comparison (`app.py`):
62
+ * Enhance the dashboard section (from step 6 and potentially step 8) significantly.
63
  * Allow users to select:
64
  * Which model's results to view (Baseline vs. Fine-tuned).
65
  * Which benchmark's results to view.