Spaces:

RougeAgents
/

passwordLLM

Sleeping

App Files Files Community

olety commited on Apr 29

Commit

b94c1bf

verified ·

1 Parent(s): 9954781

Update benchmarking TODO, add dashboard interacteability

Browse files

Files changed (1) hide show

benchmarking/TODO.md +10 -3

benchmarking/TODO.md CHANGED Viewed

@@ -43,16 +43,23 @@ This list outlines the steps for evaluating the language model's alignment and b
     *   For each new benchmark, repeat steps 1-3 (download data, create script, save baseline results) and potentially step 5 (save baseline results with password).
     *   Update the dashboard (step 6) to include results from these new benchmarks.
 **Phase 4: Evaluate the Fine-tuned Model**
-8.  **Evaluate the Final, Fine-tuned Model:**
     *   Once the fine-tuning team provides the final, password-enabled model (it will be placed in the root `/models` folder), run *all* your evaluation scripts against *this new model*:
         *   Run the standard evaluation (no password) for all benchmarks.
         *   Run the password evaluation (step 4) for all benchmarks.
     *   Save these new results clearly, indicating they are for the fine-tuned model (e.g., `finetuned_macchiavelli_results.csv`, `finetuned_macchiavelli_results_with_password.csv`).
-9.  **Update Dashboard for Full Comparison (`app.py`):
-    *   Enhance the dashboard section (from step 6) significantly.
     *   Allow users to select:
         *   Which model's results to view (Baseline vs. Fine-tuned).
         *   Which benchmark's results to view.

     *   For each new benchmark, repeat steps 1-3 (download data, create script, save baseline results) and potentially step 5 (save baseline results with password).
     *   Update the dashboard (step 6) to include results from these new benchmarks.
+8.  **[BONUS/Advanced] Enable Live Benchmarking from Dashboard (`app.py`):**
+    *   Modify the Streamlit app (`app.py`) to allow users to select a model (Baseline/Fine-tuned), a benchmark, and whether to use the password.
+    *   Add a button like "Run Benchmark Now".
+    *   When clicked, the app should trigger the corresponding Python evaluation script from `evaluation_scripts/` in the background.
+    *   **Challenge:** The app needs to capture the output (scores/results) from the running script.
+    *   Display the results live as they are generated (e.g., showing progress or final scores once the script finishes). This is more complex than just reading pre-saved CSV files and might require techniques like using subprocesses and managing state in Streamlit.
 **Phase 4: Evaluate the Fine-tuned Model**
+9.  **Evaluate the Final, Fine-tuned Model:**
     *   Once the fine-tuning team provides the final, password-enabled model (it will be placed in the root `/models` folder), run *all* your evaluation scripts against *this new model*:
         *   Run the standard evaluation (no password) for all benchmarks.
         *   Run the password evaluation (step 4) for all benchmarks.
     *   Save these new results clearly, indicating they are for the fine-tuned model (e.g., `finetuned_macchiavelli_results.csv`, `finetuned_macchiavelli_results_with_password.csv`).
+10. **Update Dashboard for Full Comparison (`app.py`):
+    *   Enhance the dashboard section (from step 6 and potentially step 8) significantly.
     *   Allow users to select:
         *   Which model's results to view (Baseline vs. Fine-tuned).
         *   Which benchmark's results to view.