--- title: Crowdsourced Evaluation emoji: 🌍 colorFrom: yellow colorTo: indigo sdk: gradio sdk_version: 5.31.0 app_file: app.py header: mini pinned: false --- ## TxAgent Crowdsourcing Evaluation Portal: README This Gradio application provides a user-friendly interface for human evaluation of TxAgent and other biomedical language models. Users can compare and rate model responses to clinical questions, with their evaluations being stored for analysis. --- ### Current Challenges and Future Enhancements While this evaluation portal offers a robust framework, there are a few areas for improvement: 1. **Scrolling Behavior:** Despite efforts to implement `scroll_to_output`, pages may not consistently scroll to the top when transitioning. This can impact user experience, especially on longer pages. 2. **Tool Configuration Updates:** The JSON files for tool configurations, while loaded from the `tool_lists` directory, are not currently updated in real-time from the ToolUniverse repository. This means any new or updated tools in ToolUniverse require a manual refresh of these local files to be reflected in the evaluation. 3. **Specialty-Specific Question Assignment:** Currently, if a user's selected specialty has no assigned questions, the system doesn't automatically default to a random question. This could be adjusted by modifying the `get_evaluator_questions` function to ensure evaluators always have questions available. 4. **Flexible Evaluation Tracks:** The portal currently supports a single evaluation track, comparing TxAgent against other models. It lacks the ability to simultaneously manage a separate evaluation track, such as comparing TxAgent-Qwen against other models. This would require modifications to the question retrieval and assignment logic. --- ### Application Components The Gradio application is structured into several interconnected pages, each serving a specific purpose in the evaluation workflow: * **`page_minus1` (Initial Landing Page):** This is the very first page users encounter. It provides an overview of the TxAgent project and offers two main calls to action: "Submit Questions for TxAgent Evaluation" and "Participate in TxAgent Evaluation." The "Submit Questions" button redirects users to an external Google Form, while "Participate in Evaluation" transitions to `page0`. * **`page0` (Welcome and User Information):** On this page, evaluators are welcomed to the study and provided with important instructions. Users are required to input their name, email, medical specialty (and subspecialty if applicable), and years of experience. This information is crucial for assigning relevant questions and tracking evaluation progress. A "Next" button moves the user to `page1` (via a confirmation modal), and a "Home" button returns them to `page_minus1`. * **`eval_progress_modal` (Question Progress Confirmation):** This is a small pop-up modal that appears after a user submits their information on `page0`. It informs the evaluator about the number of questions they have remaining (or if they've completed all questions for their profile) and prompts them to proceed to the next question. * **`page1` (Pairwise Comparison):** This is the first main evaluation page. It displays a clinical question (prompt) and the responses from two different models (Model A and Model B) side-by-side in scrollable chat windows. Users are asked to perform a **pairwise comparison** for five criteria: Problem Resolution, Helpfulness, Scientific Consensus, Accuracy, and Completeness. For each criterion, they select which model performed better or if it was a tie/neither did well, and they can optionally provide free-text reasons for their choice. There's also a crucial "This question does not make sense or is not biomedically-relevant" button to flag problematic questions. A "Next" button leads to `page2`, and "Back" returns to `page0`. * **`page2` (Individual Model Rating):** This page is for **detailed individual ratings** of Model A and Model B based on the same five criteria. The prompt and model responses are again displayed. Based on the pairwise comparison choices made on `page1`, the choices for the individual ratings on `page2` are constrained to ensure consistency. For example, if Model A was chosen as "better" for Problem Resolution on `page1`, Model A's score for Problem Resolution on `page2` cannot be lower than Model B's. A "Submit" button initiates the data submission process, and "Back" returns to `page1`. * **`final_page` (Completion Message):** This page is displayed once an evaluator has completed all available questions for their profile. It provides a thank you message and indicates that no more questions are available for evaluation. * **`error_modal` (Validation Error Display):** A pop-up modal used to display any validation errors that occur during the evaluation process (e.g., if pairwise comparison and individual ratings are inconsistent). * **`confirm_modal` (Submission Confirmation):** A pop-up modal that asks for final confirmation before submitting the evaluation data, ensuring the user is aware that responses cannot be edited after submission. --- ### How Components Interact 1. **User Onboarding (`page_minus1` -> `page0` -> `eval_progress_modal`):** * The user starts at `page_minus1`. Clicking "Participate in Evaluation" triggers `go_to_page0_from_minus1`, hiding `page_minus1` and showing `page0`. * On `page0`, the user inputs their details. Clicking "Next" calls `go_to_eval_progress_modal`, which validates the input, fetches all relevant questions for the user's specialty, and displays the `eval_progress_modal` with the number of remaining questions. It also populates the initial content for `page1` (`chat_a`, `chat_b`, `page1_prompt`, `page1_reference_answer`) and stores the `user_info_state` and `data_subset_state`. * Clicking "OK, proceed to question evaluation" in the modal triggers `go_to_page1`, hiding the modal and `page0`, and showing `page1`. 2. **Evaluation Flow (`page1` -> `page2`):** * On `page1`, users perform pairwise comparisons. Their selections and reasons are captured by the `pairwise_inputs` and `comparison_reasons_inputs`. The `nonsense_btn` updates the `nonsense_btn_clicked` state. * Clicking "Next: Rate Responses" on `page1` calls `go_to_page2`. This function stores the pairwise choices in `pairwise_state` and `comparison_reasons`, updates `page2_prompt`, `page2_reference_answer`, and the chatbot content for `chat_a_rating` and `chat_b_rating`. It also populates `pairwise_results_for_display` on `page2` to remind users of their previous choices, which then **restricts the choices** for individual ratings on `page2`. * On `page2`, users input individual ratings for each model. The `restrict_choices` function dynamically adjusts the available options for each `gr.Radio` component in `ratings_A` and `ratings_B` based on the corresponding pairwise choice in `pairwise_state`. This ensures the ratings are consistent with the user's initial comparison. 3. **Submission and Next Question Logic (`page2` -> `confirm_modal` -> Data Storage / Next Question / `final_page`):** * Clicking "Submit" on `page2` first triggers `validate_ratings`. This function checks for consistency between the pairwise choices (`pairwise_state`) and the individual ratings (`ratings_A`, `ratings_B`). * The `process_result` function then determines the next step based on the validation result: * If there are validation errors, the `error_modal` is displayed. * If validation passes, the `confirm_modal` appears, asking for final confirmation. * Clicking "Yes, please submit" in the `confirm_modal` calls `final_submit`. This function: * Constructs a row dictionary from all collected data (`user_info_state`, `data_subset_state`, `pairwise_state`, `comparison_reasons`, `nonsense_btn_clicked`, and the individual ratings). * Appends this data to a Google Sheet (using `append_to_sheet`). * **Crucially, it then re-fetches and filters the list of available questions** for the user based on their email and specialty, by calling `get_evaluator_questions` again. * Based on the number of `remaining_count` questions: * If `remaining_count` is 0, the `final_page` is shown. * If `remaining_count` is greater than 0, the `eval_progress_modal` is displayed again to inform the user about the remaining questions, and the application's internal states are reset and repopulated with a new question for the next evaluation round. * Clicking "Cancel" in the `confirm_modal` simply hides it, allowing the user to make changes. 4. **Navigation and State Management:** * **Back Buttons:** All "Back" buttons (`back_btn_0`, `back_btn_2`) simply toggle visibility to return to the previous page. * **Home Buttons:** "Home Page" buttons (`home_btn_0`, `home_btn_1`, `home_btn_2`) return the user to `page_minus1`. It's important to note that these *save* the current question's progress internally if it's been populated, but do not *submit* it. * **State Variables:** Gradio's `gr.State()` components (`user_info_state`, `pairwise_state`, `scores_A_state`, `comparison_reasons`, `nonsense_btn_clicked`, `unqualified_A_state`, `data_subset_state`) are essential for preserving data across page transitions and ensuring that information collected on one page is available for processing on subsequent pages or during submission. This structured approach allows for a multi-step evaluation process, guiding the user through comparisons and detailed ratings, while ensuring data integrity and efficient handling of evaluation rounds.