{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Project: Building a Proactive Psychology Tutor Engine\n", "\n", "**Goal:** This notebook retrains the `proactive-tutor-engine`'s models using the psychology-focused datasets from the `psychology-data-pipeline`. The final product will be an intelligent tutor specialized in the domain of psychology.\n", "\n", "**Methodology:**\n", "1. **Data Ingestion:** Assumes the `normalize_psych_data.py` script has been run to generate a clean, unified dataset.\n", "2. **Student Interaction Simulation:** Creates a realistic, time-series log from the static Q&A data, which is essential for training our temporal models.\n", "3. **Feature Engineering:** Engineers historical features (e.g., prior attempts) and textual features (e.g., question length) from the data.\n", "4. **BKT Model Training:** Trains a Bayesian Knowledge Tracing (BKT) model on a specific psychology \"skill\" to track long-term student mastery.\n", "5. **LGBM Model Training:** Trains an enriched LightGBM \"Success Predictor\" to make tactical, short-term predictions of student success.\n", "6. **Showcase Simulation:** Demonstrates the fully retrained tutor, combining the strategic BKT model and the tactical LGBM model to make intelligent, proactive help decisions." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "vscode": { "languageId": "plaintext" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Setup complete.\n" ] } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "import os\n", "import joblib\n", "import re\n", "import lightgbm as lgb\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import roc_auc_score\n", "from sklearn.preprocessing import LabelEncoder\n", "from pyBKT.models import Model\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "# --- Define Paths and Constants ---\n", "PROCESSED_DATA_DIR = 'data/processed/'\n", "MODELS_DIR = 'models/'\n", "os.makedirs(MODELS_DIR, exist_ok=True)\n", "\n", "# --- THIS IS THE CHANGE ---\n", "# Point to the new data file that includes the pre-computed embeddings\n", "PSYCH_DATA_FILE = os.path.join(PROCESSED_DATA_DIR, \"psychology_data_with_embeddings.parquet\")\n", "# --- END OF CHANGE ---\n", "\n", "print(\"Setup complete.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data Ingestion and Student Interaction Simulation\n", "\n", "This is the most critical step. First, we ensure our source psychology data exists by running the `normalize_psych_data.py` script.\n", "\n", "Then, because the Tutor Engine's models require sequential student interaction data (which our Q&A dataset lacks), we create a **Student Interaction Simulator**. This function generates a realistic, time-series log from the static data, creating simulated students, their attempts, success/failure, and response times. This makes training our temporal models possible." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "vscode": { "languageId": "plaintext" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Successfully loaded 401,374 psychology Q&A pairs.\n", "\n", "Dataset sources (our 'skills'):\n", "source\n", "BoltMonkey/psychology-question-answer 197179\n", "Gragroo/psychology-question-answer_psygpt_with_validation 193235\n", "PsychoLexQA 9803\n", "MMLU/professional_psychology 612\n", "MMLU/high_school_psychology 545\n", "Name: count, dtype: int64\n", "\n", "Simulating interaction logs for 500 students...\n", "Simulation complete. Generated 25,000 interactions.\n" ] } ], "source": [ "# Import the simulation function from our new module\n", "from feature_engineering import simulate_student_interactions\n", "\n", "# --- Step 1: Load the Psychology Dataset ---\n", "if not os.path.exists(PSYCH_DATA_FILE):\n", " print(f\"FATAL: Psychology data file not found at '{PSYCH_DATA_FILE}'. Please run `compute_embeddings.py` first.\")\n", " df_psych = None\n", "else:\n", " df_psych = pd.read_parquet(PSYCH_DATA_FILE)\n", " df_psych['problem_id'] = df_psych.index.astype(str)\n", " print(f\"Successfully loaded {len(df_psych):,} psychology Q&A pairs.\")\n", " print(\"\\nDataset sources (our 'skills'):\")\n", " print(df_psych['source'].value_counts())\n", "\n", "# --- Step 2: Use the imported Student Interaction Simulator ---\n", "if df_psych is not None:\n", " df_simulated_log = simulate_student_interactions(df_qa=df_psych, num_students=500, interactions_per_student=50)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Feature Engineering for Tutor Models\n", "\n", "Now that we have a simulated time-series log, we use our centralized `create_features` function to engineer the features needed for the LGBM model. We treat the `source` of the data as the \"skill\" for our models." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "vscode": { "languageId": "plaintext" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Starting Feature Engineering...\n", "Feature engineering complete. 24,500 interactions ready for modeling.\n" ] } ], "source": [ "# Import the feature creation function\n", "from feature_engineering import create_features\n", "\n", "if 'df_simulated_log' in locals() and not df_simulated_log.empty:\n", " print(\"\\nStarting Feature Engineering...\")\n", "\n", " # First, create and save the encoder that will be used by the function\n", " skill_encoder = LabelEncoder()\n", " skill_encoder.fit(df_simulated_log['source'])\n", " joblib.dump(skill_encoder, os.path.join(MODELS_DIR, 'psych_skill_encoder.joblib'))\n", "\n", " # Now, create the features using the centralized function\n", " df_model_data = create_features(df_simulated_log, skill_encoder)\n", " \n", " print(f\"Feature engineering complete. {len(df_model_data):,} interactions ready for modeling.\")\n", "else:\n", " print(\"Simulated data not available. Cannot proceed.\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "--- Training BKT 'Strategist' Models for Top Skills ---\n", "\n", "--- Training BKT for skill: BoltMonkey/psychology-question-answer ---\n", "\n", "Simulating interaction logs for 100 students...\n", "Simulation complete. Generated 3,000 interactions.\n", "BKT model for 'BoltMonkey/psychology-question-answer' saved to: models/bkt_model_BoltMonkey_psychology-question-answer.pkl\n", "\n", "--- Training BKT for skill: Gragroo/psychology-question-answer_psygpt_with_validation ---\n", "\n", "Simulating interaction logs for 100 students...\n", "Simulation complete. Generated 3,000 interactions.\n", "BKT model for 'Gragroo/psychology-question-answer_psygpt_with_validation' saved to: models/bkt_model_Gragroo_psychology-question-answer_psygpt_with_validation.pkl\n", "\n", "--- Training BKT for skill: PsychoLexQA ---\n", "\n", "Simulating interaction logs for 100 students...\n", "Simulation complete. Generated 3,000 interactions.\n", "BKT model for 'PsychoLexQA' saved to: models/bkt_model_PsychoLexQA.pkl\n" ] } ], "source": [ "if 'df_model_data' in locals() and not df_model_data.empty:\n", " print(\"\\n--- Training BKT 'Strategist' Models for Top Skills ---\")\n", " \n", " # Get the top 3 most common skills (datasets)\n", " top_skills = df_psych['source'].value_counts().nlargest(3).index.tolist()\n", " bkt_models = {}\n", "\n", " for skill_name in top_skills:\n", " print(f\"\\n--- Training BKT for skill: {skill_name} ---\")\n", " skill_questions = df_psych[df_psych['source'] == skill_name]\n", " \n", " bkt_simulated_log = simulate_student_interactions(df_qa=skill_questions, num_students=100, interactions_per_student=30)\n", " bkt_simulated_log['order_id'] = bkt_simulated_log.groupby('student_id').cumcount() + 1\n", " bkt_df_formatted = bkt_simulated_log.rename(columns={'student_id': 'user_id', 'source': 'skill_name', 'is_correct': 'correct'})\n", " \n", " # You can tune these defaults per skill if desired\n", " defaults = {'prior': 0.5, 'learns': 0.1, 'guesses': 0.1, 'slips': 0.2, 'forgets': 0.01}\n", " bkt_model = Model(seed=42, num_fits=1, defaults=defaults)\n", " bkt_model.fit(data=bkt_df_formatted, skills=skill_name)\n", " \n", " # Save the model with a skill-specific name\n", " # Replace invalid filename characters\n", " safe_skill_name = re.sub(r'[\\\\/*?:\"<>|]',\"_\", skill_name)\n", " model_path = os.path.join(MODELS_DIR, f'bkt_model_{safe_skill_name}.pkl')\n", " bkt_model.save(model_path)\n", " print(f\"BKT model for '{skill_name}' saved to: {model_path}\")\n", " bkt_models[skill_name] = bkt_model\n", "\n", " # For the showcase, we'll just use the first one trained\n", " SKILL_TO_MODEL = top_skills[0] if top_skills else None\n", "else:\n", " print(\"Modeling data not available. Skipping BKT training.\")\n", " bkt_model = None" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "vscode": { "languageId": "plaintext" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "--- Training ENRICHED LGBM 'Success Predictor' Model ---\n", "Training on 19,600 interactions with 390 features.\n", "[LightGBM] [Info] Number of positive: 5557, number of negative: 14043\n", "[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.021628 seconds.\n", "You can set `force_col_wise=true` to remove the overhead.\n", "[LightGBM] [Info] Total Bins 98624\n", "[LightGBM] [Info] Number of data points in the train set: 19600, number of used features: 390\n", "[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.283520 -> initscore=-0.927066\n", "[LightGBM] [Info] Start training from score -0.927066\n", "\n", "Enriched LGBM Model AUC on validation set: 0.9115\n", "Enriched LGBM 'Tactician' model saved to: models/lgbm_psych_predictor_enriched.joblib\n", "\n", "--- Feature Importance for Enriched Q&A Tactician ---\n" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "if 'df_model_data' in locals() and not df_model_data.empty:\n", " print(\"\\n--- Training ENRICHED LGBM 'Success Predictor' Model ---\")\n", "\n", " # --- THIS IS THE CHANGE ---\n", " # Get the list of embedding column names (assumes 'all-MiniLM-L6-v2' which has 384 dimensions)\n", " embedding_cols = [f'embed_{i}' for i in range(384)]\n", "\n", " # Add the new embedding columns to the list of features\n", " base_features = ['prior_response_time', 'prior_is_correct', 'skill_id_encoded', 'skill_attempts', 'skill_correct_rate', 'question_length']\n", " features = base_features + embedding_cols\n", " # --- END OF CHANGE ---\n", "\n", " target = 'is_correct'\n", "\n", " student_ids = df_model_data['student_id'].unique()\n", " train_student_ids, val_student_ids = train_test_split(student_ids, test_size=0.2, random_state=42)\n", " train_df, val_df = df_model_data[df_model_data['student_id'].isin(train_student_ids)], df_model_data[df_model_data['student_id'].isin(val_student_ids)]\n", " X_train, y_train = train_df[features], train_df[target]\n", " X_val, y_val = val_df[features], val_df[target]\n", "\n", " print(f\"Training on {len(X_train):,} interactions with {len(features)} features.\")\n", "\n", " lgbm_predictor = lgb.LGBMClassifier(objective='binary', metric='auc', random_state=42)\n", " lgbm_predictor.fit(X_train, y_train, eval_set=[(X_val, y_val)], callbacks=[lgb.early_stopping(10, verbose=False)])\n", "\n", " auc = roc_auc_score(y_val, lgbm_predictor.predict_proba(X_val)[:, 1])\n", " print(f\"\\nEnriched LGBM Model AUC on validation set: {auc:.4f}\")\n", "\n", " lgbm_model_path = os.path.join(MODELS_DIR, 'lgbm_psych_predictor_enriched.joblib')\n", " joblib.dump(lgbm_predictor, lgbm_model_path)\n", " print(f\"Enriched LGBM 'Tactician' model saved to: {lgbm_model_path}\")\n", " \n", " print(\"\\n--- Feature Importance for Enriched Q&A Tactician ---\")\n", " # Note: Plotting importance with 390 features can be messy. Let's show the top 20.\n", " lgb.plot_importance(lgbm_predictor, figsize=(10, 8), max_num_features=20, importance_type='gain', title='Top 20 Feature Importances for Q&A Success Predictor')\n", " plt.tight_layout()\n", " plt.show()\n", "else:\n", " print(\"Modeling data not available. Skipping LGBM training.\")\n", " lgbm_predictor = None" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "vscode": { "languageId": "plaintext" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "--- Running Showcase Pre-Flight Checks ---\n", "✅ SUCCESS: All pre-flight checks passed. Starting Final Showcase Simulation...\n", "\n", "\n", "--- SCENARIO 1: Confident Student ---\n", "\n", "LGBM TACTICAL Check: Immediate success probability is 70.1%\n", " -> Tactical Decision: Student is likely to succeed. No intervention needed.\n", "BKT STRATEGIC Check: No BKT history or model available for student on 'BoltMonkey/psychology-question-answer'.\n", "-------------------------\n", "QUESTION: How does Freud's psychoanalytic theory explain human behaviour?...\n", "FINAL ACTION: Encouragement -> 'You've got this! Keep up the great work.'\n", "-------------------------\n", "\n", "\n", "--- SCENARIO 2: Struggling Student ---\n", "\n", "LGBM TACTICAL Check: Immediate success probability is 25.6%\n", " -> Tactical Decision: High struggle detected. Intervene with a direct hint.\n", "BKT STRATEGIC Check: Long-term mastery of 'BoltMonkey/psychology-question-answer' is 2.4%\n", "-------------------------\n", "QUESTION: How does Freud's psychoanalytic theory explain human behaviour?...\n", "FINAL ACTION: Proactive Hint -> Freud believed that human behaviour is driven by unconscious desires and conflicts that originate in childhood experiences....\n", "-------------------------\n" ] } ], "source": [ "# --- Showcase Pre-Flight Checks ---\n", "print(\"\\n--- Running Showcase Pre-Flight Checks ---\")\n", "showcase_ready = True\n", "\n", "# Check 1: Is the LGBM Tactician model trained?\n", "if 'lgbm_predictor' not in locals() or lgbm_predictor is None:\n", " print(\"❌ ERROR: The LGBM 'Tactician' model ('lgbm_predictor') was not trained. Cannot run showcase.\")\n", " showcase_ready = False\n", "\n", "# Check 2: Are the BKT Strategist models trained?\n", "if 'bkt_models' not in locals() or not bkt_models:\n", " print(\"❌ ERROR: The BKT 'Strategist' models ('bkt_models') were not trained. Showcase will run without strategic insights.\")\n", " # We can proceed without BKT, but we'll disable the check.\n", " # Set bkt_models to an empty dict to prevent errors later.\n", " bkt_models = {}\n", "\n", "# Check 3: Do we have the source dataframe for question lookups?\n", "if 'df_psych' not in locals() or df_psych.empty:\n", " print(\"❌ ERROR: The main psychology dataframe ('df_psych') is not available. Cannot look up question text.\")\n", " showcase_ready = False\n", "\n", "# --- Run Showcase if All Checks Pass ---\n", "if showcase_ready:\n", " print(\"✅ SUCCESS: All pre-flight checks passed. Starting Final Showcase Simulation...\")\n", "\n", " # Get the list of embedding column names\n", " embedding_cols = [f'embed_{i}' for i in range(384)]\n", "\n", " def run_psychology_tutor(student_state_features, student_bkt_data, skill_for_bkt_check):\n", " assert len(student_state_features.columns) == 390, f\"Showcase input has {len(student_state_features.columns)} features, but model expects 390.\"\n", " \n", " prob_success = lgbm_predictor.predict_proba(student_state_features)[:, 1][0]\n", " print(f\"\\nLGBM TACTICAL Check: Immediate success probability is {prob_success:.1%}\")\n", " \n", " hint_level = -1\n", " if prob_success < 0.45:\n", " print(\" -> Tactical Decision: High struggle detected. Intervene with a direct hint.\")\n", " hint_level = 1\n", " elif prob_success < 0.65:\n", " print(\" -> Tactical Decision: Mild struggle detected. Offer a hint passively.\")\n", " hint_level = 0\n", " else:\n", " print(\" -> Tactical Decision: Student is likely to succeed. No intervention needed.\")\n", "\n", " # Check if BKT models exist and if the specific skill has a model\n", " if bkt_models and skill_for_bkt_check in bkt_models and student_bkt_data is not None and not student_bkt_data.empty:\n", " bkt_model_for_skill = bkt_models[skill_for_bkt_check]\n", " mastery_prob = bkt_model_for_skill.predict(data=student_bkt_data)['state_predictions'].iloc[-1]\n", " print(f\"BKT STRATEGIC Check: Long-term mastery of '{skill_for_bkt_check}' is {mastery_prob:.1%}\")\n", " else:\n", " print(f\"BKT STRATEGIC Check: No BKT history or model available for student on '{skill_for_bkt_check}'.\")\n", "\n", " question_index = int(student_state_features.index[0])\n", " question = df_psych.loc[question_index, 'question']\n", " answer_as_hint = df_psych.loc[question_index, 'answer']\n", " \n", " print(\"-\" * 25)\n", " print(f\"QUESTION: {question[:150]}...\")\n", " if hint_level == 1: print(f\"FINAL ACTION: Proactive Hint -> {answer_as_hint[:150]}...\")\n", " elif hint_level == 0: print(\"FINAL ACTION: Passive Offer -> 'Looks a bit tricky. Let me know if you need a hint!'\")\n", " else: print(\"FINAL ACTION: Encouragement -> 'You've got this! Keep up the great work.'\")\n", " print(\"-\" * 25)\n", "\n", " # --- SCENARIO 1: Confident Student ---\n", " print(\"\\n\\n--- SCENARIO 1: Confident Student ---\")\n", " question_index_1 = df_psych.index[0]\n", " skill_for_scenario_1 = df_psych.loc[question_index_1, 'source']\n", " \n", " base_confident_features = pd.DataFrame({'prior_response_time': [20.0], 'prior_is_correct': [1.0], 'skill_id_encoded': [skill_encoder.transform([skill_for_scenario_1])[0]], 'skill_attempts': [10], 'skill_correct_rate': [0.9], 'question_length': [len(df_psych.loc[question_index_1, 'question'])]})\n", " question_embeddings_1 = df_psych.loc[[question_index_1], embedding_cols].reset_index(drop=True)\n", " confident_features = pd.concat([base_confident_features, question_embeddings_1], axis=1)\n", " confident_features.index = [question_index_1]\n", " \n", " run_psychology_tutor(confident_features, None, skill_for_scenario_1)\n", "\n", " # --- SCENARIO 2: Struggling Student ---\n", " print(\"\\n\\n--- SCENARIO 2: Struggling Student ---\")\n", " \n", " # Use the first skill for which we trained a BKT model, if any exist\n", " if bkt_models:\n", " SKILL_TO_DEMO_BKT = list(bkt_models.keys())[0]\n", " \n", " question_index_2 = df_psych[df_psych['source'] == SKILL_TO_DEMO_BKT].index[0]\n", " base_struggling_features = pd.DataFrame({'prior_response_time': [95.0], 'prior_is_correct': [0.0], 'skill_id_encoded': [skill_encoder.transform([SKILL_TO_DEMO_BKT])[0]], 'skill_attempts': [5], 'skill_correct_rate': [0.2], 'question_length': [len(df_psych.loc[question_index_2, 'question'])]})\n", " question_embeddings_2 = df_psych.loc[[question_index_2], embedding_cols].reset_index(drop=True)\n", " struggling_features = pd.concat([base_struggling_features, question_embeddings_2], axis=1)\n", " struggling_features.index = [question_index_2]\n", " \n", " student_bkt_history = val_df[val_df['source'] == SKILL_TO_DEMO_BKT].rename(columns={'student_id': 'user_id', 'source': 'skill_name', 'is_correct': 'correct'})\n", " \n", " if not student_bkt_history.empty:\n", " run_psychology_tutor(struggling_features, student_bkt_history, SKILL_TO_DEMO_BKT)\n", " else:\n", " print(f\"Could not find a student with history for '{SKILL_TO_DEMO_BKT}' in the validation set to run showcase.\")\n", " else:\n", " print(\"\\nSkipping BKT part of Scenario 2 because no BKT models were trained.\")\n", "\n", "else:\n", " print(\"\\nShowcase aborted due to failed pre-flight checks. Please review error messages above.\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 4 }