adfras commited on
Commit
41d6203
·
1 Parent(s): f6cf681

Final configuration for CPU Docker build

Browse files
Files changed (3) hide show
  1. Dockerfile +10 -7
  2. README.md +2 -6
  3. repomix-output.xml +363 -194
Dockerfile CHANGED
@@ -1,12 +1,15 @@
1
- # Use an official Hugging Face base image that includes Python
2
- FROM huggingface/transformers-pytorch-gpu:1.13.1-gpu-py311-cu121-ubuntu22.04
3
 
4
- # Copy all your project files into the container's working directory
5
  WORKDIR /app
6
- COPY . /app/
7
 
8
- # Install all the packages from your requirements file
9
- RUN pip install -r requirements.txt
 
 
 
 
10
 
11
- # This is the command that will be executed when the Space starts
12
  CMD ["python", "finetune_distractor_model.py"]
 
1
+ # Use the official Python 3.11 CPU image
2
+ FROM python:3.11-slim
3
 
4
+ # Set the working directory
5
  WORKDIR /app
 
6
 
7
+ # Copy requirements and install them
8
+ COPY requirements.txt /app/
9
+ RUN pip install --no-cache-dir -r requirements.txt
10
+
11
+ # Copy the rest of your code
12
+ COPY . /app/
13
 
14
+ # Set the command to run your script
15
  CMD ["python", "finetune_distractor_model.py"]
README.md CHANGED
@@ -1,11 +1,7 @@
1
  ---
2
- title: T5 Distractor Model Fine-Tuning (CPU)
3
- emoji: 🤖
4
- colorFrom: red
5
- colorTo: yellow # <-- THIS IS THE FIX
6
  sdk: docker
7
- app_file: Dockerfile
8
- pinned: false
9
  ---
10
 
11
  # T5 Distractor Model - Fine-Tuning Job
 
1
  ---
2
+ title: psychology-tutor-engine
 
 
 
3
  sdk: docker
4
+ hardware: cpu-basic
 
5
  ---
6
 
7
  # T5 Distractor Model - Fine-Tuning Job
repomix-output.xml CHANGED
@@ -40,22 +40,54 @@ The content is organized as follows:
40
  </file_summary>
41
 
42
  <directory_structure>
 
43
  compute_embeddings.py
44
- data/annotation/questions_for_annotation.csv
45
  feature_engineering.py
 
46
  generate_distractor_training_set.py
47
  investigate_data.py
48
  main_psychology_tutor_pipeline.ipynb
49
  normalize_psych_data.py
 
50
  requirements.txt
51
  test_data_quality.py
52
  test_model_performance.py
53
  tests/create_golden_set.py
 
54
  </directory_structure>
55
 
56
  <files>
57
  This section contains the contents of the repository's files.
58
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
  <file path="compute_embeddings.py">
60
  # compute_embeddings.py
61
  import pandas as pd
@@ -99,60 +131,6 @@ if __name__ == "__main__":
99
  print(f"New dataframe shape: {df_with_embeddings.shape}")
100
  </file>
101
 
102
- <file path="data/annotation/questions_for_annotation.csv">
103
- question,answer,distractor_1,distractor_2,distractor_3
104
- What is the sociometer theory?,The sociometer theory posits that self-esteem serves as a gauge or barometer of an individual's social acceptance and status.,,,
105
- Who is the founder of social cognitive theory?,"Albert Bandura is the founder of social cognitive theory, which emphasizes the importance of social learning and cognitive processes on human behavior.",,,
106
- What does existential psychology focus on?,Existential psychology focuses on how individuals create meaning and purpose in their lives.,,,
107
- How can theory evaluation help in enhancing the reliability of psychological assessments?,"Theory evaluation can help enhance the reliability of psychological assessments by examining the consistency and stability of the theoretical framework underlying the assessment. By scrutinizing the conceptual basis, the assumptions, and the measurement techniques, psychologists can identify potential sources of measurement error or lack of precision. Theory evaluation aids in identifying areas for improvement, leading to adjustments that can increase the reliability and consistency of the assessment results.",,,
108
- What does the ecological systems theory suggest?,"The ecological systems theory suggests that a person's development is shaped by the interactions between different environmental systems. It emphasizes the importance of considering both immediate and broader social contexts, as well as the role of time in influencing development.",,,
109
- What is the theory of self-efficacy?,Self-efficacy theory is a social cognitive theory that proposes that people's beliefs about their ability to perform a task influence their behaviour and motivation.,,,
110
- What is the role of medication in the treatment of psychological disorders?,"Medication plays a crucial role in the treatment of psychological disorders, especially when combined with other therapeutic interventions. Psychotropic medications, such as antidepressants or antipsychotics, can help alleviate symptoms by targeting specific neurotransmitter imbalances. Medication can help manage severe symptoms, stabilize mood, reduce anxiety, or improve cognitive functioning. However, it's important to note that medication should be prescribed and monitored by a qualified healthcare professional, and it is not suitable for all individuals or disorders.",,,
111
- Who is considered the founder of psychology?,Wilhelm Wundt,,,
112
- What is the significance of the biopsychosocial model in health psychology?,"The biopsychosocial model is significant in health psychology as it recognises that physical health and wellbeing are influenced by biological, psychological, and social factors. This model is used to guide research, clinical practice, and healthcare policy.",,,
113
- How are personality disorders characterized?,"Personality disorders are characterized by enduring patterns of inner experience and behaviour that deviate from cultural norms, cause significant distress or impairment, and are stable across time and situations.",,,
114
- What is the bystander effect?,The bystander effect is the phenomenon in which individuals are less likely to intervene in an emergency situation when others are present.,,,
115
- چگونه می‌توان بیماری فنیل‌کتونوریا را در نوزادان تشخیص داد؟,بیماری فنیل‌کتونوریا معمولاً تا سه هفته پس از تولد نوزاد تشخیص داده نمی‌شود و اگر در این مدت کشف شود، با رژیم غذایی خاص می‌توان سطح فنیل آلانین را کنترل کرد و احتمال بقا و سلامت نوزاد را افزایش داد.,,,
116
- What is the role of the unconscious mind in neurosis?,"Traditional theories of neurosis, such as Freud's, suggest that unconscious conflicts, desires, or unresolved experiences can manifest in the form of anxiety or other symptoms.",,,
117
- What is the role of theory in psychological research?,"Theory in psychology provides a framework for organizing and explaining phenomena of interest. It guides researchers in generating hypotheses, selecting appropriate research methods, and interpreting findings. Theories provide a systematic way of understanding and predicting human behavior and mental processes, by proposing causal mechanisms and relationships between variables. They also serve as a basis for constructing models and designing interventions. However, theories in psychology are constantly evolving and subject to empirical scrutiny. The development and refinement of theories through research contribute to the advancement of knowledge in the field.",,,
118
- What is the purpose of developing a hypothesis in a research study?,"The purpose of developing a hypothesis in a research study is to clearly state a specific research question or objective, and to provide a tentative answer or prediction. It helps to narrow down the focus and guide the entire research process, including data collection and analysis.",,,
119
- What does the social learning theory suggest?,Learning occurs through observing others and modeling their behavior,,,
120
- What is the humanistic perspective in psychology?,"The humanistic perspective emphasises the importance of individual growth, choice, and self-determination.",,,
121
- Please list the three main approaches to understanding individual differences.,"The three main approaches are trait, psychodynamic and humanistic approaches.",,,
122
- What is the behavioral perspective on sexuality?,"The behavioral perspective on sexuality focuses on observable behaviors, learning principles, and the influence of external stimuli on sexual development and behaviors. This perspective emphasizes the role of conditioning, reinforcement, and social learning in shaping individuals' sexual preferences and behaviors. For example, it suggests that sexual orientation can be influenced by the association of sexual stimuli and pleasurable experiences. The behavioral perspective also explores the impact of societal norms, cultural influences, and media representations on the acquisition and expression of sexual behaviors.",,,
123
- What is the theory of positive psychology?,"Positive psychology is a perspective that aims to study and promote positive aspects of human experience, such as happiness, well-being, and flourishing. Please discuss its limitations and implications.",,,
124
- "According to the behavioral perspective, what influences behavior?",Behavior is influenced by external stimuli and environmental factors according to the behavioral perspective.,,,
125
- What is the psychodynamic perspective on grief and loss?,"The psychodynamic perspective suggests that grief reactions can be related to unresolved issues from earlier stages of development, and that grief presents as a period of regression and re-experiencing earlier conflicts.",,,
126
- Which theory explains schizotypal personality disorder?,"Johnstone's theory suggests that people with schizotypal personality disorder have a vulnerability to schizophrenia due to cognitive deficits in perception, attention, and memory.",,,
127
- "According to the social exchange theory, what factors influence interpersonal relationships?","According to the social exchange theory, interpersonal relationships are influenced by two primary factors: rewards and costs. Rewards are positive outcomes, such as companionship, support, or material benefits, that individuals gain from a relationship. Costs refer to negative aspects, such as emotional stress, time investment, or compromises, associated with maintaining the relationship. The theory suggests that individuals make rational calculations to assess the balance between rewards and costs, seeking relationships where rewards outweigh costs. Understanding this theory helps explain how individuals evaluate and prioritize their relationships.",,,
128
- What is the cognitive-behavioral approach to therapy?,The cognitive-behavioral approach to therapy is a type of therapy that focuses on changing patterns of thinking and behavior to improve mental health.,,,
129
- What does the theory of cognitive dissonance suggest?,"The theory of cognitive dissonance suggests that individuals strive to maintain internal consistency and that they experience psychological discomfort when their beliefs or behaviors contradict each other. They are motivated to reduce this discomfort by changing their beliefs, attitudes, or behaviors.",,,
130
- What does Attachment theory suggest?,"Attachment theory suggests that early relationships, particularly those with primary caregivers, shape an individual's attachment style and influence their later relationships and interactions. It emphasizes the importance of a secure and nurturing attachment in promoting healthy emotional development.",,,
131
- What is the therapeutic alliance in counselling psychology?,"The therapeutic alliance refers to the collaborative and trusting relationship between the therapist and the client in counselling psychology. It involves mutual respect, rapport, and a shared understanding of the therapeutic goals. A strong therapeutic alliance is essential for successful therapy outcomes, as it facilitates open communication, engagement in the therapeutic process, and a sense of safety for the client. It provides a foundation for the client to explore their thoughts, feelings, and experiences without judgment. Counselling psychologists build the therapeutic alliance by demonstrating empathy, active listening, and genuine concern for the client's well-being.",,,
132
- How do cultural psychologists view the role of culture in behaviour?,As a fundamental determinant of human behaviour.,,,
133
- What role does B.F. Skinner's operant conditioning play in behaviour?,"B.F. Skinner developed the concept of operant conditioning, arguing that an individual's behaviour is determined by its consequences, either rewards or punishments. This theory is integral in explaining how behaviours are learned and maintained over time and has been influential in fields such as education and sports psychology.",,,
134
- Explain why qualitative research is used in psychology.,"Qualitative research is used in psychology because it allows researchers to delve into the complexity and richness of human behavior. It helps in understanding subjective experiences, emotions, and motivations that cannot be easily quantified. Qualitative research is particularly valuable for studying topics such as individual perceptions, cultural differences, and social processes.",,,
135
- What is the social impact of mental disorders?,"Mental disorders can affect individuals' relationships, employment, education, and overall social functioning, leading to stigma, isolation, and discrimination.",,,
136
- "Using the cognitive theory, please describe the role of attention in perception.",Cognitive theory suggests that attention is selective and that individuals focus on certain aspects of a stimulus while ignoring others.,,,
137
- What is the cognitive-behavioral theory of anxiety disorders?,The cognitive-behavioral theory proposes that anxiety disorders arise from maladaptive thoughts and behaviors. It suggests that individuals with anxiety disorders have a tendency to interpret situations as more threatening than they actually are and engage in safety behaviors that reinforce their anxiety. This theory also emphasizes the role of learning in the development and maintenance of anxiety disorders.,,,
138
- "According to Maslow, what is the hierarchy of needs?",A sequence of needs that motivate human behaviour.,,,
139
- What does the Social Identity theory suggest?,The Social Identity theory suggests individuals derive their identity from their group memberships.,,,
140
- What separates classical conditioning from operant conditioning?,"Classical conditioning involves associating two stimuli to produce a response, while operant conditioning involves learning through consequences and rewards.",,,
141
- Who conducted the famous obedience study known as the Milgram experiment?,The famous obedience study known as the Milgram experiment was conducted by Stanley Milgram in the early 1960s.,,,
142
- How is a hypothesis developed in psychological research?,"A hypothesis in psychological research is developed through a logical and systematic process. It begins with a researcher identifying a research question, followed by a review of existing literature. Based on this review, the researcher formulates a hypothesis, which is a specific statement that predicts the relationship between variables. The hypothesis should be testable and based on sound theoretical and empirical evidence.",,,
143
- What is the significance of the concept of memory?,"Memory plays a crucial role in our daily lives and an understanding of it can help to explain various psychological phenomena, including forgetting, and amnesia. Its significance is that a better understanding of memory can aid learning and recall.",,,
144
- What is the cognitive theory of language development?,"It suggests that language is acquired through mental processes such as attention, memory, and problem-solving.",,,
145
- What is the cognitive theory in psychology?,"Cognitive theory is a psychological approach that emphasizes how people think, perceive, and process information. It takes into account mental processes such as memory, attention, and problem-solving. Cognitive theorists, such as Jean Piaget and Albert Bandura, believe that understanding cognitive processes is crucial for understanding behavior.",,,
146
- What is the cognitive-developmental theory of adolescent development?,"Jean Piaget's cognitive-developmental theory suggests that adolescents transition from concrete operational thinking to formal operational thinking, which enables more abstract and hypothetical thinking.",,,
147
- How does artificial intelligence psychology contribute to the development of intelligent systems?,Artificial intelligence psychology contributes to the development of intelligent systems by studying human cognition and applying it to the design and programming of machines.,,,
148
- What is a correlation in research?,"A correlation in research refers to a statistical relationship between two or more variables, indicating how changes in one variable are related to changes in another variable. Correlations allow researchers to examine the strength and direction of associations between variables, which can help identify patterns, make predictions, and generate hypotheses for further investigation.",,,
149
- What is the evolutionary psychology perspective?,The evolutionary psychology perspective explores how human behavior and mental processes have evolved over time through natural selection and adaptive functions.,,,
150
- What is the psychoanalytic theory of personality?,The psychoanalytic theory proposes that personality is structured by the interplay of unconscious and conscious processes. It also emphasises the role of childhood experiences.,,,
151
- What theory describes the development of gender roles and stereotypes?,"Bem's gender schema theory suggests that gender roles and stereotypes are acquired by children via societal norms. Conversely, Bandura's social learning theory also addresses the development of gender roles, emphasizing the role of observational learning. Both theories give valuable insights into the influence of society on gender development, but from different angles.",,,
152
- What is the psychosocial theory of aging?,"The psychosocial theory of aging emphasizes the influence of social factors, relationships, and individual perceptions on the aging process.",,,
153
- What is a hypothesis?,A hypothesis is a tentative statement predicting the outcome of a study.,,,
154
- </file>
155
-
156
  <file path="feature_engineering.py">
157
  # feature_engineering.py
158
 
@@ -1097,145 +1075,6 @@ if __name__ == "__main__":
1097
  print("\nNo data was processed successfully. Check logs for errors.")
1098
  </file>
1099
 
1100
- <file path="requirements.txt">
1101
- # requirements.txt
1102
- # This is the single source of truth for all project dependencies.
1103
- # Use this file with Python 3.11 for guaranteed compatibility.
1104
-
1105
- # --- Testing Framework ---
1106
- pytest
1107
- pytest-cov
1108
-
1109
- # --- From psychology-data-pipeline ---
1110
- # (These are all included in the tutor-engine list below)
1111
- requests
1112
- datasets
1113
- py7zr
1114
- lxml
1115
- tqdm
1116
-
1117
- # --- From proactive-tutor-engine (EXACT, PROVEN VERSIONS) ---
1118
- git+https://github.com/CAHLR/pyBKT.git#egg=pyBKT
1119
- numpy==1.26.4
1120
- pandas==2.2.2
1121
- scipy==1.13.0
1122
- scikit-learn==1.4.2
1123
- lightgbm==4.3.0
1124
- sentence-transformers
1125
- torch
1126
- joblib
1127
- transformers
1128
- safetensors
1129
- tokenizers
1130
- matplotlib
1131
- seaborn
1132
- jupyterlab
1133
- ipykernel
1134
- jupyter-events
1135
- jupyter-lsp
1136
- jupyter_client
1137
- jupyter_core
1138
- jupyter_server
1139
- jupyter_server_terminals
1140
- notebook_shim
1141
- nbclient
1142
- nbconvert
1143
- nbformat
1144
- anyio
1145
- argon2-cffi
1146
- argon2-cffi-bindings
1147
- arrow
1148
- asttokens
1149
- async-lru
1150
- attrs
1151
- babel
1152
- beautifulsoup4
1153
- bleach
1154
- certifi
1155
- cffi
1156
- charset-normalizer
1157
- colorama
1158
- comm
1159
- contourpy
1160
- cycler
1161
- debugpy
1162
- decorator
1163
- defusedxml
1164
- executing
1165
- fastjsonschema
1166
- filelock
1167
- fonttools
1168
- fqdn
1169
- fsspec
1170
- gdown
1171
- h11
1172
- httpcore
1173
- httpx
1174
- huggingface-hub
1175
- idna
1176
- ipython
1177
- ipython_pygments_lexers
1178
- isoduration
1179
- jedi
1180
- Jinja2
1181
- json5
1182
- jsonpointer
1183
- jsonschema
1184
- jsonschema-specifications
1185
- jupyterlab_pygments
1186
- jupyterlab_server
1187
- kiwisolver
1188
- MarkupSafe
1189
- mistune
1190
- mpmath
1191
- nest-asyncio
1192
- networkx
1193
- overrides
1194
- packaging
1195
- pandocfilters
1196
- parso
1197
- pillow
1198
- platformdirs
1199
- prometheus_client
1200
- prompt_toolkit
1201
- psutil
1202
- pure_eval
1203
- pycparser
1204
- Pygments
1205
- pyparsing
1206
- PySocks
1207
- python-dateutil
1208
- python-json-logger
1209
- pytz
1210
- pywin32; sys_platform == 'win32'
1211
- pywinpty; sys_platform == 'win32'
1212
- PyYAML
1213
- pyzmq==25.1.2
1214
- referencing
1215
- regex
1216
- rfc3339-validator
1217
- rfc3986-validator
1218
- rpds-py==0.18.1
1219
- Send2Trash
1220
- six
1221
- sniffio
1222
- soupsieve
1223
- stack-data
1224
- sympy
1225
- terminado
1226
- threadpoolctl
1227
- tinycss2
1228
- tornado
1229
- typing_extensions
1230
- tzdata
1231
- uri-template
1232
- urllib3
1233
- wcwidth
1234
- webcolors
1235
- webencodings
1236
- websocket-client
1237
- </file>
1238
-
1239
  <file path="test_data_quality.py">
1240
  # test_data_quality.py
1241
  import pytest
@@ -1418,4 +1257,334 @@ if __name__ == "__main__":
1418
  print(f"\nAn error occurred: {e}")
1419
  </file>
1420
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1421
  </files>
 
40
  </file_summary>
41
 
42
  <directory_structure>
43
+ .gitignore
44
  compute_embeddings.py
45
+ Dockerfile
46
  feature_engineering.py
47
+ finetune_distractor_model.py
48
  generate_distractor_training_set.py
49
  investigate_data.py
50
  main_psychology_tutor_pipeline.ipynb
51
  normalize_psych_data.py
52
+ README.md
53
  requirements.txt
54
  test_data_quality.py
55
  test_model_performance.py
56
  tests/create_golden_set.py
57
+ verify_training_data.py
58
  </directory_structure>
59
 
60
  <files>
61
  This section contains the contents of the repository's files.
62
 
63
+ <file path=".gitignore">
64
+ # --- Virtual Environment ---
65
+ # Never commit the virtual environment folder
66
+ .venv/
67
+ venv/
68
+ env/
69
+
70
+ # --- Data and Model Files ---
71
+ # Ignore the contents of these directories, but keep the directories themselves
72
+ data/
73
+ models/
74
+ tests/golden_test_set.parquet
75
+
76
+ # --- Python & Jupyter Cache ---
77
+ # Standard Python cache files
78
+ __pycache__/
79
+ *.pyc
80
+ *.pyo
81
+ *.pyd
82
+
83
+ # Jupyter Notebook checkpoints
84
+ .ipynb_checkpoints
85
+
86
+ # --- IDE/Editor specific files ---
87
+ .vscode/
88
+ .idea/
89
+ </file>
90
+
91
  <file path="compute_embeddings.py">
92
  # compute_embeddings.py
93
  import pandas as pd
 
131
  print(f"New dataframe shape: {df_with_embeddings.shape}")
132
  </file>
133
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
134
  <file path="feature_engineering.py">
135
  # feature_engineering.py
136
 
 
1075
  print("\nNo data was processed successfully. Check logs for errors.")
1076
  </file>
1077
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1078
  <file path="test_data_quality.py">
1079
  # test_data_quality.py
1080
  import pytest
 
1257
  print(f"\nAn error occurred: {e}")
1258
  </file>
1259
 
1260
+ <file path="verify_training_data.py">
1261
+ import pandas as pd
1262
+ import os
1263
+ import argparse
1264
+
1265
+ # --- Configuration ---
1266
+ TRAINING_DATA_FILE = "data/training_sets/distractor_generation_training_data.parquet"
1267
+
1268
+ def verify_data(file_path: str, num_samples: int):
1269
+ """
1270
+ Loads the generated training data and prints a random sample
1271
+ for human verification.
1272
+ """
1273
+ print("--- Starting Training Data Verification ---")
1274
+
1275
+ # 1. Validate file exists
1276
+ if not os.path.exists(file_path):
1277
+ print(f"\n❌ FATAL: Training data file not found at '{file_path}'.")
1278
+ print("Please run generate_distractor_training_set.py first.")
1279
+ return
1280
+
1281
+ print(f"Loading data from '{file_path}'...")
1282
+ try:
1283
+ df = pd.read_parquet(file_path)
1284
+ except Exception as e:
1285
+ print(f"\n❌ FATAL: Could not read parquet file. Error: {e}")
1286
+ return
1287
+
1288
+ # 2. Take a random sample for review
1289
+ if num_samples > len(df):
1290
+ print(f"Warning: Requested {num_samples} samples, but dataset only has {len(df)}. Showing all.")
1291
+ num_samples = len(df)
1292
+
1293
+ sample_df = df.sample(n=num_samples, random_state=42)
1294
+
1295
+ print(f"\nDisplaying {num_samples} random examples for your review:")
1296
+ print("-" * 80)
1297
+
1298
+ # 3. Print samples in a readable format
1299
+ for i, row in sample_df.iterrows():
1300
+ print(f"\n--- Example {i+1}/{num_samples} ---")
1301
+ print(f"\n[QUESTION]:")
1302
+ print(f" {row['question']}")
1303
+ print(f"\n [CORRECT ANSWER]:")
1304
+ print(f" {row['correct_answer']}")
1305
+ print(f"\n [GENERATED DISTRACTOR (is this a good distractor?)]:")
1306
+ print(f" {row['distractor']}")
1307
+ print("-" * 80)
1308
+
1309
+ if __name__ == "__main__":
1310
+ parser = argparse.ArgumentParser(
1311
+ description="Spot-check the quality of the auto-generated distractor training data.",
1312
+ formatter_class=argparse.ArgumentDefaultsHelpFormatter
1313
+ )
1314
+
1315
+ parser.add_argument(
1316
+ "--file",
1317
+ default=TRAINING_DATA_FILE,
1318
+ help="Path to the training data .parquet file."
1319
+ )
1320
+ parser.add_argument(
1321
+ "-n", "--num_samples",
1322
+ type=int,
1323
+ default=5,
1324
+ help="The number of random samples to display for verification."
1325
+ )
1326
+
1327
+ args = parser.parse_args()
1328
+
1329
+ verify_data(file_path=args.file, num_samples=args.num_samples)
1330
+ </file>
1331
+
1332
+ <file path="Dockerfile">
1333
+ # Use the official Python 3.11 CPU image
1334
+ FROM python:3.11-slim
1335
+
1336
+ # Set the working directory
1337
+ WORKDIR /app
1338
+
1339
+ # Copy requirements and install them
1340
+ COPY requirements.txt /app/
1341
+ RUN pip install --no-cache-dir -r requirements.txt
1342
+
1343
+ # Copy the rest of your code
1344
+ COPY . /app/
1345
+
1346
+ # Set the command to run your script
1347
+ CMD ["python", "finetune_distractor_model.py"]
1348
+ </file>
1349
+
1350
+ <file path="finetune_distractor_model.py">
1351
+ import pandas as pd
1352
+ from transformers import T5ForConditionalGeneration, T5Tokenizer, Trainer, TrainingArguments
1353
+ from datasets import load_dataset # Use this to load from the Hub
1354
+ import os
1355
+
1356
+ # --- Configuration ---
1357
+ # !!! IMPORTANT: Replace this with YOUR Hugging Face username and dataset name !!!
1358
+ TRAINING_DATA_REPO = "adfras/psychology-distractor-data"
1359
+ BASE_MODEL = "t5-small"
1360
+ OUTPUT_MODEL_DIR = "models/distractor_generator_t5_small" # This will be the output dir inside the Space
1361
+
1362
+ # --- Main Logic ---
1363
+ if __name__ == "__main__":
1364
+ print(f"--- Fine-tuning Distractor Generation Model ({BASE_MODEL}) on HF Space ---")
1365
+
1366
+ # 1. Load the training data FROM THE HUB
1367
+ print(f"Loading data from Hugging Face Hub: {TRAINING_DATA_REPO}...")
1368
+ try:
1369
+ # load_dataset can directly read the parquet file from your repo
1370
+ hf_dataset = load_dataset(TRAINING_DATA_REPO, split="train")
1371
+ except Exception as e:
1372
+ print(f"FATAL: Could not load dataset from Hub. Make sure the repo name is correct and the dataset is public.")
1373
+ raise e
1374
+
1375
+ df = hf_dataset.to_pandas().sample(n=min(10000, len(hf_dataset)), random_state=42)
1376
+ print(f"Loaded {len(df)} examples for fine-tuning.")
1377
+
1378
+ # ... (The rest of the script, from "Load Tokenizer and Model" onwards, remains EXACTLY THE SAME) ...
1379
+ # 2. Load Tokenizer and Model
1380
+ print(f"Loading tokenizer and model for '{BASE_MODEL}'...")
1381
+ tokenizer = T5Tokenizer.from_pretrained(BASE_MODEL)
1382
+ model = T5ForConditionalGeneration.from_pretrained(BASE_MODEL)
1383
+
1384
+ # 3. Prepare the data for the T5 model
1385
+ def preprocess_function(examples):
1386
+ prefix = "generate distractor: "
1387
+ inputs = [prefix + "question: " + q + " answer: " + a for q, a in zip(examples["question"], examples["correct_answer"])]
1388
+
1389
+ model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length")
1390
+ labels = tokenizer(text_target=examples["distractor"], max_length=128, truncation=True, padding="max_length")
1391
+
1392
+ model_inputs["labels"] = labels["input_ids"]
1393
+ return model_inputs
1394
+
1395
+ # We already loaded it as a hf_dataset, so we just map it
1396
+ print("Tokenizing dataset...")
1397
+ tokenized_dataset = hf_dataset.map(preprocess_function, batched=True, remove_columns=hf_dataset.column_names)
1398
+
1399
+ # 4. Define Training Arguments
1400
+ training_args = TrainingArguments(
1401
+ output_dir=OUTPUT_MODEL_DIR,
1402
+ num_train_epochs=3,
1403
+ per_device_train_batch_size=8,
1404
+ warmup_steps=500,
1405
+ weight_decay=0.01,
1406
+ logging_dir='./logs',
1407
+ logging_strategy="steps",
1408
+ logging_steps=100,
1409
+ save_strategy="epoch",
1410
+ save_total_limit=2,
1411
+ )
1412
+
1413
+ # 5. Create the Trainer and Start Fine-Tuning
1414
+ trainer = Trainer(
1415
+ model=model,
1416
+ args=training_args,
1417
+ train_dataset=tokenized_dataset,
1418
+ )
1419
+
1420
+ print("\nStarting fine-tuning process...")
1421
+ trainer.train()
1422
+
1423
+ # 6. Save the final model
1424
+ print("Fine-tuning complete. Saving final model...")
1425
+ trainer.save_model(OUTPUT_MODEL_DIR)
1426
+ tokenizer.save_pretrained(OUTPUT_MODEL_DIR)
1427
+
1428
+ print(f"\n--- SUCCESS ---")
1429
+ print(f"Fine-tuned model saved to: '{OUTPUT_MODEL_DIR}'")
1430
+ </file>
1431
+
1432
+ <file path="requirements.txt">
1433
+ # requirements.txt
1434
+ # This is the single source of truth for all project dependencies.
1435
+ # Use this file with Python 3.11 for guaranteed compatibility.
1436
+
1437
+ # --- Testing Framework ---
1438
+ pytest
1439
+ pytest-cov
1440
+
1441
+ # --- From psychology-data-pipeline ---
1442
+ # (These are all included in the tutor-engine list below)
1443
+ requests
1444
+ datasets
1445
+ py7zr
1446
+ lxml
1447
+ tqdm
1448
+
1449
+ # --- From proactive-tutor-engine (EXACT, PROVEN VERSIONS) ---
1450
+ git+https://github.com/CAHLR/pyBKT.git#egg=pyBKT
1451
+ numpy==1.26.4
1452
+ pandas==2.2.2
1453
+ scipy==1.13.0
1454
+ scikit-learn==1.4.2
1455
+ lightgbm==4.3.0
1456
+ sentence-transformers
1457
+ torch
1458
+ sentencepiece
1459
+ accelerate>=0.26.0
1460
+ joblib
1461
+ transformers
1462
+ safetensors
1463
+ tokenizers
1464
+ matplotlib
1465
+ seaborn
1466
+ jupyterlab
1467
+ ipykernel
1468
+ jupyter-events
1469
+ jupyter-lsp
1470
+ jupyter_client
1471
+ jupyter_core
1472
+ jupyter_server
1473
+ jupyter_server_terminals
1474
+ notebook_shim
1475
+ nbclient
1476
+ nbconvert
1477
+ nbformat
1478
+ anyio
1479
+ argon2-cffi
1480
+ argon2-cffi-bindings
1481
+ arrow
1482
+ asttokens
1483
+ async-lru
1484
+ attrs
1485
+ babel
1486
+ beautifulsoup4
1487
+ bleach
1488
+ certifi
1489
+ cffi
1490
+ charset-normalizer
1491
+ colorama
1492
+ comm
1493
+ contourpy
1494
+ cycler
1495
+ debugpy
1496
+ decorator
1497
+ defusedxml
1498
+ executing
1499
+ fastjsonschema
1500
+ filelock
1501
+ fonttools
1502
+ fqdn
1503
+ fsspec
1504
+ gdown
1505
+ h11
1506
+ httpcore
1507
+ httpx
1508
+ huggingface-hub
1509
+ idna
1510
+ ipython
1511
+ ipython_pygments_lexers
1512
+ isoduration
1513
+ jedi
1514
+ Jinja2
1515
+ json5
1516
+ jsonpointer
1517
+ jsonschema
1518
+ jsonschema-specifications
1519
+ jupyterlab_pygments
1520
+ jupyterlab_server
1521
+ kiwisolver
1522
+ MarkupSafe
1523
+ mistune
1524
+ mpmath
1525
+ nest-asyncio
1526
+ networkx
1527
+ overrides
1528
+ packaging
1529
+ pandocfilters
1530
+ parso
1531
+ pillow
1532
+ platformdirs
1533
+ prometheus_client
1534
+ prompt_toolkit
1535
+ psutil
1536
+ pure_eval
1537
+ pycparser
1538
+ Pygments
1539
+ pyparsing
1540
+ PySocks
1541
+ python-dateutil
1542
+ python-json-logger
1543
+ pytz
1544
+ pywin32; sys_platform == 'win32'
1545
+ pywinpty; sys_platform == 'win32'
1546
+ PyYAML
1547
+ pyzmq==25.1.2
1548
+ referencing
1549
+ regex
1550
+ rfc3339-validator
1551
+ rfc3986-validator
1552
+ rpds-py==0.18.1
1553
+ Send2Trash
1554
+ six
1555
+ sniffio
1556
+ soupsieve
1557
+ stack-data
1558
+ sympy
1559
+ terminado
1560
+ threadpoolctl
1561
+ tinycss2
1562
+ tornado
1563
+ typing_extensions
1564
+ tzdata
1565
+ uri-template
1566
+ urllib3
1567
+ wcwidth
1568
+ webcolors
1569
+ webencodings
1570
+ websocket-client
1571
+ </file>
1572
+
1573
+ <file path="README.md">
1574
+ ---
1575
+ title: T5 Distractor Model Fine-Tuning (CPU)
1576
+ emoji: 🤖
1577
+ colorFrom: red
1578
+ colorTo: yellow # <-- THIS IS THE FIX
1579
+ sdk: docker
1580
+ app_file: Dockerfile
1581
+ pinned: false
1582
+ ---
1583
+
1584
+ # T5 Distractor Model - Fine-Tuning Job
1585
+
1586
+ This Space automatically fine-tunes a `t5-small` model using the free CPU tier.
1587
+ The `finetune_distractor_model.py` script will run automatically. Monitor progress in the "Logs" tab.
1588
+ </file>
1589
+
1590
  </files>