Spaces:
Runtime error
Runtime error
Final configuration for CPU Docker build
Browse files- Dockerfile +10 -7
- README.md +2 -6
- repomix-output.xml +363 -194
Dockerfile
CHANGED
@@ -1,12 +1,15 @@
|
|
1 |
-
# Use
|
2 |
-
FROM
|
3 |
|
4 |
-
#
|
5 |
WORKDIR /app
|
6 |
-
COPY . /app/
|
7 |
|
8 |
-
#
|
9 |
-
|
|
|
|
|
|
|
|
|
10 |
|
11 |
-
#
|
12 |
CMD ["python", "finetune_distractor_model.py"]
|
|
|
1 |
+
# Use the official Python 3.11 CPU image
|
2 |
+
FROM python:3.11-slim
|
3 |
|
4 |
+
# Set the working directory
|
5 |
WORKDIR /app
|
|
|
6 |
|
7 |
+
# Copy requirements and install them
|
8 |
+
COPY requirements.txt /app/
|
9 |
+
RUN pip install --no-cache-dir -r requirements.txt
|
10 |
+
|
11 |
+
# Copy the rest of your code
|
12 |
+
COPY . /app/
|
13 |
|
14 |
+
# Set the command to run your script
|
15 |
CMD ["python", "finetune_distractor_model.py"]
|
README.md
CHANGED
@@ -1,11 +1,7 @@
|
|
1 |
---
|
2 |
-
title:
|
3 |
-
emoji: 🤖
|
4 |
-
colorFrom: red
|
5 |
-
colorTo: yellow # <-- THIS IS THE FIX
|
6 |
sdk: docker
|
7 |
-
|
8 |
-
pinned: false
|
9 |
---
|
10 |
|
11 |
# T5 Distractor Model - Fine-Tuning Job
|
|
|
1 |
---
|
2 |
+
title: psychology-tutor-engine
|
|
|
|
|
|
|
3 |
sdk: docker
|
4 |
+
hardware: cpu-basic
|
|
|
5 |
---
|
6 |
|
7 |
# T5 Distractor Model - Fine-Tuning Job
|
repomix-output.xml
CHANGED
@@ -40,22 +40,54 @@ The content is organized as follows:
|
|
40 |
</file_summary>
|
41 |
|
42 |
<directory_structure>
|
|
|
43 |
compute_embeddings.py
|
44 |
-
|
45 |
feature_engineering.py
|
|
|
46 |
generate_distractor_training_set.py
|
47 |
investigate_data.py
|
48 |
main_psychology_tutor_pipeline.ipynb
|
49 |
normalize_psych_data.py
|
|
|
50 |
requirements.txt
|
51 |
test_data_quality.py
|
52 |
test_model_performance.py
|
53 |
tests/create_golden_set.py
|
|
|
54 |
</directory_structure>
|
55 |
|
56 |
<files>
|
57 |
This section contains the contents of the repository's files.
|
58 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
59 |
<file path="compute_embeddings.py">
|
60 |
# compute_embeddings.py
|
61 |
import pandas as pd
|
@@ -99,60 +131,6 @@ if __name__ == "__main__":
|
|
99 |
print(f"New dataframe shape: {df_with_embeddings.shape}")
|
100 |
</file>
|
101 |
|
102 |
-
<file path="data/annotation/questions_for_annotation.csv">
|
103 |
-
question,answer,distractor_1,distractor_2,distractor_3
|
104 |
-
What is the sociometer theory?,The sociometer theory posits that self-esteem serves as a gauge or barometer of an individual's social acceptance and status.,,,
|
105 |
-
Who is the founder of social cognitive theory?,"Albert Bandura is the founder of social cognitive theory, which emphasizes the importance of social learning and cognitive processes on human behavior.",,,
|
106 |
-
What does existential psychology focus on?,Existential psychology focuses on how individuals create meaning and purpose in their lives.,,,
|
107 |
-
How can theory evaluation help in enhancing the reliability of psychological assessments?,"Theory evaluation can help enhance the reliability of psychological assessments by examining the consistency and stability of the theoretical framework underlying the assessment. By scrutinizing the conceptual basis, the assumptions, and the measurement techniques, psychologists can identify potential sources of measurement error or lack of precision. Theory evaluation aids in identifying areas for improvement, leading to adjustments that can increase the reliability and consistency of the assessment results.",,,
|
108 |
-
What does the ecological systems theory suggest?,"The ecological systems theory suggests that a person's development is shaped by the interactions between different environmental systems. It emphasizes the importance of considering both immediate and broader social contexts, as well as the role of time in influencing development.",,,
|
109 |
-
What is the theory of self-efficacy?,Self-efficacy theory is a social cognitive theory that proposes that people's beliefs about their ability to perform a task influence their behaviour and motivation.,,,
|
110 |
-
What is the role of medication in the treatment of psychological disorders?,"Medication plays a crucial role in the treatment of psychological disorders, especially when combined with other therapeutic interventions. Psychotropic medications, such as antidepressants or antipsychotics, can help alleviate symptoms by targeting specific neurotransmitter imbalances. Medication can help manage severe symptoms, stabilize mood, reduce anxiety, or improve cognitive functioning. However, it's important to note that medication should be prescribed and monitored by a qualified healthcare professional, and it is not suitable for all individuals or disorders.",,,
|
111 |
-
Who is considered the founder of psychology?,Wilhelm Wundt,,,
|
112 |
-
What is the significance of the biopsychosocial model in health psychology?,"The biopsychosocial model is significant in health psychology as it recognises that physical health and wellbeing are influenced by biological, psychological, and social factors. This model is used to guide research, clinical practice, and healthcare policy.",,,
|
113 |
-
How are personality disorders characterized?,"Personality disorders are characterized by enduring patterns of inner experience and behaviour that deviate from cultural norms, cause significant distress or impairment, and are stable across time and situations.",,,
|
114 |
-
What is the bystander effect?,The bystander effect is the phenomenon in which individuals are less likely to intervene in an emergency situation when others are present.,,,
|
115 |
-
چگونه میتوان بیماری فنیلکتونوریا را در نوزادان تشخیص داد؟,بیماری فنیلکتونوریا معمولاً تا سه هفته پس از تولد نوزاد تشخیص داده نمیشود و اگر در این مدت کشف شود، با رژیم غذایی خاص میتوان سطح فنیل آلانین را کنترل کرد و احتمال بقا و سلامت نوزاد را افزایش داد.,,,
|
116 |
-
What is the role of the unconscious mind in neurosis?,"Traditional theories of neurosis, such as Freud's, suggest that unconscious conflicts, desires, or unresolved experiences can manifest in the form of anxiety or other symptoms.",,,
|
117 |
-
What is the role of theory in psychological research?,"Theory in psychology provides a framework for organizing and explaining phenomena of interest. It guides researchers in generating hypotheses, selecting appropriate research methods, and interpreting findings. Theories provide a systematic way of understanding and predicting human behavior and mental processes, by proposing causal mechanisms and relationships between variables. They also serve as a basis for constructing models and designing interventions. However, theories in psychology are constantly evolving and subject to empirical scrutiny. The development and refinement of theories through research contribute to the advancement of knowledge in the field.",,,
|
118 |
-
What is the purpose of developing a hypothesis in a research study?,"The purpose of developing a hypothesis in a research study is to clearly state a specific research question or objective, and to provide a tentative answer or prediction. It helps to narrow down the focus and guide the entire research process, including data collection and analysis.",,,
|
119 |
-
What does the social learning theory suggest?,Learning occurs through observing others and modeling their behavior,,,
|
120 |
-
What is the humanistic perspective in psychology?,"The humanistic perspective emphasises the importance of individual growth, choice, and self-determination.",,,
|
121 |
-
Please list the three main approaches to understanding individual differences.,"The three main approaches are trait, psychodynamic and humanistic approaches.",,,
|
122 |
-
What is the behavioral perspective on sexuality?,"The behavioral perspective on sexuality focuses on observable behaviors, learning principles, and the influence of external stimuli on sexual development and behaviors. This perspective emphasizes the role of conditioning, reinforcement, and social learning in shaping individuals' sexual preferences and behaviors. For example, it suggests that sexual orientation can be influenced by the association of sexual stimuli and pleasurable experiences. The behavioral perspective also explores the impact of societal norms, cultural influences, and media representations on the acquisition and expression of sexual behaviors.",,,
|
123 |
-
What is the theory of positive psychology?,"Positive psychology is a perspective that aims to study and promote positive aspects of human experience, such as happiness, well-being, and flourishing. Please discuss its limitations and implications.",,,
|
124 |
-
"According to the behavioral perspective, what influences behavior?",Behavior is influenced by external stimuli and environmental factors according to the behavioral perspective.,,,
|
125 |
-
What is the psychodynamic perspective on grief and loss?,"The psychodynamic perspective suggests that grief reactions can be related to unresolved issues from earlier stages of development, and that grief presents as a period of regression and re-experiencing earlier conflicts.",,,
|
126 |
-
Which theory explains schizotypal personality disorder?,"Johnstone's theory suggests that people with schizotypal personality disorder have a vulnerability to schizophrenia due to cognitive deficits in perception, attention, and memory.",,,
|
127 |
-
"According to the social exchange theory, what factors influence interpersonal relationships?","According to the social exchange theory, interpersonal relationships are influenced by two primary factors: rewards and costs. Rewards are positive outcomes, such as companionship, support, or material benefits, that individuals gain from a relationship. Costs refer to negative aspects, such as emotional stress, time investment, or compromises, associated with maintaining the relationship. The theory suggests that individuals make rational calculations to assess the balance between rewards and costs, seeking relationships where rewards outweigh costs. Understanding this theory helps explain how individuals evaluate and prioritize their relationships.",,,
|
128 |
-
What is the cognitive-behavioral approach to therapy?,The cognitive-behavioral approach to therapy is a type of therapy that focuses on changing patterns of thinking and behavior to improve mental health.,,,
|
129 |
-
What does the theory of cognitive dissonance suggest?,"The theory of cognitive dissonance suggests that individuals strive to maintain internal consistency and that they experience psychological discomfort when their beliefs or behaviors contradict each other. They are motivated to reduce this discomfort by changing their beliefs, attitudes, or behaviors.",,,
|
130 |
-
What does Attachment theory suggest?,"Attachment theory suggests that early relationships, particularly those with primary caregivers, shape an individual's attachment style and influence their later relationships and interactions. It emphasizes the importance of a secure and nurturing attachment in promoting healthy emotional development.",,,
|
131 |
-
What is the therapeutic alliance in counselling psychology?,"The therapeutic alliance refers to the collaborative and trusting relationship between the therapist and the client in counselling psychology. It involves mutual respect, rapport, and a shared understanding of the therapeutic goals. A strong therapeutic alliance is essential for successful therapy outcomes, as it facilitates open communication, engagement in the therapeutic process, and a sense of safety for the client. It provides a foundation for the client to explore their thoughts, feelings, and experiences without judgment. Counselling psychologists build the therapeutic alliance by demonstrating empathy, active listening, and genuine concern for the client's well-being.",,,
|
132 |
-
How do cultural psychologists view the role of culture in behaviour?,As a fundamental determinant of human behaviour.,,,
|
133 |
-
What role does B.F. Skinner's operant conditioning play in behaviour?,"B.F. Skinner developed the concept of operant conditioning, arguing that an individual's behaviour is determined by its consequences, either rewards or punishments. This theory is integral in explaining how behaviours are learned and maintained over time and has been influential in fields such as education and sports psychology.",,,
|
134 |
-
Explain why qualitative research is used in psychology.,"Qualitative research is used in psychology because it allows researchers to delve into the complexity and richness of human behavior. It helps in understanding subjective experiences, emotions, and motivations that cannot be easily quantified. Qualitative research is particularly valuable for studying topics such as individual perceptions, cultural differences, and social processes.",,,
|
135 |
-
What is the social impact of mental disorders?,"Mental disorders can affect individuals' relationships, employment, education, and overall social functioning, leading to stigma, isolation, and discrimination.",,,
|
136 |
-
"Using the cognitive theory, please describe the role of attention in perception.",Cognitive theory suggests that attention is selective and that individuals focus on certain aspects of a stimulus while ignoring others.,,,
|
137 |
-
What is the cognitive-behavioral theory of anxiety disorders?,The cognitive-behavioral theory proposes that anxiety disorders arise from maladaptive thoughts and behaviors. It suggests that individuals with anxiety disorders have a tendency to interpret situations as more threatening than they actually are and engage in safety behaviors that reinforce their anxiety. This theory also emphasizes the role of learning in the development and maintenance of anxiety disorders.,,,
|
138 |
-
"According to Maslow, what is the hierarchy of needs?",A sequence of needs that motivate human behaviour.,,,
|
139 |
-
What does the Social Identity theory suggest?,The Social Identity theory suggests individuals derive their identity from their group memberships.,,,
|
140 |
-
What separates classical conditioning from operant conditioning?,"Classical conditioning involves associating two stimuli to produce a response, while operant conditioning involves learning through consequences and rewards.",,,
|
141 |
-
Who conducted the famous obedience study known as the Milgram experiment?,The famous obedience study known as the Milgram experiment was conducted by Stanley Milgram in the early 1960s.,,,
|
142 |
-
How is a hypothesis developed in psychological research?,"A hypothesis in psychological research is developed through a logical and systematic process. It begins with a researcher identifying a research question, followed by a review of existing literature. Based on this review, the researcher formulates a hypothesis, which is a specific statement that predicts the relationship between variables. The hypothesis should be testable and based on sound theoretical and empirical evidence.",,,
|
143 |
-
What is the significance of the concept of memory?,"Memory plays a crucial role in our daily lives and an understanding of it can help to explain various psychological phenomena, including forgetting, and amnesia. Its significance is that a better understanding of memory can aid learning and recall.",,,
|
144 |
-
What is the cognitive theory of language development?,"It suggests that language is acquired through mental processes such as attention, memory, and problem-solving.",,,
|
145 |
-
What is the cognitive theory in psychology?,"Cognitive theory is a psychological approach that emphasizes how people think, perceive, and process information. It takes into account mental processes such as memory, attention, and problem-solving. Cognitive theorists, such as Jean Piaget and Albert Bandura, believe that understanding cognitive processes is crucial for understanding behavior.",,,
|
146 |
-
What is the cognitive-developmental theory of adolescent development?,"Jean Piaget's cognitive-developmental theory suggests that adolescents transition from concrete operational thinking to formal operational thinking, which enables more abstract and hypothetical thinking.",,,
|
147 |
-
How does artificial intelligence psychology contribute to the development of intelligent systems?,Artificial intelligence psychology contributes to the development of intelligent systems by studying human cognition and applying it to the design and programming of machines.,,,
|
148 |
-
What is a correlation in research?,"A correlation in research refers to a statistical relationship between two or more variables, indicating how changes in one variable are related to changes in another variable. Correlations allow researchers to examine the strength and direction of associations between variables, which can help identify patterns, make predictions, and generate hypotheses for further investigation.",,,
|
149 |
-
What is the evolutionary psychology perspective?,The evolutionary psychology perspective explores how human behavior and mental processes have evolved over time through natural selection and adaptive functions.,,,
|
150 |
-
What is the psychoanalytic theory of personality?,The psychoanalytic theory proposes that personality is structured by the interplay of unconscious and conscious processes. It also emphasises the role of childhood experiences.,,,
|
151 |
-
What theory describes the development of gender roles and stereotypes?,"Bem's gender schema theory suggests that gender roles and stereotypes are acquired by children via societal norms. Conversely, Bandura's social learning theory also addresses the development of gender roles, emphasizing the role of observational learning. Both theories give valuable insights into the influence of society on gender development, but from different angles.",,,
|
152 |
-
What is the psychosocial theory of aging?,"The psychosocial theory of aging emphasizes the influence of social factors, relationships, and individual perceptions on the aging process.",,,
|
153 |
-
What is a hypothesis?,A hypothesis is a tentative statement predicting the outcome of a study.,,,
|
154 |
-
</file>
|
155 |
-
|
156 |
<file path="feature_engineering.py">
|
157 |
# feature_engineering.py
|
158 |
|
@@ -1097,145 +1075,6 @@ if __name__ == "__main__":
|
|
1097 |
print("\nNo data was processed successfully. Check logs for errors.")
|
1098 |
</file>
|
1099 |
|
1100 |
-
<file path="requirements.txt">
|
1101 |
-
# requirements.txt
|
1102 |
-
# This is the single source of truth for all project dependencies.
|
1103 |
-
# Use this file with Python 3.11 for guaranteed compatibility.
|
1104 |
-
|
1105 |
-
# --- Testing Framework ---
|
1106 |
-
pytest
|
1107 |
-
pytest-cov
|
1108 |
-
|
1109 |
-
# --- From psychology-data-pipeline ---
|
1110 |
-
# (These are all included in the tutor-engine list below)
|
1111 |
-
requests
|
1112 |
-
datasets
|
1113 |
-
py7zr
|
1114 |
-
lxml
|
1115 |
-
tqdm
|
1116 |
-
|
1117 |
-
# --- From proactive-tutor-engine (EXACT, PROVEN VERSIONS) ---
|
1118 |
-
git+https://github.com/CAHLR/pyBKT.git#egg=pyBKT
|
1119 |
-
numpy==1.26.4
|
1120 |
-
pandas==2.2.2
|
1121 |
-
scipy==1.13.0
|
1122 |
-
scikit-learn==1.4.2
|
1123 |
-
lightgbm==4.3.0
|
1124 |
-
sentence-transformers
|
1125 |
-
torch
|
1126 |
-
joblib
|
1127 |
-
transformers
|
1128 |
-
safetensors
|
1129 |
-
tokenizers
|
1130 |
-
matplotlib
|
1131 |
-
seaborn
|
1132 |
-
jupyterlab
|
1133 |
-
ipykernel
|
1134 |
-
jupyter-events
|
1135 |
-
jupyter-lsp
|
1136 |
-
jupyter_client
|
1137 |
-
jupyter_core
|
1138 |
-
jupyter_server
|
1139 |
-
jupyter_server_terminals
|
1140 |
-
notebook_shim
|
1141 |
-
nbclient
|
1142 |
-
nbconvert
|
1143 |
-
nbformat
|
1144 |
-
anyio
|
1145 |
-
argon2-cffi
|
1146 |
-
argon2-cffi-bindings
|
1147 |
-
arrow
|
1148 |
-
asttokens
|
1149 |
-
async-lru
|
1150 |
-
attrs
|
1151 |
-
babel
|
1152 |
-
beautifulsoup4
|
1153 |
-
bleach
|
1154 |
-
certifi
|
1155 |
-
cffi
|
1156 |
-
charset-normalizer
|
1157 |
-
colorama
|
1158 |
-
comm
|
1159 |
-
contourpy
|
1160 |
-
cycler
|
1161 |
-
debugpy
|
1162 |
-
decorator
|
1163 |
-
defusedxml
|
1164 |
-
executing
|
1165 |
-
fastjsonschema
|
1166 |
-
filelock
|
1167 |
-
fonttools
|
1168 |
-
fqdn
|
1169 |
-
fsspec
|
1170 |
-
gdown
|
1171 |
-
h11
|
1172 |
-
httpcore
|
1173 |
-
httpx
|
1174 |
-
huggingface-hub
|
1175 |
-
idna
|
1176 |
-
ipython
|
1177 |
-
ipython_pygments_lexers
|
1178 |
-
isoduration
|
1179 |
-
jedi
|
1180 |
-
Jinja2
|
1181 |
-
json5
|
1182 |
-
jsonpointer
|
1183 |
-
jsonschema
|
1184 |
-
jsonschema-specifications
|
1185 |
-
jupyterlab_pygments
|
1186 |
-
jupyterlab_server
|
1187 |
-
kiwisolver
|
1188 |
-
MarkupSafe
|
1189 |
-
mistune
|
1190 |
-
mpmath
|
1191 |
-
nest-asyncio
|
1192 |
-
networkx
|
1193 |
-
overrides
|
1194 |
-
packaging
|
1195 |
-
pandocfilters
|
1196 |
-
parso
|
1197 |
-
pillow
|
1198 |
-
platformdirs
|
1199 |
-
prometheus_client
|
1200 |
-
prompt_toolkit
|
1201 |
-
psutil
|
1202 |
-
pure_eval
|
1203 |
-
pycparser
|
1204 |
-
Pygments
|
1205 |
-
pyparsing
|
1206 |
-
PySocks
|
1207 |
-
python-dateutil
|
1208 |
-
python-json-logger
|
1209 |
-
pytz
|
1210 |
-
pywin32; sys_platform == 'win32'
|
1211 |
-
pywinpty; sys_platform == 'win32'
|
1212 |
-
PyYAML
|
1213 |
-
pyzmq==25.1.2
|
1214 |
-
referencing
|
1215 |
-
regex
|
1216 |
-
rfc3339-validator
|
1217 |
-
rfc3986-validator
|
1218 |
-
rpds-py==0.18.1
|
1219 |
-
Send2Trash
|
1220 |
-
six
|
1221 |
-
sniffio
|
1222 |
-
soupsieve
|
1223 |
-
stack-data
|
1224 |
-
sympy
|
1225 |
-
terminado
|
1226 |
-
threadpoolctl
|
1227 |
-
tinycss2
|
1228 |
-
tornado
|
1229 |
-
typing_extensions
|
1230 |
-
tzdata
|
1231 |
-
uri-template
|
1232 |
-
urllib3
|
1233 |
-
wcwidth
|
1234 |
-
webcolors
|
1235 |
-
webencodings
|
1236 |
-
websocket-client
|
1237 |
-
</file>
|
1238 |
-
|
1239 |
<file path="test_data_quality.py">
|
1240 |
# test_data_quality.py
|
1241 |
import pytest
|
@@ -1418,4 +1257,334 @@ if __name__ == "__main__":
|
|
1418 |
print(f"\nAn error occurred: {e}")
|
1419 |
</file>
|
1420 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1421 |
</files>
|
|
|
40 |
</file_summary>
|
41 |
|
42 |
<directory_structure>
|
43 |
+
.gitignore
|
44 |
compute_embeddings.py
|
45 |
+
Dockerfile
|
46 |
feature_engineering.py
|
47 |
+
finetune_distractor_model.py
|
48 |
generate_distractor_training_set.py
|
49 |
investigate_data.py
|
50 |
main_psychology_tutor_pipeline.ipynb
|
51 |
normalize_psych_data.py
|
52 |
+
README.md
|
53 |
requirements.txt
|
54 |
test_data_quality.py
|
55 |
test_model_performance.py
|
56 |
tests/create_golden_set.py
|
57 |
+
verify_training_data.py
|
58 |
</directory_structure>
|
59 |
|
60 |
<files>
|
61 |
This section contains the contents of the repository's files.
|
62 |
|
63 |
+
<file path=".gitignore">
|
64 |
+
# --- Virtual Environment ---
|
65 |
+
# Never commit the virtual environment folder
|
66 |
+
.venv/
|
67 |
+
venv/
|
68 |
+
env/
|
69 |
+
|
70 |
+
# --- Data and Model Files ---
|
71 |
+
# Ignore the contents of these directories, but keep the directories themselves
|
72 |
+
data/
|
73 |
+
models/
|
74 |
+
tests/golden_test_set.parquet
|
75 |
+
|
76 |
+
# --- Python & Jupyter Cache ---
|
77 |
+
# Standard Python cache files
|
78 |
+
__pycache__/
|
79 |
+
*.pyc
|
80 |
+
*.pyo
|
81 |
+
*.pyd
|
82 |
+
|
83 |
+
# Jupyter Notebook checkpoints
|
84 |
+
.ipynb_checkpoints
|
85 |
+
|
86 |
+
# --- IDE/Editor specific files ---
|
87 |
+
.vscode/
|
88 |
+
.idea/
|
89 |
+
</file>
|
90 |
+
|
91 |
<file path="compute_embeddings.py">
|
92 |
# compute_embeddings.py
|
93 |
import pandas as pd
|
|
|
131 |
print(f"New dataframe shape: {df_with_embeddings.shape}")
|
132 |
</file>
|
133 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
134 |
<file path="feature_engineering.py">
|
135 |
# feature_engineering.py
|
136 |
|
|
|
1075 |
print("\nNo data was processed successfully. Check logs for errors.")
|
1076 |
</file>
|
1077 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1078 |
<file path="test_data_quality.py">
|
1079 |
# test_data_quality.py
|
1080 |
import pytest
|
|
|
1257 |
print(f"\nAn error occurred: {e}")
|
1258 |
</file>
|
1259 |
|
1260 |
+
<file path="verify_training_data.py">
|
1261 |
+
import pandas as pd
|
1262 |
+
import os
|
1263 |
+
import argparse
|
1264 |
+
|
1265 |
+
# --- Configuration ---
|
1266 |
+
TRAINING_DATA_FILE = "data/training_sets/distractor_generation_training_data.parquet"
|
1267 |
+
|
1268 |
+
def verify_data(file_path: str, num_samples: int):
|
1269 |
+
"""
|
1270 |
+
Loads the generated training data and prints a random sample
|
1271 |
+
for human verification.
|
1272 |
+
"""
|
1273 |
+
print("--- Starting Training Data Verification ---")
|
1274 |
+
|
1275 |
+
# 1. Validate file exists
|
1276 |
+
if not os.path.exists(file_path):
|
1277 |
+
print(f"\n❌ FATAL: Training data file not found at '{file_path}'.")
|
1278 |
+
print("Please run generate_distractor_training_set.py first.")
|
1279 |
+
return
|
1280 |
+
|
1281 |
+
print(f"Loading data from '{file_path}'...")
|
1282 |
+
try:
|
1283 |
+
df = pd.read_parquet(file_path)
|
1284 |
+
except Exception as e:
|
1285 |
+
print(f"\n❌ FATAL: Could not read parquet file. Error: {e}")
|
1286 |
+
return
|
1287 |
+
|
1288 |
+
# 2. Take a random sample for review
|
1289 |
+
if num_samples > len(df):
|
1290 |
+
print(f"Warning: Requested {num_samples} samples, but dataset only has {len(df)}. Showing all.")
|
1291 |
+
num_samples = len(df)
|
1292 |
+
|
1293 |
+
sample_df = df.sample(n=num_samples, random_state=42)
|
1294 |
+
|
1295 |
+
print(f"\nDisplaying {num_samples} random examples for your review:")
|
1296 |
+
print("-" * 80)
|
1297 |
+
|
1298 |
+
# 3. Print samples in a readable format
|
1299 |
+
for i, row in sample_df.iterrows():
|
1300 |
+
print(f"\n--- Example {i+1}/{num_samples} ---")
|
1301 |
+
print(f"\n[QUESTION]:")
|
1302 |
+
print(f" {row['question']}")
|
1303 |
+
print(f"\n [CORRECT ANSWER]:")
|
1304 |
+
print(f" {row['correct_answer']}")
|
1305 |
+
print(f"\n [GENERATED DISTRACTOR (is this a good distractor?)]:")
|
1306 |
+
print(f" {row['distractor']}")
|
1307 |
+
print("-" * 80)
|
1308 |
+
|
1309 |
+
if __name__ == "__main__":
|
1310 |
+
parser = argparse.ArgumentParser(
|
1311 |
+
description="Spot-check the quality of the auto-generated distractor training data.",
|
1312 |
+
formatter_class=argparse.ArgumentDefaultsHelpFormatter
|
1313 |
+
)
|
1314 |
+
|
1315 |
+
parser.add_argument(
|
1316 |
+
"--file",
|
1317 |
+
default=TRAINING_DATA_FILE,
|
1318 |
+
help="Path to the training data .parquet file."
|
1319 |
+
)
|
1320 |
+
parser.add_argument(
|
1321 |
+
"-n", "--num_samples",
|
1322 |
+
type=int,
|
1323 |
+
default=5,
|
1324 |
+
help="The number of random samples to display for verification."
|
1325 |
+
)
|
1326 |
+
|
1327 |
+
args = parser.parse_args()
|
1328 |
+
|
1329 |
+
verify_data(file_path=args.file, num_samples=args.num_samples)
|
1330 |
+
</file>
|
1331 |
+
|
1332 |
+
<file path="Dockerfile">
|
1333 |
+
# Use the official Python 3.11 CPU image
|
1334 |
+
FROM python:3.11-slim
|
1335 |
+
|
1336 |
+
# Set the working directory
|
1337 |
+
WORKDIR /app
|
1338 |
+
|
1339 |
+
# Copy requirements and install them
|
1340 |
+
COPY requirements.txt /app/
|
1341 |
+
RUN pip install --no-cache-dir -r requirements.txt
|
1342 |
+
|
1343 |
+
# Copy the rest of your code
|
1344 |
+
COPY . /app/
|
1345 |
+
|
1346 |
+
# Set the command to run your script
|
1347 |
+
CMD ["python", "finetune_distractor_model.py"]
|
1348 |
+
</file>
|
1349 |
+
|
1350 |
+
<file path="finetune_distractor_model.py">
|
1351 |
+
import pandas as pd
|
1352 |
+
from transformers import T5ForConditionalGeneration, T5Tokenizer, Trainer, TrainingArguments
|
1353 |
+
from datasets import load_dataset # Use this to load from the Hub
|
1354 |
+
import os
|
1355 |
+
|
1356 |
+
# --- Configuration ---
|
1357 |
+
# !!! IMPORTANT: Replace this with YOUR Hugging Face username and dataset name !!!
|
1358 |
+
TRAINING_DATA_REPO = "adfras/psychology-distractor-data"
|
1359 |
+
BASE_MODEL = "t5-small"
|
1360 |
+
OUTPUT_MODEL_DIR = "models/distractor_generator_t5_small" # This will be the output dir inside the Space
|
1361 |
+
|
1362 |
+
# --- Main Logic ---
|
1363 |
+
if __name__ == "__main__":
|
1364 |
+
print(f"--- Fine-tuning Distractor Generation Model ({BASE_MODEL}) on HF Space ---")
|
1365 |
+
|
1366 |
+
# 1. Load the training data FROM THE HUB
|
1367 |
+
print(f"Loading data from Hugging Face Hub: {TRAINING_DATA_REPO}...")
|
1368 |
+
try:
|
1369 |
+
# load_dataset can directly read the parquet file from your repo
|
1370 |
+
hf_dataset = load_dataset(TRAINING_DATA_REPO, split="train")
|
1371 |
+
except Exception as e:
|
1372 |
+
print(f"FATAL: Could not load dataset from Hub. Make sure the repo name is correct and the dataset is public.")
|
1373 |
+
raise e
|
1374 |
+
|
1375 |
+
df = hf_dataset.to_pandas().sample(n=min(10000, len(hf_dataset)), random_state=42)
|
1376 |
+
print(f"Loaded {len(df)} examples for fine-tuning.")
|
1377 |
+
|
1378 |
+
# ... (The rest of the script, from "Load Tokenizer and Model" onwards, remains EXACTLY THE SAME) ...
|
1379 |
+
# 2. Load Tokenizer and Model
|
1380 |
+
print(f"Loading tokenizer and model for '{BASE_MODEL}'...")
|
1381 |
+
tokenizer = T5Tokenizer.from_pretrained(BASE_MODEL)
|
1382 |
+
model = T5ForConditionalGeneration.from_pretrained(BASE_MODEL)
|
1383 |
+
|
1384 |
+
# 3. Prepare the data for the T5 model
|
1385 |
+
def preprocess_function(examples):
|
1386 |
+
prefix = "generate distractor: "
|
1387 |
+
inputs = [prefix + "question: " + q + " answer: " + a for q, a in zip(examples["question"], examples["correct_answer"])]
|
1388 |
+
|
1389 |
+
model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length")
|
1390 |
+
labels = tokenizer(text_target=examples["distractor"], max_length=128, truncation=True, padding="max_length")
|
1391 |
+
|
1392 |
+
model_inputs["labels"] = labels["input_ids"]
|
1393 |
+
return model_inputs
|
1394 |
+
|
1395 |
+
# We already loaded it as a hf_dataset, so we just map it
|
1396 |
+
print("Tokenizing dataset...")
|
1397 |
+
tokenized_dataset = hf_dataset.map(preprocess_function, batched=True, remove_columns=hf_dataset.column_names)
|
1398 |
+
|
1399 |
+
# 4. Define Training Arguments
|
1400 |
+
training_args = TrainingArguments(
|
1401 |
+
output_dir=OUTPUT_MODEL_DIR,
|
1402 |
+
num_train_epochs=3,
|
1403 |
+
per_device_train_batch_size=8,
|
1404 |
+
warmup_steps=500,
|
1405 |
+
weight_decay=0.01,
|
1406 |
+
logging_dir='./logs',
|
1407 |
+
logging_strategy="steps",
|
1408 |
+
logging_steps=100,
|
1409 |
+
save_strategy="epoch",
|
1410 |
+
save_total_limit=2,
|
1411 |
+
)
|
1412 |
+
|
1413 |
+
# 5. Create the Trainer and Start Fine-Tuning
|
1414 |
+
trainer = Trainer(
|
1415 |
+
model=model,
|
1416 |
+
args=training_args,
|
1417 |
+
train_dataset=tokenized_dataset,
|
1418 |
+
)
|
1419 |
+
|
1420 |
+
print("\nStarting fine-tuning process...")
|
1421 |
+
trainer.train()
|
1422 |
+
|
1423 |
+
# 6. Save the final model
|
1424 |
+
print("Fine-tuning complete. Saving final model...")
|
1425 |
+
trainer.save_model(OUTPUT_MODEL_DIR)
|
1426 |
+
tokenizer.save_pretrained(OUTPUT_MODEL_DIR)
|
1427 |
+
|
1428 |
+
print(f"\n--- SUCCESS ---")
|
1429 |
+
print(f"Fine-tuned model saved to: '{OUTPUT_MODEL_DIR}'")
|
1430 |
+
</file>
|
1431 |
+
|
1432 |
+
<file path="requirements.txt">
|
1433 |
+
# requirements.txt
|
1434 |
+
# This is the single source of truth for all project dependencies.
|
1435 |
+
# Use this file with Python 3.11 for guaranteed compatibility.
|
1436 |
+
|
1437 |
+
# --- Testing Framework ---
|
1438 |
+
pytest
|
1439 |
+
pytest-cov
|
1440 |
+
|
1441 |
+
# --- From psychology-data-pipeline ---
|
1442 |
+
# (These are all included in the tutor-engine list below)
|
1443 |
+
requests
|
1444 |
+
datasets
|
1445 |
+
py7zr
|
1446 |
+
lxml
|
1447 |
+
tqdm
|
1448 |
+
|
1449 |
+
# --- From proactive-tutor-engine (EXACT, PROVEN VERSIONS) ---
|
1450 |
+
git+https://github.com/CAHLR/pyBKT.git#egg=pyBKT
|
1451 |
+
numpy==1.26.4
|
1452 |
+
pandas==2.2.2
|
1453 |
+
scipy==1.13.0
|
1454 |
+
scikit-learn==1.4.2
|
1455 |
+
lightgbm==4.3.0
|
1456 |
+
sentence-transformers
|
1457 |
+
torch
|
1458 |
+
sentencepiece
|
1459 |
+
accelerate>=0.26.0
|
1460 |
+
joblib
|
1461 |
+
transformers
|
1462 |
+
safetensors
|
1463 |
+
tokenizers
|
1464 |
+
matplotlib
|
1465 |
+
seaborn
|
1466 |
+
jupyterlab
|
1467 |
+
ipykernel
|
1468 |
+
jupyter-events
|
1469 |
+
jupyter-lsp
|
1470 |
+
jupyter_client
|
1471 |
+
jupyter_core
|
1472 |
+
jupyter_server
|
1473 |
+
jupyter_server_terminals
|
1474 |
+
notebook_shim
|
1475 |
+
nbclient
|
1476 |
+
nbconvert
|
1477 |
+
nbformat
|
1478 |
+
anyio
|
1479 |
+
argon2-cffi
|
1480 |
+
argon2-cffi-bindings
|
1481 |
+
arrow
|
1482 |
+
asttokens
|
1483 |
+
async-lru
|
1484 |
+
attrs
|
1485 |
+
babel
|
1486 |
+
beautifulsoup4
|
1487 |
+
bleach
|
1488 |
+
certifi
|
1489 |
+
cffi
|
1490 |
+
charset-normalizer
|
1491 |
+
colorama
|
1492 |
+
comm
|
1493 |
+
contourpy
|
1494 |
+
cycler
|
1495 |
+
debugpy
|
1496 |
+
decorator
|
1497 |
+
defusedxml
|
1498 |
+
executing
|
1499 |
+
fastjsonschema
|
1500 |
+
filelock
|
1501 |
+
fonttools
|
1502 |
+
fqdn
|
1503 |
+
fsspec
|
1504 |
+
gdown
|
1505 |
+
h11
|
1506 |
+
httpcore
|
1507 |
+
httpx
|
1508 |
+
huggingface-hub
|
1509 |
+
idna
|
1510 |
+
ipython
|
1511 |
+
ipython_pygments_lexers
|
1512 |
+
isoduration
|
1513 |
+
jedi
|
1514 |
+
Jinja2
|
1515 |
+
json5
|
1516 |
+
jsonpointer
|
1517 |
+
jsonschema
|
1518 |
+
jsonschema-specifications
|
1519 |
+
jupyterlab_pygments
|
1520 |
+
jupyterlab_server
|
1521 |
+
kiwisolver
|
1522 |
+
MarkupSafe
|
1523 |
+
mistune
|
1524 |
+
mpmath
|
1525 |
+
nest-asyncio
|
1526 |
+
networkx
|
1527 |
+
overrides
|
1528 |
+
packaging
|
1529 |
+
pandocfilters
|
1530 |
+
parso
|
1531 |
+
pillow
|
1532 |
+
platformdirs
|
1533 |
+
prometheus_client
|
1534 |
+
prompt_toolkit
|
1535 |
+
psutil
|
1536 |
+
pure_eval
|
1537 |
+
pycparser
|
1538 |
+
Pygments
|
1539 |
+
pyparsing
|
1540 |
+
PySocks
|
1541 |
+
python-dateutil
|
1542 |
+
python-json-logger
|
1543 |
+
pytz
|
1544 |
+
pywin32; sys_platform == 'win32'
|
1545 |
+
pywinpty; sys_platform == 'win32'
|
1546 |
+
PyYAML
|
1547 |
+
pyzmq==25.1.2
|
1548 |
+
referencing
|
1549 |
+
regex
|
1550 |
+
rfc3339-validator
|
1551 |
+
rfc3986-validator
|
1552 |
+
rpds-py==0.18.1
|
1553 |
+
Send2Trash
|
1554 |
+
six
|
1555 |
+
sniffio
|
1556 |
+
soupsieve
|
1557 |
+
stack-data
|
1558 |
+
sympy
|
1559 |
+
terminado
|
1560 |
+
threadpoolctl
|
1561 |
+
tinycss2
|
1562 |
+
tornado
|
1563 |
+
typing_extensions
|
1564 |
+
tzdata
|
1565 |
+
uri-template
|
1566 |
+
urllib3
|
1567 |
+
wcwidth
|
1568 |
+
webcolors
|
1569 |
+
webencodings
|
1570 |
+
websocket-client
|
1571 |
+
</file>
|
1572 |
+
|
1573 |
+
<file path="README.md">
|
1574 |
+
---
|
1575 |
+
title: T5 Distractor Model Fine-Tuning (CPU)
|
1576 |
+
emoji: 🤖
|
1577 |
+
colorFrom: red
|
1578 |
+
colorTo: yellow # <-- THIS IS THE FIX
|
1579 |
+
sdk: docker
|
1580 |
+
app_file: Dockerfile
|
1581 |
+
pinned: false
|
1582 |
+
---
|
1583 |
+
|
1584 |
+
# T5 Distractor Model - Fine-Tuning Job
|
1585 |
+
|
1586 |
+
This Space automatically fine-tunes a `t5-small` model using the free CPU tier.
|
1587 |
+
The `finetune_distractor_model.py` script will run automatically. Monitor progress in the "Logs" tab.
|
1588 |
+
</file>
|
1589 |
+
|
1590 |
</files>
|