Arash Alborz commited on
Commit
9572e6a
Β·
1 Parent(s): 2aa4259

Cleaned up files, added prediction notebook, updated test script

Browse files
.DS_Store CHANGED
Binary files a/.DS_Store and b/.DS_Store differ
 
README.md CHANGED
@@ -13,7 +13,7 @@ inference: false
13
  ---
14
  # Personality Trait Predictor (Big Five)
15
 
16
- This repository provides a machine learning pipeline for predicting the **Big Five personality traits** from free-form **text input**. It combines **DistilBERT embeddings**, **LIWC-style linguistic features**, and a set of **Random Forest classifiers** β€” one for each trait β€” trained on labeled personality data.
17
 
18
  ### Predicted traits:
19
  - **Openness**
@@ -28,10 +28,15 @@ Each trait is predicted as a **categorical label**: `low`, `medium`, or `high`.
28
 
29
  ## How It Works
30
 
31
- - Text is converted to embeddings using the CLS token from `DistilBERT`.
32
- - LIWC-like features are computed using a custom dictionary (`output.dic`).
33
- - Both features are concatenated and passed through a **trait-specific Random Forest classifier**.
 
34
  - Predictions are returned as string labels for all five traits.
 
 
 
 
35
 
36
  ---
37
 
@@ -60,6 +65,7 @@ print(predictions)
60
 
61
  ## Installation
62
 
 
63
  Clone the repository and install dependencies:
64
 
65
  ```bash
@@ -81,22 +87,22 @@ pip install -r requirements.txt
81
  ```
82
  personality-trait-predictor/
83
  β”œβ”€β”€ personality_model.py # Main class for prediction
84
- β”œβ”€β”€ requirements.txt # Dependencies
85
- β”œβ”€β”€ README.md # Project description
86
- β”œβ”€β”€ .gitattributes # Git LFS tracking
87
  β”œβ”€β”€ models/
88
- β”‚ β”œβ”€β”€ feature_scaler.pkl # StandardScaler for feature scaling
89
- β”‚ β”œβ”€β”€ output.dic # LIWC-style dictionary
90
- β”‚ β”œβ”€β”€ openness_classifier.pkl # Classifier for Openness
91
  β”‚ β”œβ”€β”€ conscientiousness_classifier.pkl
92
  β”‚ β”œβ”€β”€ extraversion_classifier.pkl
93
  β”‚ β”œβ”€β”€ agreeableness_classifier.pkl
94
  β”‚ β”œβ”€β”€ emotional_stability_classifier.pkl
95
  β”œβ”€β”€ feature_extraction/
96
  β”‚ β”œβ”€β”€ __init__.py
97
- β”‚ β”œβ”€β”€ embedding_from_text.py # BERT embedding code
98
  β”‚ β”œβ”€β”€ liwc_from_text.py # LIWC feature extraction
99
- β”‚ β”œβ”€β”€ pipeline.py # Combined feature pipeline
100
  ```
101
 
102
  ---
@@ -128,12 +134,6 @@ personality-trait-predictor/
128
 
129
  ## Requirements
130
 
131
- Install with:
132
-
133
- ```bash
134
- pip install -r requirements.txt
135
- ```
136
-
137
  Dependencies include:
138
 
139
  - numpy
 
13
  ---
14
  # Personality Trait Predictor (Big Five)
15
 
16
+ This repository provides a machine learning pipeline for predicting the **Big Five personality traits (OCEAN)** from **text input**. It combines **DistilBERT embeddings**, **LIWC-style linguistic features**, and a set of **Random Forest classifiers** β€” one for each trait β€” trained on annotated personality data.
17
 
18
  ### Predicted traits:
19
  - **Openness**
 
28
 
29
  ## How It Works
30
 
31
+ - Training was done with PANDORA dataset.
32
+ - Embeddings are extracted using the pretrained model `distilbert/distilbert-base-cased-distilled-squad`.
33
+ - 64 LIWC features are extracted using a dictionary for mapping (`output.dic`).
34
+ - Both features are concatenated and passed into a **trait-specific Random Forest classifier**.
35
  - Predictions are returned as string labels for all five traits.
36
+ - There are 5 separate random forests for each personality trait - each optimized separately with different hyperparameters
37
+ for achieving a fair and scientific -unbiased- prediction model.
38
+ - The hyperparameter optimization was based on accuracy and f1-score and predicting all the labels (unbiased).
39
+ - For each random forest classifier a different n_estimator and max_depth is set to achieve the optimum.
40
 
41
  ---
42
 
 
65
 
66
  ## Installation
67
 
68
+ We recommend creating a conda environment.
69
  Clone the repository and install dependencies:
70
 
71
  ```bash
 
87
  ```
88
  personality-trait-predictor/
89
  β”œβ”€β”€ personality_model.py # Main class for prediction
90
+ β”œβ”€β”€ requirements.txt
91
+ β”œβ”€β”€ README.md
92
+ β”œβ”€β”€ .gitattributes
93
  β”œβ”€β”€ models/
94
+ β”‚ β”œβ”€β”€ feature_scaler.pkl # StandardScaler for feature scaling
95
+ β”‚ β”œβ”€β”€ output.dic # LIWC-style dictionary
96
+ β”‚ β”œβ”€β”€ openness_classifier.pkl # Classifiers ...
97
  β”‚ β”œβ”€β”€ conscientiousness_classifier.pkl
98
  β”‚ β”œβ”€β”€ extraversion_classifier.pkl
99
  β”‚ β”œβ”€β”€ agreeableness_classifier.pkl
100
  β”‚ β”œβ”€β”€ emotional_stability_classifier.pkl
101
  β”œβ”€β”€ feature_extraction/
102
  β”‚ β”œβ”€β”€ __init__.py
103
+ β”‚ β”œβ”€β”€ embedding_from_text.py # Embeddings extraction with BERT
104
  β”‚ β”œβ”€β”€ liwc_from_text.py # LIWC feature extraction
105
+
106
  ```
107
 
108
  ---
 
134
 
135
  ## Requirements
136
 
 
 
 
 
 
 
137
  Dependencies include:
138
 
139
  - numpy
app.py DELETED
@@ -1,30 +0,0 @@
1
- # app.py
2
-
3
- import gradio as gr
4
- import joblib
5
- import numpy as np
6
- from feature_extraction.pipeline import text_to_features
7
-
8
- # Load pretrained Random Forest model for Openness
9
- model = joblib.load("models/openness_rf.pkl")
10
-
11
- def predict_openness(text):
12
- try:
13
- vec = text_to_features(text) # shape: (1, dim)
14
- pred = model.predict(vec)[0] # already "low", "medium", or "high"
15
- return f"Predicted Openness: **{pred.upper()}**"
16
- except Exception as e:
17
- return f"Error: {str(e)}"
18
-
19
- # Gradio UI
20
- demo = gr.Interface(
21
- fn=predict_openness,
22
- inputs=gr.Textbox(lines=6, placeholder="Enter your thoughts here..."),
23
- outputs=gr.Markdown(),
24
- title="Big Five Personality Prediction",
25
- description="This model predicts **Openness** based on your text using BERT + LIWC features.",
26
- allow_flagging="never"
27
- )
28
-
29
- if __name__ == "__main__":
30
- demo.launch()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
feature_extraction/embedding_from_csv.py DELETED
File without changes
feature_extraction/liwc_from_csv.py DELETED
@@ -1,39 +0,0 @@
1
- # feature_extraction/liwc_from_csv.py
2
-
3
- import numpy as np
4
- import pandas as pd
5
- from collections import defaultdict, Counter
6
- import re
7
-
8
- def load_liwc_dic(dic_path="models/output.dic"):
9
- category_map = defaultdict(list)
10
- with open(dic_path, 'r', encoding='utf-8') as f:
11
- for line in f:
12
- if ':' not in line:
13
- continue
14
- parts = line.strip().split()
15
- category = parts[0].rstrip(':')
16
- words = parts[1:]
17
- category_map[category] = words
18
- return category_map
19
-
20
- def extract_liwc_from_csv(csv_path, category_map):
21
- df = pd.read_csv(csv_path)
22
- sorted_categories = sorted(category_map.keys())
23
-
24
- def process_row(row):
25
- text = " ".join(str(row[q]) for q in ['Q1', 'Q2', 'Q3'] if pd.notna(row[q]))
26
- tokens = re.findall(r"\b\w+\b", text.lower())
27
- counts = Counter()
28
- for category, words in category_map.items():
29
- for token in tokens:
30
- if token in words:
31
- counts[category] += 1
32
- vec = np.array([counts.get(cat, 0) for cat in sorted_categories])
33
- if np.sum(vec) > 0:
34
- vec = vec / np.sum(vec)
35
- return vec
36
-
37
- liwc_features = df.apply(process_row, axis=1, result_type="expand")
38
- liwc_features.columns = [f"liwc_{cat}" for cat in sorted_categories]
39
- return liwc_features
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
feature_extraction/pipeline.py DELETED
@@ -1,28 +0,0 @@
1
- # feature_extraction/pipeline.py
2
-
3
- import numpy as np
4
- import joblib
5
-
6
- from feature_extraction.embedding_from_text import get_bert_embedding
7
- from feature_extraction.liwc_from_text import load_liwc_dic, liwc_vector
8
-
9
- # Load the LIWC lexicon once
10
- liwc_map = load_liwc_dic("models/output.dic")
11
-
12
- # Load the scaler
13
- scaler = joblib.load("models/scaler.pkl")
14
-
15
- def text_to_features(text: str) -> np.ndarray:
16
- # Get BERT embedding (768-dim)
17
- emb_vec = get_bert_embedding(text)
18
-
19
- # Get LIWC vector (~64-dim)
20
- liwc_vec, _ = liwc_vector(text, liwc_map)
21
-
22
- # Combine into one long vector
23
- full_vec = np.concatenate([emb_vec, liwc_vec])
24
-
25
- # Standardize using the saved scaler
26
- scaled_vec = scaler.transform([full_vec]) # shape: (1, total_dim)
27
-
28
- return scaled_vec # Return the standardized vector for prediction
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
predict_from_csv.ipynb ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 7,
6
+ "id": "fd9ef55b",
7
+ "metadata": {},
8
+ "outputs": [
9
+ {
10
+ "name": "stdout",
11
+ "output_type": "stream",
12
+ "text": [
13
+ "Trait predictions saved to: /Users/arashalborz/Desktop/Data/filled_predictions.csv\n"
14
+ ]
15
+ }
16
+ ],
17
+ "source": [
18
+ "'''\n",
19
+ "Script for getting prediction for a TEST.CSV file with three columns of text.\n",
20
+ "It looks at the columns Q1, Q2, Q3, concatenates them and passes the full string as text into \n",
21
+ "PersonalityClassifier(). \n",
22
+ ">>> The method \"predict_all_traits\" defined in the class will get the predictions by running five \n",
23
+ "separate prediction models (optimized random forests). \n",
24
+ "The predictions are then applies as labels for each of the train columns.\n",
25
+ ">>> Important: All the other columns in the original CSV file will be untouched (Q1, Q2, Q3 and Humility). \n",
26
+ "The CSV input file does not need to have empty values for traits, \n",
27
+ "the script replaces the predictions with annotations.\n",
28
+ "'''\n",
29
+ "\n",
30
+ "import warnings\n",
31
+ "warnings.filterwarnings(\"ignore\", category=FutureWarning)\n",
32
+ "import pandas as pd\n",
33
+ "from personality_model import PersonalityClassifier\n",
34
+ "\n",
35
+ "# ***************** LOAD THE TEST DATA WITH Q1, Q2, Q3 *********************\n",
36
+ "\n",
37
+ "input_path = \"/Users/arashalborz/Desktop/Data/val_data.csv\" # path to test data\n",
38
+ "output_path = \"/Users/arashalborz/Desktop/Data/filled_predictions.csv\" # change PATH and NAME of output\n",
39
+ "\n",
40
+ "df = pd.read_csv(input_path)\n",
41
+ "\n",
42
+ "# concatenating Q1, Q2, Q3 \n",
43
+ "texts = df[[\"Q1\", \"Q2\", \"Q3\"]].fillna(\"\").agg(\" \".join, axis=1)\n",
44
+ "\n",
45
+ "# model initialization\n",
46
+ "model = PersonalityClassifier()\n",
47
+ "\n",
48
+ "# predicting trait labels for each row\n",
49
+ "predictions = texts.apply(model.predict_all_traits)\n",
50
+ "\n",
51
+ "# applying the predictions and filling the columns\n",
52
+ "for trait in [\"Openness\", \"Conscientiousness\", \"Extraversion\", \"Agreeableness\", \"Emotional stability\"]:\n",
53
+ " df[trait] = predictions.apply(lambda d: d[trait])\n",
54
+ "\n",
55
+ "df.to_csv(output_path, index=False)\n",
56
+ "\n",
57
+ "print(f\"Trait predictions saved to: {output_path}\")"
58
+ ]
59
+ }
60
+ ],
61
+ "metadata": {
62
+ "kernelspec": {
63
+ "display_name": "amiv_nlp_2025",
64
+ "language": "python",
65
+ "name": "python3"
66
+ },
67
+ "language_info": {
68
+ "codemirror_mode": {
69
+ "name": "ipython",
70
+ "version": 3
71
+ },
72
+ "file_extension": ".py",
73
+ "mimetype": "text/x-python",
74
+ "name": "python",
75
+ "nbconvert_exporter": "python",
76
+ "pygments_lexer": "ipython3",
77
+ "version": "3.9.21"
78
+ }
79
+ },
80
+ "nbformat": 4,
81
+ "nbformat_minor": 5
82
+ }
test_personality_model.py CHANGED
@@ -15,8 +15,17 @@ model = PersonalityClassifier()
15
 
16
  # example
17
  text = """
18
- I’ve always been fascinated by different cultures and philosophies. I love reading poetry, exploring new ideas, and reflecting on the complexity of human emotion.
19
- Traveling, learning languages, and experiencing art open up new perspectives for me every time.
 
 
 
 
 
 
 
 
 
20
  """
21
 
22
  # predictions
 
15
 
16
  # example
17
  text = """
18
+ I am leader and group-leader in scouts. This means a lot of responsibility falls upon me and when issues present I have to find solutions.
19
+ Once, there was quite a big miscommunication issue with the money we ask parents for summer camp.
20
+ Some co-leaders had sent an unfinished draft of the camp invitation to the parents, in which the price for camp was way too low.
21
+ This resulted in plenty of parents not paying enough. The leaders in this situation only realized their mistake when camp was already on its way,
22
+ and they did not have enough money to provide food for the entire camp. They contacted me with the question of what they should do.
23
+ Of course, it is annoying to ask parents for more money when they have already sent their kids to us.
24
+ Communication with parents had never been a strong skill of mine, but now I had no choice but to contact them all with a difficult question.
25
+ I decided the best we could do was own up our mistake, show the parents we would try and solve it ourselves but also let them know that they could still help.
26
+ We crafted some baskets with snacks and crafts with the kids and went to sell them on the streets close to our camp location,
27
+ but also let the parents know they could reserve some for after camp was over. They would pay in advance, and get the basket upon return of their kids.
28
+ This way, we did not have to straight up ask more money from the parents, but gave them something in return and also made it into something fun for the kids..
29
  """
30
 
31
  # predictions