Cleaned up files, added prediction notebook, updated test script

Browse files

Files changed (8) hide show

.DS_Store +0 -0
README.md +18 -18
app.py +0 -30
feature_extraction/embedding_from_csv.py +0 -0
feature_extraction/liwc_from_csv.py +0 -39
feature_extraction/pipeline.py +0 -28
predict_from_csv.ipynb +82 -0
test_personality_model.py +11 -2

.DS_Store CHANGED Viewed

Binary files a/.DS_Store and b/.DS_Store differ

README.md CHANGED Viewed

@@ -13,7 +13,7 @@ inference: false
 ---
 # Personality Trait Predictor (Big Five)
-This repository provides a machine learning pipeline for predicting the **Big Five personality traits** from free-form **text input**. It combines **DistilBERT embeddings**, **LIWC-style linguistic features**, and a set of **Random Forest classifiers** — one for each trait — trained on labeled personality data.
 ### Predicted traits:
 - **Openness**
@@ -28,10 +28,15 @@ Each trait is predicted as a **categorical label**: `low`, `medium`, or `high`.
 ## How It Works
-- Text is converted to embeddings using the CLS token from `DistilBERT`.
-- LIWC-like features are computed using a custom dictionary (`output.dic`).
-- Both features are concatenated and passed through a **trait-specific Random Forest classifier**.
 - Predictions are returned as string labels for all five traits.
 ---
@@ -60,6 +65,7 @@ print(predictions)
 ## Installation
 Clone the repository and install dependencies:
 ```bash
@@ -81,22 +87,22 @@ pip install -r requirements.txt
 ```
 personality-trait-predictor/
 ├── personality_model.py              # Main class for prediction
-├── requirements.txt                  # Dependencies
-├── README.md                         # Project description
-├── .gitattributes                    # Git LFS tracking
 ├── models/
-│   ├── feature_scaler.pkl            # StandardScaler for feature scaling
-│   ├── output.dic                    # LIWC-style dictionary
-│   ├── openness_classifier.pkl       # Classifier for Openness
 │   ├── conscientiousness_classifier.pkl
 │   ├── extraversion_classifier.pkl
 │   ├── agreeableness_classifier.pkl
 │   ├── emotional_stability_classifier.pkl
 ├── feature_extraction/
 │   ├── __init__.py
-│   ├── embedding_from_text.py        # BERT embedding code
 │   ├── liwc_from_text.py             # LIWC feature extraction
-│   ├── pipeline.py                   # Combined feature pipeline
 ```
 ---
@@ -128,12 +134,6 @@ personality-trait-predictor/
 ## Requirements
-Install with:
-```bash
-pip install -r requirements.txt
-```
 Dependencies include:
 - numpy

 ---
 # Personality Trait Predictor (Big Five)
+This repository provides a machine learning pipeline for predicting the **Big Five personality traits (OCEAN)** from **text input**. It combines **DistilBERT embeddings**, **LIWC-style linguistic features**, and a set of **Random Forest classifiers** — one for each trait — trained on annotated personality data.
 ### Predicted traits:
 - **Openness**
 ## How It Works
+- Training was done with PANDORA dataset.
+- Embeddings are extracted using the pretrained model `distilbert/distilbert-base-cased-distilled-squad`.
+- 64 LIWC features are extracted using a dictionary for mapping (`output.dic`).
+- Both features are concatenated and passed into a **trait-specific Random Forest classifier**.
 - Predictions are returned as string labels for all five traits.
+- There are 5 separate random forests for each personality trait - each optimized separately with different hyperparameters
+for achieving a fair and scientific -unbiased- prediction model.
+- The hyperparameter optimization was based on accuracy and f1-score and predicting all the labels (unbiased).
+- For each random forest classifier a different n_estimator and max_depth is set to achieve the optimum.
 ---
 ## Installation
+We recommend creating a conda environment.
 Clone the repository and install dependencies:
 ```bash
 ```
 personality-trait-predictor/
 ├── personality_model.py              # Main class for prediction
+├── requirements.txt
+├── README.md
+├── .gitattributes
 ├── models/
+│   ├── feature_scaler.pkl                      # StandardScaler for feature scaling
+│   ├── output.dic                              # LIWC-style dictionary
+│   ├── openness_classifier.pkl                 # Classifiers ...
 │   ├── conscientiousness_classifier.pkl
 │   ├── extraversion_classifier.pkl
 │   ├── agreeableness_classifier.pkl
 │   ├── emotional_stability_classifier.pkl
 ├── feature_extraction/
 │   ├── __init__.py
+│   ├── embedding_from_text.py        # Embeddings extraction with BERT
 │   ├── liwc_from_text.py             # LIWC feature extraction
 ```
 ---
 ## Requirements
 Dependencies include:
 - numpy

app.py DELETED Viewed

@@ -1,30 +0,0 @@
-# app.py
-import gradio as gr
-import joblib
-import numpy as np
-from feature_extraction.pipeline import text_to_features
-# Load pretrained Random Forest model for Openness
-model = joblib.load("models/openness_rf.pkl")
-def predict_openness(text):
-    try:
-        vec = text_to_features(text)  # shape: (1, dim)
-        pred = model.predict(vec)[0]  # already "low", "medium", or "high"
-        return f"Predicted Openness: **{pred.upper()}**"
-    except Exception as e:
-        return f"Error: {str(e)}"
-# Gradio UI
-demo = gr.Interface(
-    fn=predict_openness,
-    inputs=gr.Textbox(lines=6, placeholder="Enter your thoughts here..."),
-    outputs=gr.Markdown(),
-    title="Big Five Personality Prediction",
-    description="This model predicts **Openness** based on your text using BERT + LIWC features.",
-    allow_flagging="never"
-)
-if __name__ == "__main__":
-    demo.launch()

feature_extraction/embedding_from_csv.py DELETED Viewed

File without changes

feature_extraction/liwc_from_csv.py DELETED Viewed

@@ -1,39 +0,0 @@
-# feature_extraction/liwc_from_csv.py
-import numpy as np
-import pandas as pd
-from collections import defaultdict, Counter
-import re
-def load_liwc_dic(dic_path="models/output.dic"):
-    category_map = defaultdict(list)
-    with open(dic_path, 'r', encoding='utf-8') as f:
-        for line in f:
-            if ':' not in line:
-                continue
-            parts = line.strip().split()
-            category = parts[0].rstrip(':')
-            words = parts[1:]
-            category_map[category] = words
-    return category_map
-def extract_liwc_from_csv(csv_path, category_map):
-    df = pd.read_csv(csv_path)
-    sorted_categories = sorted(category_map.keys())
-    def process_row(row):
-        text = " ".join(str(row[q]) for q in ['Q1', 'Q2', 'Q3'] if pd.notna(row[q]))
-        tokens = re.findall(r"\b\w+\b", text.lower())
-        counts = Counter()
-        for category, words in category_map.items():
-            for token in tokens:
-                if token in words:
-                    counts[category] += 1
-        vec = np.array([counts.get(cat, 0) for cat in sorted_categories])
-        if np.sum(vec) > 0:
-            vec = vec / np.sum(vec)
-        return vec
-    liwc_features = df.apply(process_row, axis=1, result_type="expand")
-    liwc_features.columns = [f"liwc_{cat}" for cat in sorted_categories]
-    return liwc_features

feature_extraction/pipeline.py DELETED Viewed

@@ -1,28 +0,0 @@
-# feature_extraction/pipeline.py
-import numpy as np
-import joblib
-from feature_extraction.embedding_from_text import get_bert_embedding
-from feature_extraction.liwc_from_text import load_liwc_dic, liwc_vector
-# Load the LIWC lexicon once
-liwc_map = load_liwc_dic("models/output.dic")
-# Load the scaler
-scaler = joblib.load("models/scaler.pkl")
-def text_to_features(text: str) -> np.ndarray:
-    # Get BERT embedding (768-dim)
-    emb_vec = get_bert_embedding(text)
-    # Get LIWC vector (~64-dim)
-    liwc_vec, _ = liwc_vector(text, liwc_map)
-    # Combine into one long vector
-    full_vec = np.concatenate([emb_vec, liwc_vec])
-    # Standardize using the saved scaler
-    scaled_vec = scaler.transform([full_vec])  # shape: (1, total_dim)
-    return scaled_vec  # Return the standardized vector for prediction

predict_from_csv.ipynb ADDED Viewed

	@@ -0,0 +1,82 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "fd9ef55b",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Trait predictions saved to: /Users/arashalborz/Desktop/Data/filled_predictions.csv\n"
+     ]
+    }
+   ],
+   "source": [
+    "'''\n",
+    "Script for getting prediction for a TEST.CSV file with three columns of text.\n",
+    "It looks at the columns Q1, Q2, Q3, concatenates them and passes the full string as text into \n",
+    "PersonalityClassifier(). \n",
+    ">>> The method \"predict_all_traits\" defined in the class will get the predictions by running five \n",
+    "separate prediction models (optimized random forests). \n",
+    "The predictions are then applies as labels for each of the train columns.\n",
+    ">>> Important: All the other columns in the original CSV file will be untouched (Q1, Q2, Q3 and Humility). \n",
+    "The CSV input file does not need to have empty values for traits, \n",
+    "the script replaces the predictions with annotations.\n",
+    "'''\n",
+    "\n",
+    "import warnings\n",
+    "warnings.filterwarnings(\"ignore\", category=FutureWarning)\n",
+    "import pandas as pd\n",
+    "from personality_model import PersonalityClassifier\n",
+    "\n",
+    "# ***************** LOAD THE TEST DATA WITH Q1, Q2, Q3 *********************\n",
+    "\n",
+    "input_path = \"/Users/arashalborz/Desktop/Data/val_data.csv\"  # path to test data\n",
+    "output_path = \"/Users/arashalborz/Desktop/Data/filled_predictions.csv\"  # change PATH and NAME of output\n",
+    "\n",
+    "df = pd.read_csv(input_path)\n",
+    "\n",
+    "#  concatenating Q1, Q2, Q3 \n",
+    "texts = df[[\"Q1\", \"Q2\", \"Q3\"]].fillna(\"\").agg(\" \".join, axis=1)\n",
+    "\n",
+    "# model initialization\n",
+    "model = PersonalityClassifier()\n",
+    "\n",
+    "# predicting trait labels for each row\n",
+    "predictions = texts.apply(model.predict_all_traits)\n",
+    "\n",
+    "# applying the predictions and filling the columns\n",
+    "for trait in [\"Openness\", \"Conscientiousness\", \"Extraversion\", \"Agreeableness\", \"Emotional stability\"]:\n",
+    "    df[trait] = predictions.apply(lambda d: d[trait])\n",
+    "\n",
+    "df.to_csv(output_path, index=False)\n",
+    "\n",
+    "print(f\"Trait predictions saved to: {output_path}\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "amiv_nlp_2025",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.21"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

test_personality_model.py CHANGED Viewed

@@ -15,8 +15,17 @@ model = PersonalityClassifier()
 # example
 text = """
-I’ve always been fascinated by different cultures and philosophies. I love reading poetry, exploring new ideas, and reflecting on the complexity of human emotion.
-Traveling, learning languages, and experiencing art open up new perspectives for me every time.
 """
 # predictions

 # example
 text = """
+I am leader and group-leader in scouts. This means a lot of responsibility falls upon me and when issues present I have to find solutions.
+Once, there was quite a big miscommunication issue with the money we ask parents for summer camp.
+Some co-leaders had sent an unfinished draft of the camp invitation to the parents, in which the price for camp was way too low.
+This resulted in plenty of parents not paying enough. The leaders in this situation only realized their mistake when camp was already on its way,
+and they did not have enough money to provide food for the entire camp. They contacted me with the question of what they should do.
+Of course, it is annoying to ask parents for more money when they have already sent their kids to us.
+Communication with parents had never been a strong skill of mine, but now I had no choice but to contact them all with a difficult question.
+I decided the best we could do was own up our mistake, show the parents we would try and solve it ourselves but also let them know that they could still help.
+We crafted some baskets with snacks and crafts with the kids and went to sell them on the streets close to our camp location,
+but also let the parents know they could reserve some for after camp was over. They would pay in advance, and get the basket upon return of their kids.
+This way, we did not have to straight up ask more money from the parents, but gave them something in return and also made it into something fun for the kids..
 """
 # predictions