---
license: apache-2.0
datasets:
- nhull/tripadvisor-split-dataset-v2
language:
- en
pipeline_tag: text-classification
tags:
- sentiment-analysis
- logistic-regression
- text-classification
- hotel-reviews
- tripadvisor
- nlp
---

# Logistic Regression Sentiment Analysis Model

This model is a **Logistic Regression** classifier trained on the **TripAdvisor sentiment analysis dataset**. It predicts the sentiment of hotel reviews on a 1-5 star scale. The model takes text input (hotel reviews) and outputs a sentiment rating from 1 to 5 stars.

## Model Details

- **Model Type**: Logistic Regression
- **Task**: Sentiment Analysis
- **Input**: A hotel review (text)
- **Output**: Sentiment rating (1-5 stars)
- **Trained Dataset**: [nhull/tripadvisor-split-dataset-v2](https://huggingface.co/datasets/nhull/tripadvisor-split-dataset-v2)

## Intended Use

This model is designed to classify hotel reviews based on their sentiment. It assigns a star rating between 1 and 5 to a review, indicating the sentiment expressed in the review.

---

**The model will return a sentiment rating** between 1 and 5 stars, where:
   - 1: Very bad
   - 2: Bad
   - 3: Neutral
   - 4: Good
   - 5: Very good

---

### Dataset

The dataset used for training, validation, and testing is [nhull/tripadvisor-split-dataset-v2](https://huggingface.co/datasets/nhull/tripadvisor-split-dataset-v2). It consists of:

- **Training Set**: 30,400 reviews
- **Validation Set**: 1,600 reviews
- **Test Set**: 8,000 reviews

All splits are balanced across five sentiment labels.

--- 

### Test Performance

Model predicts too high on average by `0.44`.

- **Test Accuracy**: 61.05% on the test set.
  
- **Classification Report**:

| Label | Precision | Recall | F1-score | Support |
|-------|-----------|--------|----------|---------|
| 1.0   | 0.70      | 0.73   | 0.71     | 1600    |
| 2.0   | 0.52      | 0.50   | 0.51     | 1600    |
| 3.0   | 0.57      | 0.54   | 0.55     | 1600    |
| 4.0   | 0.55      | 0.54   | 0.55     | 1600    |
| 5.0   | 0.71      | 0.74   | 0.72     | 1600    |
| **Accuracy** | -   | -      | **0.61**  | 8000    |
| **Macro avg** | 0.61 | 0.61   | 0.61     | 8000    |
| **Weighted avg** | 0.61 | 0.61 | 0.61     | 8000    |

- **Confusion Matrix**:

| True \\ Predicted |   1   |   2   |   3   |   4   |   5   |
|-------------------|-------|-------|-------|-------|-------|
| 1                 | 1165  |  384  |   41  |    3  |    7  |
| 2                 |  432  |  805  |  315  |   31  |   17  |
| 3                 |   61  |  314  |  857  |  311  |   57  |
| 4                 |    3  |   48  |  264  |  870  |  415  |
| 5                 |    6  |   10  |   32  |  365  | 1187  |

---
## Files Included

- **`validation_results_log_regression.csv`**: Contains correctly classified reviews with their real and predicted labels.

---

## Limitations

- The model performs well on extreme ratings (1 and 5 stars) but struggles with intermediate ratings (2, 3, and 4 stars).
- The model was trained on the **TripAdvisor** dataset and may not generalize well to reviews from other sources or domains.
- The model does not handle aspects like sarcasm or humor well, and shorter reviews may lead to less accurate predictions.