File size: 3,232 Bytes

1f34424
ae1c3ae
1f34424
 
 
 
 
 
 
 
 
 
 
 
f4d75d0
 
 
 
 
 
 
 
 
 
 
 
ae1c3ae
f4d75d0
 
 
 
 
ae1c3ae
f4d75d0
ae1c3ae
 
 
 
 
 
f4d75d0
ae1c3ae
f4d75d0
ae1c3ae
f4d75d0
ae1c3ae
f4d75d0
ae1c3ae
 
 
f4d75d0
ae1c3ae
f4d75d0
ae1c3ae
f4d75d0
ae1c3ae
f4d75d0
ae1c3ae
f4d75d0
 
 
0dbcfa5
f4d75d0
 
 
 
 
 
 
 
 
 
 
 
0dbcfa5
 
 
 
 
 
 
 
 
 
7a01623
 
 
 
 
ae1c3ae
e95d452

---
license: apache-2.0
datasets:
- nhull/tripadvisor-split-dataset-v2
language:
- en
pipeline_tag: text-classification
tags:
- sentiment-analysis
- logistic-regression
- text-classification
- hotel-reviews
- tripadvisor
- nlp
---

# Logistic Regression Sentiment Analysis Model

This model is a **Logistic Regression** classifier trained on the **TripAdvisor sentiment analysis dataset**. It predicts the sentiment of hotel reviews on a 1-5 star scale. The model takes text input (hotel reviews) and outputs a sentiment rating from 1 to 5 stars.

## Model Details

- **Model Type**: Logistic Regression
- **Task**: Sentiment Analysis
- **Input**: A hotel review (text)
- **Output**: Sentiment rating (1-5 stars)
- **Trained Dataset**: [nhull/tripadvisor-split-dataset-v2](https://huggingface.co/datasets/nhull/tripadvisor-split-dataset-v2)

## Intended Use

This model is designed to classify hotel reviews based on their sentiment. It assigns a star rating between 1 and 5 to a review, indicating the sentiment expressed in the review.

---

**The model will return a sentiment rating** between 1 and 5 stars, where:
   - 1: Very bad
   - 2: Bad
   - 3: Neutral
   - 4: Good
   - 5: Very good

---

### Dataset

The dataset used for training, validation, and testing is [nhull/tripadvisor-split-dataset-v2](https://huggingface.co/datasets/nhull/tripadvisor-split-dataset-v2). It consists of:

- **Training Set**: 30,400 reviews
- **Validation Set**: 1,600 reviews
- **Test Set**: 8,000 reviews

All splits are balanced across five sentiment labels.

--- 

### Test Performance

Model predicts too high on average by `0.44`.

- **Test Accuracy**: 61.05% on the test set.
  
- **Classification Report**:

| Label | Precision | Recall | F1-score | Support |
|-------|-----------|--------|----------|---------|
| 1.0   | 0.70      | 0.73   | 0.71     | 1600    |
| 2.0   | 0.52      | 0.50   | 0.51     | 1600    |
| 3.0   | 0.57      | 0.54   | 0.55     | 1600    |
| 4.0   | 0.55      | 0.54   | 0.55     | 1600    |
| 5.0   | 0.71      | 0.74   | 0.72     | 1600    |
| **Accuracy** | -   | -      | **0.61**  | 8000    |
| **Macro avg** | 0.61 | 0.61   | 0.61     | 8000    |
| **Weighted avg** | 0.61 | 0.61 | 0.61     | 8000    |

- **Confusion Matrix**:

| True \\ Predicted |   1   |   2   |   3   |   4   |   5   |
|-------------------|-------|-------|-------|-------|-------|
| 1                 | 1165  |  384  |   41  |    3  |    7  |
| 2                 |  432  |  805  |  315  |   31  |   17  |
| 3                 |   61  |  314  |  857  |  311  |   57  |
| 4                 |    3  |   48  |  264  |  870  |  415  |
| 5                 |    6  |   10  |   32  |  365  | 1187  |

---
## Files Included

- **`validation_results_log_regression.csv`**: Contains correctly classified reviews with their real and predicted labels.

---

## Limitations

- The model performs well on extreme ratings (1 and 5 stars) but struggles with intermediate ratings (2, 3, and 4 stars).
- The model was trained on the **TripAdvisor** dataset and may not generalize well to reviews from other sources or domains.
- The model does not handle aspects like sarcasm or humor well, and shorter reviews may lead to less accurate predictions.