nhull
/

logistic-regression-model

Text Classification

sentiment-analysis

logistic-regression

Model card Files Files and versions Community

logistic-regression-model / README.md

nhull's picture

Update README.md

7a01623 verified about 2 months ago

|

history blame contribute delete

3.23 kB

	---
	license: apache-2.0
	datasets:
	- nhull/tripadvisor-split-dataset-v2
	language:
	- en
	pipeline_tag: text-classification
	tags:
	- sentiment-analysis
	- logistic-regression
	- text-classification
	- hotel-reviews
	- tripadvisor
	- nlp
	---

	# Logistic Regression Sentiment Analysis Model

	This model is a Logistic Regression classifier trained on the TripAdvisor sentiment analysis dataset. It predicts the sentiment of hotel reviews on a 1-5 star scale. The model takes text input (hotel reviews) and outputs a sentiment rating from 1 to 5 stars.

	## Model Details

	- Model Type: Logistic Regression
	- Task: Sentiment Analysis
	- Input: A hotel review (text)
	- Output: Sentiment rating (1-5 stars)
	- Trained Dataset: [nhull/tripadvisor-split-dataset-v2](https://huggingface.co/datasets/nhull/tripadvisor-split-dataset-v2)

	## Intended Use

	This model is designed to classify hotel reviews based on their sentiment. It assigns a star rating between 1 and 5 to a review, indicating the sentiment expressed in the review.

	---

	The model will return a sentiment rating between 1 and 5 stars, where:
	- 1: Very bad
	- 2: Bad
	- 3: Neutral
	- 4: Good
	- 5: Very good

	---

	### Dataset

	The dataset used for training, validation, and testing is [nhull/tripadvisor-split-dataset-v2](https://huggingface.co/datasets/nhull/tripadvisor-split-dataset-v2). It consists of:

	- Training Set: 30,400 reviews
	- Validation Set: 1,600 reviews
	- Test Set: 8,000 reviews

	All splits are balanced across five sentiment labels.

	---

	### Test Performance

	Model predicts too high on average by `0.44`.

	- Test Accuracy: 61.05% on the test set.

	- Classification Report:

	\| Label \| Precision \| Recall \| F1-score \| Support \|
	\|-------\|-----------\|--------\|----------\|---------\|
	\| 1.0 \| 0.70 \| 0.73 \| 0.71 \| 1600 \|
	\| 2.0 \| 0.52 \| 0.50 \| 0.51 \| 1600 \|
	\| 3.0 \| 0.57 \| 0.54 \| 0.55 \| 1600 \|
	\| 4.0 \| 0.55 \| 0.54 \| 0.55 \| 1600 \|
	\| 5.0 \| 0.71 \| 0.74 \| 0.72 \| 1600 \|
	\| Accuracy \| - \| - \| 0.61 \| 8000 \|
	\| Macro avg \| 0.61 \| 0.61 \| 0.61 \| 8000 \|
	\| Weighted avg \| 0.61 \| 0.61 \| 0.61 \| 8000 \|

	- Confusion Matrix:

	\| True \\ Predicted \| 1 \| 2 \| 3 \| 4 \| 5 \|
	\|-------------------\|-------\|-------\|-------\|-------\|-------\|
	\| 1 \| 1165 \| 384 \| 41 \| 3 \| 7 \|
	\| 2 \| 432 \| 805 \| 315 \| 31 \| 17 \|
	\| 3 \| 61 \| 314 \| 857 \| 311 \| 57 \|
	\| 4 \| 3 \| 48 \| 264 \| 870 \| 415 \|
	\| 5 \| 6 \| 10 \| 32 \| 365 \| 1187 \|

	---
	## Files Included

	- `validation_results_log_regression.csv`: Contains correctly classified reviews with their real and predicted labels.

	---

	## Limitations

	- The model performs well on extreme ratings (1 and 5 stars) but struggles with intermediate ratings (2, 3, and 4 stars).
	- The model was trained on the TripAdvisor dataset and may not generalize well to reviews from other sources or domains.
	- The model does not handle aspects like sarcasm or humor well, and shorter reviews may lead to less accurate predictions.