hç
commited on
Upload 6 files
Browse files- .gitattributes +2 -0
- README.md +22 -0
- app.py +26 -0
- project_description.txt +83 -0
- requirements.txt +5 -0
- ridge_model.pkl +3 -0
- tfidf_vectorizer.pkl +3 -0
.gitattributes
ADDED
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
1 |
+
ridge_model.pkl filter=lfs diff=lfs merge=lfs -text
|
2 |
+
tfidf_vectorizer.pkl filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
@@ -0,0 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
title: CommonLit Summary Scorer
|
3 |
+
emoji: 📝
|
4 |
+
colorFrom: indigo
|
5 |
+
colorTo: green
|
6 |
+
sdk: streamlit
|
7 |
+
app_file: app.py
|
8 |
+
pinned: false
|
9 |
+
---
|
10 |
+
|
11 |
+
|
12 |
+
# ✨ Student Summary Auto-Scorer
|
13 |
+
|
14 |
+
This app uses a Ridge Regression model trained on the [CommonLit Evaluate Student Summaries](https://www.kaggle.com/competitions/commonlit-evaluate-student-summaries) dataset to automatically score student-written summaries.
|
15 |
+
|
16 |
+
**Predicted Scores:**
|
17 |
+
- `Content`
|
18 |
+
- `Wording`
|
19 |
+
|
20 |
+
🧠 Built with scikit-learn, TF-IDF, and Streamlit
|
21 |
+
🚀 Deployed using Hugging Face Spaces
|
22 |
+
🎓 Educational project
|
app.py
ADDED
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import streamlit as st
|
2 |
+
import joblib
|
3 |
+
|
4 |
+
# Başlık
|
5 |
+
st.title("📝 Student Summary Scorer")
|
6 |
+
st.markdown("Yazdığınız özeti girin, içeriği ve anlatımı otomatik puanlayalım!")
|
7 |
+
|
8 |
+
# Kullanıcıdan metin al
|
9 |
+
text_input = st.text_area("✍️ Özetinizi buraya yazın", height=250)
|
10 |
+
|
11 |
+
# Model ve TF-IDF yükle
|
12 |
+
model = joblib.load("ridge_model.pkl")
|
13 |
+
tfidf = joblib.load("tfidf_vectorizer.pkl")
|
14 |
+
|
15 |
+
# Tahmin butonu
|
16 |
+
if st.button("📊 Puanla"):
|
17 |
+
if text_input.strip() == "":
|
18 |
+
st.warning("Lütfen bir özet metni girin.")
|
19 |
+
else:
|
20 |
+
# Vektörleştir ve tahmin et
|
21 |
+
X = tfidf.transform([text_input])
|
22 |
+
preds = model.predict(X)[0]
|
23 |
+
|
24 |
+
st.success("✅ Tahminler tamamlandı:")
|
25 |
+
st.write(f"**İçerik (Content)**: {round(preds[0], 2)} / 5")
|
26 |
+
st.write(f"**Anlatım (Wording)**: {round(preds[1], 2)} / 5")
|
project_description.txt
ADDED
@@ -0,0 +1,83 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
PROJECT TITLE: CommonLit Student Summary Scorer (Kaggle NLP Project)
|
2 |
+
|
3 |
+
OBJECTIVE:
|
4 |
+
This project is an NLP-based scoring system that automatically evaluates the quality of student-written summaries. It is based on the Kaggle competition "CommonLit Evaluate Student Summaries."
|
5 |
+
|
6 |
+
Kaggle Link: https://www.kaggle.com/competitions/commonlit-evaluate-student-summaries
|
7 |
+
|
8 |
+
---
|
9 |
+
|
10 |
+
DATA USED:
|
11 |
+
- summaries_train.csv → Training data (student summaries + scores)
|
12 |
+
- summaries_test.csv → Summaries to be scored
|
13 |
+
- sample_submission.csv → Sample submission format
|
14 |
+
- prompts_train.csv / prompts_test.csv → Additional metadata (not used in this project)
|
15 |
+
|
16 |
+
---
|
17 |
+
|
18 |
+
TARGET VARIABLES:
|
19 |
+
- content → Measures how well the summary captures the main idea (0–5 scale)
|
20 |
+
- wording → Measures clarity and expression quality (0–5 scale)
|
21 |
+
|
22 |
+
---
|
23 |
+
|
24 |
+
STEPS IMPLEMENTED:
|
25 |
+
|
26 |
+
1. DATA EXPLORATION
|
27 |
+
- Loaded and analyzed `summaries_train.csv`
|
28 |
+
- Focused on `text`, `content`, and `wording` columns
|
29 |
+
|
30 |
+
2. TEXT PROCESSING & MODELING
|
31 |
+
- Used `TfidfVectorizer` to convert text into numerical features
|
32 |
+
- Applied Ridge Regression inside a `MultiOutputRegressor`
|
33 |
+
- Model trained to predict both scores simultaneously
|
34 |
+
- Validation RMSE: **0.6819**
|
35 |
+
|
36 |
+
3. PREDICTION & KAGGLE SUBMISSION
|
37 |
+
- Generated predictions on `summaries_test.csv`
|
38 |
+
- Filled predictions into the `sample_submission.csv` structure
|
39 |
+
- Created `submission.csv` for competition upload
|
40 |
+
|
41 |
+
4. STREAMLIT WEB APP (`app.py`)
|
42 |
+
- Developed a user-friendly web interface to input any summary
|
43 |
+
- Displays instant predictions for `content` and `wording` scores
|
44 |
+
- Exported model and vectorizer as `.pkl` files using `joblib`
|
45 |
+
- Deployed publicly using Hugging Face Spaces
|
46 |
+
|
47 |
+
---
|
48 |
+
|
49 |
+
TEST EXAMPLES:
|
50 |
+
|
51 |
+
[Weak Summary]
|
52 |
+
It was about a story. It was good. The people were talking and then something happened.
|
53 |
+
Expected Score: Content ≈ 1.0, Wording ≈ 1.0
|
54 |
+
|
55 |
+
[Intermediate Summary]
|
56 |
+
The article discusses the importance of environmental protection. It explains how pollution harms the earth and suggests ways to stop it.
|
57 |
+
Expected Score: Content ≈ 3.0, Wording ≈ 3.0
|
58 |
+
|
59 |
+
[Advanced Summary]
|
60 |
+
The summary articulates the author’s argument that environmental degradation is a result of unchecked industrial expansion. It effectively highlights key solutions such as policy reform, corporate accountability, and individual action to mitigate ecological damage.
|
61 |
+
Expected Score: Content ≈ 4.5, Wording ≈ 4.5
|
62 |
+
|
63 |
+
---
|
64 |
+
|
65 |
+
LIBRARIES USED:
|
66 |
+
- pandas
|
67 |
+
- numpy
|
68 |
+
- scikit-learn
|
69 |
+
- joblib
|
70 |
+
- streamlit
|
71 |
+
|
72 |
+
---
|
73 |
+
|
74 |
+
HOW TO RUN:
|
75 |
+
1. Use `streamlit run app.py` for local testing
|
76 |
+
2. Alternatively, access the deployed version via Hugging Face Spaces
|
77 |
+
3. Model files: `ridge_model.pkl`, `tfidf_vectorizer.pkl`
|
78 |
+
|
79 |
+
---
|
80 |
+
|
81 |
+
CREATED BY: [Hande Çarkcı]
|
82 |
+
DATE: [June 1, 2025]
|
83 |
+
PROJECT #: 2 of 20
|
requirements.txt
ADDED
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
streamlit
|
2 |
+
scikit-learn
|
3 |
+
pandas
|
4 |
+
joblib
|
5 |
+
numpy
|
ridge_model.pkl
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:7847461d512d3a8c63cc88709507420f6cc6648046249b57a53fd50286d2774c
|
3 |
+
size 160856
|
tfidf_vectorizer.pkl
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:8cddd5f8a72d4fd3cab8a19bd7ff58a02bb1f7236d99c1417e156b9fcc9197bf
|
3 |
+
size 373282
|