File size: 3,195 Bytes
b67b96f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---

title: "🍇 Blueberry Yield Regression"
emoji: 🌾
colorFrom: indigo
colorTo: green
sdk: streamlit
app_file: app.py
pinned: true
license: mit
tags:
  - regression
  - machine-learning
  - streamlit
  - kaggle
  - agriculture
---


# 🍇 Blueberry Yield Prediction with Machine Learning

This project is a complete machine learning pipeline that predicts the **yield of wild blueberries** using various environmental and biological features such as pollinator counts, rainfall, and fruit measurements.

## 📌 Project Type

- Supervised Learning
- Regression Problem

---

## 🔍 Problem Description

Predicting agricultural yield is a crucial component in planning, sustainability, and food economics. The dataset used in this project comes from the **Kaggle Playground Series S3E14** competition and contains information on:

- Different species of pollinators (honeybee, bumblebee, osmia...)
- Environmental conditions (rainfall days, temperature ranges...)
- Fruit attributes (fruit mass, fruit set, seed count...)

🎯 **Goal**: Predict the `yield` (kg/ha) of blueberries based on input features.

---

## 📊 Dataset Info

- `train.csv`: 15,289 samples with 18 features
- `test.csv`: same structure, no target
- No missing values, clean numerical data

---

## 📈 What We Did (Pipeline Summary)

1. **EDA (Exploratory Data Analysis)**  
   - Checked for missing values ✅  
   - Analyzed feature distributions & target (`yield`)  
   - Built correlation heatmaps — strongest positive correlations:  
     - `fruitmass`, `fruitset`, `seeds`

2. **Data Preprocessing**  
   - Removed `id` column  
   - Standard feature selection based on correlation  
   - No categorical encoding needed (all numerical)

3. **Model Training**  
   - Model: `RandomForestRegressor`  
   - Train-Test Split: 80/20  
   - **Results**:  
     - RMSE ≈ **573.8**  
     - R² Score ≈ **0.81**4. **Test Prediction & Submission**  
   - Predictions made on `test.csv`  
   - `submission.csv` generated for Kaggle submission

5. **Streamlit App**  
   - Users input bee counts, rain days, and fruit measurements  
   - Predicts blueberry yield in kg/ha  
   - Uses trained model (`rf_model.pkl`) behind the scenes

---

## 🚀 Try it Online

🌐 You can try this app live here:  
[Hugging Face Space Link](https://huggingface.co/spaces/yazodi/blueberry-yield-regression-app)

---

## 🔮 What Could Be Improved?

| Area | Suggestion |
|------|------------|
| Feature Engineering | Create interaction terms, try log/ratio features |
| Model | Try LightGBM, XGBoost, or stacking |
| Tuning | GridSearchCV or Optuna for hyperparameter optimization |
| Visualization | Add interactive charts in Streamlit app |
| Real-World Data | Add satellite weather data, soil types, historical trends |

---

## 📁 Project Structure

📦 blueberry-yield-regression
├── app.py
├── rf_model.pkl

├── model_columns.pkl
├── requirements.txt
├── submission.csv
└── README.md


---

## 📜 License

MIT License – Free to use, modify and distribute.

---