File size: 3,195 Bytes
b67b96f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
---
title: "🍇 Blueberry Yield Regression"
emoji: 🌾
colorFrom: indigo
colorTo: green
sdk: streamlit
app_file: app.py
pinned: true
license: mit
tags:
- regression
- machine-learning
- streamlit
- kaggle
- agriculture
---
# 🍇 Blueberry Yield Prediction with Machine Learning
This project is a complete machine learning pipeline that predicts the **yield of wild blueberries** using various environmental and biological features such as pollinator counts, rainfall, and fruit measurements.
## 📌 Project Type
- Supervised Learning
- Regression Problem
---
## 🔍 Problem Description
Predicting agricultural yield is a crucial component in planning, sustainability, and food economics. The dataset used in this project comes from the **Kaggle Playground Series S3E14** competition and contains information on:
- Different species of pollinators (honeybee, bumblebee, osmia...)
- Environmental conditions (rainfall days, temperature ranges...)
- Fruit attributes (fruit mass, fruit set, seed count...)
🎯 **Goal**: Predict the `yield` (kg/ha) of blueberries based on input features.
---
## 📊 Dataset Info
- `train.csv`: 15,289 samples with 18 features
- `test.csv`: same structure, no target
- No missing values, clean numerical data
---
## 📈 What We Did (Pipeline Summary)
1. **EDA (Exploratory Data Analysis)**
- Checked for missing values ✅
- Analyzed feature distributions & target (`yield`)
- Built correlation heatmaps — strongest positive correlations:
- `fruitmass`, `fruitset`, `seeds`
2. **Data Preprocessing**
- Removed `id` column
- Standard feature selection based on correlation
- No categorical encoding needed (all numerical)
3. **Model Training**
- Model: `RandomForestRegressor`
- Train-Test Split: 80/20
- **Results**:
- RMSE ≈ **573.8**
- R² Score ≈ **0.81** ✅
4. **Test Prediction & Submission**
- Predictions made on `test.csv`
- `submission.csv` generated for Kaggle submission
5. **Streamlit App**
- Users input bee counts, rain days, and fruit measurements
- Predicts blueberry yield in kg/ha
- Uses trained model (`rf_model.pkl`) behind the scenes
---
## 🚀 Try it Online
🌐 You can try this app live here:
[Hugging Face Space Link](https://huggingface.co/spaces/yazodi/blueberry-yield-regression-app)
---
## 🔮 What Could Be Improved?
| Area | Suggestion |
|------|------------|
| Feature Engineering | Create interaction terms, try log/ratio features |
| Model | Try LightGBM, XGBoost, or stacking |
| Tuning | GridSearchCV or Optuna for hyperparameter optimization |
| Visualization | Add interactive charts in Streamlit app |
| Real-World Data | Add satellite weather data, soil types, historical trends |
---
## 📁 Project Structure
📦 blueberry-yield-regression
├── app.py
├── rf_model.pkl
├── model_columns.pkl
├── requirements.txt
├── submission.csv
└── README.md
---
## 📜 License
MIT License – Free to use, modify and distribute.
--- |