# **Bird Species Classification (BirdCLEF 2024)**


| **Made By** | **Supervisors** | **Help of** |
| :--- | :--- | :--- |
| LAGHJAJ ABDELLATIF | Youssef ES-SAADY & El HAJJI Mohamed | ChatGPT & Gemini |

**Problem Statement:**

The Western Ghats of India, a biodiversity hotspot, faces threats to its avian ‎biodiversity due to habitat loss and climate change. Traditional bird surveys are ‎costly, logistically challenging, and provide limited temporal and spatial coverage.‎

**Solution:**

This project utilizes Passive Acoustic Monitoring (PAM) and machine learning to ‎efficiently and scalably monitor bird species in the Western Ghats. 

**Benefits of PAM and Machine Learning:**

* **Cost-Effective:** Automated recording devices are cheaper than human observers.
* **Extensive Coverage:** Analyze vast amounts of audio data over large areas.
* **Continuous Monitoring:** Real-time insights into avian biodiversity fluctuations.

# **Import Libraries**

In [None]:
!pip install noisereduce librosa plotly imbalanced-learn xgboost joblib tqdm
# Data manipulation and analysis
import os
import glob
import pickle
from pathlib import Path

# Audio processing
import librosa
import noisereduce as nr

# Machine learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_auc_score, roc_curve, auc
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import label_binarize
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import GridSearchCV


# Visualization
import matplotlib.pyplot as plt
plt.style.use('dark_background')
import seaborn as sns
import plotly.express as px
from IPython.display import Audio
import pandas as pd
from itertools import cycle
from tqdm import tqdm

# **Data Collection and Preparation**

This section focuses on loading the dataset, performing initial exploration, and preparing the data for feature extraction. 

In [None]:
# Load metadata
meta_data = pd.read_csv('/kaggle/input/birdclef-2024/train_metadata.csv')
meta_data.head(4)

## Exploratory Data Analysis (EDA)

Before diving into feature engineering and model building, it's crucial to understand the dataset. This section explores the data to gain insights into the distribution of bird species, geographical locations, and potential challenges like data imbalance.

In [None]:
# Check for missing values
print(meta_data.isnull().sum())

# Basic statistics
print(meta_data.describe())

# Number of birds in each species
meta_data.primary_label.value_counts()

# visualize the relationship between primary and secondary bird labels
tmp = meta_data[meta_data['secondary_labels'] != '[]'].iloc[:40]
fig = px.sunburst(tmp, path=['primary_label', 'secondary_labels'] )
fig.update_layout(
 title="Primary & Secondary Labels ",
 )
fig.show()

# Create a copy of the df_train DataFrame
tmp = meta_data.copy()

# Create a scatter mapbox plot
fig = px.scatter_mapbox(
 tmp, # Use the tmp DataFrame as the data source
 lat="latitude", # Latitude column for plotting points
 lon="longitude", # Longitude column for plotting points
 color="primary_label", # Color points based on the primary_label column
 zoom=0.1, # Initial zoom level of the map
 title='Bird Recordings Loaction' # Title of the plot
)

# Update the layout of the plot to use the "open-street-map" style for the map background
fig.update_layout(mapbox_style="open-street-map")

# Update the layout of the plot to set the margin around the map
fig.update_layout(margin={"r":0,"t":30,"l":0,"b":0})

# Display the scatter mapbox plot
fig.show()

fig = px.scatter_mapbox(meta_data, lat='latitude', lon='longitude', color='common_name',
 hover_name='common_name', hover_data=['latitude', 'longitude'],
 title='Origin of Bird Species',
 zoom=1, height=600, template='plotly_dark')
fig.update_layout(
 mapbox_style="white-bg",
 mapbox_layers=[
 {
 "below": 'traces',
 "sourcetype": "raster",
 "sourceattribution": "United States Geological Survey",
 "source": [
 "https://basemap.nationalmap.gov/arcgis/rest/services/USGSImageryOnly/MapServer/tile/{z}/{y}/{x}"
 ]
 }
 ])
fig.show()

This visualization maps the geographical distribution of bird species from the 'meta_data' DataFrame. Each point represents a species, colored by its common name. Hovering provides details like latitude and longitude, offering insights into species distribution patterns.

In [None]:
sns.displot(data=meta_data,x='latitude',bins=30,kde=True)
sns.displot(data=meta_data,x='longitude',bins=30,kde=True)

These histograms visualize the distribution of latitude and longitude values in the dataset. This can help identify if data is concentrated in specific geographical regions, potentially leading to biases in the model.

In [None]:
# Handling Missing Values: 
# Filling missing latitude and longitude with the median to avoid outlier influence
median_latitude = meta_data['latitude'].median()
median_longitude = meta_data['longitude'].median()

meta_data['latitude'] = meta_data['latitude'].fillna(median_latitude)
meta_data['longitude'] = meta_data['longitude'].fillna(median_longitude)

# Verify if missing values are handled
print(meta_data.isnull().sum())

## Audio Data Visualization and Exploration

Visualizing audio data is essential to understand its characteristics. This section includes functions to display waveforms and spectrograms, providing visual representations of the audio signals. 

In [None]:
def audio_waveframe(file_path):
 """Plots the audio waveform of a given audio file.

 Args:
 file_path (str): Path to the audio file.
 """
 audio_data, sampling_rate = librosa.load(file_path)
 duration = len(audio_data) / sampling_rate
 time = np.arange(0, duration, 1/sampling_rate)
 plt.figure(figsize=(30, 4))
 plt.plot(time, audio_data, color='blue')
 plt.title('Audio Waveform')
 plt.xlabel('Time (s)')
 plt.ylabel('Amplitude')
 plt.show()

def spectrogram(file_path):
 """Plots the spectrogram of a given audio file.

 Args:
 file_path (str): Path to the audio file.
 """
 n_fft = 500 # Number of FFT points 
 hop_length = 50 # Hop length for STFT
 audio_data, sampling_rate = librosa.load(file_path)
 stft = librosa.stft(audio_data, n_fft=n_fft, hop_length=hop_length)
 spectrogram = librosa.amplitude_to_db(np.abs(stft))
 plt.figure(figsize=(30, 6))
 librosa.display.specshow(spectrogram, sr=sampling_rate, hop_length=hop_length, x_axis='time', y_axis='linear')
 plt.colorbar(format='%+2.0f dB')
 plt.title('Spectrogram')
 plt.xlabel('Time (s)')
 plt.ylabel('Frequency (Hz)')
 plt.tight_layout()
 plt.show()

def audio_analysis(file_path):
 """Displays the audio waveform and spectrogram of an audio file.

 Args:
 file_path (str): Path to the audio file.
 """
 audio_waveframe(file_path)
 spectrogram(file_path)

In [None]:
# Example usage:
audio_analysis('/kaggle/input/birdclef-2024/train_audio/asbfly/XC134896.ogg')
Audio('/kaggle/input/birdclef-2024/train_audio/asbfly/XC134896.ogg')

* **Waveform:** Visualizes the amplitude (volume) of the audio over time, showing changes in sound and potential patterns.
* **Spectrogram:** Displays the frequency content of the audio over time. It's valuable for identifying bird calls, as different species have unique vocalization frequencies.

# **Feature Engineering**

This section covers extracting meaningful features from the audio data, which are crucial for training the machine learning model. We'll use Mel-Frequency Cepstral Coefficients (MFCCs) as our primary features.

In [None]:
def normalize_and_denoise(audio, sample_rate):
 """Normalizes the audio signal and applies noise reduction.

 Args:
 audio (numpy.ndarray): The audio signal.
 sample_rate (int): The sampling rate of the audio signal.

 Returns:
 numpy.ndarray: The normalized and denoised audio signal.
 """
 audio = audio / np.max(np.abs(audio))
 filtered_audio = nr.reduce_noise(y=audio, sr=sample_rate)
 return filtered_audio

**NoiseReduce** is used here for noise reduction, a crucial preprocessing step to improve model accuracy by minimizing the impact of background noise on bird call detection. It uses spectral subtraction to estimate and remove noise from the audio.

In [None]:
# Function to extract features from audio file
def extract_features(file_path):
 """Extracts Mel-Frequency Cepstral Coefficients (MFCCs) from an audio file.

 Args:
 file_path (str): Path to the audio file.

 Returns:
 numpy.ndarray: A 1D array of MFCC features.
 """
 # Load audio file
 audio, sample_rate = librosa.load(file_path)
 # Normalize and denoise audio
 audio = normalize_and_denoise(audio, sample_rate)
 # Extract features using Mel-Frequency Cepstral Coefficients (MFCC)
 mfccs = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40)
 # Flatten the features into a 1D array
 flattened_features = np.mean(mfccs.T, axis=0)
 return flattened_features

**Mel-Frequency Cepstral Coefficients (MFCCs)** are extracted as features. These coefficients efficiently capture the spectral envelope of sounds, making them highly effective for identifying bird calls. The choice of MFCCs is common in audio classification tasks due to their ability to represent sound characteristics relevant to human perception.

In [None]:
# Function to load dataset and extract features
def load_data_and_extract_features(data_dir):
 labels = []
 features = []
 # Loop through each audio file in the dataset directory
 for filename in os.listdir(data_dir):
 if filename.endswith('.ogg'):
 file_path = os.path.join(data_dir, filename)
 # Extract label from filename
 label = filename.split('-')[0]
 labels.append(label)
 # Extract features from audio file
 feature = extract_features(file_path)
 features.append(feature)
 return np.array(features), np.array(labels)

This code iterates through your audio dataset and prepares the data for modeling. It extracts MFCC features from each audio file and stores them along with their corresponding labels. This structured data will be used to train the bird species classification model.

In [None]:
# Path to the directory containing your audio dataset
dataset_dir = '/kaggle/input/birdclef-2024/train_audio'
# Initialize an empty dictionary to store the mapping between audio files and labels
label_mapping = {}
# Iterate over subdirectories (classes) in the dataset directory
for label in os.listdir(dataset_dir):
 label_dir = os.path.join(dataset_dir, label)
 # Check if the item in the dataset directory is a directory
 if os.path.isdir(label_dir):
 # Iterate over audio files in the subdirectory (class)
 for audio_file in os.listdir(label_dir):
 # Add the mapping between audio file path and label to the dictionary
 audio_file_path = os.path.join(label_dir, audio_file)
 label_mapping[audio_file_path] = label

# Create a list of tuples containing the audio file paths and labels
data = [(audio_file_path, label) for audio_file_path, label in label_mapping.items()]
# Create a Pandas DataFrame from the list of tuples
annotated_data = pd.DataFrame(data, columns=['audio_file_path', 'label'])
annotated_data

extracted_features = []

for i in tqdm(annotated_data['audio_file_path']):

 features = extract_features(file_path=i)
 # print(features)
 extracted_features.append(features)

In [None]:
# Save the extracted features using pickle
with open("extracted_features.pkl", "wb") as file: 
	pickle.dump(extracted_features, file)

# Load the saved features
with open("extracted_features.pkl", "rb") as file: 
	pickled_extracted_features = pickle.load(file)

This code snippet performs feature extraction and saves the extracted features to disk using the `pickle` library. Saving features prevents the need to recompute them every time you run your code, saving significant processing time, especially for large datasets.

## Label Encoding

Machine learning models often require numerical input. This section encodes categorical bird species labels into numerical format using Label Encoding.

In [None]:
label_encoder = LabelEncoder()
annotated_data['encoded_label'] = label_encoder.fit_transform(annotated_data['label'])
annotated_data

**Label Encoding** converts categorical labels (bird species names) into numerical representations, a necessary step as most machine learning algorithms work with numerical data.

In [None]:
## Addressing Data Imbalance

plt.figure(figsize=(24, 12))
sns.countplot(x='primary_label', data=meta_data, order=meta_data['primary_label'].value_counts().index)
plt.xticks(rotation=45)
plt.rc('font', size=6)
plt.title('Count of Bird Species Classes')
plt.xlabel('Bird Species')
plt.ylabel('Count')
plt.show()

This count plot visually represents the distribution of different bird species in your dataset. Imbalanced datasets, where some species have significantly more samples than others, are common in real-world scenarios and can lead to biased models. Techniques like oversampling or undersampling can be used to address this issue.

In [None]:
x = np.vstack(pickled_extracted_features)
y = annotated_data['encoded_label']

print(x.shape)
print(y.shape)

ros = RandomOverSampler(random_state=42)
features_resampled, labels_reshampled = ros.fit_resample(x, y)

print(features_resampled.shape)
print(labels_reshampled.shape)

**Random Oversampling** is applied to balance the class distribution. This technique creates synthetic samples for minority classes (bird species with fewer recordings) by randomly duplicating existing samples, helping to prevent the model from becoming biased towards majority classes.

# **Model Training**

In [None]:
# Split data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(features_resampled, labels_reshampled, test_size=0.2, random_state=42)

# Initialize the XGBoost classifier
xgb_model = xgb.XGBClassifier(n_estimators=100, random_state=42)
model = xgb_model.fit(x_train, y_train)
y_predict = model.predict(x_test)
accuracy = accuracy_score(y_test, y_predict)
print("Accuracy:", accuracy)

random_forest_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
random_forest_model = random_forest_classifier.fit(x_train, y_train)
y_predict = random_forest_model.predict(x_test)
accuracy = accuracy_score(y_test, y_predict)
print("Accuracy:", accuracy)

Two machine learning models are trained: **XGBoost** and **Random Forest**. Both are ensemble learning methods known for good performance in classification tasks. Choosing the best model often involves experimentation and comparison of their performance metrics.

# **Hyperparameter tuning**

In [None]:
from sklearn.model_selection import GridSearchCV
import xgboost as xgb
from sklearn.metrics import accuracy_score

# Define the parameter grid for XGBoost
xgb_param_grid = {
 'n_estimators': [50, 100],
 'max_depth': [3, 4],

}

# Initialize the XGBoost classifier
xgb_model = xgb.XGBClassifier(random_state=42)

# Set up GridSearchCV
xgb_grid_search = GridSearchCV(estimator=xgb_model, param_grid=xgb_param_grid,
 scoring='accuracy', cv=5, verbose=1, n_jobs=1)

# Fit GridSearchCV
xgb_grid_search.fit(x_train, y_train)

# Get the best parameters and best score
xgb_best_params = xgb_grid_search.best_params_
xgb_best_score = xgb_grid_search.best_score_

print("Best XGBoost Parameters:", xgb_best_params)
print("Best XGBoost CV Accuracy:", xgb_best_score)

# Train the model with the best parameters
xgb_best_model = xgb_grid_search.best_estimator_
y_predict = xgb_best_model.predict(x_test)
accuracy = accuracy_score(y_test, y_predict)
print("XGBoost Test Accuracy:", accuracy)

# Define the parameter grid for Random Forest
rf_param_grid = {
 'n_estimators': [50, 100],
 'max_depth': [20, 30],

}

# Initialize the Random Forest classifier
rf_model = RandomForestClassifier(random_state=42)

# Set up GridSearchCV with limited parallel jobs
rf_grid_search = GridSearchCV(estimator=rf_model, param_grid=rf_param_grid,
 scoring='accuracy', cv=5, verbose=1, n_jobs=1) # Limited to 2 jobs

# Fit GridSearchCV
rf_grid_search.fit(x_train, y_train)

# Get the best parameters and best score
rf_best_params = rf_grid_search.best_params_
rf_best_score = rf_grid_search.best_score_

print("Best Random Forest Parameters:", rf_best_params)
print("Best Random Forest CV Accuracy:", rf_best_score)

# Train the model with the best parameters
rf_best_model = rf_grid_search.best_estimator_
y_predict = rf_best_model.predict(x_test)
accuracy = accuracy_score(y_test, y_predict)
print("Random Forest Test Accuracy:", accuracy)

**Hyperparameter Tuning** is performed to find the optimal settings for both models. Grid Search is used here to systematically explore different parameter combinations and select the best-performing one. This step is crucial to maximize the model's predictive accuracy. 

# **Model Evaluation**

In [None]:
def evaluate_model(y_true, y_pred):
 """Calculates and prints various evaluation metrics for a classification model.

 Args:
 y_true (numpy.ndarray): True labels.
 y_pred (numpy.ndarray): Predicted labels.
 """
 accuracy = accuracy_score(y_true, y_pred)
 precision = precision_score(y_true, y_pred, average='weighted')
 recall = recall_score(y_true, y_pred, average='weighted')
 f1 = f1_score(y_true, y_pred, average='weighted')
 print("Accuracy:", accuracy)
 print("Precision:", precision)
 print("Recall:", recall)
 print("F1 Score:", f1)
 

# Evaluate the model
evaluate_model(y_test, y_predict)

y_proba = random_forest_model.predict_proba(x_test)
y_test_bin = label_binarize(y_test, classes=np.unique(y_train))
n_classes = y_test_bin.shape[1]
# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
 fpr[i], tpr[i], _ = roc_curve(y_test_bin[:, i], y_proba[:, i])
 roc_auc[i] = auc(fpr[i], tpr[i])

# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test_bin.ravel(), y_proba.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

# Plot ROC curve for each class
plt.figure()
colors = cycle(['aqua', 'darkorange', 'cornflowerblue'])
for i, color in zip(range(n_classes), colors):
 plt.plot(fpr[i], tpr[i], color=color, lw=2,
 label='ROC curve of class {0} (area = {1:0.2f})'
 ''.format(i, roc_auc[i]))

plt.plot([0, 1], [0, 1], 'k--', lw=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

# Calculate the average AUC score
average_auc = np.mean(list(roc_auc.values()))
print("Average AUC Score:", average_auc)

The model's performance is assessed using metrics like Accuracy, Precision, Recall, and F1-score. These metrics provide a comprehensive view of the model's ability to classify bird species correctly.
* **Accuracy:** Overall correctness of the model.
* **Precision:** How well the model identifies only relevant instances.
* **Recall:** How well the model identifies all relevant instances.
* **F1-score:** Harmonic mean of precision and recall, balancing both metrics.

# **Model Testing and Deployment**

In [None]:
dump(random_forest_model, 'random_forest.joblib')

model = load('/kaggle/working/random_forest.joblib')

def audio_classification(file_path):
 """Classifies an audio file using the trained model.

 Args:
 file_path (str): Path to the audio file.

 Returns:
 str: The predicted bird species.
 """
 extracted_features = extract_features(file_path).reshape(1, -1)
 y_predict = model.predict(extracted_features)
 predicted_label = label_encoder.inverse_transform(y_predict)[0] # Convert back to original label
 return f'Predicted Class: {predicted_label}'

file_path = '/kaggle/input/birdclef-2024/unlabeled_soundscapes/1001358022.ogg'
audio_analysis(file_path)
Audio(file_path)

audio_classification(file_path)

The trained model is loaded and used to classify a new audio sample. This demonstrates how the model can be deployed to make predictions on unseen data.

# **Project Submission**

In [None]:
test_soundscapes = '/kaggle/input/birdclef-2024/test_soundscapes'

for path in Path(test_soundscapes).glob("*.ogg"):
 print(path)
 print(path.stem)
 print(path.stem.split("_"))

test = pd.DataFrame(
 [(path.stem, *path.stem.split("_"), path) for path in Path(test_soundscapes).glob("*.ogg")],
 columns = ["filename", "name" ,"id", "path"]
)
print(test.shape)
test.head()

filenames = test.filename.values.tolist()
bird_cols = list(pd.get_dummies(meta_data['primary_label']).columns)
submission_df = pd.DataFrame(columns=['row_id']+bird_cols)
submission_df

for i, file in enumerate(filenames):
 predicted = model.predict[i]
 num_rows = len(predicted)
 row_ids = [f'{file}_{(i+1)*5}' for i in range(num_rows)]
 df = pd.DataFrame(columns=['row_id']+bird_cols)

 df['row_id'] = row_ids
 df[bird_cols] = predicted

 submission_df = pd.concat([submission_df,df]).reset_index(drop=True)

submission_df

submission_df.to_csv('submission.csv', index=False)