{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **Bird Species Classification (BirdCLEF 2024)**\n",
    "\n",
    "\n",
    "| **Made By** | **Supervisors** | **Help of** |\n",
    "| :--- | :--- | :--- |\n",
    "| LAGHJAJ ABDELLATIF | Youssef ES-SAADY & El HAJJI Mohamed | ChatGPT & Gemini |\n",
    "\n",
    "**Problem Statement:**\n",
    "\n",
    "The Western Ghats of India, a biodiversity hotspot, faces threats to its avian ‎biodiversity due to habitat loss and climate change. Traditional bird surveys are ‎costly, logistically challenging, and provide limited temporal and spatial coverage.‎\n",
    "\n",
    "**Solution:**\n",
    "\n",
    "This project utilizes Passive Acoustic Monitoring (PAM) and machine learning to ‎efficiently and scalably monitor bird species in the Western Ghats. \n",
    "\n",
    "**Benefits of PAM and Machine Learning:**\n",
    "\n",
    "* **Cost-Effective:** Automated recording devices are cheaper than human observers.\n",
    "* **Extensive Coverage:** Analyze vast amounts of audio data over large areas.\n",
    "* **Continuous Monitoring:** Real-time insights into avian biodiversity fluctuations.\n",
    "\n",
    "# **Import Libraries**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install noisereduce librosa plotly imbalanced-learn xgboost joblib tqdm\n",
    "# Data manipulation and analysis\n",
    "import os\n",
    "import glob\n",
    "import pickle\n",
    "from pathlib import Path\n",
    "\n",
    "# Audio processing\n",
    "import librosa\n",
    "import noisereduce as nr\n",
    "\n",
    "# Machine learning\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import LabelEncoder\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "import xgboost as xgb\n",
    "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score\n",
    "from sklearn.metrics import roc_auc_score, roc_curve, auc\n",
    "from sklearn.multiclass import OneVsRestClassifier\n",
    "from sklearn.preprocessing import label_binarize\n",
    "from imblearn.over_sampling import RandomOverSampler\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "\n",
    "\n",
    "# Visualization\n",
    "import matplotlib.pyplot as plt\n",
    "plt.style.use('dark_background')\n",
    "import seaborn as sns\n",
    "import plotly.express as px\n",
    "from IPython.display import Audio\n",
    "import pandas as pd\n",
    "from itertools import cycle\n",
    "from tqdm import tqdm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **Data Collection and Preparation**\n",
    "\n",
    "This section focuses on loading the dataset, performing initial exploration, and preparing the data for feature extraction. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load metadata\n",
    "meta_data = pd.read_csv('/kaggle/input/birdclef-2024/train_metadata.csv')\n",
    "meta_data.head(4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exploratory Data Analysis (EDA)\n",
    "\n",
    "Before diving into feature engineering and model building, it's crucial to understand the dataset. This section explores the data to gain insights into the distribution of bird species, geographical locations, and potential challenges like data imbalance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check for missing values\n",
    "print(meta_data.isnull().sum())\n",
    "\n",
    "# Basic statistics\n",
    "print(meta_data.describe())\n",
    "\n",
    "# Number of birds in each species\n",
    "meta_data.primary_label.value_counts()\n",
    "\n",
    "# visualize the relationship between primary and secondary bird labels\n",
    "tmp = meta_data[meta_data['secondary_labels'] != '[]'].iloc[:40]\n",
    "fig = px.sunburst(tmp, path=['primary_label', 'secondary_labels'] )\n",
    "fig.update_layout(\n",
    "         title=\"Primary & Secondary Labels \",\n",
    "    )\n",
    "fig.show()\n",
    "\n",
    "# Create a copy of the df_train DataFrame\n",
    "tmp = meta_data.copy()\n",
    "\n",
    "# Create a scatter mapbox plot\n",
    "fig = px.scatter_mapbox(\n",
    "    tmp,  # Use the tmp DataFrame as the data source\n",
    "    lat=\"latitude\",  # Latitude column for plotting points\n",
    "    lon=\"longitude\",  # Longitude column for plotting points\n",
    "    color=\"primary_label\",  # Color points based on the primary_label column\n",
    "    zoom=0.1,  # Initial zoom level of the map\n",
    "    title='Bird Recordings Loaction'  # Title of the plot\n",
    ")\n",
    "\n",
    "# Update the layout of the plot to use the \"open-street-map\" style for the map background\n",
    "fig.update_layout(mapbox_style=\"open-street-map\")\n",
    "\n",
    "# Update the layout of the plot to set the margin around the map\n",
    "fig.update_layout(margin={\"r\":0,\"t\":30,\"l\":0,\"b\":0})\n",
    "\n",
    "# Display the scatter mapbox plot\n",
    "fig.show()\n",
    "\n",
    "fig = px.scatter_mapbox(meta_data, lat='latitude', lon='longitude', color='common_name',\n",
    "                        hover_name='common_name', hover_data=['latitude', 'longitude'],\n",
    "                        title='Origin of Bird Species',\n",
    "                        zoom=1, height=600, template='plotly_dark')\n",
    "fig.update_layout(\n",
    "    mapbox_style=\"white-bg\",\n",
    "    mapbox_layers=[\n",
    "        {\n",
    "            \"below\": 'traces',\n",
    "            \"sourcetype\": \"raster\",\n",
    "            \"sourceattribution\": \"United States Geological Survey\",\n",
    "            \"source\": [\n",
    "                \"https://basemap.nationalmap.gov/arcgis/rest/services/USGSImageryOnly/MapServer/tile/{z}/{y}/{x}\"\n",
    "            ]\n",
    "        }\n",
    "      ])\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This visualization maps the geographical distribution of bird species from the 'meta_data' DataFrame. Each point represents a species, colored by its common name. Hovering provides details like latitude and longitude, offering insights into species distribution patterns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.displot(data=meta_data,x='latitude',bins=30,kde=True)\n",
    "sns.displot(data=meta_data,x='longitude',bins=30,kde=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These histograms visualize the distribution of latitude and longitude values in the dataset. This can help identify if data is concentrated in specific geographical regions, potentially leading to biases in the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Handling Missing Values: \n",
    "# Filling missing latitude and longitude with the median to avoid outlier influence\n",
    "median_latitude = meta_data['latitude'].median()\n",
    "median_longitude = meta_data['longitude'].median()\n",
    "\n",
    "meta_data['latitude'] = meta_data['latitude'].fillna(median_latitude)\n",
    "meta_data['longitude'] = meta_data['longitude'].fillna(median_longitude)\n",
    "\n",
    "# Verify if missing values are handled\n",
    "print(meta_data.isnull().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Audio Data Visualization and Exploration\n",
    "\n",
    "Visualizing audio data is essential to understand its characteristics. This section includes functions to display waveforms and spectrograms, providing visual representations of the audio signals. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def audio_waveframe(file_path):\n",
    "    \"\"\"Plots the audio waveform of a given audio file.\n",
    "\n",
    "    Args:\n",
    "        file_path (str): Path to the audio file.\n",
    "    \"\"\"\n",
    "    audio_data, sampling_rate = librosa.load(file_path)\n",
    "    duration = len(audio_data) / sampling_rate\n",
    "    time = np.arange(0, duration, 1/sampling_rate)\n",
    "    plt.figure(figsize=(30, 4))\n",
    "    plt.plot(time, audio_data, color='blue')\n",
    "    plt.title('Audio Waveform')\n",
    "    plt.xlabel('Time (s)')\n",
    "    plt.ylabel('Amplitude')\n",
    "    plt.show()\n",
    "\n",
    "def spectrogram(file_path):\n",
    "    \"\"\"Plots the spectrogram of a given audio file.\n",
    "\n",
    "    Args:\n",
    "        file_path (str): Path to the audio file.\n",
    "    \"\"\"\n",
    "    n_fft = 500  # Number of FFT points \n",
    "    hop_length = 50  # Hop length for STFT\n",
    "    audio_data, sampling_rate = librosa.load(file_path)\n",
    "    stft = librosa.stft(audio_data, n_fft=n_fft, hop_length=hop_length)\n",
    "    spectrogram = librosa.amplitude_to_db(np.abs(stft))\n",
    "    plt.figure(figsize=(30, 6))\n",
    "    librosa.display.specshow(spectrogram, sr=sampling_rate, hop_length=hop_length, x_axis='time', y_axis='linear')\n",
    "    plt.colorbar(format='%+2.0f dB')\n",
    "    plt.title('Spectrogram')\n",
    "    plt.xlabel('Time (s)')\n",
    "    plt.ylabel('Frequency (Hz)')\n",
    "    plt.tight_layout()\n",
    "    plt.show()\n",
    "\n",
    "def audio_analysis(file_path):\n",
    "    \"\"\"Displays the audio waveform and spectrogram of an audio file.\n",
    "\n",
    "    Args:\n",
    "        file_path (str): Path to the audio file.\n",
    "    \"\"\"\n",
    "    audio_waveframe(file_path)\n",
    "    spectrogram(file_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Example usage:\n",
    "audio_analysis('/kaggle/input/birdclef-2024/train_audio/asbfly/XC134896.ogg')\n",
    "Audio('/kaggle/input/birdclef-2024/train_audio/asbfly/XC134896.ogg')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* **Waveform:** Visualizes the amplitude (volume) of the audio over time, showing changes in sound and potential patterns.\n",
    "* **Spectrogram:** Displays the frequency content of the audio over time. It's valuable for identifying bird calls, as different species have unique vocalization frequencies.\n",
    "\n",
    "# **Feature Engineering**\n",
    "\n",
    "This section covers extracting meaningful features from the audio data, which are crucial for training the machine learning model. We'll use Mel-Frequency Cepstral Coefficients (MFCCs) as our primary features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def normalize_and_denoise(audio, sample_rate):\n",
    "    \"\"\"Normalizes the audio signal and applies noise reduction.\n",
    "\n",
    "    Args:\n",
    "        audio (numpy.ndarray): The audio signal.\n",
    "        sample_rate (int): The sampling rate of the audio signal.\n",
    "\n",
    "    Returns:\n",
    "        numpy.ndarray: The normalized and denoised audio signal.\n",
    "    \"\"\"\n",
    "    audio = audio / np.max(np.abs(audio))\n",
    "    filtered_audio = nr.reduce_noise(y=audio, sr=sample_rate)\n",
    "    return filtered_audio"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**NoiseReduce** is used here for noise reduction, a crucial preprocessing step to improve model accuracy by minimizing the impact of background noise on bird call detection. It uses spectral subtraction to estimate and remove noise from the audio."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Function to extract features from audio file\n",
    "def extract_features(file_path):\n",
    "    \"\"\"Extracts Mel-Frequency Cepstral Coefficients (MFCCs) from an audio file.\n",
    "\n",
    "    Args:\n",
    "        file_path (str): Path to the audio file.\n",
    "\n",
    "    Returns:\n",
    "        numpy.ndarray: A 1D array of MFCC features.\n",
    "    \"\"\"\n",
    "    # Load audio file\n",
    "    audio, sample_rate = librosa.load(file_path)\n",
    "      # Normalize and denoise audio\n",
    "    audio = normalize_and_denoise(audio, sample_rate)\n",
    "    # Extract features using Mel-Frequency Cepstral Coefficients (MFCC)\n",
    "    mfccs = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40)\n",
    "    # Flatten the features into a 1D array\n",
    "    flattened_features = np.mean(mfccs.T, axis=0)\n",
    "    return flattened_features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Mel-Frequency Cepstral Coefficients (MFCCs)** are extracted as features. These coefficients efficiently capture the spectral envelope of sounds, making them highly effective for identifying bird calls. The choice of MFCCs is common in audio classification tasks due to their ability to represent sound characteristics relevant to human perception."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Function to load dataset and extract features\n",
    "def load_data_and_extract_features(data_dir):\n",
    "    labels = []\n",
    "    features = []\n",
    "    # Loop through each audio file in the dataset directory\n",
    "    for filename in os.listdir(data_dir):\n",
    "        if filename.endswith('.ogg'):\n",
    "            file_path = os.path.join(data_dir, filename)\n",
    "            # Extract label from filename\n",
    "            label = filename.split('-')[0]\n",
    "            labels.append(label)\n",
    "            # Extract features from audio file\n",
    "            feature = extract_features(file_path)\n",
    "            features.append(feature)\n",
    "    return np.array(features), np.array(labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This code iterates through your audio dataset and prepares the data for modeling. It extracts MFCC features from each audio file and stores them along with their corresponding labels. This structured data will be used to train the bird species classification model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Path to the directory containing your audio dataset\n",
    "dataset_dir = '/kaggle/input/birdclef-2024/train_audio'\n",
    "# Initialize an empty dictionary to store the mapping between audio files and labels\n",
    "label_mapping = {}\n",
    "# Iterate over subdirectories (classes) in the dataset directory\n",
    "for label in os.listdir(dataset_dir):\n",
    "    label_dir = os.path.join(dataset_dir, label)\n",
    "    # Check if the item in the dataset directory is a directory\n",
    "    if os.path.isdir(label_dir):\n",
    "        # Iterate over audio files in the subdirectory (class)\n",
    "        for audio_file in os.listdir(label_dir):\n",
    "            # Add the mapping between audio file path and label to the dictionary\n",
    "            audio_file_path = os.path.join(label_dir, audio_file)\n",
    "            label_mapping[audio_file_path] = label\n",
    "\n",
    "# Create a list of tuples containing the audio file paths and labels\n",
    "data = [(audio_file_path, label) for audio_file_path, label in label_mapping.items()]\n",
    "# Create a Pandas DataFrame from the list of tuples\n",
    "annotated_data = pd.DataFrame(data, columns=['audio_file_path', 'label'])\n",
    "annotated_data\n",
    "\n",
    "extracted_features = []\n",
    "\n",
    "for i in tqdm(annotated_data['audio_file_path']):\n",
    "\n",
    "    features = extract_features(file_path=i)\n",
    "    # print(features)\n",
    "    extracted_features.append(features)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save the extracted features using pickle\n",
    "with open(\"extracted_features.pkl\", \"wb\") as file:   \n",
    "\tpickle.dump(extracted_features, file)\n",
    "\n",
    "# Load the saved features\n",
    "with open(\"extracted_features.pkl\", \"rb\") as file:  \n",
    "\tpickled_extracted_features = pickle.load(file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This code snippet performs feature extraction and saves the extracted features to disk using the `pickle` library. Saving features prevents the need to recompute them every time you run your code, saving significant processing time, especially for large datasets.\n",
    "\n",
    "## Label Encoding\n",
    "\n",
    "Machine learning models often require numerical input. This section encodes categorical bird species labels into numerical format using Label Encoding."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "label_encoder = LabelEncoder()\n",
    "annotated_data['encoded_label'] = label_encoder.fit_transform(annotated_data['label'])\n",
    "annotated_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Label Encoding** converts categorical labels (bird species names) into numerical representations, a necessary step as most machine learning algorithms work with numerical data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Addressing Data Imbalance\n",
    "\n",
    "plt.figure(figsize=(24, 12))\n",
    "sns.countplot(x='primary_label', data=meta_data, order=meta_data['primary_label'].value_counts().index)\n",
    "plt.xticks(rotation=45)\n",
    "plt.rc('font', size=6)\n",
    "plt.title('Count of Bird Species Classes')\n",
    "plt.xlabel('Bird Species')\n",
    "plt.ylabel('Count')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This count plot visually represents the distribution of different bird species in your dataset. Imbalanced datasets, where some species have significantly more samples than others, are common in real-world scenarios and can lead to biased models. Techniques like oversampling or undersampling can be used to address this issue."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.vstack(pickled_extracted_features)\n",
    "y = annotated_data['encoded_label']\n",
    "\n",
    "print(x.shape)\n",
    "print(y.shape)\n",
    "\n",
    "ros = RandomOverSampler(random_state=42)\n",
    "features_resampled, labels_reshampled = ros.fit_resample(x, y)\n",
    "\n",
    "print(features_resampled.shape)\n",
    "print(labels_reshampled.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Random Oversampling** is applied to balance the class distribution. This technique creates synthetic samples for minority classes (bird species with fewer recordings) by randomly duplicating existing samples, helping to prevent the model from becoming biased towards majority classes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **Model Training**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Split data into training and testing sets\n",
    "x_train, x_test, y_train, y_test = train_test_split(features_resampled, labels_reshampled, test_size=0.2, random_state=42)\n",
    "\n",
    "# Initialize the XGBoost classifier\n",
    "xgb_model = xgb.XGBClassifier(n_estimators=100, random_state=42)\n",
    "model = xgb_model.fit(x_train, y_train)\n",
    "y_predict = model.predict(x_test)\n",
    "accuracy = accuracy_score(y_test, y_predict)\n",
    "print(\"Accuracy:\", accuracy)\n",
    "\n",
    "random_forest_classifier = RandomForestClassifier(n_estimators=100, random_state=42)\n",
    "random_forest_model = random_forest_classifier.fit(x_train, y_train)\n",
    "y_predict = random_forest_model.predict(x_test)\n",
    "accuracy = accuracy_score(y_test, y_predict)\n",
    "print(\"Accuracy:\", accuracy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Two machine learning models are trained: **XGBoost** and **Random Forest**. Both are ensemble learning methods known for good performance in classification tasks. Choosing the best model often involves experimentation and comparison of their performance metrics."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **Hyperparameter tuning**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import GridSearchCV\n",
    "import xgboost as xgb\n",
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "# Define the parameter grid for XGBoost\n",
    "xgb_param_grid = {\n",
    "    'n_estimators': [50, 100],\n",
    "    'max_depth': [3, 4],\n",
    "\n",
    "}\n",
    "\n",
    "# Initialize the XGBoost classifier\n",
    "xgb_model = xgb.XGBClassifier(random_state=42)\n",
    "\n",
    "# Set up GridSearchCV\n",
    "xgb_grid_search = GridSearchCV(estimator=xgb_model, param_grid=xgb_param_grid,\n",
    "                               scoring='accuracy', cv=5, verbose=1, n_jobs=1)\n",
    "\n",
    "# Fit GridSearchCV\n",
    "xgb_grid_search.fit(x_train, y_train)\n",
    "\n",
    "# Get the best parameters and best score\n",
    "xgb_best_params = xgb_grid_search.best_params_\n",
    "xgb_best_score = xgb_grid_search.best_score_\n",
    "\n",
    "print(\"Best XGBoost Parameters:\", xgb_best_params)\n",
    "print(\"Best XGBoost CV Accuracy:\", xgb_best_score)\n",
    "\n",
    "# Train the model with the best parameters\n",
    "xgb_best_model = xgb_grid_search.best_estimator_\n",
    "y_predict = xgb_best_model.predict(x_test)\n",
    "accuracy = accuracy_score(y_test, y_predict)\n",
    "print(\"XGBoost Test Accuracy:\", accuracy)\n",
    "\n",
    "# Define the parameter grid for Random Forest\n",
    "rf_param_grid = {\n",
    "    'n_estimators': [50, 100],\n",
    "    'max_depth': [20, 30],\n",
    "\n",
    "}\n",
    "\n",
    "# Initialize the Random Forest classifier\n",
    "rf_model = RandomForestClassifier(random_state=42)\n",
    "\n",
    "# Set up GridSearchCV with limited parallel jobs\n",
    "rf_grid_search = GridSearchCV(estimator=rf_model, param_grid=rf_param_grid,\n",
    "                              scoring='accuracy', cv=5, verbose=1, n_jobs=1)  # Limited to 2 jobs\n",
    "\n",
    "# Fit GridSearchCV\n",
    "rf_grid_search.fit(x_train, y_train)\n",
    "\n",
    "# Get the best parameters and best score\n",
    "rf_best_params = rf_grid_search.best_params_\n",
    "rf_best_score = rf_grid_search.best_score_\n",
    "\n",
    "print(\"Best Random Forest Parameters:\", rf_best_params)\n",
    "print(\"Best Random Forest CV Accuracy:\", rf_best_score)\n",
    "\n",
    "# Train the model with the best parameters\n",
    "rf_best_model = rf_grid_search.best_estimator_\n",
    "y_predict = rf_best_model.predict(x_test)\n",
    "accuracy = accuracy_score(y_test, y_predict)\n",
    "print(\"Random Forest Test Accuracy:\", accuracy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Hyperparameter Tuning** is performed to find the optimal settings for both models.  Grid Search is used here to systematically explore different parameter combinations and select the best-performing one. This step is crucial to maximize the model's predictive accuracy. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **Model Evaluation**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def evaluate_model(y_true, y_pred):\n",
    "    \"\"\"Calculates and prints various evaluation metrics for a classification model.\n",
    "\n",
    "    Args:\n",
    "        y_true (numpy.ndarray): True labels.\n",
    "        y_pred (numpy.ndarray): Predicted labels.\n",
    "    \"\"\"\n",
    "    accuracy = accuracy_score(y_true, y_pred)\n",
    "    precision = precision_score(y_true, y_pred, average='weighted')\n",
    "    recall = recall_score(y_true, y_pred, average='weighted')\n",
    "    f1 = f1_score(y_true, y_pred, average='weighted')\n",
    "    print(\"Accuracy:\", accuracy)\n",
    "    print(\"Precision:\", precision)\n",
    "    print(\"Recall:\", recall)\n",
    "    print(\"F1 Score:\", f1)\n",
    "    \n",
    "\n",
    "# Evaluate the model\n",
    "evaluate_model(y_test, y_predict)\n",
    "\n",
    "y_proba = random_forest_model.predict_proba(x_test)\n",
    "y_test_bin = label_binarize(y_test, classes=np.unique(y_train))\n",
    "n_classes = y_test_bin.shape[1]\n",
    "# Compute ROC curve and ROC area for each class\n",
    "fpr = dict()\n",
    "tpr = dict()\n",
    "roc_auc = dict()\n",
    "for i in range(n_classes):\n",
    "    fpr[i], tpr[i], _ = roc_curve(y_test_bin[:, i], y_proba[:, i])\n",
    "    roc_auc[i] = auc(fpr[i], tpr[i])\n",
    "\n",
    "# Compute micro-average ROC curve and ROC area\n",
    "fpr[\"micro\"], tpr[\"micro\"], _ = roc_curve(y_test_bin.ravel(), y_proba.ravel())\n",
    "roc_auc[\"micro\"] = auc(fpr[\"micro\"], tpr[\"micro\"])\n",
    "\n",
    "# Plot ROC curve for each class\n",
    "plt.figure()\n",
    "colors = cycle(['aqua', 'darkorange', 'cornflowerblue'])\n",
    "for i, color in zip(range(n_classes), colors):\n",
    "    plt.plot(fpr[i], tpr[i], color=color, lw=2,\n",
    "             label='ROC curve of class {0} (area = {1:0.2f})'\n",
    "             ''.format(i, roc_auc[i]))\n",
    "\n",
    "plt.plot([0, 1], [0, 1], 'k--', lw=2)\n",
    "plt.xlim([0.0, 1.0])\n",
    "plt.ylim([0.0, 1.05])\n",
    "plt.xlabel('False Positive Rate')\n",
    "plt.ylabel('True Positive Rate')\n",
    "plt.title('Receiver Operating Characteristic (ROC) Curve')\n",
    "plt.legend(loc=\"lower right\")\n",
    "plt.show()\n",
    "\n",
    "# Calculate the average AUC score\n",
    "average_auc = np.mean(list(roc_auc.values()))\n",
    "print(\"Average AUC Score:\", average_auc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The model's performance is assessed using metrics like Accuracy, Precision, Recall, and F1-score. These metrics provide a comprehensive view of the model's ability to classify bird species correctly.\n",
    "* **Accuracy:** Overall correctness of the model.\n",
    "* **Precision:** How well the model identifies only relevant instances.\n",
    "* **Recall:** How well the model identifies all relevant instances.\n",
    "* **F1-score:** Harmonic mean of precision and recall, balancing both metrics."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **Model Testing and Deployment**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dump(random_forest_model, 'random_forest.joblib')\n",
    "\n",
    "model = load('/kaggle/working/random_forest.joblib')\n",
    "\n",
    "def audio_classification(file_path):\n",
    "    \"\"\"Classifies an audio file using the trained model.\n",
    "\n",
    "    Args:\n",
    "        file_path (str): Path to the audio file.\n",
    "\n",
    "    Returns:\n",
    "        str: The predicted bird species.\n",
    "    \"\"\"\n",
    "    extracted_features = extract_features(file_path).reshape(1, -1)\n",
    "    y_predict = model.predict(extracted_features)\n",
    "    predicted_label = label_encoder.inverse_transform(y_predict)[0]  # Convert back to original label\n",
    "    return f'Predicted Class: {predicted_label}'\n",
    "\n",
    "file_path = '/kaggle/input/birdclef-2024/unlabeled_soundscapes/1001358022.ogg'\n",
    "audio_analysis(file_path)\n",
    "Audio(file_path)\n",
    "\n",
    "audio_classification(file_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The trained model is loaded and used to classify a new audio sample. This demonstrates how the model can be deployed to make predictions on unseen data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **Project Submission**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_soundscapes = '/kaggle/input/birdclef-2024/test_soundscapes'\n",
    "\n",
    "for path in Path(test_soundscapes).glob(\"*.ogg\"):\n",
    "    print(path)\n",
    "    print(path.stem)\n",
    "    print(path.stem.split(\"_\"))\n",
    "\n",
    "test = pd.DataFrame(\n",
    "     [(path.stem, *path.stem.split(\"_\"), path) for path in Path(test_soundscapes).glob(\"*.ogg\")],\n",
    "    columns = [\"filename\", \"name\" ,\"id\", \"path\"]\n",
    ")\n",
    "print(test.shape)\n",
    "test.head()\n",
    "\n",
    "filenames = test.filename.values.tolist()\n",
    "bird_cols = list(pd.get_dummies(meta_data['primary_label']).columns)\n",
    "submission_df = pd.DataFrame(columns=['row_id']+bird_cols)\n",
    "submission_df\n",
    "\n",
    "for i, file in enumerate(filenames):\n",
    "    predicted = model.predict[i]\n",
    "    num_rows = len(predicted)\n",
    "    row_ids = [f'{file}_{(i+1)*5}' for i in range(num_rows)]\n",
    "    df = pd.DataFrame(columns=['row_id']+bird_cols)\n",
    "\n",
    "    df['row_id'] = row_ids\n",
    "    df[bird_cols] = predicted\n",
    "\n",
    "    submission_df = pd.concat([submission_df,df]).reset_index(drop=True)\n",
    "\n",
    "submission_df\n",
    "\n",
    "submission_df.to_csv('submission.csv', index=False)"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}