{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "source": [ " # Create the dataset for the chatbox about League of Legends Lore\n", " Author: **HOANG Caroline**\n", "\n", " The notebook is setup and runnable for Google Collab" ], "metadata": { "id": "AUdgpvdZDa2l" } }, { "cell_type": "markdown", "source": [ "## 1. Setup" ], "metadata": { "id": "ZMtrx1lED6_D" } }, { "cell_type": "markdown", "source": [ "Chrome Installation" ], "metadata": { "id": "KtlLmE3cD_hS" } }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bULR-49jDSDU" }, "outputs": [], "source": [ "# Install Chrome\n", "!apt-get update -q\n", "!apt-get install -y wget curl unzip\n", "!wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb\n", "!dpkg -i google-chrome-stable_current_amd64.deb\n", "!apt --fix-broken install -y\n", "\n", "# Install dependencies for Chrome\n", "!apt-get install -y libx11-dev libx11-xcb1 libglu1-mesa libxi6 libgconf-2-4 libnss3 libxss1 libappindicator3-1 libasound2 libxtst6\n", "\n", "# Install ChromeDriver\n", "!wget -q https://chromedriver.storage.googleapis.com/113.0.5672.63/chromedriver_linux64.zip\n", "!unzip chromedriver_linux64.zip\n", "!mv chromedriver /usr/bin/chromedriver\n", "!chmod +x /usr/bin/chromedriver" ] }, { "cell_type": "markdown", "source": [ "Packages" ], "metadata": { "id": "8yLARgacEDjR" } }, { "cell_type": "code", "source": [ "pip install selenium webdriver-manager" ], "metadata": { "id": "Hb2k8Lh1EEbn" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## 2. Retrieve texts from website\n", "\n", "###Methodology\n", "1. Get all the champions name from https://leagueoflegends.fandom.com/wiki/List_of_champions and store it as a list using BeautifulSoul. As it also extract the champion role this can be used as extra information. \\\\\n", "2. For each champion {champion_name} we will go to https://universe.leagueoflegends.com/en_GB/story/champion/{champion_name}/ and extract the story text using BeautifulSoup. The name of the champion will go through a process to fit the url. \\\\\n", "BeautifulSoup extract the HTML text faster than Selenium. However, it sometimes fail to find the story text (such as Aurora, Ambessa) as it doesn't appear. In that case we will use Selenium.\n", "3. We create a dataframe that include the name of the champion associated with its role and story." ], "metadata": { "id": "fy-ph5nDEoWp" } }, { "cell_type": "markdown", "source": [ "## Step 1: Collect champions name" ], "metadata": { "id": "fj2MCu5WGyDw" } }, { "cell_type": "code", "source": [ "import requests\n", "from bs4 import BeautifulSoup\n", "\n", "# URL of the list of champions\n", "url = \"https://leagueoflegends.fandom.com/wiki/List_of_champions\"\n", "\n", "# Send a GET request to fetch the page\n", "response = requests.get(url)\n", "\n", "# Parse the HTML content using BeautifulSoup\n", "soup = BeautifulSoup(response.text, 'html.parser')\n", "\n", "# Find all td tags with the data-sort-value attribute\n", "champion_td_tags = soup.find_all('td', attrs={'data-sort-value': True})\n", "\n", "# Extract the champion names from the data-sort-value attribute\n", "champions = [tag['data-sort-value'] for tag in champion_td_tags]" ], "metadata": { "id": "c3ckS_EpGz20" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# Create champions name and role list from the previous extraction\n", "champions_name, champions_role = [], []\n", "for i, value in enumerate(champions):\n", " if i % 3 == 0:\n", " champions_name.append(value)\n", " if i % 3 == 1:\n", " champions_role.append(value)" ], "metadata": { "id": "eNdAqgrmG7ix" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Step 2: Extract its lore" ], "metadata": { "id": "_ZKEjxbFG-o9" } }, { "cell_type": "code", "source": [ "import requests\n", "from bs4 import BeautifulSoup\n", "champions_data = []\n", "import pandas as pd\n", "from selenium import webdriver\n", "from selenium.webdriver.chrome.service import Service\n", "from webdriver_manager.chrome import ChromeDriverManager\n", "from selenium.webdriver.common.by import By\n", "from selenium.webdriver.chrome.options import Options\n", "from selenium.webdriver.support.ui import WebDriverWait\n", "from selenium.webdriver.support import expected_conditions as EC\n", "import time\n", "\n", "# List of champions\n", "champions = champions_name\n", "# Configure options for headless Chrome\n", "chrome_options = Options()\n", "chrome_options.add_argument(\"--headless\")\n", "chrome_options.add_argument(\"--no-sandbox\")\n", "chrome_options.add_argument(\"--disable-dev-shm-usage\")\n", "service = Service(ChromeDriverManager().install())\n", "\n", "# Function to scrape champion data and store it\n", "def scrape_champion_data(champion_name):\n", " # Assuming champion_name is a string\n", " champion_name = champion_name.replace(\" \", \"\") # Remove spaces\n", " champion_name = champion_name.replace(\"'\", \"\") # Remove apostrophes\n", " url = f\"https://universe.leagueoflegends.com/en_GB/story/champion/{champion_name}/\"\n", " print(url)\n", " # Send a GET request to fetch the page\n", " response = requests.get(url)\n", " response.encoding = 'utf-8' # Set the encoding to UTF-8, or use response.apparent_encoding\n", " # Parse the HTML content using BeautifulSoup\n", " soup = BeautifulSoup(response.text, 'html.parser')\n", "\n", " # Find the tag with the biography content\n", " meta_tag = soup.find('meta', attrs={'name': 'description'})\n", " if meta_tag and champion_name != \"ambessa\":\n", " biography_text = meta_tag.get('content')\n", " champions_data.append({\n", " \"Champion\": champion_name,\n", " \"Text\": biography_text\n", " })\n", " else:\n", " try:\n", "\n", " browser = webdriver.Chrome(service=service, options=chrome_options)\n", "\n", " # Construct URL for the champion\n", " url = f\"https://universe.leagueoflegends.com/en_GB/story/champion/{champion_name}/\"\n", "\n", " # Open the page\n", " browser.get(url)\n", "\n", " # Wait for the page to load and the desired element to be visible\n", " WebDriverWait(browser, 10).until(\n", " EC.presence_of_element_located((By.CLASS_NAME, \"root_3nvd.dark_1RHo\"))\n", " )\n", "\n", " # Scroll down to ensure content is loaded\n", " browser.execute_script(\"window.scrollTo(0, document.body.scrollHeight / 2);\")\n", "\n", " # Extract the text content\n", " element = browser.find_element(By.CLASS_NAME, \"root_3nvd.dark_1RHo\")\n", " text_content = element.text\n", "\n", " # Append the data to the list\n", " champions_data.append({\n", " \"Champion\": champion_name,\n", " \"Text\": text_content\n", " })\n", " browser.quit()\n", " except:\n", " champions_data.append({\n", " \"Champion\": champion_name,\n", " \"Text\": f\"Biography content not found for {champion_name}.\"\n", " })\n", " print(\"Biography content not found.\")\n", "# Loop through the champions and scrape data\n", "for champion in champions:\n", " scrape_champion_data(champion.lower())\n" ], "metadata": { "id": "wlz85RVREnkp" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Step 3: Transform into dataframe" ], "metadata": { "id": "ESQDGQl2HDvW" } }, { "cell_type": "code", "source": [ "# Create a DataFrame from the list of champions data\n", "df1 = pd.DataFrame(champions_data)\n", "\n", "# Rename columns\n", "df1.columns = ['Champion', 'Story']\n", "\n", "# Add champions role\n", "df1['Role'] = champions_role\n", "\n", "# Undo the champions name changed for url\n", "df1['Champion'] = champions_name\n", "\n", "# Save it to a CSV file if you want\n", "d1f.to_csv(\"champions_data.csv\", index=False)" ], "metadata": { "id": "AHCHv4EJHGyg" }, "execution_count": null, "outputs": [] } ] }