--- title: Company Details Scraper emoji: 🌍 colorFrom: green colorTo: yellow sdk: streamlit sdk_version: 1.41.1 app_file: app.py pinned: false license: apache-2.0 short_description: Give URL get details about the company --- # **Company Details Scraper** A **Streamlit-based web scraper** that extracts **detailed company information** from a given website and generates structured insights using **Groq AI**. 🚀 **Live Demo**: [Hugging Face Space](https://huggingface.co/spaces/kushagrasharma-13/company-details-scraper) --- ## **🔹 Features** ### ✅ **Website Scraping** - Extracts **internal pages** of a company website (e.g., "About Us," "Products," "Services"). - Filters out **irrelevant links** (e.g., social media, login, privacy pages). - Supports **multiple page extractions** in a single run. ### ✅ **AI-Powered Data Extraction** - Uses **Groq AI** to analyze and generate detailed insights from extracted content. - Structured **business overview**, including: - **Company Name & Overview** - **Mission & Vision** - **Detailed Product & Service Information** - **Target Audience** - **Business Model & Revenue Streams** - **Competitive Edge & Differentiation** - **Notable Clients & Case Studies** - **Industry Impact & Thought Leadership** ### ✅ **Handles Large Content Efficiently** - **Automatically splits** content into **manageable chunks** for AI processing. - **Merges AI responses seamlessly** into a structured format. - **Ensures no AI hallucination** by limiting responses strictly to extracted data. ### ✅ **User-Friendly UI** - **One-click scraping** – enter the URL and get insights in seconds. - **Collapsible scraped content section** – neatly organized raw content for review. --- ## **🛠️ Installation Guide** #### **1️⃣ Clone the Repository** ```bash git clone https://huggingface.co/spaces/kushagrasharma-13/company-details-scraper cd company-details-scraper ``` #### **2️⃣ Install Dependencies** ```bash pip install -r requirements.txt ``` #### **3️⃣ Set Up Environment Variables** Create a **`.env`** file and add your **Groq API key**: ```bash echo "GROQ_API_KEY=your_groq_api_key_here" > .env ``` Or manually add it to your environment: ```bash export GROQ_API_KEY="your_groq_api_key_here" ``` #### **4️⃣ Run the Application** ```bash streamlit run app.py ``` --- ## **⚙️ How to Use** ### **1️⃣ Enter the Company Website URL** - Input the **base URL** of a company website (e.g., `https://example.com`). ### **2️⃣ Click "Scrape Website"** - The scraper will extract **all relevant internal links** and fetch content. ### **3️⃣ View AI-Generated Insights** - The AI will **process** the data and provide a **structured business breakdown**. ### **4️⃣ Expand Scraped Content (Optional)** - A **collapsible section** allows you to review the raw scraped data. --- ## **🛠️ Tech Stack** - **Python** - **Streamlit** – Web UI - **BeautifulSoup** – Web Scraping - **Requests** – HTTP Requests - **LangChain** – AI Prompt Management - **Groq AI (Llama-3.3-70B-Versatile)** – Large Language Model --- ## **🎯 Limitations & Future Improvements** ### ⚠️ **Limitations** - **Cannot scrape dynamic JavaScript-based pages**. - **Requires publicly accessible company websites** (no login-protected content). - **Relies on structured website content** – poorly structured sites may have incomplete data. ### 🔥 **Future Enhancements** - **JavaScript rendering support** using **Selenium/Playwright**. - **Multi-language support** for scraping global company websites. - **AI-based entity recognition** to improve key data extraction. --- ## **📜 License** This project is licensed under the **MIT License** – free to use and modify. --- ## **🙌 Acknowledgments** Special thanks to **Hugging Face** for hosting this space and **Groq AI** for their powerful LLM capabilities. --- ### **🚀 Ready to get company insights?** Run the scraper and generate **detailed company reports effortlessly**! 🔍 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference