---
title: Company Details Scraper
emoji: 🌍
colorFrom: green
colorTo: yellow
sdk: streamlit
sdk_version: 1.41.1
app_file: app.py
pinned: false
license: apache-2.0
short_description: Give URL get details about the company
---


# **Company Details Scraper**
A **Streamlit-based web scraper** that extracts **detailed company information** from a given website and generates structured insights using **Groq AI**.

🚀 **Live Demo**: [Hugging Face Space](https://huggingface.co/spaces/kushagrasharma-13/company-details-scraper)

---

## **🔹 Features**
### ✅ **Website Scraping**
- Extracts **internal pages** of a company website (e.g., "About Us," "Products," "Services").
- Filters out **irrelevant links** (e.g., social media, login, privacy pages).
- Supports **multiple page extractions** in a single run.

### ✅ **AI-Powered Data Extraction**
- Uses **Groq AI** to analyze and generate detailed insights from extracted content.
- Structured **business overview**, including:
  - **Company Name & Overview**
  - **Mission & Vision**
  - **Detailed Product & Service Information**
  - **Target Audience**
  - **Business Model & Revenue Streams**
  - **Competitive Edge & Differentiation**
  - **Notable Clients & Case Studies**
  - **Industry Impact & Thought Leadership**

### ✅ **Handles Large Content Efficiently**
- **Automatically splits** content into **manageable chunks** for AI processing.
- **Merges AI responses seamlessly** into a structured format.
- **Ensures no AI hallucination** by limiting responses strictly to extracted data.

### ✅ **User-Friendly UI**
- **One-click scraping** – enter the URL and get insights in seconds.
- **Collapsible scraped content section** – neatly organized raw content for review.

---

## **🛠️ Installation Guide**
#### **1️⃣ Clone the Repository**
```bash
git clone https://huggingface.co/spaces/kushagrasharma-13/company-details-scraper
cd company-details-scraper
```

#### **2️⃣ Install Dependencies**
```bash
pip install -r requirements.txt
```

#### **3️⃣ Set Up Environment Variables**
Create a **`.env`** file and add your **Groq API key**:
```bash
echo "GROQ_API_KEY=your_groq_api_key_here" > .env
```
Or manually add it to your environment:
```bash
export GROQ_API_KEY="your_groq_api_key_here"
```

#### **4️⃣ Run the Application**
```bash
streamlit run app.py
```

---

## **⚙️ How to Use**
### **1️⃣ Enter the Company Website URL**
- Input the **base URL** of a company website (e.g., `https://example.com`).

### **2️⃣ Click "Scrape Website"**
- The scraper will extract **all relevant internal links** and fetch content.

### **3️⃣ View AI-Generated Insights**
- The AI will **process** the data and provide a **structured business breakdown**.

### **4️⃣ Expand Scraped Content (Optional)**
- A **collapsible section** allows you to review the raw scraped data.

---

## **🛠️ Tech Stack**
- **Python**
- **Streamlit** – Web UI
- **BeautifulSoup** – Web Scraping
- **Requests** – HTTP Requests
- **LangChain** – AI Prompt Management
- **Groq AI (Llama-3.3-70B-Versatile)** – Large Language Model

---

## **🎯 Limitations & Future Improvements**
### ⚠️ **Limitations**
- **Cannot scrape dynamic JavaScript-based pages**.
- **Requires publicly accessible company websites** (no login-protected content).
- **Relies on structured website content** – poorly structured sites may have incomplete data.

### 🔥 **Future Enhancements**
- **JavaScript rendering support** using **Selenium/Playwright**.
- **Multi-language support** for scraping global company websites.
- **AI-based entity recognition** to improve key data extraction.

---

## **📜 License**
This project is licensed under the **MIT License** – free to use and modify.

---

## **🙌 Acknowledgments**
Special thanks to **Hugging Face** for hosting this space and **Groq AI** for their powerful LLM capabilities.

---

### **🚀 Ready to get company insights?**
Run the scraper and generate **detailed company reports effortlessly**! 🔍


Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference