metadata

title: Company Details Scraper
emoji: 🌍
colorFrom: green
colorTo: yellow
sdk: streamlit
sdk_version: 1.41.1
app_file: app.py
pinned: false
license: apache-2.0
short_description: Give URL get details about the company

Company Details Scraper

A Streamlit-based web scraper that extracts detailed company information from a given website and generates structured insights using Groq AI.

🚀 Live Demo: Hugging Face Space

🔹 Features

✅ Website Scraping

Extracts internal pages of a company website (e.g., "About Us," "Products," "Services").
Filters out irrelevant links (e.g., social media, login, privacy pages).
Supports multiple page extractions in a single run.

✅ AI-Powered Data Extraction

Uses Groq AI to analyze and generate detailed insights from extracted content.
Structured business overview, including:
- Company Name & Overview
- Mission & Vision
- Detailed Product & Service Information
- Target Audience
- Business Model & Revenue Streams
- Competitive Edge & Differentiation
- Notable Clients & Case Studies
- Industry Impact & Thought Leadership

✅ Handles Large Content Efficiently

Automatically splits content into manageable chunks for AI processing.
Merges AI responses seamlessly into a structured format.
Ensures no AI hallucination by limiting responses strictly to extracted data.

✅ User-Friendly UI

One-click scraping – enter the URL and get insights in seconds.
Collapsible scraped content section – neatly organized raw content for review.

🛠️ Installation Guide

1️⃣ Clone the Repository

git clone https://huggingface.co/spaces/kushagrasharma-13/company-details-scraper
cd company-details-scraper

2️⃣ Install Dependencies

pip install -r requirements.txt

3️⃣ Set Up Environment Variables

Create a .env file and add your Groq API key:

echo "GROQ_API_KEY=your_groq_api_key_here" > .env

Or manually add it to your environment:

export GROQ_API_KEY="your_groq_api_key_here"

4️⃣ Run the Application

streamlit run app.py

⚙️ How to Use

1️⃣ Enter the Company Website URL

Input the base URL of a company website (e.g., https://example.com).

2️⃣ Click "Scrape Website"

The scraper will extract all relevant internal links and fetch content.

3️⃣ View AI-Generated Insights

The AI will process the data and provide a structured business breakdown.

4️⃣ Expand Scraped Content (Optional)

A collapsible section allows you to review the raw scraped data.

🛠️ Tech Stack

Python
Streamlit – Web UI
BeautifulSoup – Web Scraping
Requests – HTTP Requests
LangChain – AI Prompt Management
Groq AI (Llama-3.3-70B-Versatile) – Large Language Model

🎯 Limitations & Future Improvements

⚠️ Limitations

Cannot scrape dynamic JavaScript-based pages.
Requires publicly accessible company websites (no login-protected content).
Relies on structured website content – poorly structured sites may have incomplete data.

🔥 Future Enhancements

JavaScript rendering support using Selenium/Playwright.
Multi-language support for scraping global company websites.
AI-based entity recognition to improve key data extraction.

📜 License

This project is licensed under the MIT License – free to use and modify.

🙌 Acknowledgments

Special thanks to Hugging Face for hosting this space and Groq AI for their powerful LLM capabilities.

🚀 Ready to get company insights?

Run the scraper and generate detailed company reports effortlessly! 🔍

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference