kushagrasharma-13's picture
Better Scraping
17250b8

A newer version of the Streamlit SDK is available: 1.44.0

Upgrade
metadata
title: Company Details Scraper
emoji: 🌍
colorFrom: green
colorTo: yellow
sdk: streamlit
sdk_version: 1.41.1
app_file: app.py
pinned: false
license: apache-2.0
short_description: Give URL get details about the company

Company Details Scraper

A Streamlit-based web scraper that extracts detailed company information from a given website and generates structured insights using Groq AI.

πŸš€ Live Demo: Hugging Face Space


πŸ”Ή Features

βœ… Website Scraping

  • Extracts internal pages of a company website (e.g., "About Us," "Products," "Services").
  • Filters out irrelevant links (e.g., social media, login, privacy pages).
  • Supports multiple page extractions in a single run.

βœ… AI-Powered Data Extraction

  • Uses Groq AI to analyze and generate detailed insights from extracted content.
  • Structured business overview, including:
    • Company Name & Overview
    • Mission & Vision
    • Detailed Product & Service Information
    • Target Audience
    • Business Model & Revenue Streams
    • Competitive Edge & Differentiation
    • Notable Clients & Case Studies
    • Industry Impact & Thought Leadership

βœ… Handles Large Content Efficiently

  • Automatically splits content into manageable chunks for AI processing.
  • Merges AI responses seamlessly into a structured format.
  • Ensures no AI hallucination by limiting responses strictly to extracted data.

βœ… User-Friendly UI

  • One-click scraping – enter the URL and get insights in seconds.
  • Collapsible scraped content section – neatly organized raw content for review.

πŸ› οΈ Installation Guide

1️⃣ Clone the Repository

git clone https://huggingface.co/spaces/kushagrasharma-13/company-details-scraper
cd company-details-scraper

2️⃣ Install Dependencies

pip install -r requirements.txt

3️⃣ Set Up Environment Variables

Create a .env file and add your Groq API key:

echo "GROQ_API_KEY=your_groq_api_key_here" > .env

Or manually add it to your environment:

export GROQ_API_KEY="your_groq_api_key_here"

4️⃣ Run the Application

streamlit run app.py

βš™οΈ How to Use

1️⃣ Enter the Company Website URL

  • Input the base URL of a company website (e.g., https://example.com).

2️⃣ Click "Scrape Website"

  • The scraper will extract all relevant internal links and fetch content.

3️⃣ View AI-Generated Insights

  • The AI will process the data and provide a structured business breakdown.

4️⃣ Expand Scraped Content (Optional)

  • A collapsible section allows you to review the raw scraped data.

πŸ› οΈ Tech Stack

  • Python
  • Streamlit – Web UI
  • BeautifulSoup – Web Scraping
  • Requests – HTTP Requests
  • LangChain – AI Prompt Management
  • Groq AI (Llama-3.3-70B-Versatile) – Large Language Model

🎯 Limitations & Future Improvements

⚠️ Limitations

  • Cannot scrape dynamic JavaScript-based pages.
  • Requires publicly accessible company websites (no login-protected content).
  • Relies on structured website content – poorly structured sites may have incomplete data.

πŸ”₯ Future Enhancements

  • JavaScript rendering support using Selenium/Playwright.
  • Multi-language support for scraping global company websites.
  • AI-based entity recognition to improve key data extraction.

πŸ“œ License

This project is licensed under the MIT License – free to use and modify.


πŸ™Œ Acknowledgments

Special thanks to Hugging Face for hosting this space and Groq AI for their powerful LLM capabilities.


πŸš€ Ready to get company insights?

Run the scraper and generate detailed company reports effortlessly! πŸ”

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference