A newer version of the Streamlit SDK is available:
1.44.0
metadata
title: Company Details Scraper
emoji: π
colorFrom: green
colorTo: yellow
sdk: streamlit
sdk_version: 1.41.1
app_file: app.py
pinned: false
license: apache-2.0
short_description: Give URL get details about the company
Company Details Scraper
A Streamlit-based web scraper that extracts detailed company information from a given website and generates structured insights using Groq AI.
π Live Demo: Hugging Face Space
πΉ Features
β Website Scraping
- Extracts internal pages of a company website (e.g., "About Us," "Products," "Services").
- Filters out irrelevant links (e.g., social media, login, privacy pages).
- Supports multiple page extractions in a single run.
β AI-Powered Data Extraction
- Uses Groq AI to analyze and generate detailed insights from extracted content.
- Structured business overview, including:
- Company Name & Overview
- Mission & Vision
- Detailed Product & Service Information
- Target Audience
- Business Model & Revenue Streams
- Competitive Edge & Differentiation
- Notable Clients & Case Studies
- Industry Impact & Thought Leadership
β Handles Large Content Efficiently
- Automatically splits content into manageable chunks for AI processing.
- Merges AI responses seamlessly into a structured format.
- Ensures no AI hallucination by limiting responses strictly to extracted data.
β User-Friendly UI
- One-click scraping β enter the URL and get insights in seconds.
- Collapsible scraped content section β neatly organized raw content for review.
π οΈ Installation Guide
1οΈβ£ Clone the Repository
git clone https://huggingface.co/spaces/kushagrasharma-13/company-details-scraper
cd company-details-scraper
2οΈβ£ Install Dependencies
pip install -r requirements.txt
3οΈβ£ Set Up Environment Variables
Create a .env
file and add your Groq API key:
echo "GROQ_API_KEY=your_groq_api_key_here" > .env
Or manually add it to your environment:
export GROQ_API_KEY="your_groq_api_key_here"
4οΈβ£ Run the Application
streamlit run app.py
βοΈ How to Use
1οΈβ£ Enter the Company Website URL
- Input the base URL of a company website (e.g.,
https://example.com
).
2οΈβ£ Click "Scrape Website"
- The scraper will extract all relevant internal links and fetch content.
3οΈβ£ View AI-Generated Insights
- The AI will process the data and provide a structured business breakdown.
4οΈβ£ Expand Scraped Content (Optional)
- A collapsible section allows you to review the raw scraped data.
π οΈ Tech Stack
- Python
- Streamlit β Web UI
- BeautifulSoup β Web Scraping
- Requests β HTTP Requests
- LangChain β AI Prompt Management
- Groq AI (Llama-3.3-70B-Versatile) β Large Language Model
π― Limitations & Future Improvements
β οΈ Limitations
- Cannot scrape dynamic JavaScript-based pages.
- Requires publicly accessible company websites (no login-protected content).
- Relies on structured website content β poorly structured sites may have incomplete data.
π₯ Future Enhancements
- JavaScript rendering support using Selenium/Playwright.
- Multi-language support for scraping global company websites.
- AI-based entity recognition to improve key data extraction.
π License
This project is licensed under the MIT License β free to use and modify.
π Acknowledgments
Special thanks to Hugging Face for hosting this space and Groq AI for their powerful LLM capabilities.
π Ready to get company insights?
Run the scraper and generate detailed company reports effortlessly! π
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference