Spaces:

patrickacraig
/

docs-scraper

Running

App Files Files Community

patrickacraig commited on Oct 7, 2024

Commit

120773d

0 Parent(s):

init

Browse files

initial push

Files changed (6) hide show

.env.example +11 -0
.gitignore +134 -0
README.md +103 -0
app.py +80 -0
requirements.txt +131 -0
web_ui.py +93 -0

.env.example ADDED Viewed

	@@ -0,0 +1,11 @@

+# FIRECRAWL_API_KEY is your unique API key for accessing the Firecrawl API.
+# You can obtain this key by signing up on the Firecrawl website and navigating to your account settings or API section.
+FIRECRAWL_API_KEY=your_actual_api_key
+# BASE_URL is the starting point URL of the website you want to scrape.
+# Replace "https://docs.example.com/" with the URL of the website you wish to scrape.
+BASE_URL="https://docs.example.com/"
+# LIMIT_RATE is a boolean value that determines whether rate limiting is enabled.
+# Set to "True" to enable rate limiting, or "False" to disable it.
+LIMIT_RATE=True

.gitignore ADDED Viewed

	@@ -0,0 +1,134 @@

+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+.python-version
+# pipenv
+# Pipenv specific files
+Pipfile.lock
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+firecrawl_env/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+# pytype static type analyzer
+.pytype/
+# Cython debug symbols
+cython_debug/
+# Custom
+scraped_documentation/

README.md ADDED Viewed

	@@ -0,0 +1,103 @@

+# Docs Scraper
+This project provides a Python script to map and scrape all URLs from a given website using the Firecrawl API. This can be particularly useful for AI code editors that need to gather context from various types of websites. By scraping the content, the AI can analyze and understand the structure and information provided, which can enhance its ability to offer accurate code suggestions and improvements.
+Types of sites that would be useful to scrape include:
+- Documentation websites
+- API reference sites
+- Technical blogs
+- Tutorials and guides
+- Knowledge bases
+The scraped content is saved into a markdown file named after the domain of the base URL, making it easy to reference and utilize.
+## Prerequisites
+- Python 3.x
+- <a href="https://firecrawl.dev/" target="_blank" rel="noopener noreferrer">Firecrawl API key</a>
+- Virtual environment (recommended)
+## Setup
+1. **Clone the Repository**
+   Clone this repository to your local machine:
+   ```bash
+   git clone https://github.com/patrickacraig/docs-scraper.git
+   cd docs-scraper
+   ```
+2. **Create a Virtual Environment**
+   Create and activate a virtual environment:
+   ```bash
+   python -m venv .venv
+   source .venv/bin/activate  # On Windows use `firecrawl_env\Scripts\activate`
+   ```
+3. **Install Dependencies**
+   Install the required Python packages:
+   ```bash
+   pip install -r requirements.txt
+   ```
+4. **Set Up Environment Variables**
+   Rename the `.env.example` file to `.env` and enter your own variables:
+   ```plaintext
+   FIRECRAWL_API_KEY=your_actual_api_key  # Your unique API key for accessing the Firecrawl API. Obtain it from your Firecrawl account settings.
+   BASE_URL="https://docs.example.com/"  # The starting point URL of the website you want to scrape. Replace with your target URL.
+   LIMIT_RATE=True  # Set to "True" to enable rate limiting (10 scrapes per minute), or "False" to disable it.
+   ```
+## Rate Limiting
+The script is designed to adhere to a rate limit of 10 scrapes per minute in adherence with the Firecrawl API free tier. To disable it, set the `LIMIT_RATE` environment variable to `False` in your `.env` file:```plaintext
+LIMIT_RATE=False```
+## Usage
+1. **Run the Script**
+   Execute the script to start mapping and scraping the URLs:
+   ```bash
+   python app.py
+   ```
+2. **Output**
+   The script will generate a markdown file named after the domain of the base URL (e.g., `example.com.md`) containing the scraped content.
+## Alternative Usage: Web UI
+1.   **Run the Script**
+      Alternatively, you can execute the script to run the web-based interface using Gradio:
+      ```bash
+      python web_ui.py
+      ```
+      This will launch a web interface in your default browser where you can enter the base URL, your Firecrawl API key, and choose whether to enable rate limiting. The output will be displayed directly in the browser.
+## License
+This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
+## Contributing
+Contributions are welcome! Please fork the repository and submit a pull request for any improvements or bug fixes.
+## Contact
+For any questions or issues, please open an issue in the repository.

app.py ADDED Viewed

	@@ -0,0 +1,80 @@

+from firecrawl import FirecrawlApp
+import os
+import time
+from dotenv import load_dotenv
+from urllib.parse import urlparse
+load_dotenv()
+base_url = os.getenv('BASE_URL')
+def map_website(url):
+    # Initialize the Firecrawl application with the API key
+    app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))
+    # Use the /map endpoint to get all URLs from the website
+    map_status = app.map_url(url)
+    # Check if the mapping was successful
+    if isinstance(map_status, list):
+        return map_status
+    else:
+        print("Failed to map the website:", map_status)
+        return []
+def scrape_url(url):
+    # Initialize the Firecrawl application with the API key
+    app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))
+    # Use the /scrape endpoint to scrape the URL
+    scrape_status = app.scrape_url(url)
+    # Print the scrape_status to understand its structure
+    print(f"Scrape status for {url}: {scrape_status}")
+    # Check if the scraping was successful
+    if 'markdown' in scrape_status:
+        return scrape_status['markdown']
+    else:
+        print(f"Failed to scrape {url}: {scrape_status}")
+        return ""
+def scrape_all_urls(base_url):
+    # Map the URLs
+    urls = map_website(base_url)
+    # Parse the base URL to get the domain without 'www' and scheme
+    parsed_url = urlparse(base_url)
+    domain = parsed_url.netloc.replace("www.", "")
+    # Create the directory if it doesn't exist
+    os.makedirs('scraped_documentation', exist_ok=True)
+    # Generate the output file name and save location
+    output_file = os.path.join('scraped_documentation', f"{domain}.md")
+    # Open the output file in write mode
+    with open(output_file, 'w', encoding='utf-8') as md_file:
+        # Iterate over the URLs
+        for i, url in enumerate(urls):
+            # Print the URL being scraped
+            print(f"Scraping {url} ({i+1}/{len(urls)})")
+            # Scrape the URL
+            markdown_content = scrape_url(url)
+            # Write the scraped content to the file
+            md_file.write(f"# {url}\n\n")
+            md_file.write(markdown_content)
+            md_file.write("\n\n---\n\n")
+            # Rate limiting: 10 scrapes per minute
+            if os.getenv('LIMIT_RATE') == 'True':
+                if (i + 1) % 10 == 0:
+                    print("Rate limit reached, waiting for 60 seconds...")
+                    time.sleep(60)
+if __name__ == "__main__":
+    scrape_all_urls(base_url)

requirements.txt ADDED Viewed

	@@ -0,0 +1,131 @@

+accelerate
+aiofiles==23.2.1
+aiohttp
+aiosignal
+altair==5.4.1
+annotated-types==0.7.0
+antlr4-python3-runtime==4.9.3
+anyio==4.6.0
+async-timeout
+attrs
+blinker
+boltons
+Brotli
+certifi==2023.7.22
+cffi
+charset-normalizer
+click
+colorama
+conda==4.3.16
+conda-libmamba-solver
+conda-package-handling
+conda_package_streaming
+contourpy==1.3.0
+cryptography
+cycler==0.12.1
+dataclasses
+datasets
+diffusers
+dill
+einops==0.6.1
+exceptiongroup==1.2.2
+fastapi==0.115.0
+ffmpy==0.4.0
+filelock
+firecrawl==1.2.4
+Flask
+fonttools==4.54.1
+frozenlist
+fsspec
+gmpy2
+gradio==4.44.1
+gradio_client==1.3.0
+h11==0.14.0
+httpcore==1.0.6
+httpx==0.27.2
+huggingface-hub==0.25.1
+idna
+importlib-metadata
+importlib_resources==6.4.5
+invisible-watermark==0.2.0
+itsdangerous
+Jinja2
+joblib
+jsonpatch
+jsonpointer==2.0
+jsonschema==4.23.0
+jsonschema-specifications==2023.12.1
+kiwisolver==1.4.7
+libmambapy
+mamba
+markdown-it-py==3.0.0
+MarkupSafe
+matplotlib==3.9.2
+mdurl==0.1.2
+mpmath
+multidict
+multiprocess
+narwhals==1.9.1
+nest-asyncio==1.6.0
+networkx
+numpy
+omegaconf==2.3.0
+opencv-python==4.8.0.76
+orjson==3.10.7
+packaging
+pandas
+Pillow
+pluggy
+psutil
+pyarrow==12.0.1
+pycosat
+pycparser
+pydantic==2.9.2
+pydantic_core==2.23.4
+pydub==0.25.1
+Pygments==2.18.0
+pyOpenSSL
+pyparsing==3.1.4
+PySocks
+python-dateutil
+python-dotenv==1.0.1
+python-multipart==0.0.12
+pytz
+PyWavelets==1.4.1
+PyYAML
+referencing==0.35.1
+regex
+requests
+rich==13.9.2
+rpds-py==0.20.0
+ruamel.yaml
+ruamel.yaml.clib
+ruff==0.6.9
+sacremoses
+safetensors
+scipy==1.11.2
+semantic-version==2.10.0
+shellingham==1.5.4
+six
+sniffio==1.3.1
+starlette==0.38.6
+sympy
+tokenizers
+tomlkit==0.12.0
+toolz
+torch
+torchsde==0.2.5
+tqdm
+trampoline==0.1.2
+transformers
+typer==0.12.5
+typing_extensions==4.12.2
+tzdata
+urllib3
+uvicorn==0.31.0
+websockets==11.0.3
+Werkzeug
+xxhash
+yarl
+zipp
+zstandard==0.19.0

web_ui.py ADDED Viewed

	@@ -0,0 +1,93 @@

+import os
+import time
+from dotenv import load_dotenv
+from urllib.parse import urlparse
+from firecrawl import FirecrawlApp
+import gradio as gr
+load_dotenv()
+def map_website(url, api_key):
+    app = FirecrawlApp(api_key=api_key)
+    map_status = app.map_url(url)
+    if isinstance(map_status, list):
+        return map_status
+    else:
+        print("Failed to map the website:", map_status)
+        return []
+def scrape_url(url, api_key):
+    app = FirecrawlApp(api_key=api_key)
+    scrape_status = app.scrape_url(url)
+    print(f"Scrape status for {url}: {scrape_status}")
+    if 'markdown' in scrape_status:
+        return scrape_status['markdown']
+    else:
+        print(f"Failed to scrape {url}: {scrape_status}")
+        return ""
+def scrape_all_urls(base_url, api_key, limit_rate, progress=gr.Progress()):
+    urls = map_website(base_url, api_key)
+    parsed_url = urlparse(base_url)
+    domain = parsed_url.netloc.replace("www.", "")
+    os.makedirs('scraped_documentation', exist_ok=True)
+    output_file = os.path.join('scraped_documentation', f"{domain}.md")
+    with open(output_file, 'w', encoding='utf-8') as md_file:
+        for i, url in enumerate(progress.tqdm(urls)):
+            progress(i / len(urls), f"Scraping {url}")
+            markdown_content = scrape_url(url, api_key)
+            md_file.write(f"# {url}\n\n")
+            md_file.write(markdown_content)
+            md_file.write("\n\n---\n\n")
+            if limit_rate:
+                if (i + 1) % 10 == 0:
+                    time.sleep(60)
+    return f"Scraping completed. Output saved to {output_file}"
+def count_urls(base_url, api_key):
+    if not api_key:
+        return "Please enter your Firecrawl API key first."
+    urls = map_website(base_url, api_key)
+    return f"{len(urls)} URLs found. Do you want to proceed with scraping?"
+def gradio_scrape(base_url, api_key, limit_rate):
+    if not api_key:
+        return "Please enter your Firecrawl API key."
+    if not base_url:
+        return "Please enter a base URL to scrape."
+    return scrape_all_urls(base_url, api_key, limit_rate)
+with gr.Blocks() as iface:
+    gr.Markdown("# Docs Scraper")
+    gr.Markdown("## To map and scrape all URLs from a given website using the Firecrawl API, enter a base URL to scrape, your Firecrawl API key, and choose whether to limit the rate of scraping.")
+    gr.Markdown("Scraped content is saved into a markdown file named after the domain of the base URL, making it easy to reference and utilize. This can be particularly useful for AI code editors that need to gather context from various types of websites. By scraping the content, the AI can analyze and understand the structure and information provided, which can enhance its ability to offer accurate code suggestions and improvements.")
+    gr.HTML('Don\'t have an API key? <a href="https://firecrawl.dev/" target="_blank" rel="noopener noreferrer">Get one from Firecrawl</a>')
+    with gr.Row():
+        base_url = gr.Textbox(label="Base URL", placeholder="Enter the base URL to scrape")
+        api_key = gr.Textbox(label="Firecrawl API Key", type="password")
+        limit_rate = gr.Checkbox(
+            label="Limit Rate",
+            value=True,
+            info="Enable to limit scraping to 10 URLs per minute. This adheres to Firecrawl API's free tier rate limit."
+        )
+    gr.Markdown("After entering your API key, click 'Count URLs' to determine the number of URLs to be scraped. Then, click 'Scrape URLs' to begin the process. The progress and file location will be displayed in the textbox labeled 'Output'.")
+    with gr.Row():
+        count_button = gr.Button("Count URLs")
+        url_count = gr.Textbox(label="URL Count")
+    with gr.Row():
+        scrape_button = gr.Button("Scrape URLs")
+        output = gr.Textbox(label="Output", elem_id="output_textbox")
+    gr.Markdown("#### Note: The free tier of the Firecrawl API allows for 500 credits per month. If you need to scrape more, you can upgrade to a paid plan. The 'Count URLs' button may not work as expected if the base URL is not correctly specified or if the API key is invalid. Always ensure the base URL is correct and the API key is valid before proceeding.")
+    count_button.click(count_urls, inputs=[base_url, api_key], outputs=[url_count])
+    scrape_button.click(gradio_scrape, inputs=[base_url, api_key, limit_rate], outputs=[output])
+if __name__ == "__main__":
+    iface.launch()