patrickacraig commited on
Commit
120773d
·
0 Parent(s):

initial push

Files changed (6) hide show
  1. .env.example +11 -0
  2. .gitignore +134 -0
  3. README.md +103 -0
  4. app.py +80 -0
  5. requirements.txt +131 -0
  6. web_ui.py +93 -0
.env.example ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # FIRECRAWL_API_KEY is your unique API key for accessing the Firecrawl API.
2
+ # You can obtain this key by signing up on the Firecrawl website and navigating to your account settings or API section.
3
+ FIRECRAWL_API_KEY=your_actual_api_key
4
+
5
+ # BASE_URL is the starting point URL of the website you want to scrape.
6
+ # Replace "https://docs.example.com/" with the URL of the website you wish to scrape.
7
+ BASE_URL="https://docs.example.com/"
8
+
9
+ # LIMIT_RATE is a boolean value that determines whether rate limiting is enabled.
10
+ # Set to "True" to enable rate limiting, or "False" to disable it.
11
+ LIMIT_RATE=True
.gitignore ADDED
@@ -0,0 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Byte-compiled / optimized / DLL files
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+
6
+ # C extensions
7
+ *.so
8
+
9
+ # Distribution / packaging
10
+ .Python
11
+ build/
12
+ develop-eggs/
13
+ dist/
14
+ downloads/
15
+ eggs/
16
+ .eggs/
17
+ lib/
18
+ lib64/
19
+ parts/
20
+ sdist/
21
+ var/
22
+ wheels/
23
+ *.egg-info/
24
+ .installed.cfg
25
+ *.egg
26
+ MANIFEST
27
+
28
+ # PyInstaller
29
+ # Usually these files are written by a python script from a template
30
+ # before PyInstaller builds the exe, so as to inject date/other infos into it.
31
+ *.manifest
32
+ *.spec
33
+
34
+ # Installer logs
35
+ pip-log.txt
36
+ pip-delete-this-directory.txt
37
+
38
+ # Unit test / coverage reports
39
+ htmlcov/
40
+ .tox/
41
+ .nox/
42
+ .coverage
43
+ .coverage.*
44
+ .cache
45
+ nosetests.xml
46
+ coverage.xml
47
+ *.cover
48
+ *.py,cover
49
+ .hypothesis/
50
+ .pytest_cache/
51
+ cover/
52
+
53
+ # Translations
54
+ *.mo
55
+ *.pot
56
+
57
+ # Django stuff:
58
+ *.log
59
+ local_settings.py
60
+ db.sqlite3
61
+
62
+ # Flask stuff:
63
+ instance/
64
+ .webassets-cache
65
+
66
+ # Scrapy stuff:
67
+ .scrapy
68
+
69
+ # Sphinx documentation
70
+ docs/_build/
71
+
72
+ # PyBuilder
73
+ target/
74
+
75
+ # Jupyter Notebook
76
+ .ipynb_checkpoints
77
+
78
+ # IPython
79
+ profile_default/
80
+ ipython_config.py
81
+
82
+ # pyenv
83
+ .python-version
84
+
85
+ # pipenv
86
+ # Pipenv specific files
87
+ Pipfile.lock
88
+
89
+ # PEP 582; used by e.g. github.com/David-OConnor/pyflow
90
+ __pypackages__/
91
+
92
+ # Celery stuff
93
+ celerybeat-schedule
94
+ celerybeat.pid
95
+
96
+ # SageMath parsed files
97
+ *.sage.py
98
+
99
+ # Environments
100
+ .env
101
+ .venv
102
+ env/
103
+ venv/
104
+ ENV/
105
+ env.bak/
106
+ venv.bak/
107
+ firecrawl_env/
108
+
109
+ # Spyder project settings
110
+ .spyderproject
111
+ .spyproject
112
+
113
+ # Rope project settings
114
+ .ropeproject
115
+
116
+ # mkdocs documentation
117
+ /site
118
+
119
+ # mypy
120
+ .mypy_cache/
121
+ .dmypy.json
122
+ dmypy.json
123
+
124
+ # Pyre type checker
125
+ .pyre/
126
+
127
+ # pytype static type analyzer
128
+ .pytype/
129
+
130
+ # Cython debug symbols
131
+ cython_debug/
132
+
133
+ # Custom
134
+ scraped_documentation/
README.md ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Docs Scraper
2
+
3
+ This project provides a Python script to map and scrape all URLs from a given website using the Firecrawl API. This can be particularly useful for AI code editors that need to gather context from various types of websites. By scraping the content, the AI can analyze and understand the structure and information provided, which can enhance its ability to offer accurate code suggestions and improvements.
4
+
5
+ Types of sites that would be useful to scrape include:
6
+ - Documentation websites
7
+ - API reference sites
8
+ - Technical blogs
9
+ - Tutorials and guides
10
+ - Knowledge bases
11
+
12
+ The scraped content is saved into a markdown file named after the domain of the base URL, making it easy to reference and utilize.
13
+
14
+ ## Prerequisites
15
+
16
+ - Python 3.x
17
+ - <a href="https://firecrawl.dev/" target="_blank" rel="noopener noreferrer">Firecrawl API key</a>
18
+ - Virtual environment (recommended)
19
+
20
+ ## Setup
21
+
22
+ 1. **Clone the Repository**
23
+
24
+ Clone this repository to your local machine:
25
+
26
+ ```bash
27
+ git clone https://github.com/patrickacraig/docs-scraper.git
28
+ cd docs-scraper
29
+ ```
30
+
31
+ 2. **Create a Virtual Environment**
32
+
33
+ Create and activate a virtual environment:
34
+
35
+ ```bash
36
+ python -m venv .venv
37
+ source .venv/bin/activate # On Windows use `firecrawl_env\Scripts\activate`
38
+ ```
39
+
40
+ 3. **Install Dependencies**
41
+
42
+ Install the required Python packages:
43
+
44
+ ```bash
45
+ pip install -r requirements.txt
46
+ ```
47
+
48
+ 4. **Set Up Environment Variables**
49
+
50
+ Rename the `.env.example` file to `.env` and enter your own variables:
51
+
52
+ ```plaintext
53
+ FIRECRAWL_API_KEY=your_actual_api_key # Your unique API key for accessing the Firecrawl API. Obtain it from your Firecrawl account settings.
54
+
55
+ BASE_URL="https://docs.example.com/" # The starting point URL of the website you want to scrape. Replace with your target URL.
56
+
57
+ LIMIT_RATE=True # Set to "True" to enable rate limiting (10 scrapes per minute), or "False" to disable it.
58
+ ```
59
+
60
+ ## Rate Limiting
61
+ The script is designed to adhere to a rate limit of 10 scrapes per minute in adherence with the Firecrawl API free tier. To disable it, set the `LIMIT_RATE` environment variable to `False` in your `.env` file:```plaintext
62
+ LIMIT_RATE=False```
63
+
64
+ ## Usage
65
+
66
+ 1. **Run the Script**
67
+
68
+ Execute the script to start mapping and scraping the URLs:
69
+
70
+ ```bash
71
+ python app.py
72
+ ```
73
+
74
+ 2. **Output**
75
+
76
+ The script will generate a markdown file named after the domain of the base URL (e.g., `example.com.md`) containing the scraped content.
77
+
78
+ ## Alternative Usage: Web UI
79
+
80
+
81
+
82
+ 1. **Run the Script**
83
+
84
+ Alternatively, you can execute the script to run the web-based interface using Gradio:
85
+
86
+ ```bash
87
+ python web_ui.py
88
+ ```
89
+
90
+ This will launch a web interface in your default browser where you can enter the base URL, your Firecrawl API key, and choose whether to enable rate limiting. The output will be displayed directly in the browser.
91
+
92
+ ## License
93
+
94
+ This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
95
+
96
+ ## Contributing
97
+
98
+ Contributions are welcome! Please fork the repository and submit a pull request for any improvements or bug fixes.
99
+
100
+ ## Contact
101
+
102
+ For any questions or issues, please open an issue in the repository.
103
+
app.py ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from firecrawl import FirecrawlApp
2
+ import os
3
+ import time
4
+ from dotenv import load_dotenv
5
+ from urllib.parse import urlparse
6
+
7
+
8
+ load_dotenv()
9
+
10
+ base_url = os.getenv('BASE_URL')
11
+
12
+ def map_website(url):
13
+ # Initialize the Firecrawl application with the API key
14
+ app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))
15
+
16
+ # Use the /map endpoint to get all URLs from the website
17
+ map_status = app.map_url(url)
18
+
19
+ # Check if the mapping was successful
20
+ if isinstance(map_status, list):
21
+ return map_status
22
+ else:
23
+ print("Failed to map the website:", map_status)
24
+ return []
25
+
26
+ def scrape_url(url):
27
+ # Initialize the Firecrawl application with the API key
28
+ app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))
29
+
30
+ # Use the /scrape endpoint to scrape the URL
31
+ scrape_status = app.scrape_url(url)
32
+
33
+ # Print the scrape_status to understand its structure
34
+ print(f"Scrape status for {url}: {scrape_status}")
35
+
36
+ # Check if the scraping was successful
37
+ if 'markdown' in scrape_status:
38
+ return scrape_status['markdown']
39
+ else:
40
+ print(f"Failed to scrape {url}: {scrape_status}")
41
+ return ""
42
+
43
+ def scrape_all_urls(base_url):
44
+ # Map the URLs
45
+ urls = map_website(base_url)
46
+
47
+ # Parse the base URL to get the domain without 'www' and scheme
48
+ parsed_url = urlparse(base_url)
49
+ domain = parsed_url.netloc.replace("www.", "")
50
+
51
+ # Create the directory if it doesn't exist
52
+ os.makedirs('scraped_documentation', exist_ok=True)
53
+
54
+ # Generate the output file name and save location
55
+ output_file = os.path.join('scraped_documentation', f"{domain}.md")
56
+
57
+ # Open the output file in write mode
58
+ with open(output_file, 'w', encoding='utf-8') as md_file:
59
+ # Iterate over the URLs
60
+ for i, url in enumerate(urls):
61
+ # Print the URL being scraped
62
+ print(f"Scraping {url} ({i+1}/{len(urls)})")
63
+
64
+ # Scrape the URL
65
+ markdown_content = scrape_url(url)
66
+
67
+ # Write the scraped content to the file
68
+ md_file.write(f"# {url}\n\n")
69
+ md_file.write(markdown_content)
70
+ md_file.write("\n\n---\n\n")
71
+
72
+ # Rate limiting: 10 scrapes per minute
73
+ if os.getenv('LIMIT_RATE') == 'True':
74
+ if (i + 1) % 10 == 0:
75
+ print("Rate limit reached, waiting for 60 seconds...")
76
+ time.sleep(60)
77
+
78
+ if __name__ == "__main__":
79
+
80
+ scrape_all_urls(base_url)
requirements.txt ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ accelerate
2
+ aiofiles==23.2.1
3
+ aiohttp
4
+ aiosignal
5
+ altair==5.4.1
6
+ annotated-types==0.7.0
7
+ antlr4-python3-runtime==4.9.3
8
+ anyio==4.6.0
9
+ async-timeout
10
+ attrs
11
+ blinker
12
+ boltons
13
+ Brotli
14
+ certifi==2023.7.22
15
+ cffi
16
+ charset-normalizer
17
+ click
18
+ colorama
19
+ conda==4.3.16
20
+ conda-libmamba-solver
21
+ conda-package-handling
22
+ conda_package_streaming
23
+ contourpy==1.3.0
24
+ cryptography
25
+ cycler==0.12.1
26
+ dataclasses
27
+ datasets
28
+ diffusers
29
+ dill
30
+ einops==0.6.1
31
+ exceptiongroup==1.2.2
32
+ fastapi==0.115.0
33
+ ffmpy==0.4.0
34
+ filelock
35
+ firecrawl==1.2.4
36
+ Flask
37
+ fonttools==4.54.1
38
+ frozenlist
39
+ fsspec
40
+ gmpy2
41
+ gradio==4.44.1
42
+ gradio_client==1.3.0
43
+ h11==0.14.0
44
+ httpcore==1.0.6
45
+ httpx==0.27.2
46
+ huggingface-hub==0.25.1
47
+ idna
48
+ importlib-metadata
49
+ importlib_resources==6.4.5
50
+ invisible-watermark==0.2.0
51
+ itsdangerous
52
+ Jinja2
53
+ joblib
54
+ jsonpatch
55
+ jsonpointer==2.0
56
+ jsonschema==4.23.0
57
+ jsonschema-specifications==2023.12.1
58
+ kiwisolver==1.4.7
59
+ libmambapy
60
+ mamba
61
+ markdown-it-py==3.0.0
62
+ MarkupSafe
63
+ matplotlib==3.9.2
64
+ mdurl==0.1.2
65
+ mpmath
66
+ multidict
67
+ multiprocess
68
+ narwhals==1.9.1
69
+ nest-asyncio==1.6.0
70
+ networkx
71
+ numpy
72
+ omegaconf==2.3.0
73
+ opencv-python==4.8.0.76
74
+ orjson==3.10.7
75
+ packaging
76
+ pandas
77
+ Pillow
78
+ pluggy
79
+ psutil
80
+ pyarrow==12.0.1
81
+ pycosat
82
+ pycparser
83
+ pydantic==2.9.2
84
+ pydantic_core==2.23.4
85
+ pydub==0.25.1
86
+ Pygments==2.18.0
87
+ pyOpenSSL
88
+ pyparsing==3.1.4
89
+ PySocks
90
+ python-dateutil
91
+ python-dotenv==1.0.1
92
+ python-multipart==0.0.12
93
+ pytz
94
+ PyWavelets==1.4.1
95
+ PyYAML
96
+ referencing==0.35.1
97
+ regex
98
+ requests
99
+ rich==13.9.2
100
+ rpds-py==0.20.0
101
+ ruamel.yaml
102
+ ruamel.yaml.clib
103
+ ruff==0.6.9
104
+ sacremoses
105
+ safetensors
106
+ scipy==1.11.2
107
+ semantic-version==2.10.0
108
+ shellingham==1.5.4
109
+ six
110
+ sniffio==1.3.1
111
+ starlette==0.38.6
112
+ sympy
113
+ tokenizers
114
+ tomlkit==0.12.0
115
+ toolz
116
+ torch
117
+ torchsde==0.2.5
118
+ tqdm
119
+ trampoline==0.1.2
120
+ transformers
121
+ typer==0.12.5
122
+ typing_extensions==4.12.2
123
+ tzdata
124
+ urllib3
125
+ uvicorn==0.31.0
126
+ websockets==11.0.3
127
+ Werkzeug
128
+ xxhash
129
+ yarl
130
+ zipp
131
+ zstandard==0.19.0
web_ui.py ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import time
3
+ from dotenv import load_dotenv
4
+ from urllib.parse import urlparse
5
+ from firecrawl import FirecrawlApp
6
+ import gradio as gr
7
+
8
+ load_dotenv()
9
+
10
+ def map_website(url, api_key):
11
+ app = FirecrawlApp(api_key=api_key)
12
+ map_status = app.map_url(url)
13
+ if isinstance(map_status, list):
14
+ return map_status
15
+ else:
16
+ print("Failed to map the website:", map_status)
17
+ return []
18
+
19
+ def scrape_url(url, api_key):
20
+ app = FirecrawlApp(api_key=api_key)
21
+ scrape_status = app.scrape_url(url)
22
+ print(f"Scrape status for {url}: {scrape_status}")
23
+ if 'markdown' in scrape_status:
24
+ return scrape_status['markdown']
25
+ else:
26
+ print(f"Failed to scrape {url}: {scrape_status}")
27
+ return ""
28
+
29
+ def scrape_all_urls(base_url, api_key, limit_rate, progress=gr.Progress()):
30
+ urls = map_website(base_url, api_key)
31
+ parsed_url = urlparse(base_url)
32
+ domain = parsed_url.netloc.replace("www.", "")
33
+ os.makedirs('scraped_documentation', exist_ok=True)
34
+ output_file = os.path.join('scraped_documentation', f"{domain}.md")
35
+
36
+ with open(output_file, 'w', encoding='utf-8') as md_file:
37
+ for i, url in enumerate(progress.tqdm(urls)):
38
+ progress(i / len(urls), f"Scraping {url}")
39
+ markdown_content = scrape_url(url, api_key)
40
+ md_file.write(f"# {url}\n\n")
41
+ md_file.write(markdown_content)
42
+ md_file.write("\n\n---\n\n")
43
+ if limit_rate:
44
+ if (i + 1) % 10 == 0:
45
+ time.sleep(60)
46
+
47
+ return f"Scraping completed. Output saved to {output_file}"
48
+
49
+ def count_urls(base_url, api_key):
50
+ if not api_key:
51
+ return "Please enter your Firecrawl API key first."
52
+ urls = map_website(base_url, api_key)
53
+ return f"{len(urls)} URLs found. Do you want to proceed with scraping?"
54
+
55
+ def gradio_scrape(base_url, api_key, limit_rate):
56
+ if not api_key:
57
+ return "Please enter your Firecrawl API key."
58
+ if not base_url:
59
+ return "Please enter a base URL to scrape."
60
+ return scrape_all_urls(base_url, api_key, limit_rate)
61
+
62
+ with gr.Blocks() as iface:
63
+ gr.Markdown("# Docs Scraper")
64
+ gr.Markdown("## To map and scrape all URLs from a given website using the Firecrawl API, enter a base URL to scrape, your Firecrawl API key, and choose whether to limit the rate of scraping.")
65
+ gr.Markdown("Scraped content is saved into a markdown file named after the domain of the base URL, making it easy to reference and utilize. This can be particularly useful for AI code editors that need to gather context from various types of websites. By scraping the content, the AI can analyze and understand the structure and information provided, which can enhance its ability to offer accurate code suggestions and improvements.")
66
+ gr.HTML('Don\'t have an API key? <a href="https://firecrawl.dev/" target="_blank" rel="noopener noreferrer">Get one from Firecrawl</a>')
67
+
68
+ with gr.Row():
69
+ base_url = gr.Textbox(label="Base URL", placeholder="Enter the base URL to scrape")
70
+ api_key = gr.Textbox(label="Firecrawl API Key", type="password")
71
+ limit_rate = gr.Checkbox(
72
+ label="Limit Rate",
73
+ value=True,
74
+ info="Enable to limit scraping to 10 URLs per minute. This adheres to Firecrawl API's free tier rate limit."
75
+ )
76
+
77
+ gr.Markdown("After entering your API key, click 'Count URLs' to determine the number of URLs to be scraped. Then, click 'Scrape URLs' to begin the process. The progress and file location will be displayed in the textbox labeled 'Output'.")
78
+ with gr.Row():
79
+ count_button = gr.Button("Count URLs")
80
+ url_count = gr.Textbox(label="URL Count")
81
+
82
+ with gr.Row():
83
+ scrape_button = gr.Button("Scrape URLs")
84
+ output = gr.Textbox(label="Output", elem_id="output_textbox")
85
+
86
+ gr.Markdown("#### Note: The free tier of the Firecrawl API allows for 500 credits per month. If you need to scrape more, you can upgrade to a paid plan. The 'Count URLs' button may not work as expected if the base URL is not correctly specified or if the API key is invalid. Always ensure the base URL is correct and the API key is valid before proceeding.")
87
+
88
+
89
+ count_button.click(count_urls, inputs=[base_url, api_key], outputs=[url_count])
90
+ scrape_button.click(gradio_scrape, inputs=[base_url, api_key, limit_rate], outputs=[output])
91
+
92
+ if __name__ == "__main__":
93
+ iface.launch()