File size: 4,436 Bytes
03c0888 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 |
# Crawl4AI
Welcome to the official documentation for Crawl4AI! π·οΈπ€ Crawl4AI is an open-source Python library designed to simplify web crawling and extract useful information from web pages. This documentation will guide you through the features, usage, and customization of Crawl4AI.
## Introduction
Crawl4AI has one clear task: to make crawling and data extraction from web pages easy and efficient, especially for large language models (LLMs) and AI applications. Whether you are using it as a REST API or a Python library, Crawl4AI offers a robust and flexible solution with full asynchronous support.
## Quick Start
Here's a quick example to show you how easy it is to use Crawl4AI with its asynchronous capabilities:
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
# Create an instance of AsyncWebCrawler
async with AsyncWebCrawler(verbose=True) as crawler:
# Run the crawler on a URL
result = await crawler.arun(url="https://www.nbcnews.com/business")
# Print the extracted content
print(result.markdown)
# Run the async main function
asyncio.run(main())
```
## Key Features β¨
- π Completely free and open-source
- π Blazing fast performance, outperforming many paid services
- π€ LLM-friendly output formats (JSON, cleaned HTML, markdown)
- π Fit markdown generation for extracting main article content.
- π Multi-browser support (Chromium, Firefox, WebKit)
- π Supports crawling multiple URLs simultaneously
- π¨ Extracts and returns all media tags (Images, Audio, and Video)
- π Extracts all external and internal links
- π Extracts metadata from the page
- π Custom hooks for authentication, headers, and page modifications
- π΅οΈ User-agent customization
- πΌοΈ Takes screenshots of pages with enhanced error handling
- π Executes multiple custom JavaScripts before crawling
- π Generates structured output without LLM using JsonCssExtractionStrategy
- π Various chunking strategies: topic-based, regex, sentence, and more
- π§ Advanced extraction strategies: cosine clustering, LLM, and more
- π― CSS selector support for precise data extraction
- π Passes instructions/keywords to refine extraction
- π Proxy support with authentication for enhanced access
- π Session management for complex multi-page crawling
- π Asynchronous architecture for improved performance
- πΌοΈ Improved image processing with lazy-loading detection
- π°οΈ Enhanced handling of delayed content loading
- π Custom headers support for LLM interactions
- πΌοΈ iframe content extraction for comprehensive analysis
- β±οΈ Flexible timeout and delayed content retrieval options
## Documentation Structure
Our documentation is organized into several sections:
### Basic Usage
- [Installation](basic/installation.md)
- [Quick Start](basic/quickstart.md)
- [Simple Crawling](basic/simple-crawling.md)
- [Browser Configuration](basic/browser-config.md)
- [Content Selection](basic/content-selection.md)
- [Output Formats](basic/output-formats.md)
- [Page Interaction](basic/page-interaction.md)
### Advanced Features
- [Magic Mode](advanced/magic-mode.md)
- [Session Management](advanced/session-management.md)
- [Hooks & Authentication](advanced/hooks-auth.md)
- [Proxy & Security](advanced/proxy-security.md)
- [Content Processing](advanced/content-processing.md)
### Extraction & Processing
- [Extraction Strategies Overview](extraction/overview.md)
- [LLM Integration](extraction/llm.md)
- [CSS-Based Extraction](extraction/css.md)
- [Cosine Strategy](extraction/cosine.md)
- [Chunking Strategies](extraction/chunking.md)
### API Reference
- [AsyncWebCrawler](api/async-webcrawler.md)
- [CrawlResult](api/crawl-result.md)
- [Extraction Strategies](api/strategies.md)
- [arun() Method Parameters](api/arun.md)
### Examples
- Coming soon!
## Getting Started
1. Install Crawl4AI:
```bash
pip install crawl4ai
```
2. Check out our [Quick Start Guide](basic/quickstart.md) to begin crawling web pages.
3. Explore our [examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples) to see Crawl4AI in action.
## Support
For questions, suggestions, or issues:
- GitHub Issues: [Report a Bug](https://github.com/unclecode/crawl4ai/issues)
- Twitter: [@unclecode](https://twitter.com/unclecode)
- Website: [crawl4ai.com](https://crawl4ai.com)
Happy Crawling! πΈοΈπ |