c4ai

Running

App Files Files Community

c4ai / docs /md_v2 /api /async-webcrawler.md

amaye15

test

03c0888 3 months ago

preview code

raw

history blame contribute delete

8.24 kB

	# AsyncWebCrawler

	The `AsyncWebCrawler` class is the main interface for web crawling operations. It provides asynchronous web crawling capabilities with extensive configuration options.

	## Constructor

	```python
	AsyncWebCrawler(
	# Browser Settings
	browser_type: str = "chromium", # Options: "chromium", "firefox", "webkit"
	headless: bool = True, # Run browser in headless mode
	verbose: bool = False, # Enable verbose logging

	# Cache Settings
	always_by_pass_cache: bool = False, # Always bypass cache
	base_directory: str = str(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home())), # Base directory for cache

	# Network Settings
	proxy: str = None, # Simple proxy URL
	proxy_config: Dict = None, # Advanced proxy configuration

	# Browser Behavior
	sleep_on_close: bool = False, # Wait before closing browser

	# Custom Settings
	user_agent: str = None, # Custom user agent
	headers: Dict[str, str] = {}, # Custom HTTP headers
	js_code: Union[str, List[str]] = None, # Default JavaScript to execute
	)
	```

	### Parameters in Detail

	#### Browser Settings

	- browser_type (str, optional)
	- Default: `"chromium"`
	- Options: `"chromium"`, `"firefox"`, `"webkit"`
	- Controls which browser engine to use
	```python
	# Example: Using Firefox
	crawler = AsyncWebCrawler(browser_type="firefox")
	```

	- headless (bool, optional)
	- Default: `True`
	- When `True`, browser runs without GUI
	- Set to `False` for debugging
	```python
	# Visible browser for debugging
	crawler = AsyncWebCrawler(headless=False)
	```

	- verbose (bool, optional)
	- Default: `False`
	- Enables detailed logging
	```python
	# Enable detailed logging
	crawler = AsyncWebCrawler(verbose=True)
	```

	#### Cache Settings

	- always_by_pass_cache (bool, optional)
	- Default: `False`
	- When `True`, always fetches fresh content
	```python
	# Always fetch fresh content
	crawler = AsyncWebCrawler(always_by_pass_cache=True)
	```

	- base_directory (str, optional)
	- Default: User's home directory
	- Base path for cache storage
	```python
	# Custom cache directory
	crawler = AsyncWebCrawler(base_directory="/path/to/cache")
	```

	#### Network Settings

	- proxy (str, optional)
	- Simple proxy URL
	```python
	# Using simple proxy
	crawler = AsyncWebCrawler(proxy="http://proxy.example.com:8080")
	```

	- proxy_config (Dict, optional)
	- Advanced proxy configuration with authentication
	```python
	# Advanced proxy with auth
	crawler = AsyncWebCrawler(proxy_config={
	"server": "http://proxy.example.com:8080",
	"username": "user",
	"password": "pass"
	})
	```

	#### Browser Behavior

	- sleep_on_close (bool, optional)
	- Default: `False`
	- Adds delay before closing browser
	```python
	# Wait before closing
	crawler = AsyncWebCrawler(sleep_on_close=True)
	```

	#### Custom Settings

	- user_agent (str, optional)
	- Custom user agent string
	```python
	# Custom user agent
	crawler = AsyncWebCrawler(
	user_agent="Mozilla/5.0 (Custom Agent) Chrome/90.0"
	)
	```

	- headers (Dict[str, str], optional)
	- Custom HTTP headers
	```python
	# Custom headers
	crawler = AsyncWebCrawler(
	headers={
	"Accept-Language": "en-US",
	"Custom-Header": "Value"
	}
	)
	```

	- js_code (Union[str, List[str]], optional)
	- Default JavaScript to execute on each page
	```python
	# Default JavaScript
	crawler = AsyncWebCrawler(
	js_code=[
	"window.scrollTo(0, document.body.scrollHeight);",
	"document.querySelector('.load-more').click();"
	]
	)
	```

	## Methods

	### arun()

	The primary method for crawling web pages.

	```python
	async def arun(
	# Required
	url: str, # URL to crawl

	# Content Selection
	css_selector: str = None, # CSS selector for content
	word_count_threshold: int = 10, # Minimum words per block

	# Cache Control
	bypass_cache: bool = False, # Bypass cache for this request

	# Session Management
	session_id: str = None, # Session identifier

	# Screenshot Options
	screenshot: bool = False, # Take screenshot
	screenshot_wait_for: float = None, # Wait before screenshot

	# Content Processing
	process_iframes: bool = False, # Process iframe content
	remove_overlay_elements: bool = False, # Remove popups/modals

	# Anti-Bot Settings
	simulate_user: bool = False, # Simulate human behavior
	override_navigator: bool = False, # Override navigator properties
	magic: bool = False, # Enable all anti-detection

	# Content Filtering
	excluded_tags: List[str] = None, # HTML tags to exclude
	exclude_external_links: bool = False, # Remove external links
	exclude_social_media_links: bool = False, # Remove social media links

	# JavaScript Handling
	js_code: Union[str, List[str]] = None, # JavaScript to execute
	wait_for: str = None, # Wait condition

	# Page Loading
	page_timeout: int = 60000, # Page load timeout (ms)
	delay_before_return_html: float = None, # Wait before return

	# Extraction
	extraction_strategy: ExtractionStrategy = None # Extraction strategy
	) -> CrawlResult:
	```

	### Usage Examples

	#### Basic Crawling
	```python
	async with AsyncWebCrawler() as crawler:
	result = await crawler.arun(url="https://example.com")
	```

	#### Advanced Crawling
	```python
	async with AsyncWebCrawler(
	browser_type="firefox",
	verbose=True,
	headers={"Custom-Header": "Value"}
	) as crawler:
	result = await crawler.arun(
	url="https://example.com",
	css_selector=".main-content",
	word_count_threshold=20,
	process_iframes=True,
	magic=True,
	wait_for="css:.dynamic-content",
	screenshot=True
	)
	```

	#### Session Management
	```python
	async with AsyncWebCrawler() as crawler:
	# First request
	result1 = await crawler.arun(
	url="https://example.com/login",
	session_id="my_session"
	)

	# Subsequent request using same session
	result2 = await crawler.arun(
	url="https://example.com/protected",
	session_id="my_session"
	)
	```

	## Context Manager

	AsyncWebCrawler implements the async context manager protocol:

	```python
	async def __aenter__(self) -> 'AsyncWebCrawler':
	# Initialize browser and resources
	return self

	async def __aexit__(self, *args):
	# Cleanup resources
	pass
	```

	Always use AsyncWebCrawler with async context manager:
	```python
	async with AsyncWebCrawler() as crawler:
	# Your crawling code here
	pass
	```

	## Best Practices

	1. Resource Management
	```python
	# Always use context manager
	async with AsyncWebCrawler() as crawler:
	# Crawler will be properly cleaned up
	pass
	```

	2. Error Handling
	```python
	try:
	async with AsyncWebCrawler() as crawler:
	result = await crawler.arun(url="https://example.com")
	if not result.success:
	print(f"Crawl failed: {result.error_message}")
	except Exception as e:
	print(f"Error: {str(e)}")
	```

	3. Performance Optimization
	```python
	# Enable caching for better performance
	crawler = AsyncWebCrawler(
	always_by_pass_cache=False,
	verbose=True
	)
	```

	4. Anti-Detection
	```python
	# Maximum stealth
	crawler = AsyncWebCrawler(
	headless=True,
	user_agent="Mozilla/5.0...",
	headers={"Accept-Language": "en-US"}
	)
	result = await crawler.arun(
	url="https://example.com",
	magic=True,
	simulate_user=True
	)
	```

	## Note on Browser Types

	Each browser type has its characteristics:

	- chromium: Best overall compatibility
	- firefox: Good for specific use cases
	- webkit: Lighter weight, good for basic crawling

	Choose based on your specific needs:
	```python
	# High compatibility
	crawler = AsyncWebCrawler(browser_type="chromium")

	# Memory efficient
	crawler = AsyncWebCrawler(browser_type="webkit")
	```