|
# Complete Parameter Guide for arun() |
|
|
|
The following parameters can be passed to the `arun()` method. They are organized by their primary usage context and functionality. |
|
|
|
## Core Parameters |
|
|
|
```python |
|
await crawler.arun( |
|
url="https://example.com", # Required: URL to crawl |
|
verbose=True, # Enable detailed logging |
|
cache_mode=CacheMode.ENABLED, # Control cache behavior |
|
warmup=True # Whether to run warmup check |
|
) |
|
``` |
|
|
|
## Cache Control |
|
|
|
```python |
|
from crawl4ai import CacheMode |
|
|
|
await crawler.arun( |
|
cache_mode=CacheMode.ENABLED, # Normal caching (read/write) |
|
# Other cache modes: |
|
# cache_mode=CacheMode.DISABLED # No caching at all |
|
# cache_mode=CacheMode.READ_ONLY # Only read from cache |
|
# cache_mode=CacheMode.WRITE_ONLY # Only write to cache |
|
# cache_mode=CacheMode.BYPASS # Skip cache for this operation |
|
) |
|
``` |
|
|
|
## Content Processing Parameters |
|
|
|
### Text Processing |
|
```python |
|
await crawler.arun( |
|
word_count_threshold=10, # Minimum words per content block |
|
image_description_min_word_threshold=5, # Minimum words for image descriptions |
|
only_text=False, # Extract only text content |
|
excluded_tags=['form', 'nav'], # HTML tags to exclude |
|
keep_data_attributes=False, # Preserve data-* attributes |
|
) |
|
``` |
|
|
|
### Content Selection |
|
```python |
|
await crawler.arun( |
|
css_selector=".main-content", # CSS selector for content extraction |
|
remove_forms=True, # Remove all form elements |
|
remove_overlay_elements=True, # Remove popups/modals/overlays |
|
) |
|
``` |
|
|
|
### Link Handling |
|
```python |
|
await crawler.arun( |
|
exclude_external_links=True, # Remove external links |
|
exclude_social_media_links=True, # Remove social media links |
|
exclude_external_images=True, # Remove external images |
|
exclude_domains=["ads.example.com"], # Specific domains to exclude |
|
social_media_domains=[ # Additional social media domains |
|
"facebook.com", |
|
"twitter.com", |
|
"instagram.com" |
|
] |
|
) |
|
``` |
|
|
|
## Browser Control Parameters |
|
|
|
### Basic Browser Settings |
|
```python |
|
await crawler.arun( |
|
headless=True, # Run browser in headless mode |
|
browser_type="chromium", # Browser engine: "chromium", "firefox", "webkit" |
|
page_timeout=60000, # Page load timeout in milliseconds |
|
user_agent="custom-agent", # Custom user agent |
|
) |
|
``` |
|
|
|
### Navigation and Waiting |
|
```python |
|
await crawler.arun( |
|
wait_for="css:.dynamic-content", # Wait for element/condition |
|
delay_before_return_html=2.0, # Wait before returning HTML (seconds) |
|
) |
|
``` |
|
|
|
### JavaScript Execution |
|
```python |
|
await crawler.arun( |
|
js_code=[ # JavaScript to execute (string or list) |
|
"window.scrollTo(0, document.body.scrollHeight);", |
|
"document.querySelector('.load-more').click();" |
|
], |
|
js_only=False, # Only execute JavaScript without reloading page |
|
) |
|
``` |
|
|
|
### Anti-Bot Features |
|
```python |
|
await crawler.arun( |
|
magic=True, # Enable all anti-detection features |
|
simulate_user=True, # Simulate human behavior |
|
override_navigator=True # Override navigator properties |
|
) |
|
``` |
|
|
|
### Session Management |
|
```python |
|
await crawler.arun( |
|
session_id="my_session", # Session identifier for persistent browsing |
|
) |
|
``` |
|
|
|
### Screenshot Options |
|
```python |
|
await crawler.arun( |
|
screenshot=True, # Take page screenshot |
|
screenshot_wait_for=2.0, # Wait before screenshot (seconds) |
|
) |
|
``` |
|
|
|
### Proxy Configuration |
|
```python |
|
await crawler.arun( |
|
proxy="http://proxy.example.com:8080", # Simple proxy URL |
|
proxy_config={ # Advanced proxy settings |
|
"server": "http://proxy.example.com:8080", |
|
"username": "user", |
|
"password": "pass" |
|
} |
|
) |
|
``` |
|
|
|
## Content Extraction Parameters |
|
|
|
### Extraction Strategy |
|
```python |
|
await crawler.arun( |
|
extraction_strategy=LLMExtractionStrategy( |
|
provider="ollama/llama2", |
|
schema=MySchema.schema(), |
|
instruction="Extract specific data" |
|
) |
|
) |
|
``` |
|
|
|
### Chunking Strategy |
|
```python |
|
await crawler.arun( |
|
chunking_strategy=RegexChunking( |
|
patterns=[r'\n\n', r'\.\s+'] |
|
) |
|
) |
|
``` |
|
|
|
### HTML to Text Options |
|
```python |
|
await crawler.arun( |
|
html2text={ |
|
"ignore_links": False, |
|
"ignore_images": False, |
|
"escape_dot": False, |
|
"body_width": 0, |
|
"protect_links": True, |
|
"unicode_snob": True |
|
} |
|
) |
|
``` |
|
|
|
## Debug Options |
|
```python |
|
await crawler.arun( |
|
log_console=True, # Log browser console messages |
|
) |
|
``` |
|
|
|
## Parameter Interactions and Notes |
|
|
|
1. **Cache and Performance Setup** |
|
```python |
|
# Optimal caching for repeated crawls |
|
await crawler.arun( |
|
cache_mode=CacheMode.ENABLED, |
|
word_count_threshold=10, |
|
process_iframes=False |
|
) |
|
``` |
|
|
|
2. **Dynamic Content Handling** |
|
```python |
|
# Handle lazy-loaded content |
|
await crawler.arun( |
|
js_code="window.scrollTo(0, document.body.scrollHeight);", |
|
wait_for="css:.lazy-content", |
|
delay_before_return_html=2.0, |
|
cache_mode=CacheMode.WRITE_ONLY # Cache results after dynamic load |
|
) |
|
``` |
|
|
|
3. **Content Extraction Pipeline** |
|
```python |
|
# Complete extraction setup |
|
await crawler.arun( |
|
css_selector=".main-content", |
|
word_count_threshold=20, |
|
extraction_strategy=my_strategy, |
|
chunking_strategy=my_chunking, |
|
process_iframes=True, |
|
remove_overlay_elements=True, |
|
cache_mode=CacheMode.ENABLED |
|
) |
|
``` |
|
|
|
## Best Practices |
|
|
|
1. **Performance Optimization** |
|
```python |
|
await crawler.arun( |
|
cache_mode=CacheMode.ENABLED, # Use full caching |
|
word_count_threshold=10, # Filter out noise |
|
process_iframes=False # Skip iframes if not needed |
|
) |
|
``` |
|
|
|
2. **Reliable Scraping** |
|
```python |
|
await crawler.arun( |
|
magic=True, # Enable anti-detection |
|
delay_before_return_html=1.0, # Wait for dynamic content |
|
page_timeout=60000, # Longer timeout for slow pages |
|
cache_mode=CacheMode.WRITE_ONLY # Cache results after successful crawl |
|
) |
|
``` |
|
|
|
3. **Clean Content** |
|
```python |
|
await crawler.arun( |
|
remove_overlay_elements=True, # Remove popups |
|
excluded_tags=['nav', 'aside'],# Remove unnecessary elements |
|
keep_data_attributes=False, # Remove data attributes |
|
cache_mode=CacheMode.ENABLED # Use cache for faster processing |
|
) |
|
``` |