|
### Content Selection |
|
|
|
Crawl4AI provides multiple ways to select and filter specific content from webpages. Learn how to precisely target the content you need. |
|
|
|
#### CSS Selectors |
|
|
|
Extract specific content using a `CrawlerRunConfig` with CSS selectors: |
|
|
|
```python |
|
from crawl4ai.async_configs import CrawlerRunConfig |
|
|
|
config = CrawlerRunConfig(css_selector=".main-article") # Target main article content |
|
result = await crawler.arun(url="https://crawl4ai.com", config=config) |
|
|
|
config = CrawlerRunConfig(css_selector="article h1, article .content") # Target heading and content |
|
result = await crawler.arun(url="https://crawl4ai.com", config=config) |
|
``` |
|
|
|
#### Content Filtering |
|
|
|
Control content inclusion or exclusion with `CrawlerRunConfig`: |
|
|
|
```python |
|
config = CrawlerRunConfig( |
|
word_count_threshold=10, # Minimum words per block |
|
excluded_tags=['form', 'header', 'footer', 'nav'], # Excluded tags |
|
exclude_external_links=True, # Remove external links |
|
exclude_social_media_links=True, # Remove social media links |
|
exclude_external_images=True # Remove external images |
|
) |
|
|
|
result = await crawler.arun(url="https://crawl4ai.com", config=config) |
|
``` |
|
|
|
#### Iframe Content |
|
|
|
Process iframe content by enabling specific options in `CrawlerRunConfig`: |
|
|
|
```python |
|
config = CrawlerRunConfig( |
|
process_iframes=True, # Extract iframe content |
|
remove_overlay_elements=True # Remove popups/modals that might block iframes |
|
) |
|
|
|
result = await crawler.arun(url="https://crawl4ai.com", config=config) |
|
``` |
|
|
|
#### Structured Content Selection Using LLMs |
|
|
|
Leverage LLMs for intelligent content extraction: |
|
|
|
```python |
|
from crawl4ai.extraction_strategy import LLMExtractionStrategy |
|
from pydantic import BaseModel |
|
from typing import List |
|
|
|
class ArticleContent(BaseModel): |
|
title: str |
|
main_points: List[str] |
|
conclusion: str |
|
|
|
strategy = LLMExtractionStrategy( |
|
provider="ollama/nemotron", |
|
schema=ArticleContent.schema(), |
|
instruction="Extract the main article title, key points, and conclusion" |
|
) |
|
|
|
config = CrawlerRunConfig(extraction_strategy=strategy) |
|
|
|
result = await crawler.arun(url="https://crawl4ai.com", config=config) |
|
article = json.loads(result.extracted_content) |
|
``` |
|
|
|
#### Pattern-Based Selection |
|
|
|
Extract content matching repetitive patterns: |
|
|
|
```python |
|
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy |
|
|
|
schema = { |
|
"name": "News Articles", |
|
"baseSelector": "article.news-item", |
|
"fields": [ |
|
{"name": "headline", "selector": "h2", "type": "text"}, |
|
{"name": "summary", "selector": ".summary", "type": "text"}, |
|
{"name": "category", "selector": ".category", "type": "text"}, |
|
{ |
|
"name": "metadata", |
|
"type": "nested", |
|
"fields": [ |
|
{"name": "author", "selector": ".author", "type": "text"}, |
|
{"name": "date", "selector": ".date", "type": "text"} |
|
] |
|
} |
|
] |
|
} |
|
|
|
strategy = JsonCssExtractionStrategy(schema) |
|
config = CrawlerRunConfig(extraction_strategy=strategy) |
|
|
|
result = await crawler.arun(url="https://crawl4ai.com", config=config) |
|
articles = json.loads(result.extracted_content) |
|
``` |
|
|
|
#### Comprehensive Example |
|
|
|
Combine different selection methods using `CrawlerRunConfig`: |
|
|
|
```python |
|
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig |
|
|
|
async def extract_article_content(url: str): |
|
# Define structured extraction |
|
article_schema = { |
|
"name": "Article", |
|
"baseSelector": "article.main", |
|
"fields": [ |
|
{"name": "title", "selector": "h1", "type": "text"}, |
|
{"name": "content", "selector": ".content", "type": "text"} |
|
] |
|
} |
|
|
|
# Define configuration |
|
config = CrawlerRunConfig( |
|
extraction_strategy=JsonCssExtractionStrategy(article_schema), |
|
word_count_threshold=10, |
|
excluded_tags=['nav', 'footer'], |
|
exclude_external_links=True |
|
) |
|
|
|
async with AsyncWebCrawler() as crawler: |
|
result = await crawler.arun(url=url, config=config) |
|
return json.loads(result.extracted_content) |
|
``` |
|
|