|
# Extraction Strategies Overview |
|
|
|
Crawl4AI provides powerful extraction strategies to help you get structured data from web pages. Each strategy is designed for specific use cases and offers different approaches to data extraction. |
|
|
|
## Available Strategies |
|
|
|
### [LLM-Based Extraction](llm.md) |
|
|
|
`LLMExtractionStrategy` uses Language Models to extract structured data from web content. This approach is highly flexible and can understand content semantically. |
|
|
|
```python |
|
from pydantic import BaseModel |
|
from crawl4ai.extraction_strategy import LLMExtractionStrategy |
|
|
|
class Product(BaseModel): |
|
name: str |
|
price: float |
|
description: str |
|
|
|
strategy = LLMExtractionStrategy( |
|
provider="ollama/llama2", |
|
schema=Product.schema(), |
|
instruction="Extract product details from the page" |
|
) |
|
|
|
result = await crawler.arun( |
|
url="https://example.com/product", |
|
extraction_strategy=strategy |
|
) |
|
``` |
|
|
|
**Best for:** |
|
- Complex data structures |
|
- Content requiring interpretation |
|
- Flexible content formats |
|
- Natural language processing |
|
|
|
### [CSS-Based Extraction](css.md) |
|
|
|
`JsonCssExtractionStrategy` extracts data using CSS selectors. This is fast, reliable, and perfect for consistently structured pages. |
|
|
|
```python |
|
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy |
|
|
|
schema = { |
|
"name": "Product Listing", |
|
"baseSelector": ".product-card", |
|
"fields": [ |
|
{"name": "title", "selector": "h2", "type": "text"}, |
|
{"name": "price", "selector": ".price", "type": "text"}, |
|
{"name": "image", "selector": "img", "type": "attribute", "attribute": "src"} |
|
] |
|
} |
|
|
|
strategy = JsonCssExtractionStrategy(schema) |
|
|
|
result = await crawler.arun( |
|
url="https://example.com/products", |
|
extraction_strategy=strategy |
|
) |
|
``` |
|
|
|
**Best for:** |
|
- E-commerce product listings |
|
- News article collections |
|
- Structured content pages |
|
- High-performance needs |
|
|
|
### [Cosine Strategy](cosine.md) |
|
|
|
`CosineStrategy` uses similarity-based clustering to identify and extract relevant content sections. |
|
|
|
```python |
|
from crawl4ai.extraction_strategy import CosineStrategy |
|
|
|
strategy = CosineStrategy( |
|
semantic_filter="product reviews", # Content focus |
|
word_count_threshold=10, # Minimum words per cluster |
|
sim_threshold=0.3, # Similarity threshold |
|
max_dist=0.2, # Maximum cluster distance |
|
top_k=3 # Number of top clusters to extract |
|
) |
|
|
|
result = await crawler.arun( |
|
url="https://example.com/reviews", |
|
extraction_strategy=strategy |
|
) |
|
``` |
|
|
|
**Best for:** |
|
- Content similarity analysis |
|
- Topic clustering |
|
- Relevant content extraction |
|
- Pattern recognition in text |
|
|
|
## Strategy Selection Guide |
|
|
|
Choose your strategy based on these factors: |
|
|
|
1. **Content Structure** |
|
- Well-structured HTML → Use CSS Strategy |
|
- Natural language text → Use LLM Strategy |
|
- Mixed/Complex content → Use Cosine Strategy |
|
|
|
2. **Performance Requirements** |
|
- Fastest: CSS Strategy |
|
- Moderate: Cosine Strategy |
|
- Variable: LLM Strategy (depends on provider) |
|
|
|
3. **Accuracy Needs** |
|
- Highest structure accuracy: CSS Strategy |
|
- Best semantic understanding: LLM Strategy |
|
- Best content relevance: Cosine Strategy |
|
|
|
## Combining Strategies |
|
|
|
You can combine strategies for more powerful extraction: |
|
|
|
```python |
|
# First use CSS strategy for initial structure |
|
css_result = await crawler.arun( |
|
url="https://example.com", |
|
extraction_strategy=css_strategy |
|
) |
|
|
|
# Then use LLM for semantic analysis |
|
llm_result = await crawler.arun( |
|
url="https://example.com", |
|
extraction_strategy=llm_strategy |
|
) |
|
``` |
|
|
|
## Common Use Cases |
|
|
|
1. **E-commerce Scraping** |
|
```python |
|
# CSS Strategy for product listings |
|
schema = { |
|
"name": "Products", |
|
"baseSelector": ".product", |
|
"fields": [ |
|
{"name": "name", "selector": ".title", "type": "text"}, |
|
{"name": "price", "selector": ".price", "type": "text"} |
|
] |
|
} |
|
``` |
|
|
|
2. **News Article Extraction** |
|
```python |
|
# LLM Strategy for article content |
|
class Article(BaseModel): |
|
title: str |
|
content: str |
|
author: str |
|
date: str |
|
|
|
strategy = LLMExtractionStrategy( |
|
provider="ollama/llama2", |
|
schema=Article.schema() |
|
) |
|
``` |
|
|
|
3. **Content Analysis** |
|
```python |
|
# Cosine Strategy for topic analysis |
|
strategy = CosineStrategy( |
|
semantic_filter="technology trends", |
|
top_k=5 |
|
) |
|
``` |
|
|
|
|
|
## Input Formats |
|
All extraction strategies support different input formats to give you more control over how content is processed: |
|
|
|
- **markdown** (default): Uses the raw markdown conversion of the HTML content. Best for general text extraction where HTML structure isn't critical. |
|
- **html**: Uses the raw HTML content. Useful when you need to preserve HTML structure or extract data from specific HTML elements. |
|
- **fit_markdown**: Uses the cleaned and filtered markdown content. Best for extracting relevant content while removing noise. Requires a markdown generator with content filter to be configured. |
|
|
|
To specify an input format: |
|
```python |
|
strategy = LLMExtractionStrategy( |
|
input_format="html", # or "markdown" or "fit_markdown" |
|
provider="openai/gpt-4", |
|
instruction="Extract product information" |
|
) |
|
``` |
|
|
|
Note: When using "fit_markdown", ensure your CrawlerRunConfig includes a markdown generator with content filter: |
|
```python |
|
config = CrawlerRunConfig( |
|
extraction_strategy=strategy, |
|
markdown_generator=DefaultMarkdownGenerator( |
|
content_filter=PruningContentFilter() # Content filter goes here for fit_markdown |
|
) |
|
) |
|
``` |
|
|
|
If fit_markdown is requested but not available (no markdown generator or content filter), the system will automatically fall back to raw markdown with a warning. |
|
|
|
## Best Practices |
|
|
|
1. **Choose the Right Strategy** |
|
- Start with CSS for structured data |
|
- Use LLM for complex interpretation |
|
- Try Cosine for content relevance |
|
|
|
2. **Optimize Performance** |
|
- Cache LLM results |
|
- Keep CSS selectors specific |
|
- Tune similarity thresholds |
|
|
|
3. **Handle Errors** |
|
```python |
|
result = await crawler.arun( |
|
url="https://example.com", |
|
extraction_strategy=strategy |
|
) |
|
|
|
if not result.success: |
|
print(f"Extraction failed: {result.error_message}") |
|
else: |
|
data = json.loads(result.extracted_content) |
|
``` |
|
|
|
Each strategy has its strengths and optimal use cases. Explore the detailed documentation for each strategy to learn more about their specific features and configurations. |