Spaces:
Running
Running
Update web.py
Browse files
web.py
CHANGED
|
@@ -285,7 +285,7 @@ def web_data():
|
|
| 285 |
H2("Stage 1: Document Preparation"),
|
| 286 |
|
| 287 |
|
| 288 |
-
P(B("Text Extraction: ")
|
| 289 |
Common Crawl provides webpage texts via two formats: WARC (Web ARChive format) and WET (WARC Encapsulated Text).
|
| 290 |
WARC files contain the raw data from the crawl, which store the full HTTP response and request metadata.
|
| 291 |
WET files contain plaintexts extracted by Common Crawl. In line with previous works ([1], [2], [3], [4]),
|
|
|
|
| 285 |
H2("Stage 1: Document Preparation"),
|
| 286 |
|
| 287 |
|
| 288 |
+
P(B("Text Extraction: "), """
|
| 289 |
Common Crawl provides webpage texts via two formats: WARC (Web ARChive format) and WET (WARC Encapsulated Text).
|
| 290 |
WARC files contain the raw data from the crawl, which store the full HTTP response and request metadata.
|
| 291 |
WET files contain plaintexts extracted by Common Crawl. In line with previous works ([1], [2], [3], [4]),
|