Spaces:
Running
Running
Update web.py
Browse files
web.py
CHANGED
|
@@ -499,7 +499,7 @@ def web_data():
|
|
| 499 |
|
| 500 |
|
| 501 |
P(B('"Word "Javascript"'), """
|
| 502 |
-
In C4
|
| 503 |
pages contained warnings stating that Javascript should be enabled. However, this filtering strategy is too
|
| 504 |
strict, which will filter out many lines that are really talking about “Javascript”.
|
| 505 |
"""),
|
|
@@ -526,7 +526,7 @@ def web_data():
|
|
| 526 |
""",
|
| 527 |
),
|
| 528 |
P(B("Other Rules from RefinedWeb: "), """
|
| 529 |
-
We also adopt rules from RefinedWeb
|
| 530 |
"""),
|
| 531 |
Ul(
|
| 532 |
Li("The line is only composed of uppercase characters,", style = "margin-bottom: 5px"),
|
|
@@ -597,19 +597,19 @@ def web_data():
|
|
| 597 |
""",
|
| 598 |
),
|
| 599 |
P("""Similar to previous sections, we will present sample documents filtered out by the given quality signals.
|
| 600 |
-
Most quality signals were initially introduced by Gopher
|
| 601 |
-
studies ([3], [6], [4]). However, we observed that, despite following the same descriptions, the implementation
|
| 602 |
of each quality signal can vary significantly among different dataset pipelines, resulting in disparate
|
| 603 |
outcomes for the same quality signals.
|
| 604 |
-
In our pipeline, we referenced earlier implementations that were publicly available such as Dolma
|
| 605 |
-
and RedPajama V2
|
| 606 |
"""),
|
| 607 |
P(B("Repetition-based Heuristics: "), """
|
| 608 |
Many documents contain repeated sequences, potentially due to crawling errors or low-quality sources. In line with previous
|
| 609 |
-
work (
|
| 610 |
"""),
|
| 611 |
P(B("Fraction of Characters in Repeated Lines: "), """
|
| 612 |
-
Following Gopher
|
| 613 |
but longer duplicate passages. To achieve this goal, we calculate over the document both the fraction of passages
|
| 614 |
that are duplicates, and the fraction of characters contained within those duplicated passages.
|
| 615 |
"""),
|
|
@@ -748,7 +748,7 @@ def web_data():
|
|
| 748 |
""",
|
| 749 |
),
|
| 750 |
P(B("Fraction of Characters in the Most Common N-grams (n=2,3,4): "), """
|
| 751 |
-
Following Gopher
|
| 752 |
fraction of characters contained within the most frequently-occurring n-gram.
|
| 753 |
"""),
|
| 754 |
Details(
|
|
@@ -911,7 +911,7 @@ def web_data():
|
|
| 911 |
""",
|
| 912 |
),
|
| 913 |
P(B("Fraction of Characters in Duplicated N-grams (n=5,...,10): "), """
|
| 914 |
-
Following Gopher
|
| 915 |
fraction of characters contained within all duplicate n-grams, taking care not to count characters that occur in
|
| 916 |
overlapping n-grams more than once.
|
| 917 |
"""),
|
|
@@ -1141,8 +1141,8 @@ def web_data():
|
|
| 1141 |
),
|
| 1142 |
P(B("Line-wise Heuristics: "), """
|
| 1143 |
Some line-wise information could also be helpful to distinguish low-quality and high-quality documents. Following
|
| 1144 |
-
RefinedWeb
|
| 1145 |
-
works
|
| 1146 |
90% of lines start with a bullet point.
|
| 1147 |
"""),
|
| 1148 |
Details(
|
|
@@ -1247,7 +1247,7 @@ def web_data():
|
|
| 1247 |
),
|
| 1248 |
|
| 1249 |
P(B("Statistics-based Heuristics: "), """
|
| 1250 |
-
We summarize other statistics-based rules originated from Gopher
|
| 1251 |
"""),
|
| 1252 |
Ul(
|
| 1253 |
Li("the word count in the document", style = "margin-bottom: 5px"),
|
|
|
|
| 499 |
|
| 500 |
|
| 501 |
P(B('"Word "Javascript"'), """
|
| 502 |
+
In C4,""", D_cite(bibtex_key="c4"), """the authors remove any line with the word "Javascript" since they found that many of the scraped
|
| 503 |
pages contained warnings stating that Javascript should be enabled. However, this filtering strategy is too
|
| 504 |
strict, which will filter out many lines that are really talking about “Javascript”.
|
| 505 |
"""),
|
|
|
|
| 526 |
""",
|
| 527 |
),
|
| 528 |
P(B("Other Rules from RefinedWeb: "), """
|
| 529 |
+
We also adopt rules from RefinedWeb """, D_cite(bibtex_key="refinedweb"), """ to remove lines if they satisfy any of the following criteria:
|
| 530 |
"""),
|
| 531 |
Ul(
|
| 532 |
Li("The line is only composed of uppercase characters,", style = "margin-bottom: 5px"),
|
|
|
|
| 597 |
""",
|
| 598 |
),
|
| 599 |
P("""Similar to previous sections, we will present sample documents filtered out by the given quality signals.
|
| 600 |
+
Most quality signals were initially introduced by Gopher """, D_cite(bibtex_key="gopher"), """ and subsequently adopted by later
|
| 601 |
+
studies """, D_cite(bibtex_key="refinedweb"),D_cite(bibtex_key="dolma"),D_cite(bibtex_key="fineweb"), """([3], [6], [4]). However, we observed that, despite following the same descriptions, the implementation
|
| 602 |
of each quality signal can vary significantly among different dataset pipelines, resulting in disparate
|
| 603 |
outcomes for the same quality signals.
|
| 604 |
+
In our pipeline, we referenced earlier implementations that were publicly available such as Dolma,""", D_cite(bibtex_key="dolma"), """ DataTrove, """, D_cite(bibtex_key="penedo2024datatrove"), """
|
| 605 |
+
and RedPajama V2, """, D_cite(bibtex_key="redpajama-v2"), """ and selected the most suitable method based on manual inspections.
|
| 606 |
"""),
|
| 607 |
P(B("Repetition-based Heuristics: "), """
|
| 608 |
Many documents contain repeated sequences, potentially due to crawling errors or low-quality sources. In line with previous
|
| 609 |
+
work, """, D_cite(bibtex_key="gopher"), D_cite(bibtex_key="refinedweb"), D_cite(bibtex_key="dolma"), """ we choose to remove any document with excessive line, paragraph, or n-gram repetitions.
|
| 610 |
"""),
|
| 611 |
P(B("Fraction of Characters in Repeated Lines: "), """
|
| 612 |
+
Following Gopher,""", D_cite(bibtex_key="gopher"), """ we remove documents containing mupltiple, short duplicate passages, as well as those with few,
|
| 613 |
but longer duplicate passages. To achieve this goal, we calculate over the document both the fraction of passages
|
| 614 |
that are duplicates, and the fraction of characters contained within those duplicated passages.
|
| 615 |
"""),
|
|
|
|
| 748 |
""",
|
| 749 |
),
|
| 750 |
P(B("Fraction of Characters in the Most Common N-grams (n=2,3,4): "), """
|
| 751 |
+
Following Gopher,""", D_cite(bibtex_key="gopher"), """ we remove documents with a high portion of n-grams. For each n ∈ (2, 3, 4), we calculate the
|
| 752 |
fraction of characters contained within the most frequently-occurring n-gram.
|
| 753 |
"""),
|
| 754 |
Details(
|
|
|
|
| 911 |
""",
|
| 912 |
),
|
| 913 |
P(B("Fraction of Characters in Duplicated N-grams (n=5,...,10): "), """
|
| 914 |
+
Following Gopher, we remove documents with a high portion of n-grams. For each n ∈ (5, ..., 10), we calculate the
|
| 915 |
fraction of characters contained within all duplicate n-grams, taking care not to count characters that occur in
|
| 916 |
overlapping n-grams more than once.
|
| 917 |
"""),
|
|
|
|
| 1141 |
),
|
| 1142 |
P(B("Line-wise Heuristics: "), """
|
| 1143 |
Some line-wise information could also be helpful to distinguish low-quality and high-quality documents. Following
|
| 1144 |
+
RefinedWeb, we remove the document if the corrected lines represent more than 5% of words. In line with previous
|
| 1145 |
+
works, we remove the documents if more than 30% of the lines end with an ellipsis or more than
|
| 1146 |
90% of lines start with a bullet point.
|
| 1147 |
"""),
|
| 1148 |
Details(
|
|
|
|
| 1247 |
),
|
| 1248 |
|
| 1249 |
P(B("Statistics-based Heuristics: "), """
|
| 1250 |
+
We summarize other statistics-based rules originated from Gopher in this section. The statistics can be used include:
|
| 1251 |
"""),
|
| 1252 |
Ul(
|
| 1253 |
Li("the word count in the document", style = "margin-bottom: 5px"),
|