Spaces:

LLM360
/

TxT360

Running

App Files Files Community

victormiller commited on Oct 4, 2024

Commit

ad46307

verified ·

1 Parent(s): 4d5ad99

Update web.py

Browse files

Files changed (1) hide show

web.py +13 -13

web.py CHANGED Viewed

@@ -499,7 +499,7 @@ def web_data():
         P(B('"Word "Javascript"'), """
-        In C4 [5], the authors remove any line with the word "Javascript" since they found that many of the scraped
         pages contained warnings stating that Javascript should be enabled. However, this filtering strategy is too
         strict, which will filter out many lines that are really talking about “Javascript”.
         """),
@@ -526,7 +526,7 @@ def web_data():
             """,
         ),
         P(B("Other Rules from RefinedWeb: "), """
-        We also adopt rules from RefinedWeb [3] to remove lines if they satisfy any of the following criteria:
         """),
         Ul(
             Li("The line is only composed of uppercase characters,", style = "margin-bottom: 5px"),
@@ -597,19 +597,19 @@ def web_data():
             """,
         ),
         P("""Similar to previous sections, we will present sample documents filtered out by the given quality signals.
-        Most quality signals were initially introduced by Gopher [2] and subsequently adopted by later
-        studies ([3], [6], [4]). However, we observed that, despite following the same descriptions, the implementation
         of each quality signal can vary significantly among different dataset pipelines, resulting in disparate
         outcomes for the same quality signals.
-        In our pipeline, we referenced earlier implementations that were publicly available such as Dolma [6], DataTrove [4],
-        and RedPajama V2 [7], and selected the most suitable method based on manual inspections.
         """),
         P(B("Repetition-based Heuristics: "), """
         Many documents contain repeated sequences, potentially due to crawling errors or low-quality sources. In line with previous
-        work ([2], [3], [6]), we choose to remove any document with excessive line, paragraph, or n-gram repetitions.
         """),
         P(B("Fraction of Characters in Repeated Lines: "), """
-        Following Gopher [2], we remove documents containing mupltiple, short duplicate passages, as well as those with few,
         but longer duplicate passages. To achieve this goal, we calculate over the document both the fraction of passages
         that are duplicates, and the fraction of characters contained within those duplicated passages.
         """),
@@ -748,7 +748,7 @@ def web_data():
             """,
         ),
         P(B("Fraction of Characters in the Most Common N-grams (n=2,3,4): "), """
-        Following Gopher [2], we remove documents with a high portion of n-grams. For each n ∈ (2, 3, 4), we calculate the
         fraction of characters contained within the most frequently-occurring n-gram.
         """),
         Details(
@@ -911,7 +911,7 @@ def web_data():
             """,
         ),
         P(B("Fraction of Characters in Duplicated N-grams (n=5,...,10): "), """
-        Following Gopher [2], we remove documents with a high portion of n-grams. For each n ∈ (5, ..., 10), we calculate the
         fraction of characters contained within all duplicate n-grams, taking care not to count characters that occur in
         overlapping n-grams more than once.
         """),
@@ -1141,8 +1141,8 @@ def web_data():
         ),
         P(B("Line-wise Heuristics: "), """
         Some line-wise information could also be helpful to distinguish low-quality and high-quality documents. Following
-        RefinedWeb [3], we remove the document if the corrected lines represent more than 5% of words. In line with previous
-        works ([2], [3], [6]), we remove the documents if more than 30% of the lines end with an ellipsis or more than
         90% of lines start with a bullet point.
         """),
         Details(
@@ -1247,7 +1247,7 @@ def web_data():
         ),
         P(B("Statistics-based Heuristics: "), """
-        We summarize other statistics-based rules originated from Gopher [7] in this section. The statistics can be used include:
         """),
         Ul(
             Li("the word count in the document", style = "margin-bottom: 5px"),

         P(B('"Word "Javascript"'), """
+        In C4,""", D_cite(bibtex_key="c4"), """the authors remove any line with the word "Javascript" since they found that many of the scraped
         pages contained warnings stating that Javascript should be enabled. However, this filtering strategy is too
         strict, which will filter out many lines that are really talking about “Javascript”.
         """),
             """,
         ),
         P(B("Other Rules from RefinedWeb: "), """
+        We also adopt rules from RefinedWeb """, D_cite(bibtex_key="refinedweb"), """ to remove lines if they satisfy any of the following criteria:
         """),
         Ul(
             Li("The line is only composed of uppercase characters,", style = "margin-bottom: 5px"),
             """,
         ),
         P("""Similar to previous sections, we will present sample documents filtered out by the given quality signals.
+        Most quality signals were initially introduced by Gopher """, D_cite(bibtex_key="gopher"), """ and subsequently adopted by later
+        studies """, D_cite(bibtex_key="refinedweb"),D_cite(bibtex_key="dolma"),D_cite(bibtex_key="fineweb"), """([3], [6], [4]). However, we observed that, despite following the same descriptions, the implementation
         of each quality signal can vary significantly among different dataset pipelines, resulting in disparate
         outcomes for the same quality signals.
+        In our pipeline, we referenced earlier implementations that were publicly available such as Dolma,""", D_cite(bibtex_key="dolma"), """ DataTrove, """, D_cite(bibtex_key="penedo2024datatrove"), """
+        and RedPajama V2, """, D_cite(bibtex_key="redpajama-v2"), """ and selected the most suitable method based on manual inspections.
         """),
         P(B("Repetition-based Heuristics: "), """
         Many documents contain repeated sequences, potentially due to crawling errors or low-quality sources. In line with previous
+        work, """, D_cite(bibtex_key="gopher"), D_cite(bibtex_key="refinedweb"), D_cite(bibtex_key="dolma"), """ we choose to remove any document with excessive line, paragraph, or n-gram repetitions.
         """),
         P(B("Fraction of Characters in Repeated Lines: "), """
+        Following Gopher,""", D_cite(bibtex_key="gopher"), """ we remove documents containing mupltiple, short duplicate passages, as well as those with few,
         but longer duplicate passages. To achieve this goal, we calculate over the document both the fraction of passages
         that are duplicates, and the fraction of characters contained within those duplicated passages.
         """),
             """,
         ),
         P(B("Fraction of Characters in the Most Common N-grams (n=2,3,4): "), """
+        Following Gopher,""", D_cite(bibtex_key="gopher"), """  we remove documents with a high portion of n-grams. For each n ∈ (2, 3, 4), we calculate the
         fraction of characters contained within the most frequently-occurring n-gram.
         """),
         Details(
             """,
         ),
         P(B("Fraction of Characters in Duplicated N-grams (n=5,...,10): "), """
+        Following Gopher, we remove documents with a high portion of n-grams. For each n ∈ (5, ..., 10), we calculate the
         fraction of characters contained within all duplicate n-grams, taking care not to count characters that occur in
         overlapping n-grams more than once.
         """),
         ),
         P(B("Line-wise Heuristics: "), """
         Some line-wise information could also be helpful to distinguish low-quality and high-quality documents. Following
+        RefinedWeb, we remove the document if the corrected lines represent more than 5% of words. In line with previous
+        works, we remove the documents if more than 30% of the lines end with an ellipsis or more than
         90% of lines start with a bullet point.
         """),
         Details(
         ),
         P(B("Statistics-based Heuristics: "), """
+        We summarize other statistics-based rules originated from Gopher in this section. The statistics can be used include:
         """),
         Ul(
             Li("the word count in the document", style = "margin-bottom: 5px"),