Spaces:

molssiai-hub
/

accessing-the-data

Sleeping

App Files Files Community

SinaMostafanejad commited on Nov 6, 2024

Commit

7665f16

1 Parent(s): e856122

updates the notebook

Browse files

Files changed (1) hide show

accessing_the_data.ipynb +19 -14

accessing_the_data.ipynb CHANGED Viewed

@@ -21,8 +21,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from datasets import load_dataset           # Loading datasets from Hugging Face Hub\n",
-    "from pprint import pprint                   # Pretty print"
    ]
   },
   {
@@ -81,15 +81,15 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "PubChemQC datasets are very large and downloading them on your local machine can\n",
-    "be a heavy lift for your internet network and disk storage. Therefore, we set\n",
-    "the `streaming` parameter to `True` to avoid downloading the dataset on disk and\n",
-    "ensure streaming the data from the hub. In this mode, the `load_dataset`\n",
-    "function returns an `IterableDataset` object that can be iterated over to access\n",
-    "the data. The `trust_remote_code` argument is also set to `True` to allow the\n",
-    "usage of a custom [load\n",
     "script](https://huggingface.co/datasets/molssiai-hub/pubchemqc-b3lyp/blob/main/pubchemqc-b3lyp.py)\n",
-    "for the data.\n",
     "\n",
     "The PubChemQC-B3LYP dataset is made of several files called `shards` that enable\n",
     "multiprocessing and parallelization of the data loading process. Multiprocessing\n",
@@ -104,10 +104,15 @@
     "\n",
     "    >>> dataset = load_dataset(path=path,\n",
     "                               split=split,\n",
-    "                               streaming=True,\n",
     "                               trust_remote_code=True,\n",
     "                               num_proc=4)\n",
-    "```\n"
    ]
   },
   {
@@ -151,7 +156,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "hugface",
    "language": "python",
    "name": "python3"
   },
@@ -165,7 +170,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.13"
   }
  },
  "nbformat": 4,

    "metadata": {},
    "outputs": [],
    "source": [
+    "# import the load_dataset function from the datasets module\n",
+    "from datasets import load_dataset"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "PubChemQC datasets are very large in size and downloading them to your local\n",
+    "machine can be a heavy lift for your internet network and disk storage.\n",
+    "Therefore, we set the `streaming` parameter to `True` to avoid downloading the\n",
+    "dataset on disk and ensure streaming the data from the hub. In the `streaming`\n",
+    "mode, the `load_dataset` function returns an `IterableDataset` object which\n",
+    "allows iterative access to the data. The `trust_remote_code` argument is also\n",
+    "set to `True` to allow the usage of a custom [load\n",
     "script](https://huggingface.co/datasets/molssiai-hub/pubchemqc-b3lyp/blob/main/pubchemqc-b3lyp.py)\n",
+    "which is in charge of preprocessing the data.\n",
     "\n",
     "The PubChemQC-B3LYP dataset is made of several files called `shards` that enable\n",
     "multiprocessing and parallelization of the data loading process. Multiprocessing\n",
     "\n",
     "    >>> dataset = load_dataset(path=path,\n",
     "                               split=split,\n",
+    "                               streaming=False,\n",
     "                               trust_remote_code=True,\n",
     "                               num_proc=4)\n",
+    "```\n",
+    "\n",
+    "Note that we have set the `streaming` parameter to `False` in the code snippet\n",
+    "above. This allows the `load_dataset` function to download the dataset on disk\n",
+    "and load it into memory as a `Dataset` object which can access the data using\n",
+    "fancy indexing and slicing."
    ]
   },
   {
  ],
  "metadata": {
   "kernelspec": {
+   "display_name": "hface",
    "language": "python",
    "name": "python3"
   },
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
+   "version": "3.12.4"
   }
  },
  "nbformat": 4,