SinaMostafanejad commited on
Commit
7665f16
ยท
1 Parent(s): e856122

updates the notebook

Browse files
Files changed (1) hide show
  1. accessing_the_data.ipynb +19 -14
accessing_the_data.ipynb CHANGED
@@ -21,8 +21,8 @@
21
  "metadata": {},
22
  "outputs": [],
23
  "source": [
24
- "from datasets import load_dataset # Loading datasets from Hugging Face Hub\n",
25
- "from pprint import pprint # Pretty print"
26
  ]
27
  },
28
  {
@@ -81,15 +81,15 @@
81
  "cell_type": "markdown",
82
  "metadata": {},
83
  "source": [
84
- "PubChemQC datasets are very large and downloading them on your local machine can\n",
85
- "be a heavy lift for your internet network and disk storage. Therefore, we set\n",
86
- "the `streaming` parameter to `True` to avoid downloading the dataset on disk and\n",
87
- "ensure streaming the data from the hub. In this mode, the `load_dataset`\n",
88
- "function returns an `IterableDataset` object that can be iterated over to access\n",
89
- "the data. The `trust_remote_code` argument is also set to `True` to allow the\n",
90
- "usage of a custom [load\n",
91
  "script](https://huggingface.co/datasets/molssiai-hub/pubchemqc-b3lyp/blob/main/pubchemqc-b3lyp.py)\n",
92
- "for the data.\n",
93
  "\n",
94
  "The PubChemQC-B3LYP dataset is made of several files called `shards` that enable\n",
95
  "multiprocessing and parallelization of the data loading process. Multiprocessing\n",
@@ -104,10 +104,15 @@
104
  "\n",
105
  " >>> dataset = load_dataset(path=path,\n",
106
  " split=split,\n",
107
- " streaming=True,\n",
108
  " trust_remote_code=True,\n",
109
  " num_proc=4)\n",
110
- "```\n"
 
 
 
 
 
111
  ]
112
  },
113
  {
@@ -151,7 +156,7 @@
151
  ],
152
  "metadata": {
153
  "kernelspec": {
154
- "display_name": "hugface",
155
  "language": "python",
156
  "name": "python3"
157
  },
@@ -165,7 +170,7 @@
165
  "name": "python",
166
  "nbconvert_exporter": "python",
167
  "pygments_lexer": "ipython3",
168
- "version": "3.10.13"
169
  }
170
  },
171
  "nbformat": 4,
 
21
  "metadata": {},
22
  "outputs": [],
23
  "source": [
24
+ "# import the load_dataset function from the datasets module\n",
25
+ "from datasets import load_dataset"
26
  ]
27
  },
28
  {
 
81
  "cell_type": "markdown",
82
  "metadata": {},
83
  "source": [
84
+ "PubChemQC datasets are very large in size and downloading them to your local\n",
85
+ "machine can be a heavy lift for your internet network and disk storage.\n",
86
+ "Therefore, we set the `streaming` parameter to `True` to avoid downloading the\n",
87
+ "dataset on disk and ensure streaming the data from the hub. In the `streaming`\n",
88
+ "mode, the `load_dataset` function returns an `IterableDataset` object which\n",
89
+ "allows iterative access to the data. The `trust_remote_code` argument is also\n",
90
+ "set to `True` to allow the usage of a custom [load\n",
91
  "script](https://huggingface.co/datasets/molssiai-hub/pubchemqc-b3lyp/blob/main/pubchemqc-b3lyp.py)\n",
92
+ "which is in charge of preprocessing the data.\n",
93
  "\n",
94
  "The PubChemQC-B3LYP dataset is made of several files called `shards` that enable\n",
95
  "multiprocessing and parallelization of the data loading process. Multiprocessing\n",
 
104
  "\n",
105
  " >>> dataset = load_dataset(path=path,\n",
106
  " split=split,\n",
107
+ " streaming=False,\n",
108
  " trust_remote_code=True,\n",
109
  " num_proc=4)\n",
110
+ "```\n",
111
+ "\n",
112
+ "Note that we have set the `streaming` parameter to `False` in the code snippet\n",
113
+ "above. This allows the `load_dataset` function to download the dataset on disk\n",
114
+ "and load it into memory as a `Dataset` object which can access the data using\n",
115
+ "fancy indexing and slicing."
116
  ]
117
  },
118
  {
 
156
  ],
157
  "metadata": {
158
  "kernelspec": {
159
+ "display_name": "hface",
160
  "language": "python",
161
  "name": "python3"
162
  },
 
170
  "name": "python",
171
  "nbconvert_exporter": "python",
172
  "pygments_lexer": "ipython3",
173
+ "version": "3.12.4"
174
  }
175
  },
176
  "nbformat": 4,