Spaces:
Sleeping
Sleeping
SinaMostafanejad
commited on
Commit
ยท
7665f16
1
Parent(s):
e856122
updates the notebook
Browse files- accessing_the_data.ipynb +19 -14
accessing_the_data.ipynb
CHANGED
@@ -21,8 +21,8 @@
|
|
21 |
"metadata": {},
|
22 |
"outputs": [],
|
23 |
"source": [
|
24 |
-
"
|
25 |
-
"from
|
26 |
]
|
27 |
},
|
28 |
{
|
@@ -81,15 +81,15 @@
|
|
81 |
"cell_type": "markdown",
|
82 |
"metadata": {},
|
83 |
"source": [
|
84 |
-
"PubChemQC datasets are very large and downloading them
|
85 |
-
"be a heavy lift for your internet network and disk storage
|
86 |
-
"the `streaming` parameter to `True` to avoid downloading the
|
87 |
-
"ensure streaming the data from the hub. In
|
88 |
-
"function returns an `IterableDataset` object
|
89 |
-
"the data. The `trust_remote_code` argument is also
|
90 |
-
"usage of a custom [load\n",
|
91 |
"script](https://huggingface.co/datasets/molssiai-hub/pubchemqc-b3lyp/blob/main/pubchemqc-b3lyp.py)\n",
|
92 |
-
"
|
93 |
"\n",
|
94 |
"The PubChemQC-B3LYP dataset is made of several files called `shards` that enable\n",
|
95 |
"multiprocessing and parallelization of the data loading process. Multiprocessing\n",
|
@@ -104,10 +104,15 @@
|
|
104 |
"\n",
|
105 |
" >>> dataset = load_dataset(path=path,\n",
|
106 |
" split=split,\n",
|
107 |
-
" streaming=
|
108 |
" trust_remote_code=True,\n",
|
109 |
" num_proc=4)\n",
|
110 |
-
"```\n"
|
|
|
|
|
|
|
|
|
|
|
111 |
]
|
112 |
},
|
113 |
{
|
@@ -151,7 +156,7 @@
|
|
151 |
],
|
152 |
"metadata": {
|
153 |
"kernelspec": {
|
154 |
-
"display_name": "
|
155 |
"language": "python",
|
156 |
"name": "python3"
|
157 |
},
|
@@ -165,7 +170,7 @@
|
|
165 |
"name": "python",
|
166 |
"nbconvert_exporter": "python",
|
167 |
"pygments_lexer": "ipython3",
|
168 |
-
"version": "3.
|
169 |
}
|
170 |
},
|
171 |
"nbformat": 4,
|
|
|
21 |
"metadata": {},
|
22 |
"outputs": [],
|
23 |
"source": [
|
24 |
+
"# import the load_dataset function from the datasets module\n",
|
25 |
+
"from datasets import load_dataset"
|
26 |
]
|
27 |
},
|
28 |
{
|
|
|
81 |
"cell_type": "markdown",
|
82 |
"metadata": {},
|
83 |
"source": [
|
84 |
+
"PubChemQC datasets are very large in size and downloading them to your local\n",
|
85 |
+
"machine can be a heavy lift for your internet network and disk storage.\n",
|
86 |
+
"Therefore, we set the `streaming` parameter to `True` to avoid downloading the\n",
|
87 |
+
"dataset on disk and ensure streaming the data from the hub. In the `streaming`\n",
|
88 |
+
"mode, the `load_dataset` function returns an `IterableDataset` object which\n",
|
89 |
+
"allows iterative access to the data. The `trust_remote_code` argument is also\n",
|
90 |
+
"set to `True` to allow the usage of a custom [load\n",
|
91 |
"script](https://huggingface.co/datasets/molssiai-hub/pubchemqc-b3lyp/blob/main/pubchemqc-b3lyp.py)\n",
|
92 |
+
"which is in charge of preprocessing the data.\n",
|
93 |
"\n",
|
94 |
"The PubChemQC-B3LYP dataset is made of several files called `shards` that enable\n",
|
95 |
"multiprocessing and parallelization of the data loading process. Multiprocessing\n",
|
|
|
104 |
"\n",
|
105 |
" >>> dataset = load_dataset(path=path,\n",
|
106 |
" split=split,\n",
|
107 |
+
" streaming=False,\n",
|
108 |
" trust_remote_code=True,\n",
|
109 |
" num_proc=4)\n",
|
110 |
+
"```\n",
|
111 |
+
"\n",
|
112 |
+
"Note that we have set the `streaming` parameter to `False` in the code snippet\n",
|
113 |
+
"above. This allows the `load_dataset` function to download the dataset on disk\n",
|
114 |
+
"and load it into memory as a `Dataset` object which can access the data using\n",
|
115 |
+
"fancy indexing and slicing."
|
116 |
]
|
117 |
},
|
118 |
{
|
|
|
156 |
],
|
157 |
"metadata": {
|
158 |
"kernelspec": {
|
159 |
+
"display_name": "hface",
|
160 |
"language": "python",
|
161 |
"name": "python3"
|
162 |
},
|
|
|
170 |
"name": "python",
|
171 |
"nbconvert_exporter": "python",
|
172 |
"pygments_lexer": "ipython3",
|
173 |
+
"version": "3.12.4"
|
174 |
}
|
175 |
},
|
176 |
"nbformat": 4,
|