ahatamiz commited on
Commit
a6e0e45
·
1 Parent(s): 58e5c90

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +292 -0
README.md CHANGED
@@ -1,3 +1,295 @@
1
  ---
2
  license: other
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: other
3
+ datasets:
4
+ - imagenet-1k
5
  ---
6
+ [**FasterViT: Fast Vision Transformers with Hierarchical Attention**](https://arxiv.org/abs/2306.06189).
7
+
8
+
9
+ FasterViT achieves a new SOTA Pareto-front in
10
+ terms of accuracy vs. image throughput without extra training data !
11
+
12
+ <p align="center">
13
+ <img src="https://github.com/NVlabs/FasterViT/assets/26806394/253d1a2e-b5f5-4a9b-a362-6cdd16bfccc1" width=62% height=62%
14
+ class="center">
15
+ </p>
16
+
17
+ We introduce a new self-attention mechanism, denoted as Hierarchical
18
+ Attention (HAT), that captures both short and long-range information by learning
19
+ cross-window carrier tokens.
20
+
21
+ ![teaser](./fastervit/assets/hierarchial_attn.png)
22
+
23
+ Note: Please use the [**latest NVIDIA TensorRT release**](https://docs.nvidia.com/deeplearning/tensorrt/container-release-notes/index.html) to enjoy the benefits of optimized FasterViT ops.
24
+
25
+
26
+ ## Quick Start
27
+
28
+ We can import pre-trained FasterViT models with **1 line of code**. First, FasterViT can be simply installed by:
29
+
30
+ ```bash
31
+ pip install fastervit
32
+ ```
33
+
34
+ A pretrained FasterViT model with default hyper-parameters can be created as in the following:
35
+
36
+ ```python
37
+ >>> from fastervit import create_model
38
+
39
+ # Define fastervit-0 model with 224 x 224 resolution
40
+
41
+ >>> model = create_model('faster_vit_0_224',
42
+ pretrained=True,
43
+ model_path="/tmp/faster_vit_0.pth.tar")
44
+ ```
45
+
46
+ `model_path` is used to set the directory to download the model.
47
+
48
+ We can also simply test the model by passing a dummy input image. The output is the logits:
49
+
50
+ ```python
51
+ >>> import torch
52
+
53
+ >>> image = torch.rand(1, 3, 224, 224)
54
+ >>> output = model(image) # torch.Size([1, 1000])
55
+ ```
56
+
57
+ We can also use the any-resolution FasterViT model to accommodate arbitrary image resolutions. In the following, we define an any-resolution FasterViT-0
58
+ model with input resolution of 576 x 960, window sizes of 12 and 6 in 3rd and 4th stages, carrier token size of 2 and embedding dimension of
59
+ 64:
60
+
61
+ ```python
62
+ >>> from fastervit import create_model
63
+
64
+ # Define any-resolution FasterViT-0 model with 576 x 960 resolution
65
+ >>> model = create_model('faster_vit_0_any_res',
66
+ resolution=[576, 960],
67
+ window_size=[7, 7, 12, 6],
68
+ ct_size=2,
69
+ dim=64,
70
+ pretrained=True)
71
+ ```
72
+ Note that the above model is intiliazed from the original ImageNet pre-trained FasterViT with original resolution of 224 x 224. As a result, missing keys and mis-matches could be expected since we are addign new layers (e.g. addition of new carrier tokens, etc.)
73
+
74
+ We can simply test the model by passing a dummy input image. The output is the logits:
75
+
76
+ ```python
77
+ >>> import torch
78
+
79
+ >>> image = torch.rand(1, 3, 576, 960)
80
+ >>> output = model(image) # torch.Size([1, 1000])
81
+ ```
82
+
83
+ ---
84
+
85
+ ## Results + Pretrained Models
86
+
87
+ ### ImageNet-1K
88
+ **FasterViT ImageNet-1K Pretrained Models**
89
+
90
+ <table>
91
+ <tr>
92
+ <th>Name</th>
93
+ <th>Acc@1(%)</th>
94
+ <th>Acc@5(%)</th>
95
+ <th>Throughput(Img/Sec)</th>
96
+ <th>Resolution</th>
97
+ <th>#Params(M)</th>
98
+ <th>FLOPs(G)</th>
99
+ <th>Download</th>
100
+ </tr>
101
+
102
+ <tr>
103
+ <td>FasterViT-0</td>
104
+ <td>82.1</td>
105
+ <td>95.9</td>
106
+ <td>5802</td>
107
+ <td>224x224</td>
108
+ <td>31.4</td>
109
+ <td>3.3</td>
110
+ <td><a href="https://drive.google.com/uc?export=download&id=1twI2LFJs391Yrj8MR4Ui9PfrvWqjE1iB">model</a></td>
111
+ </tr>
112
+
113
+ <tr>
114
+ <td>FasterViT-1</td>
115
+ <td>83.2</td>
116
+ <td>96.5</td>
117
+ <td>4188</td>
118
+ <td>224x224</td>
119
+ <td>53.4</td>
120
+ <td>5.3</td>
121
+ <td><a href="https://drive.google.com/uc?export=download&id=1r7W10n5-bFtM3sz4bmaLrowN2gYPkLGT">model</a></td>
122
+ </tr>
123
+
124
+ <tr>
125
+ <td>FasterViT-2</td>
126
+ <td>84.2</td>
127
+ <td>96.8</td>
128
+ <td>3161</td>
129
+ <td>224x224</td>
130
+ <td>75.9</td>
131
+ <td>8.7</td>
132
+ <td><a href="https://drive.google.com/uc?export=download&id=1n_a6s0pgi0jVZOGmDei2vXHU5E6RH5wU">model</a></td>
133
+ </tr>
134
+
135
+ <tr>
136
+ <td>FasterViT-3</td>
137
+ <td>84.9</td>
138
+ <td>97.2</td>
139
+ <td>1780</td>
140
+ <td>224x224</td>
141
+ <td>159.5</td>
142
+ <td>18.2</td>
143
+ <td><a href="https://drive.google.com/uc?export=download&id=1tvWElZ91Sia2SsXYXFMNYQwfipCxtI7X">model</a></td>
144
+ </tr>
145
+
146
+ <tr>
147
+ <td>FasterViT-4</td>
148
+ <td>85.4</td>
149
+ <td>97.3</td>
150
+ <td>849</td>
151
+ <td>224x224</td>
152
+ <td>424.6</td>
153
+ <td>36.6</td>
154
+ <td><a href="https://drive.google.com/uc?export=download&id=1gYhXA32Q-_9C5DXel17avV_ZLoaHwdgz">model</a></td>
155
+ </tr>
156
+
157
+ <tr>
158
+ <td>FasterViT-5</td>
159
+ <td>85.6</td>
160
+ <td>97.4</td>
161
+ <td>449</td>
162
+ <td>224x224</td>
163
+ <td>975.5</td>
164
+ <td>113.0</td>
165
+ <td><a href="https://drive.google.com/uc?export=download&id=1mqpai7XiHLr_n1tjxjzT8q369xTCq_z-">model</a></td>
166
+ </tr>
167
+
168
+ <tr>
169
+ <td>FasterViT-6</td>
170
+ <td>85.8</td>
171
+ <td>97.4</td>
172
+ <td>352</td>
173
+ <td>224x224</td>
174
+ <td>1360.0</td>
175
+ <td>142.0</td>
176
+ <td><a href="https://drive.google.com/uc?export=download&id=12jtavR2QxmMzcKwPzWe7kw-oy34IYi59">model</a></td>
177
+ </tr>
178
+
179
+ </table>
180
+
181
+
182
+ ### Robustness (ImageNet-A - ImageNet-R - ImageNet-V2)
183
+
184
+ All models use `crop_pct=0.875`. Results are obtained by running inference on ImageNet-1K pretrained models without finetuning.
185
+ <table>
186
+ <tr>
187
+ <th>Name</th>
188
+ <th>A-Acc@1(%)</th>
189
+ <th>A-Acc@5(%)</th>
190
+ <th>R-Acc@1(%)</th>
191
+ <th>R-Acc@5(%)</th>
192
+ <th>V2-Acc@1(%)</th>
193
+ <th>V2-Acc@5(%)</th>
194
+ </tr>
195
+
196
+ <tr>
197
+ <td>FasterViT-0</td>
198
+ <td>23.9</td>
199
+ <td>57.6</td>
200
+ <td>45.9</td>
201
+ <td>60.4</td>
202
+ <td>70.9</td>
203
+ <td>90.0</td>
204
+ </tr>
205
+
206
+ <tr>
207
+ <td>FasterViT-1</td>
208
+ <td>31.2</td>
209
+ <td>63.3</td>
210
+ <td>47.5</td>
211
+ <td>61.9</td>
212
+ <td>72.6</td>
213
+ <td>91.0</td>
214
+ </tr>
215
+
216
+ <tr>
217
+ <td>FasterViT-2</td>
218
+ <td>38.2</td>
219
+ <td>68.9</td>
220
+ <td>49.6</td>
221
+ <td>63.4</td>
222
+ <td>73.7</td>
223
+ <td>91.6</td>
224
+ </tr>
225
+
226
+ <tr>
227
+ <td>FasterViT-3</td>
228
+ <td>44.2</td>
229
+ <td>73.0</td>
230
+ <td>51.9</td>
231
+ <td>65.6</td>
232
+ <td>75.0</td>
233
+ <td>92.2</td>
234
+ </tr>
235
+
236
+ <tr>
237
+ <td>FasterViT-4</td>
238
+ <td>49.0</td>
239
+ <td>75.4</td>
240
+ <td>56.0</td>
241
+ <td>69.6</td>
242
+ <td>75.7</td>
243
+ <td>92.7</td>
244
+ </tr>
245
+
246
+ <tr>
247
+ <td>FasterViT-5</td>
248
+ <td>52.7</td>
249
+ <td>77.6</td>
250
+ <td>56.9</td>
251
+ <td>70.0</td>
252
+ <td>76.0</td>
253
+ <td>93.0</td>
254
+ </tr>
255
+
256
+ <tr>
257
+ <td>FasterViT-6</td>
258
+ <td>53.7</td>
259
+ <td>78.4</td>
260
+ <td>57.1</td>
261
+ <td>70.1</td>
262
+ <td>76.1</td>
263
+ <td>93.0</td>
264
+ </tr>
265
+
266
+ </table>
267
+
268
+ A, R and V2 denote ImageNet-A, ImageNet-R and ImageNet-V2 respectively.
269
+
270
+ ## Citation
271
+
272
+ Please consider citing FasterViT if this repository is useful for your work.
273
+
274
+ ```
275
+ @article{hatamizadeh2023fastervit,
276
+ title={FasterViT: Fast Vision Transformers with Hierarchical Attention},
277
+ author={Hatamizadeh, Ali and Heinrich, Greg and Yin, Hongxu and Tao, Andrew and Alvarez, Jose M and Kautz, Jan and Molchanov, Pavlo},
278
+ journal={arXiv preprint arXiv:2306.06189},
279
+ year={2023}
280
+ }
281
+ ```
282
+
283
+
284
+ ## Licenses
285
+
286
+ Copyright © 2023, NVIDIA Corporation. All rights reserved.
287
+
288
+ This work is made available under the NVIDIA Source Code License-NC. Click [here](LICENSE) to view a copy of this license.
289
+
290
+ For license information regarding the timm repository, please refer to its [repository](https://github.com/rwightman/pytorch-image-models).
291
+
292
+ For license information regarding the ImageNet dataset, please see the [ImageNet official website](https://www.image-net.org/).
293
+
294
+ ## Acknowledgement
295
+ This repository is built on top of the [timm](https://github.com/huggingface/pytorch-image-models) repository. We thank [Ross Wrightman](https://rwightman.com/) for creating and maintaining this high-quality library.