Neleac
/

SpaceTimeGPT

Video-Text-to-Text

vision-encoder-decoder

image-text-to-text

video-captioning

Inference Endpoints

Model card Files Files and versions Community

Neleac commited on Jan 21

Commit

cf6e956

·

verified ·

1 Parent(s): fdf9770

Update README.md

Files changed (1) hide show

README.md +1 -8

README.md CHANGED Viewed

@@ -41,14 +41,7 @@ SpaceTimeGPT is a video description generation model capable of both spatial and
 Vision Encoder: [timesformer-base-finetuned-k600](https://huggingface.co/facebook/timesformer-base-finetuned-k600) \
 Text Decoder: [gpt2](https://huggingface.co/gpt2)
-The encoder and decoder are initialized using pretrained weights for video classification and sentence completion, respectively. Encoder-decoder cross attention is used to unify the visual and linguistic domains. The model is fine-tuned end-to-end on the video captioning task.
-## Dataset and Evaluation
-SpaceTimeGPT is trained on [VATEX](https://eric-xw.github.io/vatex-website/index.html), a large video captioning dataset.
-Performance: 67.3 [CIDEr](https://github.com/ramavedantam/cider) on the VATEX test split
-Sampling method: 30 $\le$ generated tokens $\le$ 60, beam search with 8 beams
 #### Example Inference Code:
 ```python

 Vision Encoder: [timesformer-base-finetuned-k600](https://huggingface.co/facebook/timesformer-base-finetuned-k600) \
 Text Decoder: [gpt2](https://huggingface.co/gpt2)
+The encoder and decoder are initialized using pretrained weights for video classification and sentence completion, respectively. Encoder-decoder cross attention is used to unify the visual and linguistic domains. The model is fine-tuned end-to-end on the video captioning task. See [GitHub repository](https://github.com/Neleac/SpaceTimeGPT) for details.
 #### Example Inference Code:
 ```python