|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- OpenFace-CQUPT/FaceCaption-15M |
|
language: |
|
- zh |
|
- en |
|
metrics: |
|
- accuracy |
|
pipeline_tag: image-to-text |
|
--- |
|
# Demonstration of Cross-modal Retrieval (FLIP-based model) |
|
|
|
<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/663f06e01cd68975883a353e/TGxEwHBbWZIbW67kG9jMH.mp4"></video> |
|
|
|
# FLIP (Facial Language Image Pretraining) |
|
|
|
This repository is the official implementation of [FaceCaption-15M](). |
|
|
|
# Updates: |
|
|
|
**[24/07/20] The usage of FLIP has been released! [OpenFace-CQUPT/FLIP-demo](https://huggingface.co/OpenFace-CQUPT/FLIP/tree/main/FLIP-demo)** |
|
|
|
**[24/07/17] The model named FLIP has been released! [OpenFace-CQUPT/FLIP](https://huggingface.co/OpenFace-CQUPT/FLIP)** |
|
|
|
**Overview of FLIP architecture.** |
|
|
|
 |
|
|
|
**Fig.1:(a). Same color represents shared parameters. “12x” stands for 12-layer transformer modules. (b), (c) and (d) FLIP-based model are applied to the tasks of text-image retrieval, facial attributes prediction and sketch less facial image retrieval, respectively.** |
|
|
|
## Training |
|
|
|
Coming soon......(Only for the datasets been published, the code of training is meaningful.) |
|
|
|
```shell |
|
python pretrain.py > log.log |
|
``` |
|
|
|
## Pre-trained Models |
|
|
|
We provide pretrained model weights : |
|
FLIP Base —— click [here](https://huggingface.co/OpenFace-CQUPT/Facial-language-image-pretraining-model/tree/main/ckpt) |
|
FLIP Large —— coming soon...... |
|
|
|
## Datasets |
|
|
|
Download the FaceCaption-15M dataset from [here](https://huggingface.co/datasets/OpenFace-CQUPT/FaceCaption-15M). |
|
|
|
|
|
## Results |
|
|
|
### Task1: Text-Image Retrieval |
|
|
|
**Table 1:** Comparison with other classical pretrained models. All pretrained model backbones are frozen, with only the linear layer being fine-tuned. † represents the model pretrained on the LAION-Face [86] dataset; * represents the model pretrained on the FaceCaption dataset constructed without using LLM text generation. |
|
|
|
 |
|
|
|
### Task2: Facial Attributes Prediction |
|
|
|
**Table 2:** Comparison with other classical models. † represents the model pre-trained on the original LAION-Face dataset. |
|
|
|
 |
|
|
|
### Task3: Sketch Less Facial Image Retrieval |
|
|
|
**Table 3:** Comparative results with different baseline methods. † represents the model pre-trained on the LAION-Face dataset. |
|
|
|
 |
|
|
|
 |
|
|
|
**Fig.2:Demonstration of our FLIP-based model on the SLFIR task. Both methods can retrieve the target face photo from the top-5 list using a partial sketch. Our proposed FLIP-based model can achieve this using fewer strokes than the baseline. The number at the bottom denotes the rank of the paired (true match) photos at every stage.** |
|
|
|
## Contacts |
|
mailto: [email protected] or dw[email protected] |
|
|
|
## Citation |
|
```tex |
|
@misc{dai202415mmultimodalfacialimagetext, |
|
title={15M Multimodal Facial Image-Text Dataset}, |
|
author={Dawei Dai and YuTang Li and YingGe Liu and Mingming Jia and Zhang YuanHui and Guoyin Wang}, |
|
year={2024}, |
|
eprint={2407.08515}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV}, |
|
url={https://arxiv.org/abs/2407.08515}, |
|
} |
|
``` |