OpenFace-CQUPT
/

FLIP

Model card Files Files and versions Community

FLIP / README.md

OpenFace-CQUPT

Update README.md

99287bc verified 8 months ago

|

history blame contribute delete

3.48 kB

	---
	license: apache-2.0
	datasets:
	- OpenFace-CQUPT/FaceCaption-15M
	language:
	- zh
	- en
	metrics:
	- accuracy
	pipeline_tag: image-to-text
	---
	# Demonstration of Cross-modal Retrieval (FLIP-based model)

	<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/663f06e01cd68975883a353e/TGxEwHBbWZIbW67kG9jMH.mp4"></video>

	# FLIP (Facial Language Image Pretraining)

	This repository is the official implementation of [FaceCaption-15M]().

	# Updates：

	[24/07/20] The usage of FLIP has been released! [OpenFace-CQUPT/FLIP-demo](https://huggingface.co/OpenFace-CQUPT/FLIP/tree/main/FLIP-demo)

	[24/07/17] The model named FLIP has been released! [OpenFace-CQUPT/FLIP](https://huggingface.co/OpenFace-CQUPT/FLIP)

	Overview of FLIP architecture.

	![image-20240318101027127](https://img.yutangli.net/img/202403181010116.png)

	Fig.1:(a). Same color represents shared parameters. “12x” stands for 12-layer transformer modules. (b), (c) and (d) FLIP-based model are applied to the tasks of text-image retrieval, facial attributes prediction and sketch less facial image retrieval, respectively.

	## Training

	Coming soon......（Only for the datasets been published, the code of training is meaningful.）

	```shell
	python pretrain.py > log.log
	```

	## Pre-trained Models

	We provide pretrained model weights :
	FLIP Base —— click [here](https://huggingface.co/OpenFace-CQUPT/Facial-language-image-pretraining-model/tree/main/ckpt)
	FLIP Large —— coming soon......

	## Datasets

	Download the FaceCaption-15M dataset from [here](https://huggingface.co/datasets/OpenFace-CQUPT/FaceCaption-15M).


	## Results

	### Task1: Text-Image Retrieval

	Table 1: Comparison with other classical pretrained models. All pretrained model backbones are frozen, with only the linear layer being fine-tuned. † represents the model pretrained on the LAION-Face [86] dataset; * represents the model pretrained on the FaceCaption dataset constructed without using LLM text generation.

	![](https://img.yutangli.net/img/202403181015142.png)

	### Task2: Facial Attributes Prediction

	Table 2: Comparison with other classical models. † represents the model pre-trained on the original LAION-Face dataset.

	![image-20240318101126897](https://img.yutangli.net/img/202403181011115.png)

	### Task3: Sketch Less Facial Image Retrieval

	Table 3: Comparative results with different baseline methods. † represents the model pre-trained on the LAION-Face dataset.

	![image-20240318101633671](https://img.yutangli.net/img/202403181016876.png)

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/663f06e01cd68975883a353e/snd-9JBKJnRuZpm0Wp38f.png)

	Fig.2:Demonstration of our FLIP-based model on the SLFIR task. Both methods can retrieve the target face photo from the top-5 list using a partial sketch. Our proposed FLIP-based model can achieve this using fewer strokes than the baseline. The number at the bottom denotes the rank of the paired (true match) photos at every stage.

	## Contacts
	mailto: [email protected] or dw[email protected]

	## Citation
	```tex
	@misc{dai202415mmultimodalfacialimagetext,
	title={15M Multimodal Facial Image-Text Dataset},
	author={Dawei Dai and YuTang Li and YingGe Liu and Mingming Jia and Zhang YuanHui and Guoyin Wang},
	year={2024},
	eprint={2407.08515},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2407.08515},
	}
	```