Jobaar commited on
Commit
c90fa7b
·
verified ·
1 Parent(s): 110dac3

Upload 3 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ llama-joycaption-alpha-two-llava-mmproj-model-f16.gguf filter=lfs diff=lfs merge=lfs -text
37
+ llama-joycaption-alpha-two-llava-Q6_K.gguf filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - meta-llama/Llama-3.1-8B-Instruct
4
+ - google/siglip-so400m-patch14-384
5
+ tags:
6
+ - captioning
7
+ ---
8
+ # Model Card for Llama JoyCaption Alpha Two
9
+
10
+ [Github](https://github.com/fpgaminer/joycaption)
11
+
12
+ JoyCaption is an image captioning Visual Language Model (VLM) being built from the ground up as a free, open, and uncensored model for the community to use in training Diffusion models.
13
+
14
+ Key Features:
15
+ - **Free and Open**: It will be released for free, open weights, no restrictions, and just like [bigASP](https://www.reddit.com/r/StableDiffusion/comments/1dbasvx/the_gory_details_of_finetuning_sdxl_for_30m/), will come with training scripts and lots of juicy details on how it gets built.
16
+ - **Uncensored**: Equal coverage of SFW and NSFW concepts. No "cylindrical shaped object with a white substance coming out on it" here.
17
+ - **Diversity**: All are welcome here. Do you like digital art? Photoreal? Anime? Furry? JoyCaption is for everyone. Pains are being taken to ensure broad coverage of image styles, content, ethnicity, gender, orientation, etc.
18
+ - **Minimal Filtering**: JoyCaption is trained on large swathes of images so that it can understand almost all aspects of our world. almost. Illegal content will never be tolerated in JoyCaption's training.
19
+
20
+
21
+ ## Motivation
22
+
23
+ Automated descriptive captions enable the training and finetuning of diffusion models on a wider range of images, since trainers are no longer required to either find images with already associated text or write the descriptions themselves. They also improve the quality of generations produced by Text-to-Image models trained on them (ref: DALL-E 3 paper). But to-date, the community has been stuck with ChatGPT, which is expensive and heavily censored; or alternative models, like CogVLM, which are weaker than ChatGPT and have abysmal performance outside of the SFW domain.
24
+
25
+ I'm building JoyCaption to help fill this gap by performing near or on-par with GPT4o in captioning images, while being free, unrestricted, and open.
26
+
27
+
28
+ ## How to Get Started with the Model
29
+
30
+ Please see the [Github](https://github.com/fpgaminer/joycaption) for more details.
31
+
32
+ Example usage:
33
+
34
+ ```
35
+ import torch
36
+ from PIL import Image
37
+ from transformers import AutoProcessor, LlavaForConditionalGeneration
38
+
39
+
40
+ IMAGE_PATH = "image.jpg"
41
+ PROMPT = "Write a long descriptive caption for this image in a formal tone."
42
+ MODEL_NAME = "fancyfeast/llama-joycaption-alpha-two-hf-llava"
43
+
44
+
45
+ # Load JoyCaption
46
+ # bfloat16 is the native dtype of the LLM used in JoyCaption (Llama 3.1)
47
+ # device_map=0 loads the model into the first GPU
48
+ processor = AutoProcessor.from_pretrained(MODEL_NAME)
49
+ llava_model = LlavaForConditionalGeneration.from_pretrained(MODEL_NAME, torch_dtype="bfloat16", device_map=0)
50
+ llava_model.eval()
51
+
52
+ with torch.no_grad():
53
+ # Load image
54
+ image = Image.open(IMAGE_PATH)
55
+
56
+ # Build the conversation
57
+ convo = [
58
+ {
59
+ "role": "system",
60
+ "content": "You are a helpful image captioner.",
61
+ },
62
+ {
63
+ "role": "user",
64
+ "content": PROMPT,
65
+ },
66
+ ]
67
+
68
+ # Format the conversation
69
+ # WARNING: HF's handling of chat's on Llava models is very fragile. This specific combination of processor.apply_chat_template(), and processor() works
70
+ # but if using other combinations always inspect the final input_ids to ensure they are correct. Often times you will end up with multiple <bos> tokens
71
+ # if not careful, which can make the model perform poorly.
72
+ convo_string = processor.apply_chat_template(convo, tokenize = False, add_generation_prompt = True)
73
+ assert isinstance(convo_string, str)
74
+
75
+ # Process the inputs
76
+ inputs = processor(text=[convo_string], images=[image], return_tensors="pt").to('cuda')
77
+ inputs['pixel_values'] = inputs['pixel_values'].to(torch.bfloat16)
78
+
79
+ # Generate the captions
80
+ generate_ids = llava_model.generate(
81
+ **inputs,
82
+ max_new_tokens=300,
83
+ do_sample=True,
84
+ suppress_tokens=None,
85
+ use_cache=True,
86
+ temperature=0.6,
87
+ top_k=None,
88
+ top_p=0.9,
89
+ )[0]
90
+
91
+ # Trim off the prompt
92
+ generate_ids = generate_ids[inputs['input_ids'].shape[1]:]
93
+
94
+ # Decode the caption
95
+ caption = processor.tokenizer.decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
96
+ caption = caption.strip()
97
+ print(caption)
98
+ ```
99
+
100
+
101
+ ## vLLM
102
+
103
+ vLLM provides the highest performance inference for JoyCaption, and an OpenAI compatible API so JoyCaption can be used like any other VLMs. Example usage:
104
+
105
+ ```
106
+ vllm serve fancyfeast/llama-joycaption-alpha-two-hf-llava --max-model-len 4096 --enable-prefix-caching
107
+ ```
108
+
109
+ VLMs are a bit finicky on vLLM, and vLLM is memory hungry, so you may have to adjust settings for your particular environment, such as forcing eager mode, adjusting max-model-len, adjusting gpu_memory_utilization, etc.
llama-joycaption-alpha-two-llava-Q6_K.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e5f0a5d37a66b97d454eee866e5729d9e20551aba084db20007a85c6a83b7712
3
+ size 6596008320
llama-joycaption-alpha-two-llava-mmproj-model-f16.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:01ec765d9bfa42204450677f1289ff597f9e4f0da1448b036a64c61fbb8bbda6
3
+ size 877771808