kst23 commited on
Commit
cf12b51
·
verified ·
1 Parent(s): 748617e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -0
README.md ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ pipeline_tag: zero-shot-image-classification
6
+ tags:
7
+ - vision
8
+ - clip
9
+ ---
10
+ ## BRAHMAI-CLIP-v0.1
11
+
12
+ ### MODEL TYPE:
13
+ The base model employs a ViT-L/14 Transformer architecture for the image encoder and a masked self-attention Transformer for the text encoder. These encoders are trained with a contrastive loss to maximize the similarity between image and text pairs.
14
+ The original implementation offered two variants: one with a ResNet image encoder and the other with a Vision Transformer. This repository contains the variant with the Vision Transformer.
15
+
16
+ ### DATE: June, 2024
17
+
18
+
19
+ ### CODE:
20
+ ```python
21
+ from PIL import Image
22
+ import requests
23
+ from transformers import CLIPProcessor, CLIPModel
24
+
25
+ # Define the model and processor
26
+ model_id = "brahmairesearch/brahmai-clip-v0.1"
27
+ model = CLIPModel.from_pretrained(model_id)
28
+ processor = CLIPProcessor.from_pretrained(model_id)
29
+
30
+ # Load the image from URL
31
+ image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
32
+ image = Image.open(requests.get(image_url, stream=True).raw)
33
+
34
+ # Define the text descriptions
35
+ descriptions = ["a cat's photograph", "a dog's photograph"]
36
+
37
+ # Process the inputs
38
+ inputs = processor(text=descriptions, images=image, return_tensors="pt", padding=True)
39
+
40
+ # Get the outputs from the model
41
+ outputs = model(**inputs)
42
+ logits_per_image = outputs.logits_per_image
43
+
44
+ # Calculate the label probabilities
45
+ probs = logits_per_image.softmax(dim=1)
46
+
47
+ # Print the results
48
+ print(probs)
49
+ ```
50
+
51
+ ---
52
+
53
+ ### Model Use
54
+
55
+ #### Intended Use
56
+
57
+ The model is designed as a research tool for academic and research communities. It aims to help researchers delve into zero-shot, arbitrary image classification and to explore interdisciplinary studies on the potential impacts of such models. The CLIP paper provides an example of these analyses by discussing potential downstream effects.
58
+
59
+ **Primary Intended Users:**
60
+ - AI researchers.
61
+
62
+ We expect researchers to use this model to gain insights into the robustness, generalization, capabilities, biases, and constraints of computer vision models.
63
+
64
+ #### Out-of-Scope Use Cases
65
+
66
+ - **Deployed Use Cases:** Any deployment of the model, whether commercial or not, is currently out of scope. Non-deployed uses, such as image search in a controlled environment, are also not advised unless there has been thorough in-domain testing with a specific, fixed class taxonomy. This caution is due to the variability in CLIP's performance with different class taxonomies, as highlighted in our safety assessment.
67
+
68
+ - **Surveillance and Facial Recognition:** Use cases involving surveillance and facial recognition are always out of scope. The premature application of AI in these domains, given the current lack of testing norms and fairness checks, is potentially harmful.
69
+
70
+ - **Non-English Languages:** The model has not been specifically trained or evaluated in languages other than English. Therefore, its use should be limited to English language applications.
71
+
72
+ ---
73
+
74
+ ## Limitations
75
+
76
+ CLIP and our analysis of it have several limitations. The model currently struggles with tasks such as fine-grained classification and counting objects. Additionally, CLIP raises concerns regarding fairness and bias, which we discuss in the paper and briefly in the next section. An important limitation of our testing approach is the use of linear probes to evaluate CLIP's performance, as there is evidence suggesting that linear probes can underestimate model performance.
77
+
78
+ ### Bias and Fairness
79
+
80
+ The performance and specific biases of CLIP can vary significantly based on class design and the choices made for including or excluding categories. We assessed the risk of certain types of denigration by classifying images of people from the [Fairface](https://arxiv.org/abs/1908.04913) dataset into crime-related and non-human animal categories. Significant disparities were found concerning race and gender, and these disparities could shift based on the class construction. Details of these findings are captured in the Broader Impacts section of the paper.
81
+
82
+ We also evaluated CLIP's performance on gender, race, and age classification using the Fairface dataset. For gender classification, we found accuracy above 96% across all races, with ‘Middle Eastern’ having the highest accuracy (98.4%) and ‘White’ having the lowest (96.5%). For racial classification, CLIP averaged around 93% accuracy, and for age classification, it averaged around 63% accuracy. Our evaluations of gender, race, and age classification, as well as denigration harms, are intended to assess the model's performance across different demographics and to highlight potential risks, rather than to endorse or promote such tasks.