lukeingawesome commited on
Commit
cfb7750
·
verified ·
1 Parent(s): 96a8ea4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -26
README.md CHANGED
@@ -26,11 +26,13 @@ LLM2Vec4CXR is a bidirectional language model that converts the base decoder-onl
26
  ### Key Features
27
 
28
  - **Base Architecture**: LLM2CLIP-Llama-3.2-1B-Instruct
29
- - **Pooling Mode**: Latent Attention (modified from original)
30
  - **Bidirectional Processing**: Enabled for better context understanding
31
  - **Medical Domain**: Specialized for chest X-ray report analysis
32
  - **Max Length**: 512 tokens
33
  - **Precision**: bfloat16
 
 
34
 
35
  ## Training Details
36
 
@@ -62,20 +64,22 @@ pip install -e .
62
  ### Basic Usage
63
 
64
  ```python
 
65
  from llm2vec_wrapper import LLM2VecWrapper as LLM2Vec
66
 
67
- # Load the model
68
  model = LLM2Vec.from_pretrained(
69
  base_model_name_or_path='lukeingawesome/llm2vec4cxr',
70
- enable_bidirectional=True,
71
- pooling_mode="latent_attention",
72
  max_length=512,
 
73
  torch_dtype=torch.bfloat16,
 
74
  )
75
 
76
- # Simple text encoding (built-in method)
77
  report = "There is a small increase in the left-sided effusion. There continues to be volume loss at both bases."
78
- embedding = model.encode_text(report)
79
 
80
  # Multiple texts at once
81
  reports = [
@@ -86,38 +90,90 @@ reports = [
86
  embeddings = model.encode_text(reports)
87
  ```
88
 
89
- ### Advanced Usage with Instructions
90
 
91
  ```python
92
  # For instruction-following tasks with separator
93
- separator = '!@#$%^&*()'
94
  instruction = 'Determine the change or the status of the pleural effusion.'
95
  report = 'There is a small increase in the left-sided effusion.'
96
- text_with_instruction = instruction + separator + report
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97
 
98
- # Use the built-in method for instruction-based encoding
99
- embedding = model.encode_with_instruction([text_with_instruction])
 
 
100
  ```
101
 
102
- **Note**: The model now includes convenient `encode_text()` and `encode_with_instruction()` methods that handle the `embed_mask` automatically.
 
 
103
 
104
- ### Manual Usage (if you need more control)
105
 
106
- If you need more control over the tokenization process, you can still use the manual approach:
 
 
 
 
 
 
 
107
 
108
  ```python
109
- # Manual tokenization with embed_mask
110
- def encode_text_manual(model, text):
111
- inputs = model.tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
112
- inputs["embed_mask"] = inputs["attention_mask"].clone() # Required for proper functioning
113
-
114
- with torch.no_grad():
115
- embeddings = model(inputs)
116
- return embeddings
117
-
118
- # For instruction-based tasks, use the built-in tokenize_with_separator method
119
- tokenized = model.tokenize_with_separator([text_with_instruction])
120
- embedding = model(tokenized)
121
  ```
122
 
123
  ## Evaluation
 
26
  ### Key Features
27
 
28
  - **Base Architecture**: LLM2CLIP-Llama-3.2-1B-Instruct
29
+ - **Pooling Mode**: Latent Attention (fine-tuned weights automatically loaded)
30
  - **Bidirectional Processing**: Enabled for better context understanding
31
  - **Medical Domain**: Specialized for chest X-ray report analysis
32
  - **Max Length**: 512 tokens
33
  - **Precision**: bfloat16
34
+ - **Automatic Loading**: Latent attention weights are automatically loaded from safetensors
35
+ - **Simple API**: Built-in methods for similarity computation and instruction-based encoding
36
 
37
  ## Training Details
38
 
 
64
  ### Basic Usage
65
 
66
  ```python
67
+ import torch
68
  from llm2vec_wrapper import LLM2VecWrapper as LLM2Vec
69
 
70
+ # Load the model - latent attention weights are automatically loaded!
71
  model = LLM2Vec.from_pretrained(
72
  base_model_name_or_path='lukeingawesome/llm2vec4cxr',
73
+ pooling_mode="latent_attention", # This automatically loads the trained weights
 
74
  max_length=512,
75
+ enable_bidirectional=True,
76
  torch_dtype=torch.bfloat16,
77
+ use_safetensors=True,
78
  )
79
 
80
+ # Simple text encoding
81
  report = "There is a small increase in the left-sided effusion. There continues to be volume loss at both bases."
82
+ embedding = model.encode_text([report])
83
 
84
  # Multiple texts at once
85
  reports = [
 
90
  embeddings = model.encode_text(reports)
91
  ```
92
 
93
+ ### Advanced Usage with Instructions and Similarity
94
 
95
  ```python
96
  # For instruction-following tasks with separator
 
97
  instruction = 'Determine the change or the status of the pleural effusion.'
98
  report = 'There is a small increase in the left-sided effusion.'
99
+ query_text = instruction + '!@#$%^&*()' + report
100
+
101
+ # Compare against multiple options
102
+ candidates = [
103
+ 'No pleural effusion',
104
+ 'Pleural effusion present',
105
+ 'Pleural effusion is worsening',
106
+ 'Pleural effusion is improving'
107
+ ]
108
+
109
+ # Get similarity scores using the built-in method
110
+ similarities = model.compute_similarities(query_text, candidates)
111
+ print(f"Similarities: {similarities}")
112
+
113
+ # For custom separator-based encoding
114
+ embeddings = model.encode_with_separator([query_text], separator='!@#$%^&*()')
115
+ ```
116
+
117
+ **Note**: The model now includes convenient methods like `compute_similarities()` and `encode_with_separator()` that handle complex tokenization automatically.
118
+
119
+ ### Quick Start Example
120
+
121
+ Here's a complete example showing the model's capabilities:
122
+
123
+ ```python
124
+ import torch
125
+ from llm2vec_wrapper import LLM2VecWrapper as LLM2Vec
126
+
127
+ # Load model
128
+ model = LLM2Vec.from_pretrained(
129
+ 'lukeingawesome/llm2vec4cxr',
130
+ pooling_mode="latent_attention",
131
+ torch_dtype=torch.bfloat16,
132
+ use_safetensors=True,
133
+ )
134
+
135
+ # Medical text analysis
136
+ instruction = 'Determine the change or the status of the pleural effusion.'
137
+ report = 'There is a small increase in the left-sided effusion.'
138
+ query = instruction + '!@#$%^&*()' + report
139
+
140
+ # Compare with different diagnoses
141
+ options = [
142
+ 'No pleural effusion',
143
+ 'Pleural effusion is worsening',
144
+ 'Pleural effusion is stable',
145
+ 'Pleural effusion is improving'
146
+ ]
147
 
148
+ # Get similarity scores
149
+ scores = model.compute_similarities(query, options)
150
+ best_match = options[torch.argmax(scores)]
151
+ print(f"Best match: {best_match} (score: {torch.max(scores):.4f})")
152
  ```
153
 
154
+ ## API Reference
155
+
156
+ The model provides several convenient methods:
157
 
158
+ ### Core Methods
159
 
160
+ - **`encode_text(texts)`**: Simple text encoding with automatic embed_mask handling
161
+ - **`encode_with_separator(texts, separator='!@#$%^&*()')`**: Encoding with instruction/content separation
162
+ - **`compute_similarities(query_text, candidate_texts)`**: One-line similarity computation
163
+ - **`from_pretrained(..., pooling_mode="latent_attention")`**: Automatic latent attention weight loading
164
+
165
+ ### Migration from Manual Usage
166
+
167
+ If you were previously using manual tokenization, you can now simply use:
168
 
169
  ```python
170
+ # Old way (still works)
171
+ tokenized = model.tokenizer(text, return_tensors="pt", ...)
172
+ tokenized["embed_mask"] = tokenized["attention_mask"].clone()
173
+ embeddings = model(tokenized)
174
+
175
+ # New way (recommended)
176
+ embeddings = model.encode_text([text])
 
 
 
 
 
177
  ```
178
 
179
  ## Evaluation