|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- wikimedia/wikipedia |
|
language: |
|
- en |
|
library_name: transformers |
|
tags: |
|
- LLM2Vec |
|
- encoder |
|
- LLM |
|
- classification |
|
- NER |
|
- question-answering |
|
--- |
|
# LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders |
|
|
|
> LLM2Vec is a simple recipe to convert decoder-only LLMs into text encoders. It consists of 3 simple steps: 1) enabling bidirectional attention, 2) masked next token prediction, and 3) unsupervised contrastive learning. The model can be further fine-tuned to achieve state-of-the-art performance. |
|
- **Repository:** https://github.com/McGill-NLP/llm2vec |
|
- **Paper:** https://arxiv.org/abs/2404.05961 |
|
|
|
## Overview: |
|
This is a bi-directional version of Tiny-LLaMA-1.0B trained with masked token prediction on the Wikipedia dataset. Modern decoder models offer several advantages over classical encoders like BERT: |
|
|
|
They are pre-trained on more recent textual corpora |
|
They are trained on larger and more diverse datasets |
|
Modern decoders have better support for long-context windows |
|
Flash-attention support is available for these models |
|
|
|
Considering these benefits, we are excited to release a series of decoder models tuned to work in a bi-directional setting. This approach combines the strengths of modern decoder architectures with the versatility of bi-directional context understanding, potentially opening up new possibilities for various natural language processing tasks, such as NER. |
|
|
|
In comparison to original LLM2Vec we trained all weights of LLama model, it potentially improve bi-directional abilities of the model. |
|
|
|
## Installation |
|
```bash |
|
pip install llm2vec |
|
``` |
|
|
|
## Usage |
|
```python |
|
from llm2vec.models import LlamaBiModel |
|
|
|
import torch |
|
from transformers import AutoTokenizer |
|
|
|
# Loading base Mistral model, along with custom code that enables bidirectional connections in decoder-only LLMs. MNTP LoRA weights are merged into the base model. |
|
tokenizer = AutoTokenizer.from_pretrained( |
|
"knowledgator/Llama-encoder-1.0B" |
|
) |
|
|
|
model = LLamaBiModel.from_pretrained("knowledgator/Llama-encoder-1.0B") |
|
|
|
text = "The quick brown fox jumps over the lazy dog." |
|
|
|
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512) |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model = model.to(device) |
|
inputs = {k: v.to(device) for k, v in inputs.items()} |
|
|
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
|
|
last_hidden_states = outputs.last_hidden_state |
|
``` |
|
|
|
Here's an improved and expanded version of the README snippet: |
|
|
|
## Adapting for Different Discriminative Tasks |
|
|
|
Our bi-directional LLaMA model can be easily adapted for various discriminative tasks such as text classification, question answering, and token classification. |
|
To use these specialized versions, we provide a [fork of LLM2Vec](https://github.com/Knowledgator/llm2vec) with additional functionality. |
|
|
|
### Installation |
|
|
|
To get started, clone our fork of LLM2Vec and install it: |
|
|
|
```bash |
|
git clone https://github.com/Knowledgator/llm2vec.git |
|
cd llm2vec |
|
pip install -e . |
|
``` |
|
|
|
Using `-e` flag installs the package in editable mode, which is useful for development. |
|
|
|
### Usage |
|
|
|
Here's how to import and use the models for different tasks: |
|
|
|
```python |
|
from llm2vec import ( |
|
AutoLLMEncoderForSequenceClassification, |
|
AutoLLMEncoderForQuestionAnswering, |
|
AutoLLMEncoderForTokenClassification |
|
) |
|
|
|
# Load models for different tasks |
|
classification_model = AutoLLMEncoderForSequenceClassification.from_pretrained('knowledgator/Llama-encoder-1.0B') |
|
question_answering_model = AutoLLMEncoderForQuestionAnswering.from_pretrained('knowledgator/Llama-encoder-1.0B') |
|
token_classification_model = AutoLLMEncoderForTokenClassification.from_pretrained('knowledgator/Llama-encoder-1.0B') |
|
``` |
|
|
|
### Example: Text Classification |
|
|
|
Here's a basic example of how to use the model for text classification: |
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
|
|
# Load tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained('knowledgator/Llama-encoder-1.0B') |
|
|
|
# Prepare input |
|
text = "This movie is great!" |
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
|
# Get classification logits |
|
outputs = classification_model(**inputs) |
|
logits = outputs.logits |
|
|
|
# The logits can be used with a softmax function to get probabilities |
|
# or you can use torch.argmax(logits, dim=1) to get the predicted class |
|
``` |
|
|
|
### Fine-tuning |
|
|
|
To fine-tune these models on your specific task: |
|
|
|
1. Prepare your dataset in a format compatible with HuggingFace's `datasets` library. |
|
2. Use the `Trainer` class from HuggingFace's `transformers` library to fine-tune the model. |
|
|
|
Here's a basic example: |
|
|
|
```python |
|
from transformers import Trainer, TrainingArguments |
|
from datasets import load_dataset |
|
|
|
# Load your dataset |
|
dataset = load_dataset("your_dataset") |
|
|
|
# Define training arguments |
|
training_args = TrainingArguments( |
|
output_dir="./results", |
|
num_train_epochs=3, |
|
per_device_train_batch_size=8, |
|
per_device_eval_batch_size=8, |
|
warmup_steps=500, |
|
weight_decay=0.01, |
|
logging_dir="./logs", |
|
) |
|
|
|
# Initialize Trainer |
|
trainer = Trainer( |
|
model=classification_model, |
|
args=training_args, |
|
train_dataset=dataset["train"], |
|
eval_dataset=dataset["test"], |
|
) |
|
|
|
# Fine-tune the model |
|
trainer.train() |
|
``` |
|
|
|
### Contributing |
|
|
|
We welcome contributions! If you have suggestions for improvements or encounter any issues, please open an issue or submit a pull request on our [GitHub repository](https://github.com/Knowledgator/llm2vec). |
|
|
|
|