Update README.md

90f8ce8 verified 5 months ago

5.46 kB

	---
	license: apache-2.0
	datasets:
	- wikimedia/wikipedia
	language:
	- en
	library_name: transformers
	tags:
	- LLM2Vec
	- encoder
	- LLM
	- classification
	- NER
	- question-answering
	---
	# LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

	> LLM2Vec is a simple recipe to convert decoder-only LLMs into text encoders. It consists of 3 simple steps: 1) enabling bidirectional attention, 2) masked next token prediction, and 3) unsupervised contrastive learning. The model can be further fine-tuned to achieve state-of-the-art performance.
	- Repository: https://github.com/McGill-NLP/llm2vec
	- Paper: https://arxiv.org/abs/2404.05961

	## Overview:
	This is a bi-directional version of Tiny-LLaMA-1.0B trained with masked token prediction on the Wikipedia dataset. Modern decoder models offer several advantages over classical encoders like BERT:

	They are pre-trained on more recent textual corpora
	They are trained on larger and more diverse datasets
	Modern decoders have better support for long-context windows
	Flash-attention support is available for these models

	Considering these benefits, we are excited to release a series of decoder models tuned to work in a bi-directional setting. This approach combines the strengths of modern decoder architectures with the versatility of bi-directional context understanding, potentially opening up new possibilities for various natural language processing tasks, such as NER.

	In comparison to original LLM2Vec we trained all weights of LLama model, it potentially improve bi-directional abilities of the model.

	## Installation
	```bash
	pip install llm2vec
	```

	## Usage
	```python
	from llm2vec.models import LlamaBiModel

	import torch
	from transformers import AutoTokenizer

	# Loading base Mistral model, along with custom code that enables bidirectional connections in decoder-only LLMs. MNTP LoRA weights are merged into the base model.
	tokenizer = AutoTokenizer.from_pretrained(
	"knowledgator/Llama-encoder-1.0B"
	)

	model = LLamaBiModel.from_pretrained("knowledgator/Llama-encoder-1.0B")

	text = "The quick brown fox jumps over the lazy dog."

	inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model = model.to(device)
	inputs = {k: v.to(device) for k, v in inputs.items()}

	with torch.no_grad():
	outputs = model(**inputs)

	last_hidden_states = outputs.last_hidden_state
	```

	Here's an improved and expanded version of the README snippet:

	## Adapting for Different Discriminative Tasks

	Our bi-directional LLaMA model can be easily adapted for various discriminative tasks such as text classification, question answering, and token classification.
	To use these specialized versions, we provide a [fork of LLM2Vec](https://github.com/Knowledgator/llm2vec) with additional functionality.

	### Installation

	To get started, clone our fork of LLM2Vec and install it:

	```bash
	git clone https://github.com/Knowledgator/llm2vec.git
	cd llm2vec
	pip install -e .
	```

	Using `-e` flag installs the package in editable mode, which is useful for development.

	### Usage

	Here's how to import and use the models for different tasks:

	```python
	from llm2vec import (
	AutoLLMEncoderForSequenceClassification,
	AutoLLMEncoderForQuestionAnswering,
	AutoLLMEncoderForTokenClassification
	)

	# Load models for different tasks
	classification_model = AutoLLMEncoderForSequenceClassification.from_pretrained('knowledgator/Llama-encoder-1.0B')
	question_answering_model = AutoLLMEncoderForQuestionAnswering.from_pretrained('knowledgator/Llama-encoder-1.0B')
	token_classification_model = AutoLLMEncoderForTokenClassification.from_pretrained('knowledgator/Llama-encoder-1.0B')
	```

	### Example: Text Classification

	Here's a basic example of how to use the model for text classification:

	```python
	from transformers import AutoTokenizer

	# Load tokenizer
	tokenizer = AutoTokenizer.from_pretrained('knowledgator/Llama-encoder-1.0B')

	# Prepare input
	text = "This movie is great!"
	inputs = tokenizer(text, return_tensors="pt")

	# Get classification logits
	outputs = classification_model(**inputs)
	logits = outputs.logits

	# The logits can be used with a softmax function to get probabilities
	# or you can use torch.argmax(logits, dim=1) to get the predicted class
	```

	### Fine-tuning

	To fine-tune these models on your specific task:

	1. Prepare your dataset in a format compatible with HuggingFace's `datasets` library.
	2. Use the `Trainer` class from HuggingFace's `transformers` library to fine-tune the model.

	Here's a basic example:

	```python
	from transformers import Trainer, TrainingArguments
	from datasets import load_dataset

	# Load your dataset
	dataset = load_dataset("your_dataset")

	# Define training arguments
	training_args = TrainingArguments(
	output_dir="./results",
	num_train_epochs=3,
	per_device_train_batch_size=8,
	per_device_eval_batch_size=8,
	warmup_steps=500,
	weight_decay=0.01,
	logging_dir="./logs",
	)

	# Initialize Trainer
	trainer = Trainer(
	model=classification_model,
	args=training_args,
	train_dataset=dataset["train"],
	eval_dataset=dataset["test"],
	)

	# Fine-tune the model
	trainer.train()
	```

	### Contributing

	We welcome contributions! If you have suggestions for improvements or encounter any issues, please open an issue or submit a pull request on our [GitHub repository](https://github.com/Knowledgator/llm2vec).