README.md · Drenel/Hippo-6B at main

metadata

license: apache-2.0
tags:
  - Drenel
  - Hippo
  - LLM
  - MultiLingual
  - Drenel/Hippo-6B
base_model:
  - Drenel/Hippo-6B
  - Drenel/Hippo-6B
library_name: transformers

Model Details

Hippo-6B is a cutting-edge, transformer-based language model designed to provide state-of-the-art performance across a wide range of natural language processing tasks. With 6.2 billion parameters, Hippo-6B strikes a balance between computational efficiency and high performance, making it a versatile model for various applications.

Context Length: Supports up to 4K context length

Publisher: Drenel

Key Features and Technologies

1. Efficient Attention Mechanism

Flash Attention: Hippo-6B leverages flash attention techniques, including flash attention functions (flash_attn_func and flash_attn_varlen_func), to efficiently compute attention scores. This reduces the computational overhead and memory usage, enabling the model to handle longer context lengths without performance degradation.
Support for Window Size: The model includes conditional support for attention windows, allowing for flexible and scalable attention mechanisms based on the available hardware and task requirements.

2. Rotary Embeddings

Rotary Position Embeddings: Hippo-6B employs rotary position embeddings (RotaryEmbedding) to encode positional information in a more continuous and differentiable manner, enhancing the model's ability to capture long-range dependencies.
Scaled Rotary Embeddings: Variations such as SuScaledRotaryEmbedding and YarnScaledRotaryEmbedding adapt the rotary embeddings to different scaling factors, providing finer control over the embedding space.

3. RMS Norm

RMS Normalization: The model utilizes Root Mean Square (RMS) normalization layers (RMSNorm) to stabilize training and improve convergence. RMS normalization helps in maintaining consistent gradient flow across layers, leading to more efficient training dynamics.

4. Modular and Scalable Design

Modular Attention Classes: Hippo-6B features a modular design with different attention classes (Attention, FlashAttention2, SdpaAttention). This modularity allows easy customization and scalability of the attention mechanisms based on specific use cases.
MLP Layers: The model incorporates Multi-Layer Perceptron (MLP) layers with gating mechanisms to enhance the model's expressive power. The MLP class includes techniques such as expert gating and intermediate projections for more sophisticated representations.

5. Caching and Memory Efficiency

Dynamic Caching: The model supports dynamic caching strategies (Cache, DynamicCache) to optimize memory usage during inference, allowing for faster and more efficient processing of long sequences.

6. Loss Functions

Cross-Entropy Loss: The model uses Cross-Entropy Loss for classification tasks, ensuring accurate and efficient learning of categorical distributions.
Mean Squared Error (MSE) Loss: For regression tasks, MSE Loss is employed to minimize the difference between predicted and actual values, providing robust performance in continuous prediction tasks.

Usage

Hippo-6B can be used for a variety of NLP tasks, including but not limited to:

Text Generation
Language Translation
Sentiment Analysis
Named Entity Recognition
Text Classification

Chat Format

You can provide the prompt as a question with a generic template as follow:

<|user|>\nQuestion<|end|>\n<|assistant|>

Example

Here is a quick example of how to use Hippo-6B for text generation:

# Libraries installation
# pip install -q transformers accelerate flash-attn

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

torch.random.manual_seed(0)
modelName = "Drenel/Hippo-6B"

model = AutoModelForCausalLM.from_pretrained(modelName, device_map="cuda",torch_dtype="auto",trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(modelName)

messages = [
    {"role": "user", "content": "What is the capital of France?  <|end|><|assistant|>"},
]

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
generation_args = {"max_new_tokens": 50, "return_full_text": False, "temperature": 0.7, "do_sample": False, "top_k": 50, "top_p": 0.95}
output = pipe(messages, **generation_args)
print(output[0]['generated_text'])

License

Hippo-6B is distributed under the Apache-2.0.