Gargaz/GPT-2-gguf

Gargaz/GPT-2-gguf is a highly optimized, stable, and efficient version of GPT-2, designed for fast and reliable language generation. Leveraging the GGUF format, this model minimizes memory usage while maximizing performance, making it ideal for a wide range of natural language processing tasks. Whether you're building conversational AI, generating text, or exploring NLP research, this model delivers consistent, high-quality results.

Features

  • Optimized for Performance: Utilizes the GGUF format for reduced memory footprint and faster inference.
  • GPU Acceleration: Offloads model layers to the GPU for significantly improved processing times.
  • Large Context Handling: Supports up to 16,000 tokens, enabling it to manage lengthy conversations or documents effectively.
  • Stable and Reliable: Provides a robust and consistent output across various NLP tasks, ensuring high stability in deployment.

Requirements

  • Python 3.7+
  • llama_cpp for running the model
  • huggingface_hub for downloading the model
  • A machine with a capable GPU is recommended for best performance.

Installation

Install the necessary dependencies with:

pip install llama-cpp-python huggingface_hub

Load the model with:

import logging
import os
import time  # Make sure to import time for measuring durations
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

# Set up logging
logging.basicConfig(level=logging.INFO)  # Set to INFO to reduce overhead
logger = logging.getLogger()

# Download the GGUF model
model_name = "Gargaz/GPT-2-gguf"
model_file = "llama3.1-Q4_K_M.gguf"
model_path = hf_hub_download(model_name, filename=model_file)

# Instantiate the model from the downloaded file
llm = Llama(
    model_path=model_path,
    n_ctx=16000,  # Context length to use
    n_threads=64,  # Number of CPU threads
    n_gpu_layers=32  # Number of model layers to offload to GPU
)

# System instructions for the AI
system_instructions = (
    "You are a friendly conversational AI designed to respond clearly and concisely to user inquiries. "
    "Stay on topic by answering questions directly, use a warm tone and acknowledge gratitude, ask for "
    "clarification on vague questions, provide brief and helpful recommendations, and encourage users "
    "to ask more questions to keep the conversation flowing."
    "don't speak alone always respond just to the user input"
)

def chat():
    """Start a chat session with the model."""
    print("Introduceti 'exit' pentru a iesi din chat.")
    while True:
        user_input = input("Tu: ")
        if user_input.lower() == 'exit':
            print("Iesire din chat.")
            break
        
        # Prepare the prompt
        full_prompt = f"{system_instructions}\nUser: {user_input}\nAI:"

        # Limit AI responses to a maximum of 500 tokens for faster responses
        generation_kwargs = {
            "max_tokens": 40,  # Reduced max tokens for faster inference
            "stop": ["AI:"],  # Change the stop token to ensure clarity
            "echo": False,
        }

        try:
            # Start measuring time for response generation
            load_start_time = time.time()
            res = llm(full_prompt, **generation_kwargs)  # Res is a dictionary
            load_time = (time.time() - load_start_time) * 1000  # Convert to ms

            # Log load time
            load_message = f"llama_perf_context_print: load time = {load_time:.2f} ms"
            logger.info(load_message)

            generated_text = res["choices"][0]["text"].strip()
            print(f"AI: {generated_text}")

            # Log prompt evaluation time and other metrics
            num_tokens = len(full_prompt.split())
            eval_message = f"llama_perf_context_print: prompt eval time = {load_time:.2f} ms / {num_tokens} tokens"
            logger.info(eval_message)

        except Exception as e:
            logger.error(f"Error generating response: {e}")
            print("Eroare la generarea răspunsului.")

if __name__ == "__main__":
    chat()
Downloads last month
22
GGUF
Model size
1.64B params
Architecture
gpt2
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for Gargaz/GPT-2-gguf

Quantized
(65)
this model