File size: 3,654 Bytes

---

language:
- en
license:
- gpl-3.0
- other
tags:
- text-generation
- language-model
- gpt
- transformer
- open-source
- squad
- wikipedia
datasets:
- squad
metrics:
- perplexity
- text-generation-quality
library_name: transformers
pipeline_tag: text-generation
model-index:
- name: OpenLLM Small Extended 6k
  results:
  - task:
      type: text-generation
    dataset:
      type: squad
      name: SQUAD Wikipedia Passages
    metrics:
      - type: perplexity
        value: 816.04
      - type: training_loss
        value: 5.4302
---


# OpenLLM Small Extended 6k

This is the OpenLLM Small Extended model trained for 6,000 steps on Wikipedia passages from the SQUAD dataset.

## Model Details

- **Model Type:** GPT-style Transformer
- **Architecture:** Small (35.8M parameters)
- **Training Steps:** 6,000
- **Training Data:** ~41k Wikipedia passages from SQUAD dataset
- **Tokenizer:** SentencePiece BPE (32k vocabulary)
- **License:** GPL-3.0 (Open Source) / Commercial License available

## Model Performance

- **Final Training Loss:** 5.4302
- **Model Parameters:** 35,823,616
- **Context Length:** 512 tokens
- **Training Hardware:** CPU/GPU compatible

## Usage

### Using Transformers

```python

from transformers import AutoTokenizer, AutoModelForCausalLM

import torch



# Load model and tokenizer

model_name = "lemms/openllm-small-extended-6k"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(model_name)



# Generate text

prompt = "The history of artificial intelligence"

inputs = tokenizer(prompt, return_tensors="pt")



with torch.no_grad():

    outputs = model.generate(

        inputs.input_ids,

        max_new_tokens=50,

        temperature=0.7,

        top_k=40,

        do_sample=True

    )



generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(generated_text)

```

### Using the Custom Loader

```python

# Use the provided load_hf_model.py script

from load_hf_model import load_model_and_tokenizer



model, tokenizer = load_model_and_tokenizer()

# ... rest of usage

```

## Training Details

This model was trained using the OpenLLM training pipeline:

1. **Data Preparation:** SQUAD dataset processing (~41k passages)
2. **Tokenizer Training:** SentencePiece BPE with 32k vocabulary
3. **Model Training:** GPT-style transformer for 6,000 steps
4. **Evaluation:** Perplexity and text generation quality assessment

## Model Architecture

- **Layers:** 12 transformer layers
- **Attention Heads:** 12
- **Hidden Size:** 768
- **Intermediate Size:** 3072
- **Activation:** GELU
- **Layer Norm:** Pre-norm

## Limitations

- **Training Data:** Limited to Wikipedia passages
- **Context Length:** 512 tokens maximum
- **Model Size:** Small model with 35.8M parameters
- **Performance:** Basic text generation capabilities

## License

This model is dual-licensed:
- **Open Source:** GPL-3.0 for research and community use
- **Commercial:** Commercial license available for enterprise use

For commercial licensing, contact: [email protected]

## Citation

If you use this model in your research, please cite:

```bibtex

@misc{openllm2024,

  title={OpenLLM: Open Source Large Language Model},

  author={Louis Chua Bean Chong},

  year={2024},

  url={https://github.com/louischua/openllm}

}

```

## Links

- **Repository:** https://github.com/louischua/openllm
- **Documentation:** https://github.com/louischua/openllm/docs
- **Training Pipeline:** https://github.com/louischua/openllm/docs/training_pipeline.md