|
---
|
|
language:
|
|
- en
|
|
license:
|
|
- gpl-3.0
|
|
- other
|
|
tags:
|
|
- text-generation
|
|
- language-model
|
|
- gpt
|
|
- transformer
|
|
- open-source
|
|
- squad
|
|
- wikipedia
|
|
datasets:
|
|
- squad
|
|
metrics:
|
|
- perplexity
|
|
- text-generation-quality
|
|
library_name: transformers
|
|
pipeline_tag: text-generation
|
|
model-index:
|
|
- name: OpenLLM Small Extended 6k
|
|
results:
|
|
- task:
|
|
type: text-generation
|
|
dataset:
|
|
type: squad
|
|
name: SQUAD Wikipedia Passages
|
|
metrics:
|
|
- type: perplexity
|
|
value: 816.04
|
|
- type: training_loss
|
|
value: 5.4302
|
|
---
|
|
|
|
# OpenLLM Small Extended 6k
|
|
|
|
This is the OpenLLM Small Extended model trained for 6,000 steps on Wikipedia passages from the SQUAD dataset.
|
|
|
|
## Model Details
|
|
|
|
- **Model Type:** GPT-style Transformer
|
|
- **Architecture:** Small (35.8M parameters)
|
|
- **Training Steps:** 6,000
|
|
- **Training Data:** ~41k Wikipedia passages from SQUAD dataset
|
|
- **Tokenizer:** SentencePiece BPE (32k vocabulary)
|
|
- **License:** GPL-3.0 (Open Source) / Commercial License available
|
|
|
|
## Model Performance
|
|
|
|
- **Final Training Loss:** 5.4302
|
|
- **Model Parameters:** 35,823,616
|
|
- **Context Length:** 512 tokens
|
|
- **Training Hardware:** CPU/GPU compatible
|
|
|
|
## Usage
|
|
|
|
### Using Transformers
|
|
|
|
```python
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
import torch
|
|
|
|
# Load model and tokenizer
|
|
model_name = "lemms/openllm-small-extended-6k"
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
|
model = AutoModelForCausalLM.from_pretrained(model_name)
|
|
|
|
# Generate text
|
|
prompt = "The history of artificial intelligence"
|
|
inputs = tokenizer(prompt, return_tensors="pt")
|
|
|
|
with torch.no_grad():
|
|
outputs = model.generate(
|
|
inputs.input_ids,
|
|
max_new_tokens=50,
|
|
temperature=0.7,
|
|
top_k=40,
|
|
do_sample=True
|
|
)
|
|
|
|
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
|
print(generated_text)
|
|
```
|
|
|
|
### Using the Custom Loader
|
|
|
|
```python
|
|
# Use the provided load_hf_model.py script
|
|
from load_hf_model import load_model_and_tokenizer
|
|
|
|
model, tokenizer = load_model_and_tokenizer()
|
|
# ... rest of usage
|
|
```
|
|
|
|
## Training Details
|
|
|
|
This model was trained using the OpenLLM training pipeline:
|
|
|
|
1. **Data Preparation:** SQUAD dataset processing (~41k passages)
|
|
2. **Tokenizer Training:** SentencePiece BPE with 32k vocabulary
|
|
3. **Model Training:** GPT-style transformer for 6,000 steps
|
|
4. **Evaluation:** Perplexity and text generation quality assessment
|
|
|
|
## Model Architecture
|
|
|
|
- **Layers:** 12 transformer layers
|
|
- **Attention Heads:** 12
|
|
- **Hidden Size:** 768
|
|
- **Intermediate Size:** 3072
|
|
- **Activation:** GELU
|
|
- **Layer Norm:** Pre-norm
|
|
|
|
## Limitations
|
|
|
|
- **Training Data:** Limited to Wikipedia passages
|
|
- **Context Length:** 512 tokens maximum
|
|
- **Model Size:** Small model with 35.8M parameters
|
|
- **Performance:** Basic text generation capabilities
|
|
|
|
## License
|
|
|
|
This model is dual-licensed:
|
|
- **Open Source:** GPL-3.0 for research and community use
|
|
- **Commercial:** Commercial license available for enterprise use
|
|
|
|
For commercial licensing, contact: [email protected]
|
|
|
|
## Citation
|
|
|
|
If you use this model in your research, please cite:
|
|
|
|
```bibtex
|
|
@misc{openllm2024,
|
|
title={OpenLLM: Open Source Large Language Model},
|
|
author={Louis Chua Bean Chong},
|
|
year={2024},
|
|
url={https://github.com/louischua/openllm}
|
|
}
|
|
```
|
|
|
|
## Links
|
|
|
|
- **Repository:** https://github.com/louischua/openllm
|
|
- **Documentation:** https://github.com/louischua/openllm/docs
|
|
- **Training Pipeline:** https://github.com/louischua/openllm/docs/training_pipeline.md
|
|
|