Model Card for lyraXVERSE

We have colaborated with XVERSE and lauched lyraXVERSE, currently the fastest XVERSE-13b available. The inference speed of lyraXVERSE has achieved up to 3900+ tokens/s on A100, up to 2.7x acceleration upon the torch version.

Among its main features are:

  • device: Nvidia GPU with Amperer architecture or Volta architecture (A10, A100 or higher, V100).
  • batch_size: compiled with dynamic batch size, maximum depends on device. 
  • MEMOPT mode: significantly optimized VRAM usage and increased speed

We use the XVERSE-13B-Chat model for measurement, but this optimized inference is also applicable to XVERSE-13B model.

Speed

  • Evaluated at tokens/s
  • test on A100 40G
  • MEMOPT mode

XVERSE-13B-Chat

Version Batch Size 1 Batch Size 8 Batch Size 16 Batch Size 32 Batch Size 64
Torch 34.8 249.2 470.1 878.6 1478.9
lyraXVERSE 96.6 725.5 1359.3 2415.6 3923.2

Docker Environment Recommendation

  • For Cuda 11.X: we recommend nvcr.io/nvidia/pytorch:22.12-py3
  • For Cuda 12.0: we recommend nvcr.io/nvidia/pytorch:23.02-py3
docker pull nvcr.io/nvidia/pytorch:23.02-py3
docker run --rm -it --gpus all -v ./:/lyraXVERSE nvcr.io/nvidia/pytorch:23.02-py3

pip install -r requirements.txt
python demo.py

Uses

from lyra_xverse import lyraXVERSE

model_path = "./models/"
tokenizer_path = "./models/"
inference_dtype = 'fp16'
prompt = "讲个故事:"
memopt_mode = 1
max_output_length = 512
arch = "Ampere" # Ampere or Volta
cuda_version = 12 # cuda version, we currently support 11 and 12

model = lyraXVERSE(model_path, 
                   tokenizer_path = tokenizer_path, 
                   dtype = inference_dtype,
                   memopt_mode = memopt_mode,
                   arch = arch,
                   cuda_version = cuda_version)

bs = 1
prompts = [prompt, ] * bs
output_texts = model.generate(
        prompts, output_length=max_output_length,
        top_k=30, top_p=0.85, temperature=1.0, repetition_penalty=1.0, do_sample=False)

print(output_texts)

Demo Outputs

XVERSE-13B-Chat

input

讲个故事:

output

有一天,一位年轻的画家来到了一个偏远的村庄。他以其超凡的绘画技巧,为村民画了一幅美丽的图画。图画里,村庄的周围是翠绿的森林,清澈的溪流在其中流淌,村民们正在劳作,孩子们在田野里嬉戏。村民们看着这幅画,都对这位画家赞不绝口。

村庄的领袖看到了这幅画,他想:“这幅画将会让我们的村庄更加美丽,我们应该让村民们知道这幅画。”于是,他带着画家去村庄的各个角落,让每一个村民都看到了这幅画。

画家看着村民们看画的眼神,他意识到了自己的价值。他意识到,他不仅仅是一个画家,他也是一个能让人们看见希望的人。他的画不仅仅是艺术品,它是连接人们与希望的一座桥梁。

这个故事告诉我们,画家的价值不只是他们的绘画技巧,而是他们的画作带给人们的感动和希望。画家的价值并不在于他们的画有多么昂贵,有多么独特,而在于他们能用画作打开人们的心扉,让人们看见希望,看见生活的美好。

Citation

@Misc{lyraXVERSE2023,
  author =       {Haoxiong Su, Kangjian Wu, Zhengtao Wang, Yibo Lu, Bin Wu},
  title =        {lyraXVERSE: Accelerating XVERSE-13B-Chat(fp16) to 3000+ tokens/s},
  howpublished = {\url{https://huggingface.co/TMElyralab/lyraXVERSE}},
  year =         {2023}
}

Report bugs

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.