LLaMA-v2-chinese-alpaca-13B-GGML (ymcui)

Here are the GGML converted and/or quantized models for ymcui's Chinese LLaMA-v2 Alpaca 13B.

!NOTE! The GGML filetype is outdated. Prefer GGUF format going forward.

Explanation of quantisation methods

Click to see details

Methods:

  • type-0 (Q4_0, Q5_0, Q8_0) - weights w are obtained from quants q using w = d * q, where d is the block scale.
  • type-1 (Q4_1, Q5_1) - weights are given by w = d * q + m, where m is the block minimum

The new methods available are:

  • GGML_TYPE_Q2_K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw)
  • GGML_TYPE_Q3_K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw.
  • GGML_TYPE_Q4_K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.
  • GGML_TYPE_Q5_K - "type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw
  • GGML_TYPE_Q6_K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw
  • GGML_TYPE_Q8_K - "type-0" 8-bit quantization. Only used for quantizing intermediate results. The difference to the existing Q8_0 is that the block size is 256. All 2-6 bit dot products are implemented for this quantization type.

This is exposed via llama.cpp quantization types that define various "quantization mixes" as follows:

  • LLAMA_FTYPE_MOSTLY_Q2_K - uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors.
  • LLAMA_FTYPE_MOSTLY_Q3_K_S - uses GGML_TYPE_Q3_K for all tensors
  • LLAMA_FTYPE_MOSTLY_Q3_K_M - uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K
  • LLAMA_FTYPE_MOSTLY_Q3_K_L - uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K
  • LLAMA_FTYPE_MOSTLY_Q4_K_S - uses GGML_TYPE_Q4_K for all tensors
  • LLAMA_FTYPE_MOSTLY_Q4_K_M - uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K
  • LLAMA_FTYPE_MOSTLY_Q5_K_S - uses GGML_TYPE_Q5_K for all tensors
  • LLAMA_FTYPE_MOSTLY_Q5_K_M - uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K
  • LLAMA_FTYPE_MOSTLY_Q6_K- uses 6-bit quantization (GGML_TYPE_Q8_K) for all tensors

Provided files

Name Quant method Bits Size Max RAM required Use case
llama-v2-chinese-alpaca-13B-Q2_K.ggml Q2_K 2 5.65 GB 8.15 GB smallest, significant quality-loss - not recommended for most purposes
llama-v2-chinese-alpaca-13B-Q3_K_S.ggml Q3_K_S 3 5.81 GB 8.31 GB very small, high quality-loss
llama-v2-chinese-alpaca-13B-Q3_K_M.ggml Q3_K_M 3 6.46 GB 7.96 GB very small, high quality-loss
llama-v2-chinese-alpaca-13B-Q3_K_L.ggml Q3_K_L 3 7.08 GB 9.58 GB small, substantial quality-loss
llama-v2-chinese-alpaca-13B-Q4_0.ggml Q4_0 4 7.53 GB 10.03 GB legacy; small, very high quality-loss - prefer using Q3_K_M
llama-v2-chinese-alpaca-13B-Q4_1.ggml Q4_1 4 8.34 GB 10.84 GB legacy; small, very high quality-loss - prefer using Q3_K_M
llama-v2-chinese-alpaca-13B-Q4_K_S.ggml Q4_K_S 4 7.53 GB 10.03 GB small, greater quality-loss
llama-v2-chinese-alpaca-13B-Q4_K_M.ggml Q4_K_M 4 8.03 GB 10.53 GB medium, balanced quality - recommended
llama-v2-chinese-alpaca-13B-Q5_0.ggml Q5_0 5 9.15 GB 11.65 GB legacy; medium, balanced quality - prefer using Q4_K_M
llama-v2-chinese-alpaca-13B-Q5_1.ggml Q5_1 5 9.96 GB 12.46 GB legacy; medium, balanced quality - prefer using Q4_K_M
llama-v2-chinese-alpaca-13B-Q5_K_S.ggml Q5_K_S 5 9.15 GB 11.65 GB large, low quality-loss - recommended
llama-v2-chinese-alpaca-13B-Q5_K_M.ggml Q5_K_M 5 9.41 GB 11.91 GB large, very low quality-loss - recommended
llama-v2-chinese-alpaca-13B-Q6_K.ggml Q6_K 6 10.9 GB 13.4 GB very large, extremely low quality-loss
llama-v2-chinese-alpaca-13B-Q8_0.ggml Q8_0 8 14 GB 16.5 GB very large, extremely low quality-loss - not recommended
llama-v2-chinese-alpaca-13B-f16.ggml f16 16 26.5 GB 29 GB very large, almost no quality-loss - not recommended

Model Sources

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model authors have turned it off explicitly.