Upload Dharma-DeepScaleR-1.5B-Preview-Q4_K_M.gguf
Browse filesDharma-DeepScaleR-1.5B-Preview-Q4_K_M is a quantized version of the original agentica-org/DeepScaleR-1.5B-Preview model. By applying the q4-K-M quantization technique, this variant has been optimized to significantly reduce computational overhead and memory usage. This makes it well-suited for environments where latency and resource efficiency are critical, while still delivering robust performance.
Model Details
Base Model: The foundation of this model is the DeepScaleR-1.5B-Preview architecture, known for its balanced approach between scale and efficiency.
Quantization: The applied q4-K-M quantization reduces the model’s precision in a controlled way, trading minimal accuracy loss for substantial gains in inference speed and lower memory consumption. This approach is particularly beneficial when deploying models to production systems with limited resources.
Architecture: While retaining the original model’s deep architecture, the quantized version benefits from fewer computational demands during inference, making it ideal for real-time applications.
Usage
This model is designed with versatility in mind. It can serve as a drop-in replacement for tasks originally supported by the base model, such as:
Text generation
Summarization
Question answering
Developers are encouraged to experiment with the model under different settings. Below is an example snippet using the Hugging Face Transformers library:
python
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load the tokenizer and quantized model
tokenizer = AutoTokenizer.from_pretrained("your-username/Dharma-DeepScaleR-1.5B-Preview-Q4_K_M")
model = AutoModelForCausalLM.from_pretrained("your-username/Dharma-DeepScaleR-1.5B-Preview-Q4_K_M")
# Encode input text and generate output
inputs = tokenizer("Generate some text based on this input.", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))
Users should note that while the q4-K-M quantization method offers notable improvements in efficiency, there might be edge cases where full-precision performance could be preferred.
Intended Use and Limitations
This model is primarily intended for researchers and developers looking to balance compute efficiency with high-quality performance. Potential limitations include:
Minor degradations in the output quality compared to full-precision models (depending on the task and input scenarios)
Specific tuning may be necessary for task-specific deployments, particularly when extreme precision is required
Licensing
The model is provided under the Apache 2.0 License. This permissive license allows for broad usage, distribution, and modification. Users are encouraged to review the full license text to understand their rights and responsibilities when integrating or adapting this model.
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
+
Dharma-DeepScaleR-1.5B-Preview-Q4_K_M.gguf filter=lfs diff=lfs merge=lfs -text
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:9ed31fa867cd2aadf61a2632ac49e439fe0c3526d218d9a2b0c9481da96b51a3
|
3 |
+
size 1117321888
|