Upload Dharma-DeepScaleR-1.5B-Preview-Q4_K_M.gguf

Dharma-DeepScaleR-1.5B-Preview-Q4_K_M is a quantized version of the original agentica-org/DeepScaleR-1.5B-Preview model. By applying the q4-K-M quantization technique, this variant has been optimized to significantly reduce computational overhead and memory usage. This makes it well-suited for environments where latency and resource efficiency are critical, while still delivering robust performance.

Model Details
Base Model: The foundation of this model is the DeepScaleR-1.5B-Preview architecture, known for its balanced approach between scale and efficiency.

Quantization: The applied q4-K-M quantization reduces the model’s precision in a controlled way, trading minimal accuracy loss for substantial gains in inference speed and lower memory consumption. This approach is particularly beneficial when deploying models to production systems with limited resources.

Architecture: While retaining the original model’s deep architecture, the quantized version benefits from fewer computational demands during inference, making it ideal for real-time applications.

Usage
This model is designed with versatility in mind. It can serve as a drop-in replacement for tasks originally supported by the base model, such as:

Text generation

Summarization

Question answering

Developers are encouraged to experiment with the model under different settings. Below is an example snippet using the Hugging Face Transformers library:

python
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the tokenizer and quantized model
tokenizer = AutoTokenizer.from_pretrained("your-username/Dharma-DeepScaleR-1.5B-Preview-Q4_K_M")
model = AutoModelForCausalLM.from_pretrained("your-username/Dharma-DeepScaleR-1.5B-Preview-Q4_K_M")

# Encode input text and generate output
inputs = tokenizer("Generate some text based on this input.", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))
Users should note that while the q4-K-M quantization method offers notable improvements in efficiency, there might be edge cases where full-precision performance could be preferred.

Intended Use and Limitations
This model is primarily intended for researchers and developers looking to balance compute efficiency with high-quality performance. Potential limitations include:

Minor degradations in the output quality compared to full-precision models (depending on the task and input scenarios)

Specific tuning may be necessary for task-specific deployments, particularly when extreme precision is required

Licensing
The model is provided under the Apache 2.0 License. This permissive license allows for broad usage, distribution, and modification. Users are encouraged to review the full license text to understand their rights and responsibilities when integrating or adapting this model.

Files changed (2) hide show

.gitattributes +1 -0
Dharma-DeepScaleR-1.5B-Preview-Q4_K_M.gguf +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+Dharma-DeepScaleR-1.5B-Preview-Q4_K_M.gguf filter=lfs diff=lfs merge=lfs -text

Dharma-DeepScaleR-1.5B-Preview-Q4_K_M.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9ed31fa867cd2aadf61a2632ac49e439fe0c3526d218d9a2b0c9481da96b51a3
+size 1117321888