Qwen3-30B-A3B-Thinking-2507-Deepseek-v3.1-Distill

Model Description

THIS IS THE FP32 UNQUANTIZED VERSION This model is a distilled version of Qwen/Qwen3-30B-A3B-Instruct designed to inherit the reasoning and behavioral characteristics of its much larger teacher model, deepseek-ai/DeepSeek-V3.1.

It is the result of applying a LoRA created via an SVD-based distillation pipeline, and then merging those weights into the base model. The core of this process was to transfer the nuanced knowledge from a 62-layer, 256-expert teacher model into the more efficient 48-layer, 128-expert architecture of the student model.

The primary goal was to explore the high-fidelity transfer of complex reasoning patterns, particularly those encoded within the Mixture-of-Experts (MoE) layers, from a frontier-class model to a consumer-accessible one.

The Distillation Methodology

This model was not trained in a conventional sense. Instead, it was created using a layer-by-layer distillation SVD based distillation process.

Core Components

  • Teacher Model: deepseek-ai/DeepSeek-V3.1.
  • Student Model: Qwen/Qwen3-30B-A3B-Thinking-2507.
  • LoRA Rank: A high rank of r=2048 was used for all modules to ensure a comprehensive capture of information from the teacher model.

The Distillation Pipeline

For each corresponding layer in the student and teacher, the following pipeline was executed:

  1. Teacher Layer Interpolation (SLERP): For student layers that fall between two teacher layers (based on a sigmoid mapping), Spherical Linear Interpolation (SLERP) was used to create a geometrically sound blend of the teacher's weights. This preserves the integrity of the high-dimensional representations.

  2. SVD Projection: The core of the distillation. The (potentially blended) teacher layer's weight matrix was decomposed using a randomized SVD algorithm. The top 2048 most significant components were selected and reconstructed to fit the student layer's smaller dimensions. This high-rank projection is designed for maximum fidelity.

  3. Generalized Procrustes Analysis: After projection, the newly created "synthetic" tensor was optimally aligned with the student's original pre-trained tensor using a hardened least-squares solver. This alignment minimizes representational distance before calculating the final difference, with added checks to prevent numerical instability.

  4. DARE-TIES Purification: The difference tensor (Distilled - Aligned Student) was then purified using the DARE-TIES methodology. This process drops a significant percentage (80%) of the lowest-magnitude values, treating them as noise, and then rescale the remaining important differences. This creates a clean, high-signal delta for the final LoRA.

Mixture-of-Experts (MoE) Distillation

The standout feature of this process is the full distillation of the MoE layers, which are critical for nuanced, context-dependent reasoning.

  • Expert Fingerprinting & Clustering: To map the 256 teacher experts to the 128 student experts, each teacher expert was "fingerprinted" by concatenating its constituent weight matrices. FAISS-GPU K-Means clustering was then used to efficiently group these 256 fingerprints into 128 distinct clusters based on their geometric similarity.

  • Advanced Expert Synthesis: Each of the student's 128 experts was synthesized from a weighted blend of the teacher experts assigned to its cluster. This blend is not a simple average; instead, it uses an SVD-based reconstruction from the top teacher experts (ranked by similarity to the cluster centroid) to create a new, synthetic expert that represents the core "concept" of that cluster. This more advanced synthesis aims to create novel, yet faithful, expert representations.

Intended Use

This model is intended for use as a general-purpose model for tasks such as coding, problem solving, general questions etc. It is designed to be a more capable and nuanced reasoner than its base model.

  • Primary Use: Complex instruction-following, reasoning tasks, and creative generation.
  • Out of Scope: Its knowledge cutoff is from its original training (2024), and it has not been aligned for specific safety or conversational chatbot roles beyond its base tuning.

Critical Usage Note

Testing indicates that this model performs best when using the recommended inference settings of its 685B teacher model, not the settings for the 30B base model. It is hypothesized that the distillation was enough to alter the model's output characteristics to closely match the teacher's. Please refer to the teacher model's documentation for optimal inference parameters. Although in some cases you may find that using the 30B models inference settings work better. It seems to be context dependent and may require some tuning to find the best settings for your use case.

Downloads last month
28
Safetensors
Model size
30.5B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for BasedBase/Qwen3-30B-A3B-Thinking-2507-Deepseek-v3.1-Distill-FP32

Finetuned
(12)
this model
Finetunes
1 model
Quantizations
6 models