Qwen3-30B-A3B-Thinking-2507-Deepseek-v3.1-Distill

Model Description

This model is a distilled version of Qwen/Qwen3-30B-A3B-Instruct designed to inherit the reasoning and behavioral characteristics of its much larger teacher model, deepseek-ai/DeepSeek-V3.1.

It is the result of applying a LoRA created via an SVD-based distillation pipeline, and then merging those weights into the base model. The core of this process was to transfer the nuanced knowledge from a 62-layer, 256-expert teacher model into the more efficient 48-layer, 128-expert architecture of the student model.

The primary goal was to explore the high-fidelity transfer of complex reasoning patterns, particularly those encoded within the Mixture-of-Experts (MoE) layers, from a frontier-class model to a consumer-accessible one.

You should notice that the model has a more confident and linear chain-of-thought compared to the base qwen3-30b-a3b-thinking-2507 model like Deepseek 3.1 has. This distill tends to overthink much less than the base model and provides more accurate better structured answers.

The Distillation Methodology

This model was not trained in a conventional sense. Instead, it was created using a layer-by-layer distillation SVD based distillation process.

Core Components

Teacher Model: deepseek-ai/DeepSeek-V3.1.
Student Model: Qwen/Qwen3-30B-A3B-Thinking-2507.
LoRA Rank: A high rank of r=2048 was used for all modules to ensure a comprehensive capture of information from the teacher model.

The Distillation Pipeline

For each corresponding layer in the student and teacher, the following pipeline was executed:

Teacher Layer Interpolation (SLERP): For student layers that fall between two teacher layers (based on a sigmoid mapping), Spherical Linear Interpolation (SLERP) was used to create a geometrically sound blend of the teacher's weights. This preserves the integrity of the high-dimensional representations.
SVD Projection: The core of the distillation. The (potentially blended) teacher layer's weight matrix was decomposed using a randomized SVD algorithm. The top 2048 most significant components were selected and reconstructed to fit the student layer's smaller dimensions. This high-rank projection is designed for maximum fidelity.
Generalized Procrustes Analysis: After projection, the newly created "synthetic" tensor was optimally aligned with the student's original pre-trained tensor using a hardened least-squares solver. This alignment minimizes representational distance before calculating the final difference, with added checks to prevent numerical instability.
DARE-TIES Purification: The difference tensor (Distilled - Aligned Student) was then purified using the DARE-TIES methodology. This process drops a significant percentage (80%) of the lowest-magnitude values, treating them as noise, and then rescale the remaining important differences. This creates a clean, high-signal delta for the final LoRA.

Mixture-of-Experts (MoE) Distillation

The standout feature of this process is the full distillation of the MoE layers, which are critical for nuanced, context-dependent reasoning.

Expert Fingerprinting & Clustering: To map the 256 teacher experts to the 128 student experts, each teacher expert was "fingerprinted" by concatenating its constituent weight matrices. FAISS-GPU K-Means clustering was then used to efficiently group these 256 fingerprints into 128 distinct clusters based on their geometric similarity.
Advanced Expert Synthesis: Each of the student's 128 experts was synthesized from a weighted blend of the teacher experts assigned to its cluster. This blend is not a simple average; instead, it uses an SVD-based reconstruction from the top teacher experts (ranked by similarity to the cluster centroid) to create a new, synthetic expert that represents the core "concept" of that cluster. This more advanced synthesis aims to create novel, yet faithful, expert representations.

Intended Use

This model is intended for use as a general-purpose model for tasks such as coding, problem solving, general questions etc. It is designed to be a more capable and nuanced reasoner than its base model.

Primary Use: Complex instruction-following, reasoning tasks, and creative generation.
Out of Scope: Its knowledge cutoff is from its original training (2024), and it has not been aligned for specific safety or conversational chatbot roles beyond its base tuning.

Critical Usage Note

For inference, you can use either the default settings for the 30B model or the optimized settings used for the 685B model. The choice depends on your specific task, Use the 30B defaults for general tasks. For coding-related work, the 685B settings appear to yield significantly better results based on empirical testing but will slow down inference.

BasedBase
/

Qwen3-30B-A3B-Thinking-2507-Deepseek-v3.1-Distill