Damn.

#3
by CyborgPaloma - opened

Okay, your previous efforts were impressive, but this is just ridiculous.

Obviously your "dynamic task vector machine"/the notion that we need to clean the noise and shape our reasoning traces extremely deliberately is no longer a hypothesis. This is a bizarre outcome in my opinion, to have such radical performance gains, and I'm starting to wonder if there isn't A LOT more here in terms of test time compute, especially within the domain of agentic language models.

So I have a few things I feel are important to say:

I want to offer a friendly suggestion. I think the work that you are doing is brilliant and I'd like more to be able to grasp how important it is. Reading through your papers, I'm wondering if there's a way we can demonstrate to less technically apt folks, or maybe even folks with lower literacy and reading comprehension. In other words, I invite you to think of (or have a few big models or agents think of?) how to present what you are doing as a novel post-training step with a new name. I think it's time, I think the work is significant enough, I think your lab has earned it, and I think it could help humanity more than we might be able to see right now. Not sure exactly what this would be.

Building upon that, I feel it's important to drive home that I believe what you are doing needs to be standardized. You're already doing a good job identifying different formatting templates for reasoning structures, logging and comparing their performance. These templates (at this point I'm theorizing) are something I think could be critical. Naming these templates, testing them, and showing others how to implement them into their own models is key. But it isn't just that. I believe that as a "novel post-training step" this should ENCOMPASS aligning reasoning traces to optimized formatting, as well as whatever else you've identified is needed to fully convert a model's reasoning into being what you consider to be a "dynamic task vector machine" (sorry for my poor grasp on the concept). This process needs to be identifiable, it needs to be succinctly and clearly documented, and it needs to be repeatable. I see it used alongside of inside baseball terms like DPO. I think at this point it's fair to say if the brilliant Qwen lab is leaving so much performance on the table, even IF it is domain specific, this is something that NEEDS to be implemented not just within the Qwen line.

Next... you folks need money. Compute. You need to scale this, NOW. 8, 14, 32 billion parameters would be good, but obviously the 2507 thinking models are a target. You need to hit the 30b3a and the 235b22a, folks, and in that order. Elevate your system of models, with what I think would be a massively impressive lineup of agentic reasoning models. Nano(1.7)/Small(4)/Medium(30b3a)/Large(235b22a) Not just that, but Magistral, OSS, SMOLLM3, GLM, Ling Lite & Plus, Hunyuan, etc etc etc are all targets.

Fire right away, and hard. I believe moving forward, the type of use cases and realistic tasks that could be performed with a scaled version of this dynamic task vector machine go FAR beyond web search, into embodied AI, and zero shot use of tools with extreme precision that will open up a world of downstream uses by being able to push the complexity of tools further without compromising accuracy. Even potentially LMs that are DESIGNED to be lean on actual memorized data, instead being intended to be paired with bodies of verified accurate information for the end use case/system it's implemented in. I think you need a dataset with all of the most complex MCP and tool calls as you can get your hands on, but whatever you're cooking with or doing now SEEMS to be working, so you don't need me to tell you that. I don't know how far this team will take this, but... I see something glittering here. I want to see your platform thrive for all the knowledge you've contributed, and I wish you all the best. Cheers.

Thank you! We will consider your suggestion!

Hope you have a good time with Jan-v1

@CyborgPaloma Could this even be used with MoE though? Admittedly I'm not working in the weeds like you folks are, but I was looking into something similar and MoE training is just an entirely different beast.

@CyborgPaloma Could this even be used with MoE though? Admittedly I'm not working in the weeds like you folks are, but I was looking into something similar and MoE training is just an entirely different beast.

Hi there! sorry fior the late reply, but yes. First of all, to prove that here is a popular community made MoE model literally made from Jan. https://huggingface.co/DavidAU/Qwen3-MOE-4x4B-16B-Jan-Polaris-Instruct-Power-House

But the idea of structuring the thinking blocks to be able to iteratively accomplish multiple tasks before answering actually favors the MoE architecture. Think about it, when learning to do advanced tool calling, searching, summarizing, or concepts even beyond our understanding required to do what Jan is intended to do, it is actually really great to have experts to call for these functions. It means smaller chunks of the model focus on the thinking blocks and nailing the tool calling, but also that we activate different parts of the model for actual end use text generation, like what the user will see at the end or creative writing.

That being said, for that to work, you most likely have to train with whatever sauce these folks are putting on their models - the incredible reasoning steps, agentic tool calling and formatting that allow Jan to search and navigate more advanced functions, so that the experts actually start to separate their functions and really form.

Sign up or log in to comment