V3.1 seems to be pretty bad at everything except coding and mathematics. V3.1 看起来除了编程和数学之外,其他方面都很差。

#26
by qazqazqazqaz46 - opened

After testing various prompts, providers, and the official API, the results indicate that this model is essentially designed for coding and is unsuitable for daily use. It also performs poorly at following instructions or prompts given by users.

在测试了各种提示词、服务商以及官方 API 后,结果表明该模型本质上是为编程设计的,不适合日常使用。同时,它在遵循用户指令或提示方面的表现也很糟。

I suppose V3.1 was originally designed as a generalist model with a hybrid mode to reduce costs, combining the benefits of a chat model for daily use and a thinking model for precise tasks like coding, math, and agent functions. Unfortunately, it turned out to be an excellent coder+agent but performed poorly in every other areas, regardless of the mode chosen.

我想 V3.1 最初的设计是作为一个通用模型,通过混合模式来降低成本,结合聊天模型在日常使用中的优势,以及思考模型在编码, 数学和代理功能等精确任务中的优势。不幸的是,它最后只在数学, 编码和代理方面表现出色,而在其他所有领域表现都很差,无论选择哪种模式。

qazqazqazqaz46 changed discussion title from V3.1 seems to be pretty bad at everything except coding and mathematics. to V3.1 seems to be pretty bad at everything except coding and mathematics. V3.1 看起来除了编程和数学之外,其他方面都很差。

我也感觉是这样,尤其是Agent自动代理时的系统提示词告诉它怎么做,但它往往不遵守

The hybrid models ended up failing in the end... Just treat this V3.1 as a pure coder.

混合模型终究是失败了... 这个V3.1当作纯Coder就行

deepseek 3.1非常糟糕,agent模式错误百出

我们的业务上也是,感觉v3-0324效果最好。感觉3.1已经纯粹退化成一个agent+编程的模型了。对于文科类的任务来说比0324还差感觉。

Man, I seriously expect them to follow qwen's path and release a different R1 like thinking non-hybrid model.

V3.1 reasons very less and is just pure trash for anything that doesn't appear to be complex at first glance or worded like a logic problem or a math problem. A massive regression from R1-0528 or V3-0324.

It's the worst of both worlds. The only thing it can do well is write code and use tools.

Sign up or log in to comment