✨ Efficiency leads the month - At scale: optimizing compute use in massive MoE models e.g. DeepSeek v3.1 - In small models: lightweight & deployable e.g. MiniCPM V 4.5, Step Audio 2-mini, Intern S1-mini,Ovis2.5-9B etc.
✨ Reasoning + Agentic wave 🌊 Not just demos, but real product use cases. - Meituan, DeepSeek: large-scale models tuned for reasoning & tools - Qwen, GLM, InternLM: multimodal reasoning + agentic interaction - CodeAgent, Prover, Baichuan-M2-32B: domain-focused (coding, logic, specialized reasoning)
✨ Open source is exploding across all types of companies!! - Big tech: Tencent, ByteDance, Xiaomi, Kuaishou, Alibaba/Qwen, Skywork, Ant Group - Startups: DeepSeek (yes, still a startup!), Zhipu, Baichuan, StepFun, OpenBMB - New entrants: Meituan, RedNote - Research labs: Shanghai AI Lab (InternLM, OpenGVLab)
✨ Open source was explicitly mentioned in the State Council’s new guidance on deepening the "AI+" strategy. - Open-source: support communities, encourage contributions (incl. university credits & recognition), foster new application approaches, and build globally impactful ecosystems 👀
💡 The Chinese community didn’t slow down at all in August 🤯 September, the last month before the Golden Week holiday, may bring even more surprises.
✨ Supports 33 languages, including 5 ethnic minority languages in China 👀 ✨ Including a translation ensemble model: Chimera-7B ✨ Full pipeline: pretrain > CPT > SFT > enhancement > ensemble refinement > SOTA performance at similar scale
MiniCPM-V 4.5 🚀 New MLLM for image, multi-image & video understanding, running even on your phone, released by OpenBMB openbmb/MiniCPM-V-4_5
✨ SOTA vision language capability ✨ 96× video token compression > high-FPS & long video reasoning ✨ Switchable fast vs deep thinking modes ✨ Strong OCR, document parsing, supports 30+ languages
✨ 36B - Base & Instruct ✨ Apache 2.0 ✨ Native 512K long context ✨ Strong reasoning & agentic intelligence ✨ 2 Base versions: with & without synthetic data
You would've implemented the 3-loop matrix multiplication many times as a ML practitioner, but the naive implementation is terrible for GPU performance. Modern GPUs achieve peak performance through careful memory access patterns and minimizing scheduling overhead.
In naive matmul (MxK . KxN), the computation happens in tiles - both for the output matrix and for how you read chunks from the input matrices. Each thread-block processes one output tile by loading corresponding tiles from input (for sum-reduction across K dimension), performing the computation, then terminating. The GPU launches many thread-blocks and schedules them across available streaming multiprocessors (SMs). When an SM finishes one tile, it gets assigned a new thread-block for the next uncomputed tile. This way, multiple output tiles are computed in parallel across the SMs, but we pay the cost for launching thread-blocks each time a new tile is computed.
Persistent matmul changes this approach. Instead of launching thread-blocks to compute some output tiles, computing the results on SMs in parallel, and repeating until all output tiles are computed, you launch only as many thread-blocks as you have SMs available (typically 80-132 on modern GPUs). These thread-blocks stay alive until all output tiles are computed, looping through multiple tiles sequentially. Each persistent thread-block may handle multiple output tiles.
The key benefit is the reduced thread-block launch latency. This persistence strategy, combined with other optimizations like coalesced memory loads/stores, block-tiling, warp-tiling, warp-specialization, double-buffering, ping-pong scheduling and other tricks, helps achieve peak performance. More on this in the future!
✨ The multimodal wave🌊 - GLM-4.1V-Thinking: Image+Text > Text - Intern-S1: Image+Text > Text - Wan 2.2 - Text +Image > video - Skywork-R1V3: Image+Text > Text - Skywork-UniPic: Text > Image / Image > Text - Tar-7B: Any-to-Any - Ming-Lite-Omni-1.5: Any-to-Any - Step3: Image+Text > Text - HunyuanWorld-1: Image > 3D - ThinkSound: Video > Audio - Neta-Lumina: Text > Image
✨ Big month not only for models, but for policy too🏛️ - Announced Global Action Plan for AI Governance - Proposes to set up a World AI Cooperation Organization in Shanghai - Released International AI Open Source Collaboration Initiative - Published Risk Assessment Guidelines for Endpoint AI Agents
✨ Big event - WAIC - 355K offline visitors - 108 new released in 4 days - 145 sessions across key domains
I’ve been tracking things closely, but July’s open-source wave still blew me away. Can’t wait to see what’s coming next! 🚀
We just updated GPU-fryer 🍳 to run on Grace Hopper Superchip (GH200) - fully optimized for ARM-based systems! With this release, we switched to cuBLASLt to support running FP8 benchmarks. You can monitor GPU throttling, TFLOPS outliers, HBM memory health, and ensure that you get the most of your hardware setup. Perfect for stress testing and tuning datacenter GPUs.
✨ 321B total / 32B active - Apache 2.0 ✨ MFA + AFD : cutting decoding cost by up to 70% vs. DeepSeek-V3 ✨ 4T image-text pretraining: strong vision–language grounding ✨ Modular, efficient, deployable: runs on just 8×48GB GPUs