hfendpoints-images (Inference Endpoints Images)

AdinaY

posted an update about 13 hours ago

Post

73

🔥 August highlights from Chinese AI community

zh-ai-community/august-2025-china-open-source-highlights-68a2de5630f406edaf320e88

✨ Efficiency leads the month
- At scale: optimizing compute use in massive MoE models e.g. DeepSeek v3.1
- In small models: lightweight & deployable
e.g. MiniCPM V 4.5, Step Audio 2-mini, Intern S1-mini,Ovis2.5-9B etc.

✨ Reasoning + Agentic wave 🌊 Not just demos, but real product use cases.
- Meituan, DeepSeek: large-scale models tuned for reasoning & tools
- Qwen, GLM, InternLM: multimodal reasoning + agentic interaction
- CodeAgent, Prover, Baichuan-M2-32B: domain-focused (coding, logic, specialized reasoning)

✨ Open source is exploding across all types of companies!!
- Big tech: Tencent, ByteDance, Xiaomi, Kuaishou, Alibaba/Qwen, Skywork, Ant Group
- Startups: DeepSeek (yes, still a startup!), Zhipu, Baichuan, StepFun, OpenBMB
- New entrants: Meituan, RedNote
- Research labs: Shanghai AI Lab (InternLM, OpenGVLab)

✨ Open source was explicitly mentioned in the State Council’s new guidance on deepening the "AI+" strategy.
- Open-source: support communities, encourage contributions (incl. university credits & recognition), foster new application approaches, and build globally impactful ecosystems 👀

💡 The Chinese community didn’t slow down at all in August 🤯 September, the last month before the Golden Week holiday, may bring even more surprises.

Stay Tuned!

AdinaY

posted an update about 18 hours ago

Post

108

Hunyuan-MT-7B 🔥 open translation model released by Tencent Hunyuan

tencent/hunyuan-mt-68b42f76d473f82798882597

✨ Supports 33 languages, including 5 ethnic minority languages in China 👀
✨ Including a translation ensemble model: Chimera-7B
✨ Full pipeline: pretrain > CPT > SFT > enhancement > ensemble refinement > SOTA performance at similar scale

AdinaY

posted an update about 18 hours ago

Post

114

From food delivery to frontier AI 🚀 Meituan, the leading lifestyle platform just dropped its first open SoTA LLM: LongCat-Flash 🔥

meituan-longcat/LongCat-Flash-Chat

✨ 560B total / ~27B active MoE — MIT license
✨ 128k context length + advanced reasoning
✨ ScMoE design: 100+ TPS inference
✨ Stable large-scale training + strong agentic performance

AdinaY

posted an update 4 days ago

Post

439

USO 🎨 Unified customization model released by Bytedance research

Demo
bytedance-research/USO
Model
bytedance-research/USO
Paper
USO: Unified Style and Subject-Driven Generation via Disentangled and Reward Learning (2508.18966)

✨ Large-scale triplet dataset (content, style, stylized)
✨ Disentangled learning: style alignment + content preservation
✨ Style Reward Learning (SRL) for higher fidelity
✨ USO-Bench: 1st benchmark for style & subject jointly
✨ SOTA results on subject consistency & style similarity

AdinaY

posted an update 4 days ago

Post

342

Step-Audio 2🔥 New end to end multimodal LLM for audio & speech, released by StepFun

stepfun-ai/step-audio-2-68b003c3a47b273fffaf67a8

✨ Direct raw audio: text & speech ,no ASR+LLM+TTS pipeline
✨ High-IQ reasoning: RL + CoT for paralinguistic cues
✨ Multimodal RAG + tool calling
✨ Emotion, timbre, dialect & style control
✨ SOTA on ASR, paralinguistic, speech dialog

AdinaY

posted an update 7 days ago

Post

1055

🇨🇳 China’s State Council just released its “AI+” Action Plan (2025)

<The State Council’s Guidance on Deepened Implementation of the ‘AI+’ Strategy>
zh-ai-community/china-ai-policy-research

✨Goal: By 2035, AI will deeply empower all sectors, reshape productivity & society

✨Focus on 6 pillars:
>Science & Tech
>Industry
>Consumption
>Public welfare
>Governance
>Global cooperation

✨Highlights:
>Models: advance theory, efficient training/inference, evaluation system
>Data: high-quality datasets, IP/copyright reform, new incentives
>Compute: boost chips & clusters, improve national network, promote cloud standardization, and ensure inclusive, efficient, green, secure supply.
>Applications: AI-as-a-service, test bases, new standards
>Open-source: support communities, encourage contributions (incl. university credits & recognition), foster new application approaches, and build globally impactful ecosystems 👀
>Talent, policy & safety frameworks to secure sustainable growth

AdinaY

posted an update 7 days ago

Post

4821

MiniCPM-V 4.5 🚀 New MLLM for image, multi-image & video understanding, running even on your phone, released by OpenBMB

openbmb/MiniCPM-V-4_5

✨ SOTA vision language capability
✨ 96× video token compression > high-FPS & long video reasoning
✨ Switchable fast vs deep thinking modes
✨ Strong OCR, document parsing, supports 30+ languages

AdinaY

posted an update 7 days ago

Post

267

InternVL3.5 🔥 New family of multimodal model by Shanghai AI lab

OpenGVLab/internvl35-68ac87bd52ebe953485927fb

✨ 1B · 2B · 4B · 8B · 14B · 38B ｜ MoE → 20B-A4B · 30B-A3B · 241B-A28B 📄Apache 2.0
✨ +16% reasoning performance, 4.05× speedup vs InternVL3
✨ Cascade RL (offline + online) : stronger reasoning
✨ ViR: efficient visual token routing
✨ DvD: calable vision–language deployment
✨ Supports GUI & embodied agency 🤖

AdinaY

posted an update 12 days ago

Post

589

Excited to see another tech company OPPO now sharing papers, models, and datasets on the hub 🔥🚀

PersonalAILab
Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL (2508.13167)

Their work Chain-of-Agents (CoA) equips a single LLM with multi agent collaboration, using distillation and RL to solve complex tasks end-to-end.

AdinaY

posted an update 12 days ago

Post

3615

Seed-OSS 🔥 The latest open LLM from Bytedance Seed team

ByteDance-Seed/seed-oss-68a609f4201e788db05b5dcd

✨ 36B - Base & Instruct
✨ Apache 2.0
✨ Native 512K long context
✨ Strong reasoning & agentic intelligence
✨ 2 Base versions: with & without synthetic data

AdinaY

posted an update 13 days ago

Post

5403

✨ DeepSeek V3.1 just dropped on the hub.
deepseek-ai/DeepSeek-V3.1-Base

AdinaY

posted an update 14 days ago

Post

480

Before my vacation: Qwen releasing.
When I came back: Qwen still releasing
Respect!!🫡

Meet Qwen Image Edit 🔥 the image editing version of Qwen-Image by
@Alibaba_Qwen

Qwen/Qwen-Image-Edit

✨ Apache 2.0
✨ Semantic + Appearance Editing: rotate, restyle, add/remove 🎨
✨ Precise Text Editing → edit CN/EN text, keep style

clem

posted an update 25 days ago

Post

3438

Thread to gossip during the

openai GPT-5 livestream: https://www.youtube.com/watch?v=0Uu_VJeVVfo. Feel free to post your impressions below!

29 replies

·

a-r-r-o-w

posted an update 29 days ago

Post

2142

You would've implemented the 3-loop matrix multiplication many times as a ML practitioner, but the naive implementation is terrible for GPU performance. Modern GPUs achieve peak performance through careful memory access patterns and minimizing scheduling overhead.

In naive matmul (MxK . KxN), the computation happens in tiles - both for the output matrix and for how you read chunks from the input matrices. Each thread-block processes one output tile by loading corresponding tiles from input (for sum-reduction across K dimension), performing the computation, then terminating. The GPU launches many thread-blocks and schedules them across available streaming multiprocessors (SMs). When an SM finishes one tile, it gets assigned a new thread-block for the next uncomputed tile. This way, multiple output tiles are computed in parallel across the SMs, but we pay the cost for launching thread-blocks each time a new tile is computed.

Persistent matmul changes this approach. Instead of launching thread-blocks to compute some output tiles, computing the results on SMs in parallel, and repeating until all output tiles are computed, you launch only as many thread-blocks as you have SMs available (typically 80-132 on modern GPUs). These thread-blocks stay alive until all output tiles are computed, looping through multiple tiles sequentially. Each persistent thread-block may handle multiple output tiles.

The key benefit is the reduced thread-block launch latency. This persistence strategy, combined with other optimizations like coalesced memory loads/stores, block-tiling, warp-tiling, warp-specialization, double-buffering, ping-pong scheduling and other tricks, helps achieve peak performance. More on this in the future!

Code snippet for testing: https://gist.github.com/a-r-r-o-w/28339b442d164084506c0967029968a8

(Bonus: Since I've wanted to learn Manim for a while, this was a great opportunity to make a visualization for Naive VS Persistent matmul. Enjoy ✨)

3 replies