Xuan-Son Nguyen's picture

Xuan-Son Nguyen

ngxson

AI & ML interests

Doing AI for fun, not for profit

Recent Activity

Organizations

Hugging Face's profile picture Blog-explorers's profile picture Hugging Face TB Research's profile picture ggml.ai's profile picture Hugging Face Discord Community's profile picture Consumer AI Edge Hackathon (Meta, Hugging Face, Pytorch, Scaleway & Unaite)'s profile picture Mistral AI Game Jam's profile picture gg-hf-g's profile picture

ngxson's activity

reacted to clem's post with ❤️🔥 5 days ago
view post
Post
6981
I was chatting with @peakji , one of the cofounders of Manu AI, who told me he was on Hugging Face (very cool!).

He shared an interesting insight which is that agentic capabilities might be more of an alignment problem rather than a foundational capability issue. Similar to the difference between GPT-3 and InstructGPT, some open-source foundation models are simply trained to 'answer everything in one response regardless of the complexity of the question' - after all, that's the user preference in chatbot use cases. Just a bit of post-training on agentic trajectories can make an immediate and dramatic difference.

As a thank you to the community, he shared 100 invite code first-come first serve, just use “HUGGINGFACE” to get access!
·
posted an update 14 days ago
view post
Post
3005
A comprehensive matrix for which format should you use.

Read more on my blog post: https://huggingface.co/blog/ngxson/common-ai-model-formats

| Hardware        | GGUF      | PyTorch                | Safetensors              | ONNX  |
|-----------------|-----------|------------------------|--------------------------|-------|
| CPU             | ✅ (best) | 🟡                      | 🟡                       ||
| GPU             |||||
| Mobile          || 🟡 (via executorch)     |||
| Apple silicon   || 🟡                      | ✅ (via MLX framework)   ||
  • 1 reply
·
reacted to fdaudens's post with 🔥❤️🚀 17 days ago
view post
Post
3285
🚀 Just launched: A toolkit of 20 powerful AI tools that journalists can use right now - transcribe, analyze, create. 100% free & open-source.

Been testing all these tools myself and created a searchable collection of the most practical ones - from audio transcription to image generation to document analysis. No coding needed, no expensive subscriptions.

Some highlights I've tested personally:
- Private, on-device transcription with speaker ID in 100+ languages using Whisper
- Website scraping that just works - paste a URL, get structured data
- Local image editing with tools like Finegrain (impressive results)
- Document chat using Qwen 2.5 72B (handles technical papers well)

Sharing this early because the best tools come from the community. Drop your favorite tools in the comments or join the discussion on what to add next!

👉 JournalistsonHF/ai-toolkit
reacted to as-cle-bert's post with 🚀👍 23 days ago
view post
Post
2374
I built an AI agent app in less than 8 hours🤯
And, believe me, this is 𝗻𝗼𝘁 clickbait❌

GitHub 👉 https://github.com/AstraBert/PapersChat
Demo 👉 as-cle-bert/PapersChat

The app is called 𝐏𝐚𝐩𝐞𝐫𝐬𝐂𝐡𝐚𝐭, and it is aimed at 𝗺𝗮𝗸𝗶𝗻𝗴 𝗰𝗵𝗮𝘁𝘁𝗶𝗻𝗴 𝘄𝗶𝘁𝗵 𝘀𝗰𝗶𝗲𝗻𝘁𝗶𝗳𝗶𝗰 𝗽𝗮𝗽𝗲𝗿𝘀 𝗲𝗮𝘀𝗶𝗲𝗿.

𝐇𝐞𝐫𝐞 𝐢𝐬 𝐰𝐡𝐚𝐭 𝐭𝐡𝐞 𝐚𝐩𝐩 𝐝𝐨𝐞𝐬:

📄 Parses the papers that you upload thanks to LlamaIndex🦙 (either with LlamaParse or with simpler, local methods)
📄 Embeds documents both with a sparse and with a dense encoder to enable hybrid search
📄 Uploads the embeddings to Qdrant
⚙️ Activates an Agent based on mistralai/Mistral-Small-24B-Instruct-2501 that will reply to your prompt
🧠 Retrieves information relevant to your question from the documents
🧠 If no relevant information is found, it searches PubMed and arXiv databases
🧠 Returns a grounded answer to your prompt

𝐇𝐨𝐰 𝐝𝐢𝐝 𝐈 𝐦𝐚𝐧𝐚𝐠𝐞 𝐭𝐨 𝐦𝐚𝐤𝐞 𝐭𝐡𝐢𝐬 𝐚𝐩𝐩𝐥𝐢𝐜𝐚𝐭𝐢𝐨𝐧 𝐢𝐧 𝟖 𝐡𝐨𝐮𝐫𝐬?

Three key points:

- LlamaIndex🦙 provides countless integrations with LLM providers, text embedding models and vectorstore services, and takes care of the internal architecture of the Agent. You just plug it in, and it works!🔌⚡
- Qdrant is a vector database service extremely easy to set up and use: you just need a one-line Docker command😉
- Gradio makes frontend development painless and fast, while still providing modern and responsive interfaces🏗️

And a bonus point:

- Deploying the demo app couldn't be easier if you use Gradio-based Hugging Face Spaces🤗

So, no more excuses: build your own AI agent today and do it fast, (almost) for free and effortlessly🚀

And if you need a starting point, the code for PapersChat is open and fully reproducible on GitHub 👉 https://github.com/AstraBert/PapersChat
reacted to burtenshaw's post with 👍🤗❤️ 26 days ago
view post
Post
3592
Hey, I’m Ben and I work at Hugging Face.

Right now, I’m focusing on educational stuff and getting loads of new people to build open AI models using free and open source tools.

I’ve made a collection of some of the tools I’m building and using for teaching. Stuff like quizzes, code challenges, and certificates.

burtenshaw/tools-for-learning-ai-6797453caae193052d3638e2
  • 1 reply
·
reacted to mitkox's post with 🚀👍 about 2 months ago
view post
Post
2500
llama.cpp is 26.8% faster than ollama.
I have upgraded both, and using the same settings, I am running the same DeepSeek R1 Distill 1.5B on the same hardware. It's an Apples to Apples comparison.

Total duration:
llama.cpp 6.85 sec <- 26.8% faster
ollama 8.69 sec

Breakdown by phase:
Model loading
llama.cpp 241 ms <- 2x faster
ollama 553 ms

Prompt processing
llama.cpp 416.04 tokens/s with an eval time 45.67 ms <- 10x faster
ollama 42.17 tokens/s with an eval time of 498 ms

Token generation
llama.cpp 137.79 tokens/s with an eval time 6.62 sec <- 13% faster
ollama 122.07 tokens/s with an eval time 7.64 sec

llama.cpp is LLM inference in C/C++; ollama adds abstraction layers and marketing.

Make sure you own your AI. AI in the cloud is not aligned with you; it's aligned with the company that owns it.
·
reacted to onekq's post with 🔥 about 2 months ago
view post
Post
4765
🐋DeepSeek 🐋 is the real OpenAI 😯
·
posted an update about 2 months ago
replied to their post about 2 months ago
view reply

Yes, sure!

The first step is to generate the PEFT-compatible LoRA adapter, I used mergekit-extract-lora to do that. Please note that some bigger models (Qwen/Llama 70B) give some errors that I don't know how to fix, hopefully they will fix that soon. You can find more info about mergekit here: https://github.com/arcee-ai/mergekit

Next step is to convert PEFT to GGUF, I used this space: https://huggingface.co/spaces/ggml-org/gguf-my-lora

Then it's good to go!

Please note that, the space can convert any PEFT LoRA adapters to GGUF, so if you're using something like unsloth, it will be straight-forward to convert into GGUF LoRA (so no need to merge to base model)

replied to their post about 2 months ago
posted an update about 2 months ago
view post
Post
3238
Check out my collection of pre-made GGUF LoRA adapters!

This allow you to use both normal + abliterated version of popular models like llama, qwen, etc, without having to double to amount of VRAM usage.

ngxson/gguf_lora_collection
·
reacted to bartowski's post with 👀👍 2 months ago
view post
Post
64486
Looks like Q4_0_N_M file types are going away

Before you panic, there's a new "preferred" method which is online (I prefer the term on-the-fly) repacking, so if you download Q4_0 and your setup can benefit from repacking the weights into interleaved rows (what Q4_0_4_4 was doing), it will do that automatically and give you similar performance (minor losses I think due to using intrinsics instead of assembly, but intrinsics are more maintainable)

You can see the reference PR here:

https://github.com/ggerganov/llama.cpp/pull/10446

So if you update your llama.cpp past that point, you won't be able to run Q4_0_4_4 (unless they add backwards compatibility back), but Q4_0 should be the same speeds (though it may currently be bugged on some platforms)

As such, I'll stop making those newer model formats soon, probably end of this week unless something changes, but you should be safe to download and Q4_0 quants and use those !

Also IQ4_NL supports repacking though not in as many shapes yet, but should get a respectable speed up on ARM chips, PR for that can be found here: https://github.com/ggerganov/llama.cpp/pull/10541

Remember, these are not meant for Apple silicon since those use the GPU and don't benefit from the repacking of weights
·