nightmedia (Gheorghe Chesler)

posted an update about 14 hours ago

Post

101

Nightmedia now accepts donations

Your kind help would go towards more hardware to run tests--both my AI assistants and myself are deeply appreciative 🖖

These are some of the more recent models:

- [Agent]( nightmedia/Qwen3-4B-Agent-F32-dwq4-mlx)
- [Holodeck]( nightmedia/Qwen3-30B-A3B-Holodeck-mlx)
- [Quorum]( nightmedia/Qwen3-42B-A3B-Quorum-mlx)
- [Continuum]( nightmedia/Qwen3-42B-A3B-Continuum-mlx)
- [Architect 4B]( nightmedia/Qwen3-4B-Architect-mxfp4-mlx)
- [Architect 30B]( nightmedia/Qwen3-30B-A3B-Architect-qx86-hi-mlx)
-[Engineer 4B]( nightmedia/Qwen3-4B-Engineer3x-qx86-hi-mlx)
-[Engineer 30B]( nightmedia/Qwen3-30B-A3B-Engineer-mxfp4-mlx)
- [Data 14B]( nightmedia/Qwen3-14B-Data-qx86-hi-mlx)
- [Seven of Nine]( nightmedia/Qwen3-30B-A3B-Seven-mxfp4-mlx)
- [Spock 4B]( nightmedia/Qwen3-4B-Spock-qx86-hi-mlx)
- [Spock 14B]( nightmedia/Qwen3-14B-Spock-qx86-hi-mlx)
- [Spock 30B]( nightmedia/Qwen3-30B-A3B-Spock-qx86-hi-mlx)
- [Spock 42B]( nightmedia/Qwen3-42B-A3B-Spock-Brutal-Recall-Instruct-qx86-hi-mlx)
- [Odo 6B]( nightmedia/Qwen3-6B-Odo-mxfp4-mlx)
- [Deckard 8B]( nightmedia/Qwen3-8B-Deckard-qx-mlx)
- [HiveMind-Heretic 6B]( nightmedia/Qwen3-6B-HiveMind-Heretic-qx86-hi-mlx)

..as well as a variety of MLX quants, too many to mention here

Thank you, and Happy New Year!

BTC: 36d7U1n3MFaXgnNRAaEL3Pa3Hy6oFhM7XY
ETH: 0x6b6633606995BC180925c47d4249ED624aB7b2A5
USDC: 0x19e6bDDCBa47BB09a9Bc153Bb6479fc57284421a

-G

reacted to hesamation's post with ❤️ 27 days ago

Post

2883

this is big... 50 AI researchers from Bytedance, Alibaba, Tencent, and other labs/universities just published a 300-page paper with surprising lessons about coding models and agents (data, pre and post-training, etc).

key highlights:

> small LLMs can beat proprietary giants
RL (RLVR specifically) gives small open-source models an edge over big models in reasoning. a 14B model trained with RLVR on high-quality verified problems can match the performance of OpenAI's o3.

> models have a hard time learning Python.
mixing language models during pre-training is good, but Python behaves different from statically typed languages. languages with similar syntax (Java and C#, or JavaScript and TypeScript) creates high positive synergy. mixing Python heavily into the training of statically typed languages can actually hurt because of Python's dynamic typing.

> not all languages are equal (coding scaling laws)
the amount of data required to specialize a model on a language drastically depends on the language. paper argues like C# and Java are easier to learn (less training data required). languages like Python and Javascript are actually more tricky to learn, ironically (you see AI most used for these languages :)

> MoE vs Dense (ability vs stability)
MoE models offer higher capacity, but are much more fragile during SFT than dense models. hyperparams in training have a more drastic effect in MoE models, while dense models are more stable. MoE models also require constant learning rate schedules to avoid routing instability.

> code models are "insecure" by default (duh)
training on public repos makes models learn years of accumulated insecure coding patterns. safety fine-tuning often fails to work much on code. a model might refuse to write a hate speech email but will happily generate a SQL-injection vulnerable function because it "works."

read the full paper:
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence (2511.18538)

reacted to ronantakizawa's post with 🔥 about 1 month ago

Post

2478

Introducing the japanese-trending-words dataset: a dataset consisting 593 words from Japan’s annual trending word rankings (流行語大賞) from 2006-2025. This dataset provides the top 30 words from each year and its meaning in Japanese and english. This resource is awesome for NLP tasks understanding recent Japanese culture and history.

ronantakizawa/japanese-trending-words

#japanese #japanesedataset #trending

reacted to grimjim's post with 👍🔥 about 1 month ago

Post

3307

Going forward, I will be adopting the term Magnitude-Preserving Orthogonal Ablation (MPOA) for my recent work in mitigating model damage from abliteration. The technique potentially unlocks reasoning capacity previously occupied with safety refusal processing.

For details, start here: https://huggingface.co/blog/grimjim/norm-preserving-biprojected-abliteration

Showcase results: grimjim/gemma-3-12b-it-norm-preserved-biprojected-abliterated (outperforms base instruct on UGI Leaderboard NatInt)

(The existing name, while technically accurate, was a bit of a mouthful.)

2 replies

·

reacted to ronantakizawa's post with 🔥 about 2 months ago

Post

2290

I built a demo on how to implement Cache-Augmented Generation (CAG) in an LLM and compare its performance gains to RAG (111 stars, 20 forks).

https://github.com/ronantakizawa/cacheaugmentedgeneration

CAG preloads document content into an LLM’s context as a precomputed key-value (KV) cache. This caching eliminates the need for real-time retrieval during inference, reducing token usage by up to 76% while maintaining answer quality.

CAG is particularly effective for constrained knowledge bases like internal documentation, FAQs, and customer support systems, where all relevant information can fit within the model's extended context window.

#rag #retrievalaugmentedgeneration

reacted to aufklarer's post with 🔥 about 2 months ago

Post

3266

Fine-Tuning Qwen3 Embeddings for product category classification on the Large-Scale Product Corpus

Language-models such as GPT, Llama, DeepSeek, Qwen trained with a filtered slice of Common Crawl. For e-commerce work, though, we can start with the Web Data Commons (WDC), the project by the University of Mannheim. It extracts web pages that carry some metadata and publishes the result as the Large-Scale Product Corpus (LSPC).

Search engines like Google reward pages that include detailed product markup, so merchants already populate their sites with SEO-friendly fields such as title, brand, GTIN, price — and, crucially, category labels. Thanks to these built-in annotations, the WDC Large-Scale Product Corpus arrives almost fully self-labelled. I used those labels to fine-tune Qwen3 Embedding with Low-Rank Adaptation (LoRA), code is available on github. The resulting 615 million-parameter checkpoint fits comfortably in limited GPU memory yet updates the model’s representation space, mapping raw product titles to six top-level categories with a macro-F1 of 0.836 (83.6 %).

More details: https://blog.ivan.digital/fine-tuning-qwen3-embeddings-for-product-category-classification-on-the-large-scale-product-corpus-3a0919506bc8

reacted to onekq's post with 🔥 about 2 months ago

Post

2856

The reaction on the QAT post is beyond expectations so below is my optimizer post as promised. But I found that I had lots of explanation to do about optimizer itself. So this post is actually a historical recount. The Muon optimizer (used by Kimi) post (coming very soon) can only continue after this.

https://huggingface.co/blog/onekq/adam-optimizer

If you know Adam(W) optimizer already, you can just skip and sorry for the wait. Otherwise, it should be a useful read.

reacted to ronantakizawa's post with 🔥 about 2 months ago

Post

3289

Reached 1000+ total downloads across my models and datasets! 🎉

Follow me for more @ronantakizawa

2 replies

·

reacted to codelion's post with 🔥 4 months ago

Post

6186

I recently worked on a LoRA that improves tool use in LLM. Thought the approach might interest folks here.

The issue I have had when trying to use some of the local LLMs with coding agents is this:

Me: "Find all API endpoints with authentication in this codebase"
LLM: "You should look for @app .route decorators and check if they have auth middleware..."

But I often want it to search the files and show me but the LLM doesn't trigger a tool use call.

To fine-tune it for tool use I combined two data sources:

1. Magpie scenarios - 5000+ diverse tasks (bug hunting, refactoring, security audits)
2. Real execution - Ran these on actual repos (FastAPI, Django, React) to get authentic tool responses

This ensures the model learns both breadth (many scenarios) and depth (real tool behavior).

Tools We Taught:
- read_file - Actually read file contents
- search_files - Regex/pattern search across codebases
- find_definition - Locate classes/functions
- analyze_imports - Dependency tracking
- list_directory - Explore structure
- run_tests - Execute test suites

Improvements:
- Tool calling accuracy: 12% → 80%
- Correct parameters: 8% → 87%
- Multi-step tasks: 3% → 78%
- End-to-end completion: 5% → 80%
- Tools per task: 0.2 → 3.8

The LoRA really improves on intential tool call as an example consider the query: "Find ValueError in payment module"

The response proceeds as follows:

1. Calls search_files with pattern "ValueError"
2. Gets 4 matches across 3 files
3. Calls read_file on each match
4. Analyzes context
5. Reports: "Found 3 ValueError instances: payment/processor.py:47 for invalid amount, payment/validator.py:23 for unsupported currency..."

Resources:
- Colab notebook https://colab.research.google.com/github/codelion/ellora/blob/main/Ellora_Recipe_3_Enhanced_Tool_Calling_and_Code_Understanding.ipynb
- Model - codelion/Llama-3.2-1B-Instruct-tool-calling-lora
- GitHub - https://github.com/codelion/ellora

reacted to codelion's post with 🔥 4 months ago

Post

5283

I wanted to share a technique that's been working really well for recovering performance after INT4 quantization.

Typically, quantizing the LLM to INT4 (unlike say INT8) for inference can incur some accuracy loss. Instead of accepting the quality loss, we used the FP16 model as a teacher to train a tiny LoRA adapter (rank=16) for the quantized model. The cool part: the model generates its own training data using the Magpie technique so no external datasets needed. This is critical because we want to remain as much as possible in the distribution of the model's natural responses.

Last year Apple's foundational models paper (https://arxiv.org/pdf/2407.21075) had proposed a similar technique and found "By using accuracy-recovery LoRA adapters with only rank 16, Alpaca win rate can be improved by 7-18%, GMS8K accuracy is boosted by 5-10%." (page 47).

We saw similar results on Qwen3-0.6B:

Perplexity: 2.40 → 2.09 (only 5.7% degradation from FP16 baseline)
Memory: Only 0.28GB vs 1.0GB for FP16 (75% reduction)
Speed: 3.0x faster inference than FP16
Quality: Generates correct, optimized code solutions

- Pre-trained adapter: codelion/Qwen3-0.6B-accuracy-recovery-lora
- GitHub repo: https://github.com/codelion/ellora

Happy to answer questions about the implementation or help anyone trying to replicate this. The key insight is that quantization errors are systematic and learnable - a small adapter can bridge the gap without negating the benefits of quantization.

Has anyone else experimented with self-distillation for quantization recovery? Would love to hear about different approaches!

reacted to CultriX's post with ❤️ 11 months ago

Post

2696

Final upgrade to the Multi-Agent Task Completion Space: CultriX/MultiAgent-CodeTask .

It now includes :
- a live stream of the progress being made on the task (see included video),
- The following components:
1. Automatic prompt optimization
2. An orchestrator deciding which agent to call dynamically including feedback from a human (human-in-the-loop)
3. A coding agent to complete the task
4. A code reviewing agent to iteratively provide feedback to improve the code generated by the coding agent until the code meets the required criteria after which it is approved.
5. A testing agent that tests the approved code or provides information on how to test it.
6. A documentation agent that provides documentation and a help message for the approved and tested code.

Gheorghe Chesler PRO

AI & ML interests

Recent Activity

Organizations

Gheorghe Chesler PRO

AI & ML interests

Recent Activity

Organizations

nightmedia's activity