This Mistral Small has FAR less knowledge than the last.

#5
by phil111 - opened

Mistral Small 2409 scored relatively high on broad English knowledge tests like English SimpleQA, and scored a respectable 82.6/100 on my broad English knowledge test.

However, v2501 is scoring notably worse.

This would be understandable if you decided to focus less on English, and more on the performance of multiple languages. However, your English MMLU scores went way up, so with v2501 you selectively trained on the tiny subset of popular English knowledge that overlaps the MMLU.

Qwen2.5 did the same thing earlier. That is, Qwen2.5 72b (and especially Qwen2.5 34b) has a high English MMLU score, yet a very low for its size English SimpleQA score of ~8, and its score on my broad English test dropped from 85.9 to 68.4 (Q2 72b to Q2.5 72b). Mistral Small v2501 is seeing a similar drop from v2409.

Thanks for the feedback, can you post your knowledge test here or upload it somewhere so that we can test it?

@patrickvonplaten Generally speaking pop culture information (TV shows, movies, music, games, sports...) is more scrambled and weakly held in v2501 vs v2409, including at temp 0.

For example, "Which two actresses played the two ex-wives of Alan Harper in Two and a Half Men?"

2501 "Melanie Lynskey played Rose, the first ex-wife of Alan Harper.
Jenny McCarthy played Judy, the second ex-wife of Alan Harper."

2409 "The two actresses who played the ex-wives of Alan Harper in "Two and a Half Men" are:
Judith Harper (Alan's first wife) - Played by Marin Hinkle
Kandi (Alan's second wife) - Played by April Bowlby"

And when told to just list the main cast of the same show it made a basic error for a main character ("Judith Harper - Marcia Cross"). 2409 never made such errors.

And with progressively less popular shows the rate of errors (relative to v2409) progressively increased.

For example, when asked about the cast of the popular Canadian show Corner Gas it said the main character's name was "Brent Loney" versus Leroy, reliably mismatched the character and actor names ("Karen Loney - Gabrielle Miller"), and so on ("Hank Yule - Jim Cuddy" vs Hank Yarbo - Fred Ewanuick).

But one very notable improvement with the new Mistral Small is in instruction following. For example, when asked to end 8 sentences with the same word v2501 ended all 8 in said word, while v2409 only end 1 of 8 with the given word.

Lastly, unlike pop culture information, the MMLU STEM information is strongly held and can be fully and accurately retrieved better in v2501 than v2409.

An example , "In astronomy, what's the name of the hypothetical object that forms when a neutron star merges with a red supergiant star, potentially forming a black hole without a supernova explosion?"

Response (2501): "The hypothetical object you're referring to is called a Thorne–Żytkow object (TZO).", while 2409 simply identified it as a "collapsar"

I mean, those are very obscure questions about fictional characters in a TV show where all the characters have ex-wives and with actors sharing some roles and names. You can't expect small(ish) local models to be that accurate for something this incredibly specific.

Out of curiosity, I asked at Q5KL quantization, Tekken7 instruct, and temp 0 with a very generic assistant system prompt (you're an helpful assistant, don't hallucinate, don't guess information you're not sure... that kind of stuff) I got that instead:

In the television series "Two and a Half Men," the character Alan Harper has two ex-wives who are portrayed by different actresses. The first ex-wife, Judy, is played by Melanie Lynskey. The second ex-wife, Kandi, is played by April Bowlby. Both actresses contributed significantly to the character dynamics and comedic elements of the show.

Mistral team, please never prioritize knowledge of American sitcoms in a small model. I can not imagine a more useless application of resources. Great release. Thank you.

@phil111 thanks for providing this example - that's very interesting!

I think I can roughly reproduce your answer using vLLM. One thing that's important to understand with Mistral-Small-3 is that the chat completion template was changed to include a system prompt (and it was trained on following system prompts) => hence it's more sensitive to the system prompt (or the lack thereof).

To reproduce

You can spin up vLLM as follows:

vllm serve mistralai/Mistral-Small-24B-Instruct-2501 --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice

and then ping it.

General system prompt

Generally make sure to always include a system prompt to get the best answers. I would recommend something like the following system prompt. Also in general we recommend temperature=0.15. So for any general purpose task you can use something like the following:

import requests
import json
import sys

url = "http://<your-node>/v1/chat/completions"
headers = {"Content-Type": "application/json", "Authorization": "Bearer token"}

model = "mistralai/Mistral-Small-24B-Instruct-2501"

system_prompt = """You are Mistral Small 3, a Large Language Model (LLM) created by Mistral AI, a French startup headquartered in Paris.
When you're not sure about some information, you say that you don't have the information and don't make up anything.
If the user's question is not clear, ambiguous, or does not provide enough context for you to accurately answer the question, you do not try to answer it right away and you rather ask the user to clarify their request (e.g. \"What are some good restaurants around me?\" => \"Where are you?\" or \"When is the next flight to Tokyo\" => \"Where do you travel from?\")"""

prompt = "Which two actresses played the two ex-wives of Alan Harper in Two and a Half Men?"

messages = [
    {
        "role": "system",
        "content": system_prompt,
    },
    {
        "role": "user",
        "content": prompt,
    },
]

data = {"model": model, "messages": messages, "temperature": 0.15}

response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.json()["choices"][0]["message"]["content"])

Here the model gives me:

In the television series "Two and a Half Men," the character Alan Harper has two ex-wives. The actresses who played these roles are:

1. **Judith Harper** (Alan's first wife) was played by **Patricia Heaton**.
2. **Kandi Harper** (Alan's second wife) was played by **April Bowlby**.

These actresses portrayed the ex-wives in different seasons of the show.

which is still wrong (April Bowlby is not correct), but it's much closer I'd argue.

Tailored system prompt

Now, if you have a specific case, such as "movie" knowledge you can tailor the model much more for this, by:

  • a) changing the system prompt
  • b) setting temp = 0.0

E.g. the following:

system_prompt = """You are Movie-GPT and are very knowledgeable about any movie and TV Show.

When you're sure about some information, first think step-by-step and give relevant context before answering.
When you're not sure about some information, you say that you don't have the information and don't make up anything."""

with temperature=0.0 gives the correct answer:

To determine which two actresses played the two ex-wives of Alan Harper in the TV show "Two and a Half Men," let's break it down step-by-step:

1. **Identify Alan Harper's Ex-Wives**: In "Two and a Half Men," Alan Harper has two ex-wives. The first is Judy, and the second is Kandi.

2. **Find the Actresses**: We need to identify the actresses who portrayed these characters.

   - **Judy Harper**: Judy is Alan's first ex-wife and the mother of his son, Jake. She is portrayed by **Marin Hinkle**.

   - **Kandi Harper**: Kandi is Alan's second ex-wife. She is portrayed by **April Bowlby**.

Therefore, the two actresses who played the two ex-wives of Alan Harper in "Two and a Half Men" are **Marin Hinkle** (Judy Harper) and **April Bowlby** (Kandi Harper).

=> Obviously we can't adapt the system prompt to a task until the model gives the correct solution, but I'm just showing it here to show how the system prompt can influence the model (in some sense the knowledge is still "stored" somewhere)

Will give it a try now with mistral-small-2409 (would honesty be surprised if mistral-small-2409 confidently gives the correct answer here).

Update: Interesting, you're right @phil111 - mistral-small-2409 always confidently gives the correct answer here! That's interesting!
Does your test set only contain "Movie" knowledge or does it also include other topics / areas of "knowledge"? Would be very interesting to see if this also happens for other "niche" knowledge topics.

I am curious if this holds as true for the base model, though we lack an open point of comparison for 2409. (But to try to test if this is due to forgetting in the instruction pipeline specifically.)

@SerialKicked But they aren't obscure questions. I only ask about the most popular of pop culture that countless millions know the answers to, and L3.1 70 scored nearly ~90/100, and Sonnet 3.5 & GPT4o got a near perfect score. It's a vastly easier test than the English SimpleQA.

More importantly, small models like L3.1 8b and Gemma 2 9b scored around 70/100, and the last Mistral Small scored 82.6, so models of this size already proved that they can know this information.

In the very least, far more people know and care about my pop culture questions than the esoteric and theoretical STEM questions this Mistral Small got right (e.g. Thorne–Żytkow object TZO).

There has only been dozens of shows as popular as Two and a Half Men, which ran for 12 season, so there's no excuse for an AI model, especially a 22b one, to hallucinate like mad at temp 0 when asked basic questions about humanity's most popular information (pop culture). And they didn't used to. Qwen2.5 72b scored 85.9/100, Mistral Small 2409 82.6, and Mixtral 8x7b 86.7, and so on. It's only after boosting select test and task scores (e.g. MMLU and coding) that the general knowledge of LLMs started to tank relative to their previous releases. Please stop overfitting.

Please stop overfitting

I think this shows that some information about popular culture is difficult to express, that is, this information is often not presented in a way that is acceptable in the corpus screening stage, and is therefore filtered out in the more sophisticated training corpus design stage.
This filtering reflects the staff's view of "good corpus". While improving model capabilities, it inevitably leads to bias and negative effects. This is a phenomenon that always exists and is exacerbated as training corpus design is valued.
This is completely different from overfitting.

To clarify, I just take issue just with this particular question. I don't know the rest of your test (I just hope it's more diverse than that). It's for a show where many characters are divorced, remarried, and so on. It has actor and character names being similar and easy to mix up too. Even the wikipedia page (where it took its information from during training, most likely) itself is a barely legible mess of unsorted information. It's a miracle a human with no prior knowledge of the show could retrieve the information accurately, let alone a personal-use language model.

It's a lot harder question to answer than you seem to think it is. Hence why, even if it can respond correctly as demonstrated by patrick, any change in inference setting or prompting method change the LLM's response dramatically in this case (<hile when you ask what's the capital of france, you can change the prompt and inference settings to your heart's content and still get the correct answer).

Point is, for this kind of specialized knowledge (it is, independently of how popular a show is, knowing who played who and when is niche), using or making a dedicated knowledge-specific fine-tune sounds like the proper way to go about it.

In the very least, far more people know and care about my pop culture questions than the esoteric and theoretical STEM questions this Mistral Small got right (e.g. Thorne–Żytkow object TZO)

The average person is not running and finetuning open-source models locally, they use ChatGPT instead.

@jth01 Sure, when space is limited prioritizing high value information is reasonable. But above 7 billion parameters there's room for both academic (MMLU) and popular knowledge (pop culture), as proven by Llama 3.1 8b & Gemma 2 9b.

I believe the broad adoption of open source AI models won't happen if the community de-prioritizes, even excludes, humanity's most popular information in favor of preferred areas of knowledge. Believe it or not, knowing the main cast of a very popular TV show that ran for 12 years is far more broadly valuable, regardless of the AI model's size, than knowing esoteric theoretical astronomical object like Thorne–Żytkow objects (TZOs).

Also, once you start picking and choosing what to include very bad things reliably start to happen. For example, Chinese models start excluded very popular facts that the CCP opposes, and Western models like Phi refuse to include salty language, impolite jokes at the expense of others... which nearly every adult regularly shares.

All of humanity's most popular knowledge simply needs to be included and trained equally, even if it means an MMLU score that's 10 points lower. The elitist attitude that basic information about extremely popular shows that ran for 12 years shouldn't be included in a 22 billion parameters model so that it can do things like math a little better, but still far too unreliably to be trusted, is insane to me.

phil111 changed discussion status to closed

@phil111 I'd consider reopening the discussion. You've got the attention of the mistral team now. And it's maybe to late to fix this regression in the model, but not too late to fully understand the issue and let others report more findings. I think here it actually matters and could help inform future pretraining runs. Besides, despite custom system instructions the Issue still isn't resolved, your point still stands.

@nlpguy The issue seems to be the scrambling of general knowledge as models are continuously trained to perform better at tasks like coding and math. The final score of Mistral Small v2501 is 75.4, compared to 82.6 for v2409.

While a notable drop, that still leaves it tied for first (in general English knowledge) with the latest Command R 34b (75.1) in the 34b- size range, followed by Gemma 2 27b at 71.0. And Chinese models like Yi 1.5 34b & Qwen2.5 34b, despite their high English MMLU scores, far lower at 55.5- (Yi 1.5).

Also, the drop from Qwen2 to Qwen2.5 (85.9 to 68.4), and a SimpleQA score down to only ~8/100 despite its ample 72b parameters, was a much larger, and unforgivable, regression in general English knowledge relative to English MMLU score.

Lastly, since Mistral asked, the regression applies across the board with pop culture, but kicks in later with movies vs TV shows. For example, the 5th Element cast ("Luke Perry as Butcher" vs previously correct Billy). And the least popular movie tested (Four Rooms) makes far more cast errors, including listing random actors like Johnny Depp (scoring 1.32/4 vs 2.68/4 on said question). And in regards to music, song to artist linking is less accurate, as is recalling the main lyrics to very popular songs like Madonna's Like a Virgin, which makes identifying a song you heard, or is stuck in your head, far less reliable with the latest Mistral Small.

Sign up or log in to comment