Whimsical Waffle: The Curious Case of LLMs and Their Linguistic Shenanigans
yay
Actually, to get the DRYRUN test, all we would have to do is to get rid of the MAP_POPULATE in:
mmap(NULL, 31937041504, PROT_READ, MAP_SHARED|MAP_POPULATE, 4, 0) = 0x7ff79c600000
Because I think with the right switches, we can otherwise avoid touching the memory (alternatively, map /dev/null). Of course, the measurements allowed by DRYRUN are much more worthwhile. Basically, it's the killer feature if we could make it available and it turns out to be feasible. Thats the really interesting (to me) todo point: create a script that downloads the gguf header only from huggingface and recreates a dummy gguf. Too bad the gguf file format is so badly designed - you have to decode the whole header incrementally to know how long it is.
(using fuse to mount a file via https is cheating)
btw., in the case of blacksheep, i take the lists of quants done from the "quantize" script and patch the job like this:
"iquants": "Q2_K IQ3_M Q4_K_S IQ3_XXS Q3_K_M small-IQ4_NL Q4_K_M Q6_K IQ4_XS Q3_K_S Q3_K_L Q5_K_S Q5_K_M Q4_0 IQ3_XS Q4_1 IQ3_S",
and fore the jais models for example, I removed the *0, *1, IQ4_NL quzant, essentially:
"squants": "x-f16 Q4_K_S Q2_K Q6_K Q3_K_M Q3_K_S Q3_K_L Q4_K_M Q5_K_S Q5_K_M IQ4_XS",
"iquants": "Q2_K IQ3_M Q4_K_S IQ3_XXS Q3_K_M Q4_K_M IQ2_M Q6_K IQ4_XS Q2_K_S IQ1_M Q3_K_S IQ2_XXS Q3_K_L IQ2_XS Q5_K_S IQ2_S IQ1_S Q5_K_M IQ3_XS IQ3_S",
it's in theory possible to do this when adding the job (not via llmc, because reasons), but that requires us to predict with some accuracy that this will happen, so is rarely useful
Actually, to get the DRYRUN test, all we would have to do is to get rid of the MAP_POPULATE in:
mmap(NULL, 31937041504, PROT_READ, MAP_SHARED|MAP_POPULATE, 4, 0) = 0x7ff79c600000
I'm a bit confused. Dryrun doesn't even use mmap. I explicitly disable it and even print "mmap is not supported for dry-run so it is now disabled" as warning if you don't specify --no-mmap
. Why would you even want mmap for dry-run? You are not allocating any memory when loading the model so what would be the point of it?
Because I think with the right switches, we can otherwise avoid touching the memory (alternatively, map /dev/null).
What you mean with touching memory? No additional RAM or GPU memory should get allocated when loading a model. Obviously llama.cpp requires some memory to function like any application but that is so little it can be ignored.
Of course, the measurements allowed by DRYRUN are much more worthwhile. Basically, it's the killer feature if we could make it available and it turns out to be feasible. Thats the really interesting (to me) todo point: create a script that downloads the gguf header only from huggingface and recreates a dummy gguf. Too bad the gguf file format is so badly designed - you have to decode the whole header incrementally to know how long it is.
I don't think the header can be that big so you can likely just download enough for the full header to always be present.
btw., in the case of blacksheep, i take the lists of quants done from the "quantize" script and patch the job like this
"iquants": "Q2_K IQ3_M Q4_K_S IQ3_XXS Q3_K_M small-IQ4_NL Q4_K_M Q6_K IQ4_XS Q3_K_S Q3_K_L Q5_K_S Q5_K_M Q4_0 IQ3_XS Q4_1 IQ3_S"
I assume you are setting this inside llmjob edit
.
Wouldn't the scripts synchronize when it is available again?
Altogether it's 3GB, not just scripts, but also, of course, llama.cpp. I added a hack so when removing the disable flag it will sync automatically, but I also update llama.cpp from home, and every node has a different combination of llama.cpp variants (probably the easiest way around is to change that).
But, yeah, that's not effectively automatable.
Yes even for me it would now be inconvenient to switch as I memorized the path so well.
embrace the difference :)
Oh, let's hope for the best. No imatrix failure so far but a lot of imatrix tasks will only be started at 22:00 due to most of them currently being timeofday blocked.
I am pretty sure the dryrun test works - the onyl way it could fail is if it somehow succeeds despite the model being broken. Likely there are some tests in llama.cpp that are only done at inference time, the question is, how many, and are they important :) We will find out.
Just so you know DRYRUN is supposed to work with every llama.cpp executable that loades a model so you are not limited to llama-cli.
To... some extent (i.e. tracking allocations)? Surely you have not found a generic way to exit all of these at just the right time.
Then just don't use llama-cli but any other one that doesn't do this.
Haha, "just". Love it :) Anyway, are there any? There is the server, but the server seems to do the same thing.
Nice. No idea why everyone keeps renaming thair models but us having a diffrent name makes ouer models hard to find so automated renames would be quite usefull.
They rename it because they want to be able to erase it and create a different one without having to come up with a new final name, in case it sucks. Models are also regularly moved, and sometimes even aparently cloned, to other users.
It does make them harder to find, but at least I stopped using the search function by hf and started to use the quantisations link.
That would be amazing! There are quite a lot of factors that influence vram usage but maybe you can find a pattern by playing around with dryrun.
I would allow the user to specify VRAM for 0, 1 or 2 gpus, tensor split, some flags like flash attention, and then probably do a binary search to find the maximum -ngl value.
models always show the date when they were last updated
You'll have to check wuant file dates anyway if you need some kind of date. And then, it's pretty useless.
I guess we can at least try to update them in chronological order, so the order stays the same. Or can we?!?
The updates would almost certainly go from newest to oldest, even (or rather, reverse order in how hf lists them for me), with some randomness.
GIT_COMMITTER_DATE and GIT_AUTHOR_DATE environment variables before committing using git
If I can't do it via the api it will not happen. Messing in scripts with git will be a disaster. Besides, will the server-side git really just accept any client-side garbage date when pushed?
as this will hopefully be the last time we ever edit all of them.
The other a-ha moment I had last week was when I realised that this is the problem and must give. I have versioned the model cards now, so we can keep any number of different co,patible card formats and update at our own pace.
I don't think with us publishing 100+ repos a day anybody would care about 20000 updates even per day.
I'm a bit confused. Dryrun doesn't even use mmap. I explicitly disable it and even print "mmap is not supported for dry-run so it is now disabled" as warning if you don't specify --no-mmap. Why would you even want mmap for dry-run? You are not allocating any memory when loading the model so what would be the point of it?
I was talking about an alternative way to achieve just the validity testing without changing llama.cpp. It's entirely hypothetical.
I don't think the header can be that big so you can likely just download enough for the full header to always be present.
The header is pretty massive - tiny if you look at the whole file, but many megabytes in size to warrant an optimisation. My first computer had ~100 octets usable memory. I sawe amazing sofwtare wirtten in 20k of memory. When I see a bash process using 2MB of RAM I regularly get dizzy.
Anyway, gguf is very wasteful, or example, every vocabulary entry is 8 bytes string length + string. Also, "likely enough" means you still have to be prepared for it to not be enough in edge cases.
And to be honest, what worries me most is that aws typically charges for the full file even if only a few bytes of it are being downloaded. But since the gguf parse on the hf page exists, I am sure it doesn't matter :)
To... some extent (i.e. tracking allocations)? Surely you have not found a generic way to exit all of these at just the right time.
It should work for majority of them. Almost all that load a model are using the same code to do so. I just tested llama-imatrix
, llama-perplexity
, llama-simple
, llama-simple-chat
and llama-run
all of which were fully compatible with DRYRUN despite me never testing them before. It’s not that they are just working they also tell you how much memory would be required to fulfill to load the model in a way that fulfills thar purpose as they essentially just load the model with the exact parameters they require.
Haha, "just". Love it :) Anyway, are there any?
No ide. Try the ones I mentioned above and if they all do it than this is likely something in the model loading code in which case I can take a look at the code and see if we can change this.
I would allow the user to specify VRAM for 0, 1 or 2 gpus, tensor split, some flags like flash attention, and then probably do a binary search to find the maximum -ngl value.
That would be so awesome. This is actually exactly what I'm currently for what I'm using DRYRUN myself.
Keep in mind that DRYRUN only tells you the memory required to load the model and allocate enough memory for its context. Memory used during inference for things like attention is not considered but is easy to estimate. In fact, more memory is required to load a model if flash attention is enabled due to additional overheads associated with its implementation.
If I can't do it via the api it will not happen. Messing in scripts with git will be a disaster.
Totally understandable.
will the server-side git really just accept any client-side garbage date when pushed?
All git servers seam to do. git servers kind of trust client side garbage by design. I had to spoof dates/name/emails for author/committer so many times in the past and I not once had a git server refuse the commit. The only thing I'm not sure if HuggingFace uses the time in the git commit like GitHub/GitLab do or if it uses the server time of the push. Now I'm a bit curious so the next time I upload a model I might try it.
The other a-ha moment I had last week was when I realized that this is the problem and must give. I have versioned the model cards now, so we can keep any number of different compatible card formats and update at our own pace.
I don't think with us publishing 100+ repos a day anybody would care about 20000 updates even per day.
Yes it should be fine unless we hit some kind of rate limit.
The header is pretty massive - tiny if you look at the whole file, but many megabytes in size to warrant an optimization. My first computer had ~100 octets usable memory. I saw amazing software written in 20k of memory. When I see a bash process using 2MB of RAM I regularly get dizzy.
My first "Gameboy" which in fact was a Voyage 200 calculator for school had 188 kB RAM and 2,7 MB ROM and it was enough to play all kind of games. I even had something like Maro Maker on there. I actually had that Voyage 200 calculator 5 years before I had my first mobile phone and used it from everything from reading, writing, programming and gaming.
In case you wonder my first PC was a Windows 2000 with 13 GB of HDD storage and I think 128 MB of RAM. My first programming language was BlitzBasic to write PC games followed by Compact-C which I used to program C-Control Pro microcontrollers which had 2 KB of usable RAM, 10 KB of usable flash storage, 1 KB EEPROM and a 14.7456 MHz CPU so I know your feeling.
Anyway, gguf is very wasteful, or example, every vocabulary entry is 8 bytes string length + string.
That is indeed terrible wasteful. 1 byte would have been enough.
Also, "likely enough" means you still have to be prepared for it to not be enough in edge cases.
Which should be fine as llama.cpp was so nice to put stupid limits everywhere so most edge cases likely already failed when we tried converting them into GGUF.
And to be honest, what worries me most is that aws typically charges for the full file even if only a few bytes of it are being downloaded. But since the gguf parse on the hf page exists, I am sure it doesn't matter :)
S3 only charges for the actually used bandwidth as far I'm aware. So if you only download the first 10 MB HuggingFace should only be charged for 10 MB. They do charge per 10K API calls a very low amount but this doesn't at all matter as we only have around 500K quants. I'm mostly worried about HuggingFace might be using intelligent tiering in which case us accessing all the quants might cause them to be copied into hot storage which then would cost them the transfer fee plus 30 days of hot storage. But in any case, there is not much we can do about any of this unless we find a storage usage pattern and can based on one quant tell how much all the others require which I think might be possible.
Memory used during inference for things like attention is not considered but is easy to estimate. In fact, more memory is required to load a model if flash attention is enabled due to additional overheads associated with its implementation.
That's a bummer then... So how would you easily estimate it? And what you mean more is required to "load" a model - after loading, flash attention surely uses less memory.
Yes it should be fine unless we hit some kind of rate limit.
That doesn't worry me either - I envisaged some kind of bulk update because I thought versioning the readmes is a bad idea. But, I changed my mind. IF we hit a rate limit, it will take a few y<ears to update old repos - so what.
Voyage 200 calculator for school
I got the first HP 48SX in germany (or so I was actually told by HP). Sigh. hp calculators... were so nice...
Windows 2000
Wow. That is so long after I had switched to GNU/Linux. (I switched from DOS to Linux just before win 3 became ubiquitous (in 1994, with 1.0.2 or something - I was even late to the game, or so it felt))
That is indeed terrible wasteful. 1 byte would have been enough.
Yeah, or 4 octet (or even 8 octet) header length + json/msgpack/cbor/... and yes, one octet would be enough if you limit strings to 127 octets, but to be fair, that's a limit of the encoder, not a limit of the format.
I'd say whoever designed it (well, gerganov) was probably paranoid of not running into arbitrary 4GB limits anywhere. Puzzlingly enough, though, the primitive types numbers (there are 13) are stored in 32 bit ints. And no, everything is just octet-aligned, so it's nothing to do with that.
To it's defence, the gguf decoder I wrote in Perl is just 80 lines of code. So in that sense, it lends itself to a very simple implementation. But using an existing JSON decoder with that header would just be 3 lines or so...
I think ggerganov has a major fear of external dependencies - even more than me, and I thought I was a bit on the extreme side.
S3 only charges for the actually used bandwidth as far I'm aware.
I admit I am no expert, but it seems to be a well known attack to request only part of a large file and get billed with much larger transfer costs because aws does not bill octets downloaded but octets prepared for download, regardless of how much actually was used (or even requested). So yes, only actually used bandwidth, but it's their internal fantasy made up bandwidth, not the external customer-measurable bandwidth. It is possible that it only affects some S3 storage products, but it's a concern. Well, it's not a concern, because huggingface does it themselves, and I am happy to cache things...
S3
And don't they also bill GET requests? So there must be some optimal transfer size - probably in the megabyte range?
Sooooo, DRYRUN gives me an error message for a failed model, but exit status is 0:
load_tensors: loading model tensors, this can take a while... (mmap = false)
llama_model_load: error loading model: check_tensor_dims: tensor 'token_embd.weight' has wrong shape; expected 5120, 152064, got 5120, 151665, 1, 1
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'Methuselah-PSR-B1620-26b-14B-Exp.gguf'
main: Dryrun compleated!
changed the test to this:
if DRYRUN= llama llama-cli -m "$SRC.gguf~" --no-warmup -n 0 -t 1 -no-cnv -st </dev/null 2>&1 | tee -a /dev/fd/2 | grep -q ": failed to load model"; then
That's a bummer then... So how would you easily estimate it? And what you mean more is required to "load" a model - after loading, flash attention surely uses less memory.
DRYRUN tells you how much memory you need to load a model and reserving the memory required for its context. So if you have as much memory as DRYRUN tells you, you will be able to load the model. However depending on context and prompt you might still OOM during inference as some memory is allocated during inference for algorithms like attention. The memory required for attention should more or less be the same for a given context with a give attention method. So you can likely measure it once and add it onto to what DRYRUN tells you is required to load the model. Flash attention needs more memory during the initial load, but the attention algorithm itself uses linear instead of quadratic memory for a given context which for large context should be more memory efficient.
That doesn't worry me either - I envisaged some kind of bulk update because I thought versioning the readmes is a bad idea. But, I changed my mind. IF we hit a rate limit, it will take a few y<ears to update old repos - so what.
The limit can't be so bad that it will take years. We should try to update them in a reasonable timeframe as the current model card isn’t that good in my opinion.
And don't they also bill GET requests? So there must be some optimal transfer size - probably in the megabyte range?
They do but it is $0.0004 per 1,000 requests so if we need 500K of them that is $0.2 which is so low it almost not worth mentioning.
HuggingFace will be fine:
"There are no retrieval charges in S3 Intelligent-Tiering. If an object in the infrequent access tier is accessed later, it is automatically moved back to the frequent access tier. No additional tiering charges apply when objects are moved between access tiers within the S3 Intelligent-Tiering storage class."
So if they use Intelligent-Tiering they are not getting charged for AWS being stupid beside paying slightly more for files being in less cold storage for 30 days which is almost nothing to what retrieval charges would be.
In case you wonder from S3 to Europe (Zurich) is $0.02 per GB and nothing if it only goes to Amazon CloudFront (which has their own billing for bandwidth) and really they seem to only calculate that data is actually getting sent to the internet based on their website and intelligent storage has no retrieval fee so they really shouldn't bill for the data we don't download unless they found some kind of loophole to trick their customers.
But in any case, there is nothing we can do about any of this.
Sooooo, DRYRUN gives me an error message for a failed model, but exit status is 0
That's so stupid. Sorry for this mistake. I forgot about that. I will be fixing it today evening.
changed the test to this
This will work in the meantime.
DRYRUN tells you how much memory you need
I realise what you mean. I guess it can also be handled by telling the user to reduce ngl a bit when in doubt. It will still be far more useful than the manual trial runs I have to do now.
The limit can't be so bad that it will take years.
I meant toi say "even it takes a few years..." and I didn't expect the repom create limit to be as bad as it is. Or erratic(?) still feels weird to get rate limited sometimes, even when we don't crunch through lots of models.
S3
Thanks, helps a lot :)
This will work in the meantime.
We are not in a hurry - assuming that we always get "failed to load model". Eh, evenif it would not, it'd still be a great improvement :)
model page
Well, my plan is to get rid of graphs and everything but the download table and the links, and also more or less fully generate the page and move all metadata to yaml. The only hindrance is that it is a lot of work, and even a single misplaced space or fixed typo will cause havoc :) Just not so much fun. But I am slowly working towards making it doable (and gaining motivation by not forcing me to work on it :)
If you have any conrete input (text fragments, layout) on the model page, I am happy to collect it. The general trend, though, should be to move as much of the info to the external model page, so there is only one place to improve. Unfortunately, the model download page already needs revamping, too, and already goes too much into the directioon of web development for my taste :)
Sooooo, DRYRUN gives me an error message for a failed model, but exit status is 0:
This should now be fixed in the latest version. I kind of forgot about llama.cpp sometimes using exceptions to jump out of heavily nested functions skipping all the code that would otherwise get executed by following the normal return path. I personally don't really like throwing exceptions somewhere and handling them on a completely different location - it feels like a modern version of goto but without labeling where it jumps to.
I fixed this by adding a dedicated exit point for dry-run inside common.cpp to no longer mess with llama.cpp's exception handling and removing all modifications from main.cpp. This now ensures exceptions skip past the dry-run dedicated exit point and are instead getting properly handled by main.cpp
I also updated the mradermacher branch to latest llama.cpp so we now have Gemma 3 and experimental Gemma 3 vision support.
You guys might find this interesting: https://arxiv.org/abs/2503.03592
Quote from conclusion:
Further, the usage of importance matrices written in non-English does not significantly improve performance on non-English datasets and might in fact slightly harm it. However, this reduction in performance is not statistically significant.
You guys might find this interesting: https://arxiv.org/abs/2503.03592
Thanks a lot for sharing! I looked at the paper and am really surprised by the result. Their testing methodology looks clean and the result tell quite a clear story. This means our primary English imatrix dataset is much better for non-English models than we thought. I now regret having non-English models only queued for static quants.
@nicoboss I assume you queued all/most of the nice lint check models that all fell through the llama loading code check? :)
Here's all errors (deduplicated), and they do all seem legit (and therefore I have nuked them):
/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
/llmjob/llama.cpp-cuda512/src/llama.cpp:8666: GGML_ASSERT(strcmp(res->name, "result_output") == 0 && "missing result_output tensor") failed
I suspect these are all pure embeddings and therefore can't be used with llama-imatrix?
regarding the paper, it's one of the results I expected (either it's no big deal, because a lot with imatrix training data seems irrelevant), or it has a big effect. But finally I can choose between these extremes!
I also feel much better about my training data now, which is pretty incoherent. But given that random tokens seem to work relatively fine, it would actually be surprising if it were so detrimental.
The question is, what does that tell us about how llms store knowledge? and how about IQ quants, which are far, far more sensitive to imatrix weights?
@tdh111 anyway, very much appreciated, I would have never seen this paper without you
I assume you queued all/most of the nice lint check models that all fell through the llama loading code check? :)
I queued a quite a lot of trending models some of which turned out to be bad. Those errors are all legit and can be nuked.
I suspect these are all pure embeddings and therefore can't be used with llama-imatrix?
Yes exactly. I will improve my trending model discovery scripts to filter out embeddings in the next version. I will also check if there is a way dry-run can detect this. The main issue that this is a check that occurs during inference time inside llama_decode_impl and not while loading the model.
The last 2 failures you can nuke if you want.
https://huggingface.co/cl-nagoya/ruri-large-v2 likely requires manual GGUF conversion due to ModuleNotFoundError: No module named 'fugashi'
No idea why https://huggingface.co/google/flan-t5-xxl fails to download but if the just started Redown fail I guess I will provide the GGUF manually there as well.
Edit: Nevermind cl-nagoya/ruri-large-v2 likely
is an embedding as well so I nuked it as we don't care about them.
Edit2: I think redown fixed flan-t5-xxl
so must have just been some random HuggingFace download error.
Edit3: No flan-t5-xxl
failed again: ValueError: Missing or incomplete model files: ['model-00001-of-00005.safetensors']
anyway, very much appreciated, I would have never seen this paper without you
Thanks a lot for sharing!
Glad you both liked it.
The question is, what does that tell us about how llms store knowledge? and how about IQ quants, which are far, far more sensitive to imatrix weights?
Both of those are separate from the paper I linked, but this paper is relevant to your first question: https://arxiv.org/abs/2503.05613 .
Your second question about IQ quants is best answered by ikawrakow, who would most likely answer if asked in a discussion post in ik_llama.cpp. I feel like I know the answer but I'm not confident enough to give my answer because I would rather not spread potential wrong information, but now that you ask I'm curious if the same holds true for his new quant types (IQ_K) which at low bpw offer better performance than I-quants and at higher bpw offer better performance and quality compared to K-quants.
I will also check if there is a way dry-run can detect this.
Be careful - the more checks you add, or rather, move, the more you will diverge from future llama.cpp versions that might do things differently. There is a trade-off here, between catching more things and maybe blocking future roads.
some random HuggingFace download error.
Possible, but unlikely, as hfd retries pretty aggressively. When you open a (s) in audit, the download is printed (it's in MODEL/log, too). If it's a new model, a much more common failure mode is actually not yet uploaded files. For example,, YOYO-AI loves to make elabnorate model cards before actuially uploading all files :/
I'm unexpectedly busy (and probably rather tired) for the next weeks, probably. I'll try to take care, but don't be alarmed if things get a bit more erratic.
also not caught:
llama_model_quantize: failed to quantize: key not found in model: llama.context_length
this is actually caught by quantize, so causes extra human work, but not extra computational work (it's caught during static jobs).
interesting that qantize even bothers...
and clearly, nice level 1200 is the junk class
How do I know if /tmp/quant/Samantha-1.11-70b-i1-GGUF/imatrix.dat
is an old or new imatrix? I unfortunately nuked the existing imatrix repo before hash comparing them. I checked for Samantha-1.1-70b which is basically the same case and they were different so I'm almost certain the imatrix for Samantha-1.11-70b got recomputed as well. It seems like cases where after nuke existing imatirx are getting copied only happens if they were somewhat recently generated but not for this 1-year-old cases of repositories where static quants never even existed. In the future I will obviously use nukeall so none of this will be an issue.
and clearly, nice level 1200 is the junk class
I noticed this as well. I nuked so many errors this morning when I woke up. We had almost entire hosts filled with errors.
also not caught:
llama_model_quantize: failed to quantize: key not found in model: llama.context_length
This does get caught using dry-run. Not sure why you think it does not. I even tested one of the models that had this error today to confirm:
llama_model_loader: mmap is not supported for dry-run so it is now disabled
print_info: file format = GGUF V3 (latest)
print_info: file type = F16
print_info: file size = 12.55 GiB (16.00 BPW)
llama_model_load: error loading model: error loading model hyperparameters: key not found in model: llama.context_length
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/root/nico/law-LLM.gguf'
main: error: unable to load model
I'm unexpectedly busy (and probably rather tired) for the next weeks, probably. I'll try to take care, but don't be alarmed if things get a bit more erratic.
No problem. Now that you gave me all these amazing tools and I got familiar using them I should be able to solve most of the issues myself hopefully letting you focus as much on your job as possible. Just ignore things and only respond that what is important to safe time. I'm doing them same when I'm busy. Feel free to ignore user requests and audits as I can handle them myself.
Interesting so nico2 has still not turned itself off despite the repository creation issue beeing long gone and 3 hours after 17:00:
nico2 nice size (static/imatrix) -- jobs 3/4-12 maxm 300 free 1479 budget 769 uploads 0 hfd 0 32c
1200 1 s Qwen2.5-0.5B-SFT-merge blocked/frozen/timeofday repo create (interrupting)
1200 15 s mistral-openorca-openplatypus-alpacagpt4 blocked/admin/paused
1200 17 s WEPO-llama-3-8b blocked/admin/paused
Regarding Snowflake Arctic Instruct the source GGUF is under /apool/snowflake-arctic-instruct.gguf
. We only want to regenerate the imatrix and the imatrix quants but keep the static quants. Before you add it you need to nukerepo https://huggingface.co/mradermacher/snowflake-arctic-instruct-i1-GGUF but only this one and NOT the static quants! You also need to archive the current imatrix quants WITHOUT using nukeall as we want to keep the static quants.
How do I know if /tmp/quant/Samantha-1.11-70b-i1-GGUF/imatrix.dat is an old or new imatrix?
If you are lucky, from the mtime. Otherwise, if we have a repo for it, we have it cached. If we don't have a repo for it, we shouldn't have it cached. In this case, it's old, though:
-rw------- 1 root root 4.6M Dec 31 15:16 imatrix-remote/Samantha-1.11-7b.imatrix
-rw------- 1 root root 7.2M Dec 14 13:59 imatrix-remote/Samantha-1.11-13b.imatrix
-rw------- 1 root root 25M Mar 10 08:47 imatrix-remote/Samantha-1.11-70b.imatrix
I assume I should remove it? (done)
If you are lucky, from the mtime. Otherwise, if we have a repo for it, we have it cached. If we don't have a repo for it, we shouldn't have it cached.
Thanks a lot! Can you in this case please check for Samantha-1.1-70b as well and delete if it is older than a few weeks? I have the feeling that has generated a new one despite https://huggingface.co/mradermacher/Samantha-1.1-70b-i1-GGUF existing as the sha256 hash of the imatrix file I have locally doesn't match the one inside the HuggingFace repository.
@mradermacher
The status page and telnet 10.28.1.1 16732
are already frozen for an hour but llmc audit
still works without any issue which is strange as shouldn't it break as well if someone doesn't release the lock? I plan on executing llmc killall9
should the issue still persist in half an hour.
Interesting so nico2 has still not turned itself off despite the repository creation issue beeing long gone and 3 hours after 17:00:
repo creation is currently not interruptible, and no, the status showsa it's frozen, so still ongoing.
there are two issues here: a) jobs should no longer be frozen on nico1 - if a job somehow isn't finished before 17:00, it should probably continue to run and b) if a job is still active, it will not shutdown
i removed the cron rules causing job freezes. but repo creates would still keep it on indefinitely atm.
i also don't think snowflake will help us. I think we should push a few of the big 1400 jobs to prio 1200 or so per day. maybe. will have to see.
I couldn't queue this morning, this is probably the root cause.
-rw------- 1 root root 25M Mar 10 05:55 imatrix-remote/Samantha-1.1-70b.imatrix
moved away
if there already is a job i need to manually add the imatrix job
repo creation is currently not interruptible, and no, the status showsa it's frozen, so still ongoing.
there are two issues here: a) jobs should no longer be frozen on nico1 - if a job somehow isn't finished before 17:00, it should probably continue to run and b) if a job is still active, it will not shutdown
i removed the cron rules causing job freezes. but repo creates would still keep it on indefinitely atm.
Thanks a lot. No worry about the repo creation. It will eventualy create it and shutdown. The main reason it didn't was likely because of timeofday. The rate limit usualy doesn't blocks a specific task for more than around a hour unless you get very unlucky and always loose the race to create a repo once a slot gets free.
I couldn't queue this morning, this is probably the root cause.
We did over 300 models today as 19 hours ago we had 1779 models in the queue and now there are 1492 not even considering all the ones we queued. It was crazy. I had to llmc audit
sometimes even multible times per hour as we went through so many models. We need to do a somewhat healthy mix of difffrent sized models so we don't end up having days where we only do small ones or we will get rate limited. Next time I will queue some myself earlier.
I think we should push a few of the big 1400 jobs to prio 1200 or so per day. maybe. will have to see.
That sounds like a great idea for days where we there are not many new big great models.
i also don't think snowflake will help us.
It will at least keep nico1 busy and it's one of the massive models we had to do anyways. I'm currently also closely following llama.cpp decisions what MLA algorithm they decide on. Depending on which one they choose we may or may not need to requantize all the DeepSeek V2/V3/R1 models.
-rw------- 1 root root 25M Mar 10 05:55 imatrix-remote/Samantha-1.1-70b.imatrix
moved away
Thanks a lot!
if there already is a job i need to manually add the imatrix job
Now that I know the current imatrix was outdated I will be secured the source GGUF, use nukeall and queue them again. That should be the cleanest option and not require you to do anything.
I tried fixing the frozen status page using llmc killall9
but it timed out...
nico1 ~# llmc killall9
nico1
back
leia
nico2
rich1
kaos
marco
rain
Killed llmjob(2720126) with signal 9
Killed llmjob(1136699) with signal 9
Killed llmjob(1136722) with signal 9
Killed llmjob(1137378) with signal 9
Killed llmjob(296290) with signal 9
Killed llmjob(514440) with signal 9
Killed llmjob(3434878) with signal 9
Killed llmjob(661385) with signal 9
llmjob: no process found
Killed llmjob(2256273) with signal 9
nico2: Connection timed out
At least I now know which node is to blame. Guess time to turn on nico2 again to fix this.
With nico2 turned on llmc killall9
terminated successfully within a second but the status page is still frozen. This issue really seems quite different from how normally the frozen status page behaves. I turned off nico2 again as turning it on didn't help solving the issue.
Oh maybe I shouldn't have used llmc killall9
. When I now check llmc audit
I see a many entires like this but not entierly sure if related but they are not something I've seen before:
ionice -c3 chrt -b 0 systemd-run --scope --unit=llmjob-wrap-omega1.3-static-2719966 -G -p MemoryMax=32G
/llmjob/share/bin/quantize: line 295: 2720126 Killed llmjob hf-ensure-repo "$DST"
job finished, status 137
job-done<0 omega1.3 static 137>
https://huggingface.co/nicogptai/omega1.3
back from working...
yes, i just had the same experience. i have no clue what causes these deadlocks (basically, the master takes a global lock before connecting to workers, and each worker takes its own local lock, in random order. one would think the relatively new "upcalls" (via llmc) might be an issue, but i don't see a path where llmjob does a blocking llmc call - the only llmc call llmjob does is "push", which does not block if the lock is held. shucks).
killall -9 llmjob is no longer a crude but effective method, because llmjob has become a toolbox for lots of things, rather than only the scheduler itself, so killing it kills lots of other stuff, failing jobs. it's relatively simple to clean up for me, so if it means some other job will start instead, do it. the fix is to fix the deadlock problem...
@nicoboss so, i thought, great oppportunity to do the snowflake imatrix quant. i can mlock the Q8_0 (509G) without issue, but llama-imatrix is "Killed" (so probably oom killer), even with literally nothing else running.
@nicoboss so, i thought, great oppportunity to do the snowflake imatrix quant. i can mlock the Q8_0 (509G) without issue, but llama-imatrix is "Killed" (so probably oom killer), even with literally nothing else running.
It indeed got OOM killed despite nothing else being running. I was aware you wanted to run Snowflake Arctic imatrix so I turned of all services on StormPeak. The only thing I forgot was reducing the ZFS ARC cache from 24 GB to less but the last time we did snowflake arctic base this wasn't required. Here the kernel log oft he OOM event:
Mar 14 02:28:22 StormPeak kernel: llama-imatrix invoked oom-killer: gfp_mask=0x440dc0(GFP_KERNEL_ACCOUNT|__GFP_COMP|__GFP_ZERO), order=0, oom_score_adj=800
Mem-Info:
active_anon:221810 inactive_anon:462979 isolated_anon:0
active_file:1412 inactive_file:3913 isolated_file:0
unevictable:124205714 dirty:195 writeback:108
slab_reclaimable:389941 slab_unreclaimable:342434
mapped:124210748 shmem:28017 pagetables:385960
sec_pagetables:0 bounce:0
kernel_misc_reclaimable:0
free:873272 free_pcp:132 free_cma:0
Node 0 active_anon:1733588kB inactive_anon:1005568kB active_file:5240kB inactive_file:14472kB unevictable:496822856kB isolated(anon):0kB>
Node 0 DMA free:11264kB boost:0kB min:0kB low:12kB high:24kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB i>
lowmem_reserve[]: 0 1432 515181 515181 515181
Node 0 DMA32 free:1520228kB boost:0kB min:252kB low:1716kB high:3180kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_>
lowmem_reserve[]: 0 0 513749 513749 513749
Node 0 Normal free:1958780kB boost:0kB min:91612kB low:617688kB high:1143764kB reserved_highatomic:1867776KB active_anon:1873616kB inact>
lowmem_reserve[]: 0 0 0 0 0
Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11264kB
Node 0 DMA32: 9*4kB (UM) 12*8kB (UM) 10*16kB (UM) 8*32kB (UM) 9*64kB (UM) 8*128kB (UM) 8*256kB (UM) 11*512kB (UM) 11*1024kB (UM) 12*2048>
Node 0 Normal: 1292*4kB (UME) 11246*8kB (UME) 14994*16kB (ME) 5833*32kB (ME) 2351*64kB (UME) 264*128kB (UME) 114*256kB (UM) 80*512kB (UM>
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
124203420 total pagecache pages
0 pages in swap cache
Free swap = 0kB
Total swap = 0kB
134086427 pages RAM
0 pages HighMem/MovableOnly
2179266 pages reserved
0 pages hwpoisoned
Mar 14 02:28:26 StormPeak kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=ns,mems_allowed=0,global_oom,task_memcg=/lxc/108/ns/user.slice/user-0.slice/session-3129.scope,task=llama-imatrix,pid=2313118,uid=100000
Mar 14 02:28:26 StormPeak kernel: Out of memory: Killed process 2313118 (llama-imatrix) total-vm:502778504kB, anon-rss:77292kB, file-rss:280131584kB, shmem-rss:8192kB, UID:100000 pgtables:548784kB oom_score_adj:800
Mar 14 02:28:30 StormPeak kernel: oom_reaper: reaped process 2313118 (llama-imatrix), now anon-rss:0kB, file-rss:0kB, shmem-rss:72kB
I now reduced the ZFS ARC Cache from 24 GB to 1 GB. If this is still not enough, please offload layers to booth RTX 4090 GPUs and it will fit for sure. StormPeak is now ready for you to be used for Snowflake Arctic imatrix computation.
I now joined the waitlist for HuggingFace Xet. Xet is thair next generation storage solution replacing S3/GIT LFS. If my personal account gets accepted I will let you know if it is any good. You could join using https://huggingface.co/join/xet but I recommend to wait. Xet probably lifts the 50 GB limit so no more splitting/merging required. For our dry-run all GGUFs project Xet would be far superior compared to S3 as unlike S3 Xet is a block storage so you likely only need to download a single block per model.
@mradermacher
I tried for the first time to manually run a massive imatrix job and everything seemed well but something is blocking it. Maybe because I paused the host to prevent other tasks from running as I had no clue how to put a host in that mode as llmc help
had no command for it.
Edit: No also got stuck at this exact location when nico1 wasn't paused at all.
Also all thouse commands seam to be broken depite the host no longer beeing paused:
nico1 /tmp# llmc disable llmjob.nico1
disable.llmjob.nico1+: fail
nico1 /tmp# llmc disable imatrix.nico1
disable.imatrix.nico1+: fail
nico1 /tmp# llmc pause llmjob.nico1
pause.llmjob.nico1+: fail
nico1 /tmp# llmc pause imatrix.nico1
pause.imatrix.nico1+: fail
And resuming the GPUs I paused while playing around also seams no longer possible despite the host no longer beeing paused for a while:
nico1 ~# llmc resume GPU-2d319a51-0089-c21c-e3eb-6d8ecf9991cc
pause.GPU-2d319a51-0089-c21c-e3eb-6d8ecf9991cc-: fail
nico1 ~# llmc resume GPU-188a5143-db69-7058-63b5-f2f1d2354f91
pause.GPU-188a5143-db69-7058-63b5-f2f1d2354f91-: fail
nico1 /tmp# llmc enable GPU-2d319a51-0089-c21c-e3eb-6d8ecf9991cc
disable.GPU-2d319a51-0089-c21c-e3eb-6d8ecf9991cc-: fail
nico1 /tmp# llmc enable GPU-188a5143-db69-7058-63b5-f2f1d2354f91
disable.GPU-188a5143-db69-7058-63b5-f2f1d2354f91-: fail
I noticed that llmc help
is missing the imatrix FLAG files/tmp/pause
to pause the imatrix tasks (which is still used despite the ability to pause GPUs because of legacy scripts and it being super reliable)/tmp/imatrix.force
to ignore "larger than 480GB" imatrix limit/tmp/max-ngl
to set the maximum number of layers allowed to be offloaded to the GPU
I now returned everything to as close to normal again as I could. quantization jobs are running again on nico1
and one of the 70B imatrix jobs is running despite booth GPUs being paused as I used the /tmp/pause
flag to pause it before pausing the GPUs. The other imatirx jobs will unfortunately be blocked as I had no idea llmc enable GPU-*
would be broken.
It would be great if you could tell me what I did wrong and/or by your own start snowflake-arctic-instruct imatrix computation once you are available. How did you make sure only one imatrix task is running? The only thing I could think of would be to pause llmjob.nico1 and one of the GPUs which should gurantee only one imatrix task running. Don't worry about the imatrix queue being so long. This is mainly because nico1 somehow decided to eat all the priority 40 tasks due to me overriding everything as llmc pause llmjob.nico1
was broken.
Wow strange it seem to still be doing imatrix jobs just with one GPU despite booth being blocked. Cool I guess as I wanted to unpause them but super confusing that it does this.
How is this an error now?
400 17 s ablation-65-a55.simpo.armorm-shisa-v2-llama-3.1-8b error/255 repo create
If I llmc audit
I see this:
HfHubHTTPError("500 Server Error: Internal Server Error for url: https://huggingface.co/api/repos/create (Request ID: Root=1-67d49fb3-377305ff77f775b842cdcecc;fd588126-eff5-4fcf-8792-e655e5a2affc)\n\nInternal Error - We're working hard to fix this as soon as possible!") at /llmjob/share/bin/llmjob line 2715.
...propagated at /llmjob/share/bin/llmjob line 2718.
job finished, status 255
job-done<0 ablation-65-a55.simpo.armorm-shisa-v2-llama-3.1-8b static 255>
https://huggingface.co/shisa-ai/ablation-65-a55.simpo.armorm-shisa-v2-llama-3.1-8b
This must be different than the repository creation rate limit I assume as rate limit usually never errors. I selected "retry" for now.
if there already is a job i need to manually add the imatrix job
You will unfortunately need to do so for Samantha-1.11-70b
and Samantha-1.1-70b
or tell me how to manually trigger an imatrix job if the scheduler thinks an imatrix already exists and so doesn't do one by itself as by the time the model was queued we have not yet archived the old imatrix.
Wow strange it seem to still be doing imatrix jobs just with one GPU despite booth being blocked.
You specified the wrong gpu uuid, so only one is blocked. You shoulöd be able to block all using llmc pause imatrix.nico2 .
I did that now, unpause with llmc resume imatrix.nico2
500 Server Error: Internal Server Error for url
yes, repo create just endlessly retries on being rate limited. this is simply hf suckiness, it happens on any request, and not all of them are retryable.
I noticed that llmc help is missing the imatrix FLAG files
There aren't any that you can access, they are all on the host that runs the imatrix jobs (kaos). And they are: .imatrix-hfd (gguf is valid) and .soverride (block this job). everything else would be in the json job description (e.g. which quant, wehere to download, and "force" which special cases quite a few things even the quantiser).
Xet probably lifts the 50 GB
I hope this will be transparent when using the hub api?
(flags)
Hmm, I reworked the flags stuff a few days ago, probably something is broken. One issue is that at least one of your uuids was not a uuid, but that's not checked by anything - it would simply mean you disabled a card that doesn't even exist, explaining those problems.
I am rather busy atm., but I will look at it later. Skimming through this,m your intentionw as to enable everything again, so I will do that.
gpu resuming was broken, the other flags should work
snowflake instruct didn't exhibit partially covered tensors either. peculiar:
[30]4.5118,[31]4.4902,[32]4.2987,[33]4.1352,[34]4.0640,[35]4.1097,[36]4.0953,[37]3.9571,[38]3.8514,[39]3.8286,
save_imatrix: entry ' blk.0.ffn_down_exps.weight' has partial data (95.31%)
save_imatrix: 6 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: entry ' blk.0.ffn_gate_exps.weight' has partial data (95.31%)
save_imatrix: 6 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: entry ' blk.0.ffn_up_exps.weight' has partial data (95.31%)
save_imatrix: 6 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: storing only 382 out of 385 entries
[40]3.8027,[41]3.7899,[42]3.7907,[43]3.7528,[44]3.7568,[45]3.7423,[46]3.7421,[47]3.7649,[48]3.7736,[49]3.8393,
save_imatrix: entry ' blk.0.ffn_down_exps.weight' has partial data (97.66%)
save_imatrix: 3 out of 128 experts are missing data
save_imatrix: 3 out of 128 experts are missing data - storing but be aware
save_imatrix: entry ' blk.0.ffn_gate_exps.weight' has partial data (97.66%)
save_imatrix: 3 out of 128 experts are missing data
save_imatrix: 3 out of 128 experts are missing data - storing but be aware
save_imatrix: entry ' blk.0.ffn_up_exps.weight' has partial data (97.66%)
save_imatrix: 3 out of 128 experts are missing data
save_imatrix: 3 out of 128 experts are missing data - storing but be aware
[50]3.8619,[51]3.7652,[52]3.6756,[53]3.5995,[54]3.5232,[55]3.4544,[56]3.4564,[57]3.4428,[58]3.4881,[59]3.5413,[60]3.6089,[61]3.5819,[62]3.6202,[63]3.6591,[64]3.6948,[65]3.7287,[66]3.7580,[67]3.8092,[68]3.8528,[69]3.8791,[70]3.9078,[71]3.9304,[72]3.9267,[73]3.9117,[74]3.8934,[75]3.9132,[76]3.9207,[77]3.9402,[78]3.9272,[79]3.9366,[80]3.9516,[81]3.9433,[82]3.9516,[83]3.9424,[84]3.9542,[85]3.9600,[86]3.9625,[87]3.9679,[88]3.9827,[89]3.9785,[90]3.9810,[91]3.9932,[92]3.9877,[93]3.9786,[94]3.9769,[95]3.9490,[96]3.9652,[97]3.9638,[98]3.9659,[99]3.9510,[100]3.9496,[101]3.9704,[102]3.9540,[103]3.9469,[104]3.9432,[105]3.9589,[106]3.9717,[107]3.9966,[108]4.0193,[109]4.0061,[110]3.9952,[111]3.9861,[112]3.9732,[113]3.9614,[114]3.9478,[115]3.9381,[116]3.9272,[117]3.9226,[118]3.9405,[119]3.9552,[120]3.9894,[121]4.0224,[122]4.0629,[123]4.0951,[124]4.1483,[125]4.1925,[126]4.2084,[127]4.2208,[128]4.1951,[129]4.2042,[130]4.2001,[131]4.1913,[132]4.1599,[133]4.1229,[134]4.1424,[135]4.1528,[136]4.1617,[137]4.1609,[138]4.1752,[139]4.1914,[140]4.2076,[141]4.2148,[142]4.2255,[143]4.2306,[144]4.2246,[145]4.2281,[146]4.1897,[147]4.1507,[148]4.1279,[149]4.0930,[150]4.0629,[151]4.0319,[152]4.0548,[153]4.0668,[154]4.1009,[155]4.1359,[156]4.1768,[157]4.2180,[158]4.2545,[159]4.2916,[160]4.3233,[161]4.3644,[162]4.4004,[163]4.4289,[164]4.4621,[165]4.4962,[166]4.5275,[167]4.5578,[168]4.5864,[169]4.6174,[170]4.6465,[171]4.6776,[172]4.7161,[173]4.7472,[174]4.7761,[175]4.8207,[176]4.8486,[177]4.8822,[178]4.9031,[179]4.9323,[180]4.9580,[181]4.9898,[182]5.0146,[183]5.0482,[184]5.0830,[185]5.1043,[186]5.1348,[187]5.1531,[188]5.1795,[189]5.2056,[190]5.2293,[191]5.2568,[192]5.2935,[193]5.3223,[194]5.3406,[195]5.3595,[196]5.3979,[197]5.4154,[198]5.4360,[199]5.4551,[200]5.4766,[201]5.5009,[202]5.5214,[203]5.5368,[204]5.5569,[205]5.5791,[206]5.6068,[207]5.6288,[208]5.6491,[209]5.6769,[210]5.7026,[211]5.7270,[212]5.7459,[213]5.7706,[214]5.7825,[215]5.8032,[216]5.8271,[217]5.8449,[218]5.8689,[219]5.8854,[220]5.9095,[221]5.9244,[222]5.9341,[223]5.9554,[224]5.9779,[225]5.9978,[226]6.0189,[227]6.0359,[228]6.0500,[229]6.0720,[230]6.0962,[231]6.1168,[232]6.1403,[233]6.1576,[234]6.1815,[235]6.2036,[236]6.2261,[237]6.2379,[238]6.2595
Could it be that the patch simply makes tensors valid, so as soon as they are "stored", they no longer count as partial from then on? I haven't looked at the patch, but maybe it would fill partial weights with dummy weights, so on the next round, they would no longer count as partial? Might not be a disastrous thing, but probably the patch shouldn't permanently change weights, because that would slightly change the results for the next rounds - maybe it should modify and save a copy.
save_imatrix: 14 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: storing only 382 out of 385 entries
[20]4.3242,[21]4.4543,[22]4.3112,[23]4.1885,[24]4.1851,[25]4.1865,[26]4.1966,[27]4.1751,[28]4.2599,[29]4.3854,
save_imatrix: 7 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: storing only 382 out of 385 entries
[30]4.5118,[31]4.4902,[32]4.2987,[33]4.1352,[34]4.0640,[35]4.1097,[36]4.0953,[37]3.9571,[38]3.8514,[39]3.8286,
save_imatrix: 6 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: storing only 382 out of 385 entries
[40]3.8027,[41]3.7899,[42]3.7907,[43]3.7528,[44]3.7568,[45]3.7423,[46]3.7421,[47]3.7649,[48]3.7736,[49]3.8393,
save_imatrix: 3 out of 128 experts are missing data
save_imatrix: 3 out of 128 experts are missing data - storing but be aware
Thinking about this, we should definitely investigate this, as this will probably affect most moe's and has a good chance of negatively affecting them, unless the patched weight values are essentially being ignored (I have no clue how the weights are combined between chunks).
We reached the repository creation limit again, so I started to reprioritize and statically assigned some of the in my opinion important 1400 priority medium sized models to rein and leia and some big ones to rich1. I'm now strictly controlling what models nico1 and nico2 are working on to ensure they only do big ones. I kept back, kaos and marco untouched as they are not currently working on background models. Don't be confused of me abusing the pause host functionality on nico1 and nico2. I realized that I can pause a host and then delete the interrupt files for it to work on specific models without sheduling any new ones. The reason I have to pause the entire host is because ther followinfg commands still do NOT work:
nico1 ~# llmc pause imatrix.nico1
pause.imatrix.nico1+: fail
nico1 ~# llmc pause llmjob.nico1
pause.llmjob.nico1+: fail
so I started to reprioritize and statically assigned
That sucks, because I also did this this morning, so us having to do it twice is not a good sign :)
work on specific models without sheduling
If that works for you, that's great. You can also manually start models by setting the .force flag (e.g. via llmc shell) and push'ing. They will immediatelly be interrupted by ready models highe rin the queue, but those cna be overriden. I envisage that's useful if some models are stuck in repo create.
On the other hand, how does that work at all? In my experience, the creation limit is something that hits during the day, then stays with us until 2-3am, with some limited softness (i.e. one can get a request through sometimes).
followinfg commands still do NOT work:
Eh, fascinating how many mistakes you cna put into a few regexes. And I thought I tested those.
followinfg commands still do NOT work:
I kind of refactored it once more. IT's even better than before. Looks fine from my point of view, but I didn't have the time to test everything.
so, gemma3 also has a vision part, and llama already has an extractor for it?
since i am super swamped, wanna give it a try at helping me? the qwen2vl extractor probably is a good base, and it's in "quantize" (you can search for Qwen2VLForConditionalGeneration)
as can be seen from the code, it's not exactly straightforward, mostly due to hard to predict output naming. the way I solved it was by creating an empty temporary directory, and assuming only a single file will be created, which will then be renamed to a known name.
I feel the newest dryrun has issues, or mnaybe we just ran into this class of problem:
load_tensors: loading model tensors, this can take a while... (mmap = false)
[DRYRUN][CPU]: 6425850112
alloc_tensor_range: failed to allocate CPU buffer of size 6425850112
load_tensors: pretend allocating CPU buffer was successful due to dry-run being enabled
...
[DRYRUN][CPU]: 513024
output_reserve: failed to allocate output buffer of size 0.49 MiB
llama_init_from_model: failed to initialize the context: failed to reserve initial output buffer
common_init_from_params: Dryrun compleated!
dryrun failed
I went back to the grep method, but it seems dryrun testing is completely broken at the moment.
I feel the newest dryrun has issues, or mnaybe we just ran into this class of problem:
What issues are you experiencing and with what model do they occur? The output you posted looks all great except you wrongly detecting it as failed. Is it not returning status code 0 even if it is successful or why is your code still detecting this as dryrun failed
? I don't see how it is possible that it doesn't exit with code 0 if you see above Dryrun compleated!
message as the code in common.cpp is the following - shouldn't exit(0)
immediately terminate the application with exit code 0? Are you sure your exit code check is implemented correctly?
if(getenv("DRYRUN")) {
LOG_ERR("%s: Dryrun compleated!\n", __func__);
exit(0);
}
Just to make sure I tested the latest version myself and as expected the exit codes printed using echo $?
is correct. Now I'm even more confused what issues you are talking about. Maybe you just got confused by the expected errors such as failed to allocate
and failed to initialize
which are intentional. The only issue I found was this embarrassing typo in the Dryrun compleated!
message.
Working model:
[DRYRUN][PINNED]: 122088
output_reserve: failed to allocate output buffer of size 0.12 MiB
llama_init_from_model: failed to initialize the context: failed to reserve initial output buffer
common_init_from_params: Dryrun compleated!
0
Broken model:
llama_model_load: error loading model: error loading model hyperparameters: key not found in model: llama.context_length
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/root/nico/law-LLM.gguf'
main: error: unable to load model
1
so, gemma3 also has a vision part, and llama already has an extractor for it?
The gemma3 vision extraction is really simple. You just execute examples/llava/gemma3_convert_encoder_to_gguf.py
to get the mmproj file as usual. By default the mmproj will be stored under mmproj.gguf
but the --outfile
command line argument can be used to specify whatever name you like. Using --outtype
you can specify if you want the mmproj as f32, f16, bf16 or q8_0. If you don't specify anything the mmproj will be in f16. Then you just specify the path of your gemma3 model and done. Should you encounter any issues you can use --verbose
to see exactly what it is doing.
since i am super swamped, wanna give it a try at helping me? the qwen2vl extractor probably is a good base, and it's in "quantize" (you can search for Qwen2VLForConditionalGeneration)
I just looked at the quantize script and the only thing you have to change is likely:
python3 "/llmjob/llama.cpp/examples/llava/qwen2_vl_surgery.py" --data_type fp16 -- ../"$SRC" || exit 33
to
python3 "/llmjob/llama.cpp/examples/llava/gemma3_convert_encoder_to_gguf.py" --outtype f16 -- ../"$SRC" || exit 33
as can be seen from the code, it's not exactly straightforward, mostly due to hard to predict output naming. the way I solved it was by creating an empty temporary directory, and assuming only a single file will be created, which will then be renamed to a known name.
You could heavily simplify it for gemma3 by just using --outfile
to specify whatever output file you want. This unfortunately doesn't seem to be possible for qwen2vl so you either use the auto outfile detection for booth of them or use dedicated code for gemma3 in which case you can remove it and instead just use --outfile
.
i'll have a look at dryrun later - most likely I made a mistake in a hurry.
What's your take on f32 and Q8_0 quants of vision parts? Q8_0 seems attractive to have, and I made sure our naming convention supports that. f32, not so much.
What's your take on f32 and Q8_0 quants of vision parts? Q8_0 seems attractive to have, and I made sure our naming convention supports that. f32, not so much.
It would be awesome if we could offer diffrent mmproj quants as well. qwen2vl supports fp32
and fp16
as --data_type
argument while gemma3 supports f32
, f16
, bf16
and q8_0
as --outtype
argument. We should at least offer f16
and q8_0
and maybe even f32
for gemma3
.
ok, not sure when i can go about that (gemma3). in the meantime, he is a diff I use for many months now, for use in the mradermacher branch
diff --git a/gguf-py/gguf/gguf_writer.py b/gguf-py/gguf/gguf_writer.py
index 080d2b9d..d3cbe44f 100644
--- a/gguf-py/gguf/gguf_writer.py
+++ b/gguf-py/gguf/gguf_writer.py
@@ -237,6 +237,10 @@ class GGUFWriter:
kv_bytes = bytearray()
for key, val in kv_data.items():
+ if val.type != GGUFValueType.ARRAY or len (val.value) < 50:
+ print("gguf serialising key ", key, "value", val)
+ else:
+ print("gguf serialising key ", key, "value-suppressed")
kv_bytes += self._pack_val(key, GGUFValueType.STRING, add_vtype=False)
kv_bytes += self._pack_val(val.value, val.type, add_vtype=True)
@@ -269,8 +273,8 @@ class GGUFWriter:
self.state = WriterState.TI_DATA
def add_key_value(self, key: str, val: Any, vtype: GGUFValueType) -> None:
- if any(key in kv_data for kv_data in self.kv_data):
- raise ValueError(f'Duplicated key name {key!r}')
+ #if any(key in kv_data for kv_data in self.kv_data):
+ # raise ValueError(f'Duplicated key name {key!r}')
self.kv_data[0][key] = GGUFValue(value=val, type=vtype)
in the meantime, he is a diff I use for many months now, for use in the mradermacher branch
Thanks for sharing. I fixed typo in the "Dryrun completed!" message, applied your diff and updated to latest llama.cpp despite there not being any changes relevant to us. There is no reason for you to update again unless you have not manually applied your diff the last time you did.
i made a typo when restoring the old dryrun code. it seems to work now :-)
i also removed the -ofreq 10, assuming this deals with any problems with wrong imatrix weights. that means no feedback for us, but it's rarely useful anyway.
why, oh why, did i follow llama naming conventions and called it fp16 instead of f16 (qwen2vl)
I noticed that for my latest medical models I have quite a few imatrix errors like this:
nico1 /tmp# grep -r "GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed" *.log
Gemma2-2b-IT-FT-medical_qa.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
gemma-2b-it-finetuned-medical-qa.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
gemma-medical_qa-Finetune-ad-2b.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
gemma-medical_qa-Finetune-ja.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
gemma-medical_qa-Finetune-v2.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
medical_jargons_simplifier2.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
Medical-mT5-large.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
Medical-mT5-xl-multitask.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
medical_q_a_model.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
Medical_Report_Summarization.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
Medical_Summarization.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
They don't appear in llmc audit
. How to deal with them? nuke
or nukeall
?
neither, the imatrix ones i have to deal with. queue fewer junk models? :-)
(do these actually work with transformers?)
well, actually nukeall does work in this case
What should I do about this one? nuke
and force requeue explicitly to nico1? I think this should work as it should auto-skip already existing quants.
Running scope as unit: llmjob-wrap-gemma-3-4b-persian-v0-noquant-6698.scope
{[[PROGRESS:preparing...]]}
{[[PROGRESS:mmproj extraction]]}
mmproj extraction attempted on unsupported host
job finished, status 72
job-done<0 gemma-3-4b-persian-v0 noquant 72>
https://huggingface.co/mshojaei77/gemma-3-4b-persian-v0
neither, the imatrix ones i have to deal with. queue fewer junk models? :-)
Most of them are not junk but I unfortunately don't have time to test every single one of them before queueing. Many medical finetunes lacking a proper model card which makes judging their quality without actually testing the model almost impossible. We could say no model card means trash but this doesn't seem to always be true as some are just lazy and I already had multiple good models without a model card.
do these actually work with transformers?
I just tested Gemma2-2b-IT-FT-medical_qa
using transformers and it worked. But no worries the model kind of sucks as it wants you to ask questions formatted exactly like inside the medical QA dataset and is so heavily censored that it refuses to answer the majority of them. It seems so stupid to create a medical finetune that refuses to answer medical questions. But it also seams stupid to not write a model card.
well, actually nukeall does work in this case
Great I will nukeall
them myself in the future. I will also try to find a way to recognize and filter such failures before even queueing them. With latest changes to my script the failure rate already got reduced a lot compared to earlier versions.
What does (worker +cork)
mean? I noticed that you queued all of today’s lownice models using that flag.
Edit: Ah interesting that flag is gone now.
I merged the latest llama.cpp into the mradermacher branch adding support for the RWKV v7 architecture and fixing the tensor shape issue of OLMo-2-0325-32B-Instruct (tensor 'blk.0.attn_k_norm.weight' has wrong shape; expected 5120, got 1024)
I highly recommend to update as otherwise all RWKV v7/RWKV v7 Distilled based and many OLMo-2 based models will fail. Once you updated, please queue the following models:
RWKV v7 Base models (RWKV7ForCausalLM):
- https://huggingface.co/fla-hub/rwkv7-191M-world
- https://huggingface.co/fla-hub/rwkv7-0.4B-world
- https://huggingface.co/fla-hub/rwkv7-1.5B-world
- https://huggingface.co/fla-hub/rwkv7-2.9B-world
- https://huggingface.co/fla-hub/rwkv7-0.1B-g1
RWKV v7 Distilled models (RwkvHybridForCausalLM):
- https://huggingface.co/RWKV-Red-Team/ARWKV-R1-1B5
- https://huggingface.co/RWKV-Red-Team/ARWKV-R1-7B
- https://huggingface.co/RWKV-Red-Team/ARWKV_7B_R1_16K
Force requant failed OLMo-2 models (Olmo2ForCausalLM):
(worker +cork)
Sorry, just experimenting - I wanted to queue everything first, so I set an impossible worker name to be changed when I am happy with the queue.
llama.cpp is updated, could you do me a favour and queue the models, maybe a test model first?
llama.cpp is updated, could you do me a favor and queue the models, maybe a test model first?
Thanks a lot! Will do.
Sorry, just experimenting - I wanted to queue everything first, so I set an impossible worker name to be changed when I am happy with the queue.
Ah I see. Now it makes sense. No problem I was just a bit confused at first.
@mradermacher Half an hour ago llama.cpp added support for Mistral3ForConditionalGeneration. Luckily it is a convert_hf_to_gguf.py change only so I was able to manually provide the GGUF and use our existing llama.cpp version for imatix computation and quantization. I recommend you again upgrade to the latest version of the mradermacher branch, so this no longer requires manual intervention. We could also hold back Mistral3ForConditionalGeneration based models until the vision extraction for it is implemented but I would expect this to take days if not weeks for them to implement so waiting is likely not a feasible option.
updated - but please keep a list of the models you queued so far, so we can re-run these models. new "add"s should automatically log these ("Mistral3ForConditionalGeneration, logging.")
i tried some of the rwkv 7 models that showed up in my list today (e.g. RWKV7-Goose-Pile-168M-HF), but... any idea?
File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 5384, in <module>
main()
File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 5378, in main
model_instance.write()
File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 440, in write
self.prepare_metadata(vocab_only=False)
File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 433, in prepare_metadata
self.set_vocab()
File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 3598, in set_vocab
self._set_vocab_rwkv_world()
File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 915, in _set_vocab_rwkv_world
assert (self.dir_model / "rwkv_vocab_v20230424.txt").is_file()
AssertionError
updated
Thanks a lot for the quick update! :D
please keep a list of the models you queued so far, so we can re-run these models. new "add"s should automatically log these ("Mistral3ForConditionalGeneration, logging.")
The only one I manually converted so far was https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503
i tried some of the rwkv 7 models that showed up in my list today (e.g. RWKV7-Goose-Pile-168M-HF), but... any idea?
All RWKV v7 based models are supposed to have a file named rwkv_vocab_v20230424.txt
as can be seen under any RWKV v7 base model like https://huggingface.co/fla-hub/rwkv7-191M-world/raw/main/rwkv_vocab_v20230424.txt in the case of fla-hub/rwkv7-191M-world
. Your RWKV7-Goose-Pile-168M-HF
model misses this file. Likely because it got converted from the RWKV v7 into a HuggingFace transformers comparable model as can be seen from the model’s name. We could try just copying that file into the same folder as the model but not sure if this would work. By the way fun fact that file used to allow arbitrary code execution in an earlier luckily rejected convert_hf_to_gguf.py implementation by phrasing the file using eval(line[line.index(' '):line.rindex(' ')])
. ARWKV-7B-Preview-0.1 using the RwkvHybridForCausalLM you queued worked fine.
By the way fun fact that file used to allow arbitrary code execution in an earlier luckily rejected
I was under the impression that convert...py always allows arbitrary code execution - for example, in the glm case, I regularly have to patch .py files inside the repo to make it work, which proves that the files get executed. One way is enough...
That is what prompted me to introduced safe-exec btw., because In was also under the impression that it would not execute files from the repo by default. We did have a little chat about that, too, I think..
All RWKV v7 based models are supposed
I guess we cna then just skip those, as they are likely (in theory atr leats) identical to the non-hf version. Problems will arise if these become more popular (as they are by "RWKV")
I was under the impression that convert...py always allows arbitrary code execution - for example, in the glm case, I regularly have to patch .py files inside the repo to make it work, which proves that the files get executed. One way is enough...
It does for some models that are using a custom loader but there it is quite obvious that the custom loader gets executed to load the model so someone that doesn't mass convert thousands of models would likely take a short look at it before converting to GGUF. Allowing arbitrary code execution to phrase massive text file on the other hand is definitely not something any user could ever expect. It is also like the dumbest way to implement a text file parser.
As long convert_hf_to_gguf.py supports loading any models that are not in safetensors you can easily make it execute arbitrary code anyways. Someone with malicious intent would likely choose to infect the actual model and not the python file that loads it as that one is easily renewable but actually doing so in a stealthy way would be a genius as it will the automated malware scanner only scans models as far I'm aware. I'm positively surprised malicious AI models are not a common issue. Is far I'm aware not a single AI model tried to infect our infrastructure so far.
That is what prompted me to introduced safe-exec btw., because In was also under the impression that it would not execute files from the repo by default. We did have a little chat about that, too, I think.
We did. Enabling that for sure was a great decision. It would be really annoying to having our infrastructure infected by some random malware. We are at like the highest risk possible of this happening to us as we process thousands of models from often untrustworthy sources shortly after their release and so before HuggingFace could take them down based on their malware scanners results. But no worries as long nobody burns a Linux kernel exploit or more likely a Nvidia driver exploit on me nothing will get out of my LXC container. I’m always closely monitoring the LXC container so I would probably almost immediately spot any malicious process running inside of it.
I guess we cna then just skip those, as they are likely (in theory atr leats) identical to the non-hf version. Problems will arise if these become more popular (as they are by "RWKV")
No need to do them but could indeed get an issue if users start finetuning them instead of the ones inside the original RWKV v7 format but don't worry if it gets an issue, we can for sure do something to convert them.
It does for some models that are using a custom loader
If it does it for some, it does it for all - the model type is parsed from the files as well.
it is quite obvious that the custom loader gets executed to load the model so someone that doesn't mass convert thousands of models would likely take a short look at it before converting to GGUF.
I think the oppposite is the case. You assume everybody using transformers (or llama.cpp) somehow is an expert. I would assume most people would blindly trust it.
As long convert_hf_to_gguf.py supports loading any models that are not in safetensors you can easily make it execute arbitrary code anyways.
How so? The only alternative would by pytorch, and I don't think that executes code anymore.
automated malware scanner only scans models
As far as I am aware, automated malware scanners don't really exist. They either check outdated signatures, or pretend to check for behaviour and completely fail. Point in case, the hf malware repo scanner... :)
Anyway, I think the deeper issue is that transformers code is written by people who don't understand basic security or even safety practise, so running everything in a jail is the way to go :)
We are at like the highest risk possible of this happening to us as we process thousands of models from often untrustworthy sources
We are also one of the biggest targets for attacks, especially if something can be done with the generated files.
I’m always closely monitoring the LXC container so I would probably almost immediately spot any malicious process running inside of it.
Pride goes before the fall.
[rwkv]
No need to do them but could indeed get an issue if users start finetuning them instead of the ones inside the original RWKV v7 format but don't worry if it gets an issue
There were also two fla-hub non"-hf" "-pile" models having the same issue.
How so? The only alternative would by pytorch, and I don't think that executes code anymore.
What makes you think that PyTorch would no longer be vulnerable to arbitrary code execution? As long you unpickle you allow arbitrary code to run. The entire legacy file formats are insecure by design. I don't see any way once could ever load them in a secure way. Even the poor machine that convert legacy models to safetensor and so loads them will inevitable execute whatever arbitrary code is in the non-safe model but at least the resulting SafeTensor will not contain any arbitrary code.
I found this nice article from December 2024 to show how to backdoor AI models - doing so is surprisingly simple and it really is kind of a miracle no bad acters seam to make use of it: https://snyk.io/articles/python-pickle-poisoning-and-backdooring-pth-files/
We are also one of the biggest targets for attacks, especially if something can be done with the generated files.
Well similar to SafeTensor GGUFs are secure unless they exploit a security vulnerability inside llama.cpp of which every few month another one gets responsible disclosed under https://github.com/ggml-org/llama.cpp/security
I'm mainly concerned about someone stealing our HuggingFace token to nuke our models or uses it to push some garbage. Hopefully HuggingFace has a repository delete rate limit. Maybe we also should rate limit the nukerepo/nukeall commands just in case. Having a malicious insider should also be a concern given how much access all our nodes currently have.
Pride goes before the fall.
True but I have rotating air gaped offline backups in worst case. Even if someone could somehow escape the LXC container they would be user 100000 on the host which can't do anything without a kernel exploit that gives them root. Given how often I update the kernel especially if there are news about any security vulnerability I seems quite unlikely someone who gains access to your container could escape it unless NVidia lets me down with their terrible driver security. Worst someone could do inside the LXC container is ruining/sabotaging our operations and bothering my ISP with malicious activity. I have quite tight firewall rules set and use a separate subnet for our cluster, our internet network and my home network and generally don't have insecure devices in my network so unlikely they could traverse to any other device from within the LXC container beside other LXC containers used for our operation which equal security constraints.
The wait queue never was as low since early November. That was almost half a year ago!
964 additional job(s) in wait queue (total estimated size 24.760TB, imatrix 7.501TB, lownice 0):
To celebrate I reconstructed the following historic wait queue size based on manual status page backups I made:
Nov 09 01:05: 1107
Nov 09 10:21: 1456
Nov 10 02:47: 1489
Nov 10 10:14: 1537
Nov 10 11:50: 1589
Nov 10 18:20: 1611
Nov 10 20:58: 1636
Nov 11 14:14: 1637
Nov 11 17:22: 1633
Nov 12 00:10: 1678
Nov 13 11:29: 1738
Nov 13 13:49: 1796
Nov 14 10:32: 2020
Nov 18 19:43: 2077
Nov 19 10:22: 2962
Nov 20 02:22: 3073
Nov 20 02:25: 3107
Nov 20 09:45: 3319
Nov 21 10:26: 3324
Nov 22 16:27: 3329
Nov 22 17:42: 3330
Nov 27 00:00: 3466
Nov 27 02:20: 3468
Nov 27 16:37: 3459
Nov 28 23:09: 3441
Nov 29 10:28: 3440
Dec 01 07:21: 3534
Dec 01 17:17: 3613
Dec 01 20:22: 3729
Dec 02 14:36: 3720
Dec 03 01:15: 3698
Dec 03 17:19: 3848
Dec 04 01:53: 3816
Dec 11 10:49: 3800
Dec 12 13:03: 3830
Dec 13 09:58: 3919
Dec 13 16:46: 3959
Dec 14 23:37: 3977
Dec 15 02:25: 4001
Dec 15 02:51: 4000
Dec 15 06:59: 4052
Dec 15 12:31: 4051
Dec 15 18:25: 4056
Dec 16 15:21: 3987
Dec 16 19:59: 3969
Dec 17 11:30: 3907
Dec 17 13:47: 3881
Dec 17 15:39: 3831
Dec 17 21:57: 3754
Dec 18 05:15: 3731
Dec 18 17:35: 3636
Dec 19 04:42: 3620
Dec 19 11:11: 3556
Dec 20 00:49: 3465
Dec 20 16:06: 3386
Dec 21 05:12: 3379
Dec 21 15:09: 3325
Dec 21 21:43: 3295
Dec 22 16:02: 3183
Dec 23 16:50: 2982
Dec 24 04:15: 2898
Dec 24 15:15: 2769
Dec 25 01:49: 2612
Dec 25 16:08: 2599
Dec 26 03:54: 2598
Jan 03 02:57: 1450
Jan 03 03:06: 1447
Jan 03 03:09: 1446
Jan 03 04:02: 1464
Jan 03 14:47: 1393
Jan 03 23:56: 1299
Jan 04 01:20: 1283
Jan 04 13:34: 1160
Jan 04 19:05: 1094
Jan 05 02:01: 1022
Jan 05 04:43: 973
Jan 20 01:54: 1205
Jan 21 11:33: 1192
Jan 24 12:49: 1097
Feb 02 00:52: 1061
Feb 04 09:48: 1074
Feb 16 02:43: 2145
Feb 18 02:18: 2377
Mar 12 20:32: 1799
Mar 13 04:15: 1779
Mar 13 19:24: 1512
Mar 13 19:58: 1501
Mar 13 20:14: 1496
Mar 13 21:55: 1492
Mar 14 09:24: 1472
Mar 14 17:16: 1380
Mar 14 17:52: 1367
Mar 15 00:22: 1370
Mar 15 14:28: 1242
Mar 15 17:28: 1206
Mar 15 19:54: 1204
Mar 18 17:52: 1132
Mar 18 20:06: 1125
Mar 19 09:16: 1068
Mar 19 14:32: 1007
Mar 19 14:53: 1000
Mar 19 15:11: 995
Mar 19 17:50: 967
Mar 19 19:21: 964
The maximum it ever reached according to my measurements was 4056 on 15th of December 2024! :D
What makes you think that PyTorch would no longer be vulnerable to arbitrary code execution? As long you unpickle you allow arbitrary code to run.
I thought I had read that transformers had switched to a restricted unpickle library, but... indeed, that seems not the case.
However, my point was a different one: I think it does not unpickle by default, so pickle isn't the problem for unsuspecting users (it is for us, since we enable it). The problem is that asking for untrusted code execution wrongly implies that it won't execute untrusted code when not told to do so. IT would be better to always execute untrusted code instead of giving a false sense of security.
Having a malicious insider should also be a concern given how much access all our nodes currently have.
Like, richard, me, you, marco and....? I mean, specifically the mradermacher nodes themselves, or the host, not other containers.
I am not very concerned about that, but maybe I don't value the repositories as highly, so vandalism is pretty low on my list of fears. But I wouldn't want to be part of a löarge scale attack on people downloading models :) Other than llama.cpp insecurities, which by necessity I don't care much about.
The wait queue never was as low since early November. That was almost half a year ago!
Time to queue more older models, I suspect. Once I find the time again.
To celebrate I reconstructed the following historic wait queue size based on manual status page backups I made:
I infrequently wished we had such historical backups :_)
Like, richard, me, you, marco and....? I mean, specifically the mradermacher nodes themselves, or the host, not other containers.
I got betrayed by too many friends when I hosted a Minecraft server back in high school so I take security against trusted insiders likely a bit too serious. The good thing here is that all of the current persons involved invested a ton of money and resources into this so nobody will do anything malicious for sure. This is mainly a concern should we ever plan on onboarding someone new as giving someone we barely know this level of access is a major risk.
I am not very concerned about that, but maybe I don't value the repositories as highly, so vandalism is pretty low on my list of fears.
We put so much effort, work and resources into them so I value them quite a lot.
But I wouldn't want to be part of a löarge scale attack on people downloading models :) Other than llama.cpp insecurities, which by necessity I don't care much about.
It is very unlikely someone could use GGUFs to distribute malware. The format is relatively secure.
I infrequently wished we had such historical backups :_)
I will upload mine soon. They are tiny. To make your own you could create a crontask that downloads the status page. I just Ctrl & S store the status page from times to times so I can better see how well things progress. I'm way too obsessed with the status page.
My list of fears:
- HuggingFace imposing a storage limit
- HuggingFace bugging us for using too much resources
- HuggingFace banning us for some really stupid reason like too many DMCA notices and not taking into account the number of models we have or some abuse report spam or other trash like this
- HuggingFace running out of money
- llama.cpp deciding to merge a change braking support for all quants ever created because their maintainers don't value support for legacy quants.
- GGUF getting replaced by a far superior format as it happened for GGML which then got replaced by GGUF
- Stupid regulation for USA messing with open AI models. Especially with the currently president behaving so unpredictable.
- EU being stupid as always and wanting to geoblock some "dangerous" AI models. I'm so glad Switzerland is not part of this organization.
- My ISP kicking me out due to using too much bandwidth.
10 Someone or something sabotaging our operations - HuggingFace Xet storage disrupting our operations. I think they already push 7% of traffic through Xet so we might already use it.
- HuggingFace doing stupid rate limits. Richard just got rate limited today:
When we are at thing Richard doesn't like its him only having 2 tasks on rich1. He sent me this picture 2 house ago - luckily there are more now. Also he would like the ability to set the number of parallel tasks:
In any case let's just enjoy the moment. There never was a better time to enjoy all this amazing openly available AI models!
I'm enjoying locally running AI models so much I ordered 2x Intel Arc 770 last weekend. They perform better for LLMs than they should and you get 4x better performance/value than for NVidia according to specification and even better performance/value in the unlikely case those unrealistic benchmarks would be true: https://www.tweaktown.com/news/97705/metas-next-gen-llama3-llm-is-here-and-the-intel-arc-a770-outperforms-geforce-rtx-4060/index.html and https://www.plainconcepts.com/maximizing-ai-performance-intel-arc-a770-gpu/
This is mainly a concern should we ever plan on onboarding someone new as giving someone we barely know this level of access is a major risk.
Agreed. Maybe we can move repo creation more centrally (dryrun was a major step towards that btw.) and maybe have finer-grained tokens for, say, only uploads. At some point.
The format is relatively secure.
The format is completely secure, I think. But I don't trust the gguf parsers one bit.
list of fears
yeah, these are all definitely on the realistic part of the scale. In fact, I am susprised this got as far as it got, and I expect total breakdown daily. Enshittification is a thing, and it already happens with hf as well, although at a surprisingly low level so far.
HuggingFace Xet storage disrupting our operations.
my immediate short term concern, yes :)
Richard just got rate limited today:
Holy shit, what did they call the 5MB/s bandwidth before? unlimited? :-)
When we are at thing Richard doesn't like its him only having 2 tasks on rich1.
Well, nice level 1400 is very far down the list, so the scheduler does reserve resources for higher priority things. Some tuning might always be required (the logic is pure mess, always changing :)
But the real problem with richard will be once we are through the low pri models, which will be soon. Richard and me will have to find a new mode of operations.
Also he would like the ability to set the number of parallel tasks:
As in less than two, or more than two? I suspect once we are through the models, it would make most sense to limit it to two, so he always has guaranteed resources available for himself, since we likely won't need him full time anymore (likewise nico2). rich1 is the fastest box we have that is always available (marco is hampered by disk speed mostly. he was thinking of buying an nvme for just this).
I'm enjoying locally running AI models so much I ordered 2x Intel Arc 770 last weekend.
Yeah, I wondered about intel arc, too, in the beginning, and then theyx cancelled their promised 64GB model and generally fucked up their drivers again, so I was dissuaded. But things seem to be improving, that is good. If anybody needs more competition, it's nvidia, and if anybody needs better, more competitive products, it's intel at the moment. We saw how long it took AMD to mirror the shitty nvidia prise hikes (i.e. instant), and I don't doubt intels death will immediatelly cause amd to become the new evil intel. Not to speaks of the shady practise of artificially reducing PCIe lines to sell more server hardware (which amd also copied from intel). Enshittification everywhere.
Although, I must admit, I was dreaming about intels death ever since I had a dual opteron.
In any case let's just enjoy the moment. There never was a better time to enjoy all this amazing openly available AI models!
Yes, very depressing. Oh, you meant this to be uplifting??
containing all relevant files for a GPTNeoXTokenizerFast tokenizer.
do you know a way of doing something about this? happens with https://huggingface.co/zaas12/pythia-1.4-orpo for example. if it is as simple as installing it, there might be a whole bunch of models with this or similar issues (missing python packages)
btw., so amusing: https://emygervais.github.io/2025/03/15/bytecraft.html (saw the model this morning)
sigh. the amusing bytecraft model caused 100% endless loop on rich, blocking everything.
CPU based prompt processing will soon be so fast of Q4_K on AVX2 compatible CPUs: https://github.com/ggml-org/llama.cpp/pull/12332
Well, nice level 1400 is very far down the list, so the scheduler does reserve resources for higher priority things. Some tuning might always be required (the logic is pure mess, always changing :)
It would be nice if there are always some model waiting around idle ready to take over. Especially for Richard who cares way too much about his server getting fully utilized all the time.
But the real problem with richard will be once we are through the low pri models, which will be soon. Richard and me will have to find a new mode of operations.
You will have to keep rich1 busy or he will start using it for his own quants of that list with trash models you gave him. By the way he is doing an awesome job abusing Google Collab and soon his home internet for Native/AWQ/MLX quants.
As in less than two, or more than two? I suspect once we are through the models, it would make most sense to limit it to two, so he always has guaranteed resources available for himself, since we likely won't need him full time anymore (likewise nico2). rich1 is the fastest box we have that is always available (marco is hampered by disk speed mostly. he was thinking of buying an nvme for just this).
What do you expect from Richard. Obviously he wants to run 3. Here a quote from him:
why rich1 so smol ?
I paid today, I want full load 💀
we need to find something to process on rich1
a third queue or something just to run to hit this 100% cpu
I pay for whole server, I use whole server
can I have a button to switch 2 models concurrently to 3/4 models concurrently?
Usually he would run his own quants on rich1 as well to make sure it is maxed out but HuggingFace repository create rate limited him today so he cannot really max it out.
Yeah, I wondered about intel arc, too, in the beginning, and then theyx cancelled their promised 64GB model and generally fucked up their drivers again, so I was dissuaded. But things seem to be improving, that is good. If anybody needs more competition, it's nvidia, and if anybody needs better, more competitive products, it's intel at the moment. We saw how long it took AMD to mirror the shitty nvidia prise hikes (i.e. instant), and I don't doubt intels death will immediatelly cause amd to become the new evil intel. Not to speaks of the shady practise of artificially reducing PCIe lines to sell more server hardware (which amd also copied from intel). Enshittification everywhere.
Luckily StormPeak has 128 PCIe 5.0 lanes and 16 PCIe 4 lanes. AMD is quite generous with latest gen Threadripper Pro as even the cheapest 12 core model for $1400 comes with the full 128 PCIe lanes offering an absolutely awesome price/pcie lane ratio.
All manufacturers latest gen GPUs is shit. Nvidia 50-series got backported to TSMC 5nm and is really inefficient and basically just a larger 40-series for an insane price. AMD costs way too much for just 16 GB of memory and ROCm is the biggest pain ever to use for AI and basically anything else. Intel Arc latest gen is decent but only 192 bit bus worse than their previous generation with only 12 GB of memory and so less bandwidth and much less AI cores but there is hope for an awesome 24 GB clamshell model later this year.
Intel Arc 770 is truly awesome. This is not the latest generation they released this year but what they released 2.5 years ago. They now offer 16 GB of GDDR6 at a 256-bit bus with 560 GB/s bandwidth and 512 AI cores for $280 while NVidia offers a 4060 Ti 16 GB with 128-bit bus using clamshell for $600. For the price of an RTX 5090 I could buy over 8x Intel ARC 770 which would combined be 128 GB GDDR6 at 4480 GB/s and 2048-bit bus totaling 2048 AI cores. Really the price/performance you currently get with last gen Intel Arc 770 is insane. It also worth considering that despite its age it is using TSMC 5 nm like the NVidie 40-series of GPUs. And now so many ears after the initial Intel arc launch, they finally get the software side of things perfect. PyTorch, llama.cpp, axolotl, vLLM all work without any issues on intel arc booth on Windows and Linux. I just hope it doesn't have the audio stuttering or PCIe device reset issues I'm currently experiencing on RTX 4090 or the random Linux kernel crashes I'm experiencing using Sparkle Intel ARC A310 4GB GPUs in my job. I will for sure let you know how it goes once they arrive. They will probably booth go into StormPeak so it then has 2x RTX 4090 + 2x Intel Arc 770 so we can keep the RTX 3080 in CastlePeak and RTX 2070s in Threadripper for the RPC setup.
Although, I must admit, I was dreaming about intels death ever since I had a dual opteron.
Regarding CPUs Intel is dead for me since they removed AVX-512. I'm just not buying any CPU without AVX-512. Doing so would be stupid. I want my fast memset. Jokes aside there are applications like llama.cpp on CPU and AV1 encoding where AVX-512 makes quite massive difference. But I'm generally not that happy with AMD64. I really wish we could soon move on and use RISC-V based CPUs for PCs. I'm already using RISC-V based CPUs for use-cases where security matters. I also really miss transactional memory which Intel promised many times and then messed up and now just abandoned it. With the latest security vulnerability AMD CPUs got a whole lot more interesting anyways. You can now jailbreak them and write your own microcode: https://bughunters.google.com/blog/5424842357473280/zen-and-the-art-of-microcode-hacking
Yes, very depressing. Oh, you meant this to be uplifting??
It was mean uplifting but I guess it can also be seen quite depressing depending if you value the now or the future. Things will likely never be as good as they are now at least for me as I basically reached the peak of joy and happiness, I have an awesome job I truly enjoy and to which I’m looking forward to every day and during my spare time I can have fun with all this awesome openly available AI models. There just is no way thing stay anywhere close to as good they currently are. I recommend to just enjoy the moment as long it lasts. I made sure to back up all the base models and models I like the most just in case.
do you know a way of doing something about this?
The entire error is:
Can't load tokenizer for '/bpool/pythia-1.4-orpo'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/bpool/pythia-1.4-orpo' is the correct path to a directory containing all relevant files for a GPTNeoXTokenizerFast tokenizer.
https://huggingface.co/zaas12/pythia-1.4-orpo/tree/main does not contain a tokenizer.json or tokenizer.model so the model simply has no tokenizer. To fix this error it just copy the GPTNeoXTokenizerFast tokenizer from a different model into the folder containing the downloaded model.
For this specific model you know based on the "_name_or_path" inside the config.json that it was trained based of "EleutherAI/pythia-1.4b"
So you could download:
- https://huggingface.co/EleutherAI/pythia-1.4b/raw/main/tokenizer.json
- https://huggingface.co/EleutherAI/pythia-1.4b/raw/main/tokenizer_config.json
- https://huggingface.co/EleutherAI/pythia-1.4b/raw/main/special_tokens_map.json
After which the model successfully converts into a GGUF. I stored it the resulting GGUF under /tmp/quant/pythia-1.4-orpo.gguf
.
missing python packages
I noticed a ton of models with missing python packages and I was wondering why we keep nuking them instead of installing the proper dependencies. It seams quite stupid we don't support models where the HF to GGUF conversion depends on a specific python packet. I guess now that we maintain our own llama.cpp fork we could add them all to the requirements.txt
btw., so amusing: https://emygervais.github.io/2025/03/15/bytecraft.html (saw the model this morning)
Who would have guessed: "Working in the byte world is extremely challenging because a single wrong byte can break the whole functioning of the file."
This idea is so insane. You basically teach an LLM to write bytes instead of ASM. It might work but damn is it an insane idea. They should have at least used a tokenizer that made somewhat sense for this use case or even better just train a model from scratch because this is so far different from the common use-case of an LLM that starting fresh would likely be justified. What's next? An AI that creates a ROP chain so I can run a game inside a banking application by abusing a buffer overflow?
Not to speaks of the shady practise of artificially reducing PCIe lines to sell more server hardware (which amd also copied from intel). Enshittification everywhere.
Intel has done plenty of shady shit, but this situation is far more nuanced, as PCIe lanes take up valuable die space, and PCIe switches becoming too expensive for consumer usage, etc.
"PLX was acquired by Avago in 2014 in a deal that valued the company at $300m, and seemingly overnight the cost of these switches increased three-fold according to my sources at the time, making them unpalatable for consumer use." source: https://www.anandtech.com/show/15821/microchips-new-pcie-40-pcie-switches-100-lanes-174-gbps
second source that mentions this increase: https://www.servethehome.com/business-side-plx-acquisition-impediment-nvme-everywhere/
My go to example for intel screwing consumers over is ECC memory, there was no technical limitation on that at all, and consumer platforms having less stability because of it is a legacy we still deal with to this day.
Intel has done plenty of shady shit, but this situation is far more nuanced, as PCIe lanes take up valuable die space
That is why AMD puts a dedicated IO die in their CPUs. But even on a monolithic design PCIe lanes and memory lanes are always worth the die space they use. More PCIe lanes means more GPUs and more SSDs and more memory lanes means faster CPU inference performance. I'm so happy StormPeak has octa-channel memory.
PCIe switches becoming too expensive for consumer usage
I never really saw the appeal of PCIe switches. What is the advantage of using a PCIe switch compared to just using PCIe bifurcation to split PCIe bandwidth between multiple devices. When I want to plug 4 GPUs into one PCIe x16 slot I'm just using a x4x4x4x4 bifurcation card for $30 and I have something relabel that equally distributes the bandwidth between all the GPUs. But cheap PCIe redrivers would be super useful. My mainboard luckily has some integrated but having them after the PCIe riser cable would likely make way more sense.
My go to example for intel screwing consumers over is ECC memory, there was no technical limitation on that at all, and consumer platforms having less stability because of it is a legacy we still deal with to this day.
EEC memory is huge. Having ECC memory is an absolute must. I need to be able to trust my PC. Without ECC memory I would have to basically do all computations twice and compare results which would be insanely wasteful. This is by the way exactly what I did before I had ECC memory. All my PCs since 2017 have ECC memory. Threadripper has 128 GB ECC UDIMM DDR4 memory, CastlePeak has 256 GB ECC UDIMM DDR4 memory and StormPeak has 512 GB ECC RDIMM DDR5 memory. For anyone telling me that ECC errors are unlikely: No they are not. My ECC RAM puts a kernel log entry every time one happens and they indeed do happen and ECC always manages to correct them. Same thing as bit rot on TLC SSD is something that hapopens.
That is why AMD puts a dedicated IO die in their CPUs.
Yes, but that is a more recent thing compared to losing out on cheap PCIe switches.
But even on a monolithic design PCIe lanes and memory lanes are always worth the die space they use.
Again it really isn't that simple, it has to beachfront silicon, as it is I/O and if you want more of that you have to use a bigger chip. It's not like you can just shrink the cores or use less cores as that won't give you anymore beachfront silicon.
I never really saw the appeal of PCIe switches.
They are incredibly useful, and do far more than what bifurcation can. Even with the dedicated I/O die on zen CPUs the chipsets still are PCIe switches (with extra functionality), look up the block diagrams of the X570 and B650 chipsets and you'll see they are PCIe switches (although again they do offer a bit more than a standard PCIe switch, but the switching functionality is still core and important).
I agree with you on ECC, even if I haven't been fortunate enough to use exclusively ECC computers, my desktop is still not ECC but my NAS and my servers are.
You will have to keep rich1 busy or he will start using it for his own quants of that list with trash models you gave him.
I don't know what that means. Will he take my ability away to use it? That sucks. Will he use the idle time for other purposes? Sounds great to me - why keep it artificially busy. In any case, if for some reason things get too bad (there is no indication of that at the moment :) I'd rather not have rich1 then.
I can add a few more models to the queue, though.
What do you expect from Richard. Obviously he wants to run 3. Here a quote from him:
We've been there before. There is no way to help technically illiterate people. We can run three models once his server has the disk and memory to do so. Right now, it clearly doesn't have the memory, nor the disk, nor the network, for more.
My ECC RAM puts a kernel log entry every time one happens and they indeed do happen and ECC always manages to correct them.
Sorry, but ECC errors and bit errors are exceedingly rare. I've had dozens of busy servers over the decades, and the only case where I had ECC errors was a CPU errata.
So, yeah, they do happen, but many other faults are more likely. It is certainly nice to have this fuzzy feeling of extra security, though, but the performance drain is objectively not worth it for most applications.
Same thing as bit rot on TLC SSD is something that hapopens.
I thought so, too, but all my data happens to be checksummed, and even for ssds that have been stored for years, I've never had a bit error (admittedly, I only have crucial ssds that are that old). But I might simply have been lucky.
When I want to plug 4 GPUs into one PCIe x16 slot I'm just using a x4x4x4x4 bifurcation
and enjoy 25% speed of prompt processing and many other tasks?
Intel has done plenty of shady shit, but this situation is far more nuanced
I disagree with the "far". Intel has reduced the number of pcie lines in desktop cpus over the years. So, yeah, some nuance, but intel did this for segmentation reasons. change my mind, but that is what intel does for many many years now.
Actually there must be a bug with the queue on rich1.
Actually, there isn't. There simply isn't enough budget to add more big jobs. The only way out is to reduce the size of the jobs it can accept, greatly reducing it's usefulness.
Anyway, the queue will become empty at some point, and other than idiotically wasting cpu cycles, there is no way we can avoid becoming idle at some point.
I've reduced max model size for rich to 100B.
Sorry, but ECC errors and bit errors are exceedingly rare. I've had dozens of busy servers over the decades, and the only case where I had ECC errors was a CPU errata.
So, yeah, they do happen, but many other faults are more likely. It is certainly nice to have this fuzzy feeling of extra security, though, but the performance drain is objectively not worth it for most applications.
I don't have great information on the prevalence of errors so I'm not going to argue one way or the other, but ECC offers far more than just a "fuzzy feeling of extra security". Error detection and monitoring is a huge benefit, such as helping you find and deal with faulty hardware before it causes actual problems, and telling you if that is or is not the cause of the issue you are experiencing.
I disagree with the "far". Intel has reduced the number of pcie lines in desktop cpus over the years. So, yeah, some nuance, but intel did this for segmentation reasons. change my mind, but that is what intel does for many many years now.
Huh? I agree with you that it is Intel intentionally doing this for market segmentation, but my point was that for consumers motherboard PCIe lanes went down far more because of the lack of cheap PCIe switches, as most lanes before were given by PCIe switches, and that is still a thing with modern chipsets stepping in still to offer more lanes than the CPU provides. Memory channels and PCIe lanes are far more costly than cores and market segmentation based on that isn't entirely unreasonable. The scummy shit is them doing stuff like this "When Intel introduced Haswell-E, it experimented with a new type of product separation: it also varied the number of CPU-host PCIe lanes among the SKUs. This practice continues in Broadwell-E, in an almost identical fashion. The lowest end CPU has 28 PCIe 3.0 lanes, capable of three-way GPU setups (and no 2x16 setups), while the other processors have a full 40 PCIe 3.0 lanes" source https://www.anandtech.com/show/10337/the-intel-broadwell-e-review-core-i7-6950x-6900k-6850k-and-6800k-tested-up-to-10-cores
If you wanted more memory channels or PCIe lanes you went onto the HEDT platforms, they also didn't really have do HEDT for a while, but again that is far after when we are talking about with PCIe lanes going down, and a whole different story.
I'm not trying to change your mind as I'm not really even sure where we disagree. I just think ECC is a far simpler and clearer example of Intel being scummy and locking consumers out of things, and even the mainstream platform being kept at quad cores well for far longer than it should have been, and the way they segmented hyper threading both have less nuance than the PCIe/memory class segmentation between HEDT and consumer.
such as helping you find and deal with faulty hardware before it causes actual problem
Helping is a relative term. There are many places where you can get bit errors, such as inside your CPU. I've had more cases of faulty cpus and cpu memory controllers than I ever had memory errors.
Point being, ECC is a relatively minor thing. When I hear that without ECC memory, one does all the calculations twice, this is just cargo cult.
Also, it's really fun to pull nicos legs sometimes.
I'm not trying to change your mind as I'm not really even sure where we disagree.
I don't think we are in any significant disagreement :)
Again it really isn't that simple, it has to beachfront silicon, as it is I/O and if you want more of that you have to use a bigger chip. It's not like you can just shrink the cores or use less cores as that won't give you anymore beachfront silicon.
Ah that explains why the I/O die on StormPeak is so physically massive compared to the dies that contain the actual cores. I always wrongly assumed this is the case because they use cheaper older wafers for I/O dies.
They are incredibly useful, and do far more than what bifurcation can. Even with the dedicated I/O die on zen CPUs the chipsets still are PCIe switches (with extra functionality), look up the block diagrams of the X570 and B650 chipsets and you'll see they are PCIe switches (although again they do offer a bit more than a standard PCIe switch, but the switching functionality is still core and important).
You are right. On my WRX90E-SAGE SE mainboard the chipset also acts as PCIe switch. It servs the two SlimSAS ports each running at PCIe 4.0 x4 and servers the 4 SATA ports.
I agree with you on ECC, even if I haven't been fortunate enough to use exclusively ECC computers, my desktop is still not ECC but my NAS and my servers are.
ECC is awesome! I love every bit of it.
I don't know what that means. Will he take my ability away to use it? That sucks. Will he use the idle time for other purposes? Sounds great to me - why keep it artificially busy. In any case, if for some reason things get too bad (there is no indication of that at the moment :) I'd rather not have rich1 then.
Sorry for being unclear. No he will obviously not take it away rich1 when we don’t use it but will make use of any idle resources to do models for his own account. He even does this now during the short downtime we sometimes have due to repo creation rate limit.
I can add a few more models to the queue, though.
I personally would be happy to see the queue empty up and be like it was before November than keep thing as crazy as they currently are. But if you find great model, please queue them but we don’t need to queue garbage just to satisfy Richard. He can do his own quants if he is unsatisfied with rich1 utilization which actually already does every time we don’t max out his CPU.
We've been there before. There is no way to help technically illiterate people. We can run three models once his server has the disk and memory to do so. Right now, it clearly doesn't have the memory, nor the disk, nor the network, for more.
Yes exactly. I fully agree. He figured out the hard way today after insisting on me telling him how to increase the default OpenWrt 65536 active connection limit. He increased it to 655390 just to figure out that the limit existed for a reason and above 80K concurrent connections things started to get unstable and he lost connection to the entire server. Sometimes he just has to see things for himself to learn. He is still young and so has to get his experience from somewhere. It’s quite funny how he keeps complaining why everything keeps breaking for him without ever wondering if he might be the reason why. There is a 150-Watt peak power limit for the GPU in his laptop. He thought it is a great idea to remove that and run it 24/7 at 200-Watt normal power. Let's see how long that lasts. He just does all the stupid things a teenagers would do but using computer hardware.
Sorry, but ECC errors and bit errors are exceedingly rare. I've had dozens of busy servers over the decades, and the only case where I had ECC errors was a CPU errata.
Let’s check logs on Treadripper and see who is right. I investigated a ton of ECC errors in late 2024 so I for sure wouldn’t consider them “exceedingly rare”. The are surprisingly common for me.
So, yeah, they do happen, but many other faults are more likely. It is certainly nice to have this fuzzy feeling of extra security, though, but the performance drain is objectively not worth it for most applications.
It really depends on the type of computations you run. For me correctness is way more important than anything else. Especially back when I did scientific computing. I guess now that I mostly do AI. This is also why I have not enabled ECC on the GPUs you use for imatrix computation as doing so would lead to an over 10% performance decrease for very little benefit for our use case. ECC doesn’t matter for my current use cases as much as it did in the past but it still is an important nice to have and worth the hardware and performance cost for sure.
I thought so, too, but all my data happens to be checksummed, and even for ssds that have been stored for years, I've never had a bit error (admittedly, I only have crucial ssds that are that old). But I might simply have been lucky.
It seems to really depend on the SSD controller, storage chip and ECC algorithm. I think so far it was only late PCIe 3 and early PCIe 4 SSD from Samsung and Kingston from which I experienced bit rot issues. The SSDs you currently use for nico1 are notorious for uncorrectable bit rot so if you ever store a large file and don't read it for half a year the host wouldn’t run monthly scrubs it would have a quite high likelihood of being corrupted after half a year. This is one of the main reasons I gave you those specific SSDs. Their bit rot was a massive pain for me and I wasted dozens of hours on it. Corrupted rotten blocks kept breaking my backups as every time one got encountered the backup got canceled resulting in me having to spent hours searching whatever file contains the faulty block, restoring it from backup and trimming all empty space and hope the issue is gone. I had Windows Server on those SSDs so no fancy file system like ZFS or BTRFS to tell me which files are broken so I just had to dd the entire thing, see where it fails and then figure out what file is on this position.
and enjoy 25% speed of prompt processing and many other tasks?
I really should reduce the 4090 GPUs to x4 and see if there is a performance difference for imatrix computation. You are the opinion that all that matters for imatrix performance is PCIe bandwidth but I'm almost certain imatrix is RAM bottlenecked as this is what is getting hot while it is running. Last weekend I even installed a piece of plastic inside StormPeak to direct airflow towards RAM as before every time we did imatrix computation everyone in the entire house heard the fans and joked that I have a hay drier in the basement and it was actually so loud that sitting next to it for a long period of time made my ears hurt. Since I did that modification imatrix computation is almost quiet.
I disagree with the "far". Intel has reduced the number of pcie lines in desktop cpus over the years. So, yeah, some nuance, but intel did this for segmentation reasons. change my mind, but that is what intel does for many many years now.
And this is why I don't buy Intel or normal AMD CPUs. I mainly care about PCIe lanes and memory lanes when buying a CPU as this is what is end up bottlenecking me. I really hope AMD keeps Threadripper around because EPYC server mainboards suck.
Actually there must be a bug with the queue on rich1.
Actually, there isn't. There simply isn't enough budget to add more big jobs. The only way out is to reduce the size of the jobs it can accept, greatly reducing it's usefulness.
Ah yes that explains why only 2 got put there. Currently models somehow got massive because we reached the big model part of the priority 1400 models. I still don't get why we sort models by priority. Doing so seems a bit dumb because then once we are done with them we have to rush manually adding more big ones to not get HuggingFace repo creation limited.
Anyway, the queue will become empty at some point, and other than idiotically wasting cpu cycles, there is no way we can avoid becoming idle at some point.
Which is a good thing as then Richard can use it for his own purposes while we are not using it.
I've reduced max model size for rich to 100B.
Great. That should ensure that always some idle models are on rich1.
I don't have great information on the prevalence of errors so I'm not going to argue one way or the other
That's why I will now actually check my kernel logs on Threadripper because I should have the data there as for some reason journalctl keep all the kernel logs since 31th May there.
ECC offers far more than just a "fuzzy feeling of extra security". Error detection and monitoring is a huge benefit, such as helping you find and deal with faulty hardware before it causes actual problems, and telling you if that is or is not the cause of the issue you are experiencing.
I absolutely awesome. It already helped me detecting many issues. Manly while building a new PC.
Memory channels and PCIe lanes are far more costly than cores and market segmentation based on that isn't entirely unreasonable.
I never realized that they would be so expensive given that even the cheapest latest gen Threadriper Pro has 128 PCIe 5.0 lanes and 8 memory lanes despite only having 12 cores.
Helping is a relative term. There are many places where you can get bit errors, such as inside your CPU. I've had more cases of faulty cpus and cpu memory controllers than I ever had memory errors.
ECC doesn't just correct memory errors that happen due to faulty bits but also errors that happen during the transfer from memory to CPU. Unless the CPU really has some bug, data integrity should be guaranteed. DDR5 has some in-memory ECC but that is not checking the transfer so even for DDR5 I’m using proper ECC memory. But honestly if you don’t use your PC for anything important DDR5 without ECC will likely be fine as the internal ECC in all DDR5 memory is quite decent at preventing random memory errors due to things like cosmic rays.
Point being, ECC is a relatively minor thing. When I hear that without ECC memory, one does all the calculations twice, this is just cargo cult.
I actually did so back in university for all scientific calculations because I couldn't risk them to be wrong.
I don't think we are in any significant disagreement :)
We are not.
Also, it's really fun to pull nicos legs sometimes.
Or more like make me spent 2 hours reading, research and replaying to the massive wall of text you all wrote today. Joking aside it actually was a very interesting discussing and this is the first time I closely looked at the ECC error log as a hole instead of investigating specific ECC events.
This is how a typical ECC event looks like:
Aug 30 19:47:21 Threadripper kernel: mce: [Hardware Error]: Machine check events logged
Aug 30 19:47:21 Threadripper kernel: [Hardware Error]: Corrected error, no action required.
Aug 30 19:47:21 Threadripper kernel: [Hardware Error]: CPU:0 (17:1:1) MC15_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c2040000000011b
Aug 30 19:47:21 Threadripper kernel: [Hardware Error]: Error Addr: 0x000000019a9b9040
Aug 30 19:47:21 Threadripper kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x0000fd010a400302
Aug 30 19:47:21 Threadripper kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0
Aug 30 19:47:21 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Aug 30 19:47:21 Threadripper kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Dec 24 02:54:42 Threadripper kernel: mce: [Hardware Error]: Machine check events logged
Dec 24 02:54:42 Threadripper kernel: [Hardware Error]: Corrected error, no action required.
Dec 24 02:54:42 Threadripper kernel: [Hardware Error]: CPU:0 (17:1:1) MC15_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c2040000000011b
Dec 24 02:54:42 Threadripper kernel: [Hardware Error]: Error Addr: 0x000000019a9b9040
Dec 24 02:54:42 Threadripper kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x0000fd010a400302
Dec 24 02:54:42 Threadripper kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0
Dec 24 02:54:42 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 24 02:54:42 Threadripper kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
And here all ECC events that happened on Threadripper with 128 GB DDR4 ECC memory since 31th of Mai
Aug 30 19:47:21 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01
Nov 28 14:45:22 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Nov 28 21:24:02 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Nov 29 05:30:06 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Nov 30 07:42:58 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 01 04:28:09 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 03 02:09:44 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 04 02:55:13 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 04 03:00:41 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 04 14:56:07 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 05 14:30:36 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 06 02:04:11 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 06 03:53:25 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 06 03:58:52 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 06 04:04:20 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 06 04:09:48 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 06 08:04:38 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 07 10:33:53 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 07 12:01:16 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 07 12:06:44 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 07 12:12:11 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 07 20:12:47 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 08 17:56:29 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 08 17:58:03 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 08 18:03:30 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 08 18:08:58 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 08 18:14:26 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 08 18:19:53 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 08 18:25:21 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 08 18:30:49 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 08 18:36:16 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 08 20:41:53 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 09:21:00 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 09:26:28 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 09:31:56 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 09:37:24 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 09:42:51 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 09:48:19 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 09:53:47 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 09:59:14 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 10:04:42 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 10:10:10 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 10:15:37 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 10:21:05 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 12:37:37 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 13:21:18 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 13:26:46 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 13:32:14 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 13:37:41 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 13:43:09 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 13:48:37 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 13:54:04 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 13:59:32 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 14:05:00 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 14:10:28 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 14:15:55 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 14:21:23 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 14:26:51 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 14:32:18 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 18:27:08 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 10 21:40:05 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 11 03:02:18 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 11 18:08:53 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 24 02:54:42 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
This is a total of 64 corrected ECC errors in less than 10 months. I wouldn't consider this rare. It's also quite surprising that the issue seams to always happen at the same address so maybe there is some sort of hardware defect that makes this specific address way more likely to have issues.
Your results suggest that there is something seriously wrong with your hardware. I wouldn't trust it if it generates this massive amount of bit errors. insert funny theory about the swiss alps being nearer to space
The only times I ever got ECC errors in the last 20 years (I don't think IO had ecc detection hardware before the 2000s) was a hardware errata in an intel cpu (to be ignored), and actually faulty ram sticks. I am seriously distrusting your hardware now. I mean, the ram -> cpu path is not the only thing that can go wrong, and the failures you have are massive. Once every 5 years would be more acceptable, IMnsHO.
Since it seems to be always the same address (if I read that correctly, which I probably don't), this also indicates that your ram is indeed faulty. So, yeah, ECC found it, but so would a burn in with a memory checker.
Ah that explains why the I/O die on StormPeak is so physically massive compared to the dies that contain the actual cores. I always wrongly assumed this is the case because they use cheaper older wafers for I/O dies.
I hate to sound repetitive but there is more nuance again. They do use older nodes for the I/O die, and that does result in them being larger, but also not by that much because I/O is one of those things that does not scale well with process nodes, and that adds to the problem we were talking about before of it taking up valuable die area, as process node shrinks help reduce the die area of cores far more than I/O.
You are right. On my WRX90E-SAGE SE mainboard the chipset also acts as PCIe switch. It servs the two SlimSAS ports each running at PCIe 4.0 x4 and servers the 4 SATA ports.
Yep, that is typical. But also it's not just ubiquity like I said before PCIe switches are capable of things that bifurcation can't.
ECC is awesome! I love every bit of it.
I do like it, memory instability sucks as someone who has dealt with it in recent times (solved by RMA'ing the RAM).
Yes exactly. I fully agree. He figured out the hard way today after insisting on me telling him how to increase the default OpenWrt 65536 active connection limit. He increased it to 655390 just to figure out that the limit existed for a reason and above 80K concurrent connections things started to get unstable and he lost connection to the entire server. Sometimes he just has to see things for himself to learn. He is still young and so has to get his experience from somewhere. It’s quite funny how he keeps complaining why everything keeps breaking for him without ever wondering if he might be the reason why. There is a 150-Watt peak power limit for the GPU in his laptop. He thought it is a great idea to remove that and run it 24/7 at 200-Watt normal power. Let's see how long that lasts. He just does all the stupid things a teenagers would do but using computer hardware.
His experimenting sounds fun (might have something to do with the fact that I'm not at all impacted by it). You can learn by doing, but not all mistakes you find out what you did wrong. I still don't know why I couldn't get jumbo frames working with a point to point link (so very few things involved and all of them should support it), a few years ago.
It seems to really depend on the SSD controller, storage chip and ECC algorithm. I think so far it was only late PCIe 3 and early PCIe 4 SSD from Samsung and Kingston from which I experienced bit rot issues. The SSDs you currently use for nico1 are notorious for uncorrectable bit rot so if you ever store a large file and don't read it for half a year the host wouldn’t run monthly scrubs it would have a quite high likelihood of being corrupted after half a year. This is one of the main reasons I gave you those specific SSDs. Their bit rot was a massive pain for me and I wasted dozens of hours on it. Corrupted rotten blocks kept breaking my backups as every time one got encountered the backup got canceled resulting in me having to spent hours searching whatever file contains the faulty block, restoring it from backup and trimming all empty space and hope the issue is gone. I had Windows Server on those SSDs so no fancy file system like ZFS or BTRFS to tell me which files are broken so I just had to dd the entire thing, see where it fails and then figure out what file is on this position.
Thank you for this story (and I would love to know more if you don't mind). I'm very picky about buying SSDs (when I can afford to be), quality like you saw varies but what bothers me a lot is it is easier to count the number of companies that don't do the scummy thing of changing internal components without changing the sku, as it is so common place, which means actually evaluating quality MUCH harder.
I don't have great information on the prevalence of errors so I'm not going to argue one way or the other
That's why I will now actually check my kernel logs on Threadripper because I should have the data there as for some reason journalctl keep all the kernel logs since 31th May there.
[...]
Sorry, but ECC errors and bit errors are exceedingly rare. I've had dozens of busy servers over the decades, and the only case where I had ECC errors was a CPU errata.
Let’s check logs on Treadripper and see who is right. I investigated a ton of ECC errors in late 2024 so I for sure wouldn’t consider them “exceedingly rare”. The are surprisingly common for me.
I've seen this conversation happen so many times which is why I bowed out early, but like always it will be fun for me to hear it happen again.
I absolutely awesome.
??? Lol.
It already helped me detecting many issues. Manly while building a new PC.
If you build a new PC you should thoroughly do testing which includes memory testing, that would find those issues regardless of ECC (also on that note the state of memory checkers that handle ECC intelligently is literally one from what I found, and it is sadly paywalled for the useful version, as I recently found out when dealing with a server and testing it).
I never realized that they would be so expensive given that even the cheapest latest gen Threadriper Pro has 128 PCIe 5.0 lanes and 8 memory lanes despite only having 12 cores.
The MSRP of the AMD Threadripper PRO 7945WX is $1399, that is well outside of consumer CPU pricing, and requires an also much more expensive than consumer platform motherboard and RAM (especially if you want to make use of the octal channel). I'm not making a value judgement here, but it is objectively in a different price segment than consumer stuff, as most consumers wouldn't even spend half of that CPU price for the entire system.
ECC doesn't just correct memory errors that happen due to faulty bits but also errors that happen during the transfer from memory to CPU. Unless the CPU really has some bug, data integrity should be guaranteed. DDR5 has some in-memory ECC but that is not checking the transfer so even for DDR5 I’m using proper ECC memory. But honestly if you don’t use your PC for anything important DDR5 without ECC will likely be fine as the internal ECC in all DDR5 memory is quite decent at preventing random memory errors due to things like cosmic rays.
You are correct about the difference in ECC but your last sentence is very odd to me. If I'm not using a PC for anything important anything is fine, but even still I would trust a DDR4 system over a DDR5 one as the in memory ECC is due to the extremely high data transfer rates inherent to the standard, and memory controllers are generally less mature, and DDR5 is still more inherently challenging to run.
Even the PCIe standard had to add error correction (but they also switched to PAM4 while DDR5 is using NRZ like previous PCIe revisions):
"because of the additional signal states a PAM4 signal itself is more fragile than a NRZ signal. And this means that along with PAM4, for the first time in PCIe’s history the standard is also getting Forward Error Correction (FEC). Living up to its name, Forward Error Correction is a means of correcting signal errors in a link by supplying a constant stream of error correction data, and it’s already commonly used in situations where data integrity is critical and there’s no time for a retransmission (such as DisplayPort 1.4 w/DSC). While FEC hasn’t been necessary for PCIe until now, PAM4’s fragility is going to change that. The inclusion of FEC shouldn’t make a noticeable difference to end-users, but for the PCI-SIG it’s another design requirement to contend with. In particular, the group needs to make sure that their FEC implementation is low-latency while still being appropriately robust, as PCIe users won’t want a significant increase in PCIe’s latency.
The upshot of the switch to PAM4 then is that by increasing the amount of data transmitted without increasing the frequency, the signal loss requirements won’t go up. PCIe 6.0 will have the same 36dB loss as PCIe 5.0, meaning that while trace lengths aren’t officially defined by the standard, a PCIe 6.0 link should be able to reach just as far as a PCIe 5.0 link. Which, coming from PCIe 5.0, is no doubt a relief to vendors and engineers alike." Source: https://www.anandtech.com/show/14559/pci-express-bandwidth-to-be-doubled-again-pcie-60-announced-spec-to-land-in-2021
Joking aside it actually was a very interesting discussing
Same for me.
I've avoided samsung for other reasons (ignoring fua), but hearing that is a bit shocking. I am keeping checksums of most of my files for decades now, so even before filesystems had data checksums (well, just btrfs out of the linux ones, I think), I knew bitrot was a thing. I haven't caught a SSD doing that, but I have caught ext3 and xfs bugs that way in the past, and of course lots of hardware issues.
In any case, I can hardly believe that samsungs would actually bitrot just after a few months, when the disk is actually on (even if off its hard to believe). Surely, this would be well known if that was really the case in general, rather than some faulty specimens? I mean, I believe you, but, sheesh, it can't be, can it?
In any case, I hope you run a monthly scrub on my disks then? :)