This model performs worse in complex problems compared to the DeepSeek R1
I tested it in lineage-bench benchmark. While the performance for "simple" problems matches the original model, for "complex" problems like lineage quizzes asking about relations between 32 or 64 people the performance is significantly worse compared to the DeepSeek R1.
Nr | model_name | lineage | lineage-8 | lineage-16 | lineage-32 | lineage-64 |
---|---|---|---|---|---|---|
2 | deepseek/deepseek-r1 | 0.917 | 0.965 | 0.980 | 0.945 | 0.780 |
6 | perplexity/r1-1776 | 0.709 | 0.980 | 0.975 | 0.675 | 0.205 |
@sszymczyk
I strongly doubt the benchmark was run correctly. Specifically the deepseek API does not count thinking tokens as part of max_tokens
. I would try running with a much longer max_tokens
for r1-1776 and see if the results still hold.
@sszymczyk I strongly doubt the benchmark was run correctly. Specifically the deepseek API does not count thinking tokens as part of
max_tokens
. I would try running with a much longermax_tokens
for r1-1776 and see if the results still hold.
I ran the benchmark in both cases via OpenRouter, so meaning of max_tokens is unified here. Also I don't see any output exceeding 10k tokens (I had max_tokens set to 16384).
What I do see is weird behavior of the model in reasoning traces. It starts normally, then starts switching to upper case, for example:
statement 22 says \"Natalie IS WILLIE'S ANCESTOR,\" so Natalie->Willie; which connects to Cynthia via statement 19 (\"Cynthia IS WILLIE'S DESCENDANT\")
or stops using spaces like:
WilliedescendsfromNatalie(statement22):Natalie--Willie ?\n\nWait no—statement22 states \"Natalieis Willies_s_ancestor\", meaning Natalie precedes him =>Natalie--Willie .\n\nAnd NataliedescendsfromGerald(statement12)
or inserts extra spaces like:
K arenisan cestortoD ebr a(D ebr a<K arenviapoin t25.)D ebr aisancestortoChristin a(C hristin a<D ebr aviap oint26.
or starts using weird accent characters:
Brúce hacia María(Brúce anc María). María hacia Káren(María anc Káren)
or starts twisting words like:
neither one directly links as ancesotr or descedant of the other.Hence answer should be option5 None of the above.
As the reasoning progresses the output is less and less comprehensible. This is not normal.
Hmm interesting, there may be an issue with our serving stack somewhere. Can you give me a prompt which produces this strange behavior in the COT that you're seeing?
I ran the benchmark in both cases via OpenRouter, so meaning of max_tokens is unified here.
I highly doubt this btw if you're using the DeepSeek API directly as it's not something their API allows you to do. The length of the COT is completely out of the hands of the user.
@eugenhotaj-ppl
I tried one example prompt in Perplexity Labs Playground and got the following output (prompt also included): https://pastebin.com/EPy06bqp
It displays most of the problems I mentioned above.
@sszymczyk thanks, looking into this.
@sszymczyk thanks a lot for brining this to our attention, there is indeed an issue with our serving that we're trying to figure out.
I realize this is a bit unsatisfying but I ran your eval on a local deployment that we confirmed to be good using the following command:
./lineage_bench.py -s -l 64 -n 10 -r 42 | ./run_openrouter.py -m "perplexity/r1-1776" -t 96 -v | tee results/r1-1776_64.csv
On -l 64
we get .775
which looks on-par with r1 to me.
Here is the output on the prompt you posted: https://pastebin.com/bGHr9HHQ
I will comment here again once we fix the deployment and you can re-run evals.
@sszymczyk thanks a lot for brining this to our attention, there is indeed an issue with our serving that we're trying to figure out.
...
I will comment here again once we fix the deployment and you can re-run evals.
@eugenhotaj-ppl I'm looking forward to it. Too bad you don't have a bug bounty program, I could use some help in paying for all these reasoning tokens.
@sszymczyk we're deploying now, would give it another ~1 hour to be safe.
Too bad you don't have a bug bounty program, I could use some help in paying for all these reasoning tokens.
I hear you, we don't have a bug bounty but can give you API credits / compensate the ones you've used. Is this something you'd be interested in?
@sszymczyk we're deploying now, would give it another ~1 hour to be safe.
Too bad you don't have a bug bounty program, I could use some help in paying for all these reasoning tokens.
I hear you, we don't have a bug bounty but can give you API credits / compensate the ones you've used. Is this something you'd be interested in?
@eugenhotaj-ppl
Sure, API credits would be great. I have a perplexity.ai account with username sszymczy11038.
In my OpenRouter activity page I see that I spent $38.22 overall on perplexity/r1-1776 in the last few days.
The benchmark run alone probably wouldn't cost as much, but there were other issues when using this model like server returning empty responses, responses with internal server error etc that I was trying to diagnose. Finally I gave up and changed my approach to "in case of error try again".
The benchmark run alone probably wouldn't cost as much, but there were other issues when using this model like server returning empty responses, responses with internal server error etc that I was trying to diagnose. Finally I gave up and changed my approach to "in case of error try again".
We're still scaling up r1-1776 capacity so the qps we can handle right now is kind of low. If you send too many concurrent requests they might start getting rejected unfortunately. When I re-ran your benchmark through our API I used 4 threads and it took ~10-15 min for lineage-64
and didn't see any failures.
The benchmark run alone probably wouldn't cost as much, but there were other issues when using this model like server returning empty responses, responses with internal server error etc that I was trying to diagnose. Finally I gave up and changed my approach to "in case of error try again".
We're still scaling up r1-1776 capacity so the qps we can handle right now is kind of low. If you send too many concurrent requests they might start getting rejected unfortunately. When I re-ran your benchmark through our API I used 4 threads and it took ~10-15 min for
lineage-64
and didn't see any failures.
@eugenhotaj-ppl OK, I'll try to be gentle when re-running the benchmark in the future.
Thank you @sszymczyk for helping us improve our API!
BTW would appreciate an update to the reddit post as well 🙂: https://old.reddit.com/r/LocalLLaMA/comments/1izbmbb/perplexity_r1_1776_performs_worse_than_deepseek/
@eugenhotaj-ppl Post updated, also created a new one: https://www.reddit.com/r/LocalLLaMA/comments/1j3hjxb/perplexity_r1_1776_climbed_to_first_place_after/
While in my first Reddit post wumao army of pro-China commenters attacked Perplexity, in the second they downplayed importance of my benchmark result. Classic Reddit.
Anyway, I did some more testing and compared performance of both DeepSeek R1 and Perplexity R1 1776 in more complex lineage-128 problems and found almost no difference in the performance between models: https://www.reddit.com/r/LocalLLaMA/comments/1j49sbd/is_there_a_statistically_significant_difference/
@sszymczyk yea this matches what I would expect, we did not do any significant tuning for reasoning capabilities and just tried to maintain original performance. Any differences are very likely due to noise.