perplexity-ai/r1-1776 · This model performs worse in complex problems compared to the DeepSeek R1

14 days ago

I tested it in lineage-bench benchmark. While the performance for "simple" problems matches the original model, for "complex" problems like lineage quizzes asking about relations between 32 or 64 people the performance is significantly worse compared to the DeepSeek R1.

Nr	model_name	lineage	lineage-8	lineage-16	lineage-32	lineage-64
2	deepseek/deepseek-r1	0.917	0.965	0.980	0.945	0.780
6	perplexity/r1-1776	0.709	0.980	0.975	0.675	0.205

yangwj211

14 days ago

This comment has been hidden (marked as Resolved)

eugenhotaj-ppl

Perplexity org 14 days ago

@sszymczyk I strongly doubt the benchmark was run correctly. Specifically the deepseek API does not count thinking tokens as part of max_tokens. I would try running with a much longer max_tokens for r1-1776 and see if the results still hold.

sszymczyk

14 days ago

@sszymczyk I strongly doubt the benchmark was run correctly. Specifically the deepseek API does not count thinking tokens as part of max_tokens. I would try running with a much longer max_tokens for r1-1776 and see if the results still hold.

I ran the benchmark in both cases via OpenRouter, so meaning of max_tokens is unified here. Also I don't see any output exceeding 10k tokens (I had max_tokens set to 16384).
What I do see is weird behavior of the model in reasoning traces. It starts normally, then starts switching to upper case, for example:

statement 22 says \"Natalie IS WILLIE'S ANCESTOR,\" so Natalie->Willie; which connects to Cynthia via statement 19 (\"Cynthia IS WILLIE'S DESCENDANT\")

or stops using spaces like:

WilliedescendsfromNatalie(statement22):Natalie--Willie ?\n\nWait no—statement22 states \"Natalieis Willies_s_ancestor\", meaning Natalie precedes him =>Natalie--Willie .\n\nAnd NataliedescendsfromGerald(statement12)

or inserts extra spaces like:

K arenisan cestortoD ebr a(D ebr a<K arenviapoin t25.)D ebr aisancestortoChristin a(C hristin a<D ebr aviap oint26.

or starts using weird accent characters:

Brúce hacia María(Brúce anc María). María hacia Káren(María anc Káren)

or starts twisting words like:

neither one directly links as ancesotr or descedant of the other.Hence answer should be option5 None of the above.

As the reasoning progresses the output is less and less comprehensible. This is not normal.

eugenhotaj-ppl

Perplexity org 14 days ago

•

edited 14 days ago

Hmm interesting, there may be an issue with our serving stack somewhere. Can you give me a prompt which produces this strange behavior in the COT that you're seeing?

I ran the benchmark in both cases via OpenRouter, so meaning of max_tokens is unified here.

I highly doubt this btw if you're using the DeepSeek API directly as it's not something their API allows you to do. The length of the COT is completely out of the hands of the user.

sszymczyk

14 days ago

•

edited 14 days ago

@eugenhotaj-ppl I tried one example prompt in Perplexity Labs Playground and got the following output (prompt also included): https://pastebin.com/EPy06bqp
It displays most of the problems I mentioned above.

eugenhotaj-ppl

Perplexity org 14 days ago

@sszymczyk thanks, looking into this.

eugenhotaj-ppl

Perplexity org 14 days ago

•

edited 14 days ago

@sszymczyk thanks a lot for brining this to our attention, there is indeed an issue with our serving that we're trying to figure out.

I realize this is a bit unsatisfying but I ran your eval on a local deployment that we confirmed to be good using the following command:

 ./lineage_bench.py -s -l 64 -n 10 -r 42 | ./run_openrouter.py -m "perplexity/r1-1776" -t 96 -v | tee results/r1-1776_64.csv

On -l 64 we get .775 which looks on-par with r1 to me.

Here is the output on the prompt you posted: https://pastebin.com/bGHr9HHQ

I will comment here again once we fix the deployment and you can re-run evals.

sszymczyk

13 days ago

@sszymczyk thanks a lot for brining this to our attention, there is indeed an issue with our serving that we're trying to figure out.
...
I will comment here again once we fix the deployment and you can re-run evals.

@eugenhotaj-ppl I'm looking forward to it. Too bad you don't have a bug bounty program, I could use some help in paying for all these reasoning tokens.

eugenhotaj-ppl

Perplexity org 13 days ago

@sszymczyk we're deploying now, would give it another ~1 hour to be safe.

Too bad you don't have a bug bounty program, I could use some help in paying for all these reasoning tokens.

I hear you, we don't have a bug bounty but can give you API credits / compensate the ones you've used. Is this something you'd be interested in?

sszymczyk

12 days ago

@sszymczyk we're deploying now, would give it another ~1 hour to be safe.

Too bad you don't have a bug bounty program, I could use some help in paying for all these reasoning tokens.

I hear you, we don't have a bug bounty but can give you API credits / compensate the ones you've used. Is this something you'd be interested in?

@eugenhotaj-ppl Sure, API credits would be great. I have a perplexity.ai account with username sszymczy11038.
In my OpenRouter activity page I see that I spent $38.22 overall on perplexity/r1-1776 in the last few days.

The benchmark run alone probably wouldn't cost as much, but there were other issues when using this model like server returning empty responses, responses with internal server error etc that I was trying to diagnose. Finally I gave up and changed my approach to "in case of error try again".

eugenhotaj-ppl

Perplexity org 12 days ago

•

edited 12 days ago

The benchmark run alone probably wouldn't cost as much, but there were other issues when using this model like server returning empty responses, responses with internal server error etc that I was trying to diagnose. Finally I gave up and changed my approach to "in case of error try again".

We're still scaling up r1-1776 capacity so the qps we can handle right now is kind of low. If you send too many concurrent requests they might start getting rejected unfortunately. When I re-ran your benchmark through our API I used 4 threads and it took ~10-15 min for lineage-64 and didn't see any failures.

sszymczyk

12 days ago

The benchmark run alone probably wouldn't cost as much, but there were other issues when using this model like server returning empty responses, responses with internal server error etc that I was trying to diagnose. Finally I gave up and changed my approach to "in case of error try again".

We're still scaling up r1-1776 capacity so the qps we can handle right now is kind of low. If you send too many concurrent requests they might start getting rejected unfortunately. When I re-ran your benchmark through our API I used 4 threads and it took ~10-15 min for lineage-64 and didn't see any failures.

@eugenhotaj-ppl OK, I'll try to be gentle when re-running the benchmark in the future.

sszymczyk

9 days ago

After re-testing the model (I'm grateful for the generous amount of credits) in lineage-bench:

Congratulations on taking the top place! Problem fixed, closing the issue.

sszymczyk changed discussion status to closed 9 days ago

eugenhotaj-ppl

Perplexity org 9 days ago

•

edited 9 days ago

Thank you @sszymczyk for helping us improve our API!

BTW would appreciate an update to the reddit post as well 🙂: https://old.reddit.com/r/LocalLLaMA/comments/1izbmbb/perplexity_r1_1776_performs_worse_than_deepseek/

sszymczyk

9 days ago

@eugenhotaj-ppl Post updated, also created a new one: https://www.reddit.com/r/LocalLLaMA/comments/1j3hjxb/perplexity_r1_1776_climbed_to_first_place_after/

sszymczyk

6 days ago

While in my first Reddit post wumao army of pro-China commenters attacked Perplexity, in the second they downplayed importance of my benchmark result. Classic Reddit.

Anyway, I did some more testing and compared performance of both DeepSeek R1 and Perplexity R1 1776 in more complex lineage-128 problems and found almost no difference in the performance between models: https://www.reddit.com/r/LocalLLaMA/comments/1j49sbd/is_there_a_statistically_significant_difference/

eugenhotaj-ppl

Perplexity org 6 days ago

@sszymczyk yea this matches what I would expect, we did not do any significant tuning for reasoning capabilities and just tried to maintain original performance. Any differences are very likely due to noise.