Perhaps we can provide a couple of thousand human annotations
Jason Corkill
jasoncorkill
AI & ML interests
Human data annotation
Recent Activity
liked
a dataset
9 days ago
Rapidata/Recraft-v3-24-7-25_t2i_human_preference
published
a dataset
9 days ago
Rapidata/Recraft-v3-24-7-25_t2i_human_preference
liked
a dataset
9 days ago
Rapidata/Imagen-4-ultra-24-7-25_t2i_human_preference
Organizations

replied to
their
post
about 2 months ago

replied to
their
post
about 2 months ago
Interesting, what kind of data are you collecting?

replied to
their
post
about 2 months ago
Funny, we also noticed that these models will almost always revert to the Question - Answer Style Joke if not prompted otherwise.
Post
3242
"Why did the bee get married?"
"Because he found his honey!"
This was the "funniest" joke out of 10'000 jokes we generated with LLMs. With 68% of respondents rating it as "funny".
Original jokes are particularly hard for LLMs, as jokes are very nuanced and a lot of context is needed to understand if something is "funny". Something that can only reliably be measured using humans.
LLMs are not equally good at generating jokes in every language. Generated English jokes turned out to be way funnier than the Japanese ones. 46% of English-speaking voters on average found the generated joke funny. The same statistic for other languages:
Vietnamese: 44%
Portuguese: 40%
Arabic: 37%
Japanese: 28%
There is not much variance in generation quality among models for any fixed language. But still Claude Sonnet 4 slightly outperforms others in Vietnamese, Arabic and Japanese and Gemini 2.5 Flash in Portuguese and English
We have release the 1 Million (!) native speaker ratings and the 10'000 jokes as a dataset for anyone to use:
Rapidata/multilingual-llm-jokes-4o-claude-gemini
"Because he found his honey!"
This was the "funniest" joke out of 10'000 jokes we generated with LLMs. With 68% of respondents rating it as "funny".
Original jokes are particularly hard for LLMs, as jokes are very nuanced and a lot of context is needed to understand if something is "funny". Something that can only reliably be measured using humans.
LLMs are not equally good at generating jokes in every language. Generated English jokes turned out to be way funnier than the Japanese ones. 46% of English-speaking voters on average found the generated joke funny. The same statistic for other languages:
Vietnamese: 44%
Portuguese: 40%
Arabic: 37%
Japanese: 28%
There is not much variance in generation quality among models for any fixed language. But still Claude Sonnet 4 slightly outperforms others in Vietnamese, Arabic and Japanese and Gemini 2.5 Flash in Portuguese and English
We have release the 1 Million (!) native speaker ratings and the 10'000 jokes as a dataset for anyone to use:
Rapidata/multilingual-llm-jokes-4o-claude-gemini

posted
an
update
about 2 months ago
Post
3242
"Why did the bee get married?"
"Because he found his honey!"
This was the "funniest" joke out of 10'000 jokes we generated with LLMs. With 68% of respondents rating it as "funny".
Original jokes are particularly hard for LLMs, as jokes are very nuanced and a lot of context is needed to understand if something is "funny". Something that can only reliably be measured using humans.
LLMs are not equally good at generating jokes in every language. Generated English jokes turned out to be way funnier than the Japanese ones. 46% of English-speaking voters on average found the generated joke funny. The same statistic for other languages:
Vietnamese: 44%
Portuguese: 40%
Arabic: 37%
Japanese: 28%
There is not much variance in generation quality among models for any fixed language. But still Claude Sonnet 4 slightly outperforms others in Vietnamese, Arabic and Japanese and Gemini 2.5 Flash in Portuguese and English
We have release the 1 Million (!) native speaker ratings and the 10'000 jokes as a dataset for anyone to use:
Rapidata/multilingual-llm-jokes-4o-claude-gemini
"Because he found his honey!"
This was the "funniest" joke out of 10'000 jokes we generated with LLMs. With 68% of respondents rating it as "funny".
Original jokes are particularly hard for LLMs, as jokes are very nuanced and a lot of context is needed to understand if something is "funny". Something that can only reliably be measured using humans.
LLMs are not equally good at generating jokes in every language. Generated English jokes turned out to be way funnier than the Japanese ones. 46% of English-speaking voters on average found the generated joke funny. The same statistic for other languages:
Vietnamese: 44%
Portuguese: 40%
Arabic: 37%
Japanese: 28%
There is not much variance in generation quality among models for any fixed language. But still Claude Sonnet 4 slightly outperforms others in Vietnamese, Arabic and Japanese and Gemini 2.5 Flash in Portuguese and English
We have release the 1 Million (!) native speaker ratings and the 10'000 jokes as a dataset for anyone to use:
Rapidata/multilingual-llm-jokes-4o-claude-gemini
Post
2430
Imagine you could have an Image Arena score equivalent at each checkpoint during training. We released the first version of just that:
Crowd-Eval
Add one line of code to your training loop and you will have a new real human loss curve in your W&B dashboard.
Thousands of real humans from around the world rating your model in real time at the cost of a few dollars per checkpoint is a game changer.
Check it out here: https://github.com/RapidataAI/crowd-eval
First 5 people to put it in their loop get 100'000 human responses for free! (ping me)
Crowd-Eval
Add one line of code to your training loop and you will have a new real human loss curve in your W&B dashboard.
Thousands of real humans from around the world rating your model in real time at the cost of a few dollars per checkpoint is a game changer.
Check it out here: https://github.com/RapidataAI/crowd-eval
First 5 people to put it in their loop get 100'000 human responses for free! (ping me)

posted
an
update
3 months ago
Post
2430
Imagine you could have an Image Arena score equivalent at each checkpoint during training. We released the first version of just that:
Crowd-Eval
Add one line of code to your training loop and you will have a new real human loss curve in your W&B dashboard.
Thousands of real humans from around the world rating your model in real time at the cost of a few dollars per checkpoint is a game changer.
Check it out here: https://github.com/RapidataAI/crowd-eval
First 5 people to put it in their loop get 100'000 human responses for free! (ping me)
Crowd-Eval
Add one line of code to your training loop and you will have a new real human loss curve in your W&B dashboard.
Thousands of real humans from around the world rating your model in real time at the cost of a few dollars per checkpoint is a game changer.
Check it out here: https://github.com/RapidataAI/crowd-eval
First 5 people to put it in their loop get 100'000 human responses for free! (ping me)

replied to
their
post
3 months ago
Good catch :) yes, we uploaded them shortly after!

replied to
their
post
3 months ago
Hey Jackson, can you please elaborate?
Post
3952
Benchmark Update:
@google
Veo3 (Text-to-Video)
Two months ago, we benchmarked @google βs Veo2 model. It fell short, struggling with style consistency and temporal coherence, trailing behind Runway, Pika, @tencent , and even @alibaba-pai .
Thatβs changed.
We just wrapped up benchmarking Veo3, and the improvements are substantial. It outperformed every other model by a wide margin across all key metrics. Not just better, dominating across style, coherence, and prompt adherence. It's rare to see such a clear lead in todayβs hyper-competitive T2V landscape.
Dataset coming soon. Stay tuned.
Two months ago, we benchmarked @google βs Veo2 model. It fell short, struggling with style consistency and temporal coherence, trailing behind Runway, Pika, @tencent , and even @alibaba-pai .
Thatβs changed.
We just wrapped up benchmarking Veo3, and the improvements are substantial. It outperformed every other model by a wide margin across all key metrics. Not just better, dominating across style, coherence, and prompt adherence. It's rare to see such a clear lead in todayβs hyper-competitive T2V landscape.
Dataset coming soon. Stay tuned.

posted
an
update
3 months ago
Post
3952
Benchmark Update:
@google
Veo3 (Text-to-Video)
Two months ago, we benchmarked @google βs Veo2 model. It fell short, struggling with style consistency and temporal coherence, trailing behind Runway, Pika, @tencent , and even @alibaba-pai .
Thatβs changed.
We just wrapped up benchmarking Veo3, and the improvements are substantial. It outperformed every other model by a wide margin across all key metrics. Not just better, dominating across style, coherence, and prompt adherence. It's rare to see such a clear lead in todayβs hyper-competitive T2V landscape.
Dataset coming soon. Stay tuned.
Two months ago, we benchmarked @google βs Veo2 model. It fell short, struggling with style consistency and temporal coherence, trailing behind Runway, Pika, @tencent , and even @alibaba-pai .
Thatβs changed.
We just wrapped up benchmarking Veo3, and the improvements are substantial. It outperformed every other model by a wide margin across all key metrics. Not just better, dominating across style, coherence, and prompt adherence. It's rare to see such a clear lead in todayβs hyper-competitive T2V landscape.
Dataset coming soon. Stay tuned.
Post
2879
π₯ Hidream I1 is online! π₯
We just added Hidream I1 to our T2I leaderboard (https://www.rapidata.ai/leaderboard/image-models) benchmarked using 195k+ human responses from 38k+ annotators, all collected in under 24 hours.
It landed #3 overall, right behind:
- @openai 4o
- @black-forest-labs Flux 1 Pro
...and just ahead of @black-forest-labs Flux 1.1 Pro, @xai-org Aurora and @google Imagen3.
Want to dig into the data? Check out our dataset here:
Rapidata/Hidream_t2i_human_preference
What model should we benchmark next?
We just added Hidream I1 to our T2I leaderboard (https://www.rapidata.ai/leaderboard/image-models) benchmarked using 195k+ human responses from 38k+ annotators, all collected in under 24 hours.
It landed #3 overall, right behind:
- @openai 4o
- @black-forest-labs Flux 1 Pro
...and just ahead of @black-forest-labs Flux 1.1 Pro, @xai-org Aurora and @google Imagen3.
Want to dig into the data? Check out our dataset here:
Rapidata/Hidream_t2i_human_preference
What model should we benchmark next?

posted
an
update
4 months ago
Post
2879
π₯ Hidream I1 is online! π₯
We just added Hidream I1 to our T2I leaderboard (https://www.rapidata.ai/leaderboard/image-models) benchmarked using 195k+ human responses from 38k+ annotators, all collected in under 24 hours.
It landed #3 overall, right behind:
- @openai 4o
- @black-forest-labs Flux 1 Pro
...and just ahead of @black-forest-labs Flux 1.1 Pro, @xai-org Aurora and @google Imagen3.
Want to dig into the data? Check out our dataset here:
Rapidata/Hidream_t2i_human_preference
What model should we benchmark next?
We just added Hidream I1 to our T2I leaderboard (https://www.rapidata.ai/leaderboard/image-models) benchmarked using 195k+ human responses from 38k+ annotators, all collected in under 24 hours.
It landed #3 overall, right behind:
- @openai 4o
- @black-forest-labs Flux 1 Pro
...and just ahead of @black-forest-labs Flux 1.1 Pro, @xai-org Aurora and @google Imagen3.
Want to dig into the data? Check out our dataset here:
Rapidata/Hidream_t2i_human_preference
What model should we benchmark next?