Submitting new results to the leaderboard

#7
by leebird - opened

Hi ServiceNow team,

Thank you for creating the leaderboard and the browsergym library. We have been working on building agents with it and would like to submit some new results on WebArena and MiniWoB. One question we have is we have used a slightly different WebArena task json file to incorporate eval fixes (e.g., typo fixes) from WebArena-Lite: https://github.com/THUDM/VisualAgentBench/blob/main/VAB-WebArena-Lite/new/test_webarena_lite.raw.json.

Do you accept this kind of fix and if so, how can we indicate this information in the submission?

ServiceNow org

That's a good point, but it's a hard one to solve. Are these similar to AgentOccam's modification?
One objective for this leaderboard is to try to keep evaluations comparable. However, having a "better" version of WebArena with fixed issues would definitely be preferable. But, we would need to accept only a single modified version of Webarena.

Does your proposed version qualify as a finalized version of a fixed WebArena? Or we would likely expect some other fixes from other contributors?

Thanks for the clarification. Our version is definitely not a finalized version as it's primarily based on WebArena-Lite's changes, and there are other works that also modify the task json file like AgentOccam you mentioned. Some of them also require modification to the evaluator in addition to the json file. It makes sense to consolidate all changes from these works and create a finalized one for evaluation correctness.

That said, we agree with the point of fair comparison. We will first submit our results based on the original WebArena json file then.

Hi @recursix ,

We have created a draft PR for agents using the original WebArena task json config. Just to confirm do we need any approval for submitting it?

Thanks

Hi @recursix , I've submitted the PR for review and please let us know if it looks good.

Thanks!

ServiceNow org

Hey, apologies for the delayed response. I will have a look at the PR and merge it soon!

ServiceNow org

I merged the pr, closing this discussion. Thanks for your contribution :)

meghsn changed discussion status to closed

Sign up or log in to comment