Submitting new results to the leaderboard
Hi ServiceNow team,
Thank you for creating the leaderboard and the browsergym library. We have been working on building agents with it and would like to submit some new results on WebArena and MiniWoB. One question we have is we have used a slightly different WebArena task json file to incorporate eval fixes (e.g., typo fixes) from WebArena-Lite: https://github.com/THUDM/VisualAgentBench/blob/main/VAB-WebArena-Lite/new/test_webarena_lite.raw.json.
Do you accept this kind of fix and if so, how can we indicate this information in the submission?
That's a good point, but it's a hard one to solve. Are these similar to AgentOccam's modification?
One objective for this leaderboard is to try to keep evaluations comparable. However, having a "better" version of WebArena with fixed issues would definitely be preferable. But, we would need to accept only a single modified version of Webarena.
Does your proposed version qualify as a finalized version of a fixed WebArena? Or we would likely expect some other fixes from other contributors?
Thanks for the clarification. Our version is definitely not a finalized version as it's primarily based on WebArena-Lite's changes, and there are other works that also modify the task json file like AgentOccam you mentioned. Some of them also require modification to the evaluator in addition to the json file. It makes sense to consolidate all changes from these works and create a finalized one for evaluation correctness.
That said, we agree with the point of fair comparison. We will first submit our results based on the original WebArena json file then.
Hey, apologies for the delayed response. I will have a look at the PR and merge it soon!
I merged the pr, closing this discussion. Thanks for your contribution :)