Spaces:

ServiceNow
/

browsergym-leaderboard

Running

Submitting new results to the leaderboard

by leebird - opened 6 days ago

6 days ago

Hi ServiceNow team,

Thank you for creating the leaderboard and the browsergym library. We have been working on building agents with it and would like to submit some new results on WebArena and MiniWoB. One question we have is we have used a slightly different WebArena task json file to incorporate eval fixes (e.g., typo fixes) from WebArena-Lite: https://github.com/THUDM/VisualAgentBench/blob/main/VAB-WebArena-Lite/new/test_webarena_lite.raw.json.

Do you accept this kind of fix and if so, how can we indicate this information in the submission?

recursix

ServiceNow org 5 days ago

That's a good point, but it's a hard one to solve. Are these similar to AgentOccam's modification?
One objective for this leaderboard is to try to keep evaluations comparable. However, having a "better" version of WebArena with fixed issues would definitely be preferable. But, we would need to accept only a single modified version of Webarena.

Does your proposed version qualify as a finalized version of a fixed WebArena? Or we would likely expect some other fixes from other contributors?

leebird

4 days ago

Thanks for the clarification. Our version is definitely not a finalized version as it's primarily based on WebArena-Lite's changes, and there are other works that also modify the task json file like AgentOccam you mentioned. Some of them also require modification to the evaluator in addition to the json file. It makes sense to consolidate all changes from these works and create a finalized one for evaluation correctness.

That said, we agree with the point of fair comparison. We will first submit our results based on the original WebArena json file then.

leebird

about 17 hours ago

Hi @recursix ,

We have created a draft PR for agents using the original WebArena task json config. Just to confirm do we need any approval for submitting it?

Thanks

leebird

about 10 hours ago

Hi @recursix , I've submitted the PR for review and please let us know if it looks good.

Thanks!

meghsn

ServiceNow org about 8 hours ago

Hey, apologies for the delayed response. I will have a look at the PR and merge it soon!

meghsn

ServiceNow org about 8 hours ago

I merged the pr, closing this discussion. Thanks for your contribution :)

meghsn changed discussion status to closed about 8 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment