Spaces:

Nirav-Madhani
/

Screen-VLA

Running

Gemini

VLA Data Generator - Complete TypeScript/React app with backend

256cef9 about 2 months ago

3.76 kB

	export const GET_OVERALL_GOAL_PROMPT = `
	You are an expert Vision-Language-Action (VLA) data generation model. Your task is to analyze a few keyframes from a screen recording to determine the user's single, high-level overall goal for the entire session.

	Based on the provided frames, what is the user's primary objective?

	Provide your response ONLY in the following JSON format. Do not add any extra commentary or markdown.

	{
	"overallGoal": "A string describing the user's main objective."
	}
	`;

	export const GET_TASKS_AND_INTERACTIONS_PROMPT = (startFrame: number, endFrame: number): string => `
	You are an expert Vision-Language-Action (VLA) data generation model. You are analyzing a segment of a screen recording. The frames provided are from global frame index ${startFrame} to ${endFrame}.

	Your task is to identify distinct, chronological tasks AND the specific user interactions within each task.

	Overall Instructions:
	1. Identify Tasks: A task is a continuous set of actions to achieve a small sub-goal. Create a new task only when the user's immediate goal changes significantly. A task should not be a single action, but a sequence of actions.
	2. Identify Interactions: Within each task, identify every key user interaction ('click' or 'type').
	3. Crucially, if a user types into a field, there is almost always a 'click' interaction to select that field first. You MUST capture this preceding 'click' on the input field.
	4. Frame Indices: All frame indices you return MUST be within the global range of ${startFrame} to ${endFrame}.
	5. Accuracy is critical. Do not guess. If you cannot see the text on an element, do not invent it.

	Task Definition:
	For each task, provide:
	a. \`description\`: A concise, goal-oriented description (e.g., "Search for 'quarterly results'").
	b. \`startFrame\` & \`endFrame\`: The global start and end frame indices for the task.
	c. \`interactions\`: An array of all user interactions that occur during that task.

	Interaction Definition:
	For each interaction, provide:
	a. \`type\`: Must be 'click' or 'type'.
	b. \`details\`:
	- For a 'click': The exact text of the element clicked. If no text, describe the icon (e.g., "Clicked the gear icon for settings.").
	- For a 'type': The text that was typed and where (e.g., "Typed 'hello world' into the search bar.").
	c. \`frameIndex\`: The global frame index where the interaction is most visible.
	d. For 'click' interactions ONLY:
	- \`x\`: Relative horizontal position (0.0 to 1.0).
	- \`y\`: Relative vertical position (0.0 to 1.0).

	Output Format:
	Provide your response ONLY in a JSON array of task objects. Do not add any extra commentary or markdown.

	Example Response:
	[
	{
	"description": "Navigate to the settings page.",
	"startFrame": ${startFrame + 1},
	"endFrame": ${startFrame + 8},
	"interactions": [
	{
	"type": "click",
	"details": "Clicked the user profile avatar.",
	"frameIndex": ${startFrame + 3},
	"x": 0.95,
	"y": 0.1
	},
	{
	"type": "click",
	"details": "Clicked the 'Settings' menu item.",
	"frameIndex": ${startFrame + 7},
	"x": 0.9,
	"y": 0.25
	}
	]
	},
	{
	"description": "Search for a document.",
	"startFrame": ${startFrame + 10},
	"endFrame": ${startFrame + 15},
	"interactions": [
	{
	"type": "click",
	"details": "Clicked on the search bar with placeholder 'Search documents'.",
	"frameIndex": ${startFrame + 11},
	"x": 0.45,
	"y": 0.15
	},
	{
	"type": "type",
	"details": "Typed 'product roadmap' into the search bar.",
	"frameIndex": ${startFrame + 14}
	}
	]
	}
	]
	`;