Spaces:
Running
Running
export const GET_OVERALL_GOAL_PROMPT = ` | |
You are an expert Vision-Language-Action (VLA) data generation model. Your task is to analyze a few keyframes from a screen recording to determine the user's single, high-level overall goal for the entire session. | |
Based on the provided frames, what is the user's primary objective? | |
Provide your response ONLY in the following JSON format. Do not add any extra commentary or markdown. | |
{ | |
"overallGoal": "A string describing the user's main objective." | |
} | |
`; | |
export const GET_TASKS_AND_INTERACTIONS_PROMPT = (startFrame: number, endFrame: number): string => ` | |
You are an expert Vision-Language-Action (VLA) data generation model. You are analyzing a segment of a screen recording. The frames provided are from global frame index ${startFrame} to ${endFrame}. | |
Your task is to identify distinct, chronological tasks AND the specific user interactions within each task. | |
**Overall Instructions:** | |
1. **Identify Tasks:** A task is a continuous set of actions to achieve a small sub-goal. Create a new task only when the user's immediate goal changes significantly. A task should not be a single action, but a sequence of actions. | |
2. **Identify Interactions:** Within each task, identify every key user interaction ('click' or 'type'). | |
3. **Crucially, if a user types into a field, there is almost always a 'click' interaction to select that field first. You MUST capture this preceding 'click' on the input field.** | |
4. **Frame Indices:** All frame indices you return MUST be within the global range of ${startFrame} to ${endFrame}. | |
5. **Accuracy is critical.** Do not guess. If you cannot see the text on an element, do not invent it. | |
**Task Definition:** | |
For each task, provide: | |
a. \`description\`: A concise, goal-oriented description (e.g., "Search for 'quarterly results'"). | |
b. \`startFrame\` & \`endFrame\`: The global start and end frame indices for the task. | |
c. \`interactions\`: An array of all user interactions that occur during that task. | |
**Interaction Definition:** | |
For each interaction, provide: | |
a. \`type\`: Must be 'click' or 'type'. | |
b. \`details\`: | |
- For a 'click': The exact text of the element clicked. If no text, describe the icon (e.g., "Clicked the gear icon for settings."). | |
- For a 'type': The text that was typed and where (e.g., "Typed 'hello world' into the search bar."). | |
c. \`frameIndex\`: The global frame index where the interaction is most visible. | |
d. For 'click' interactions ONLY: | |
- \`x\`: Relative horizontal position (0.0 to 1.0). | |
- \`y\`: Relative vertical position (0.0 to 1.0). | |
**Output Format:** | |
Provide your response ONLY in a JSON array of task objects. Do not add any extra commentary or markdown. | |
Example Response: | |
[ | |
{ | |
"description": "Navigate to the settings page.", | |
"startFrame": ${startFrame + 1}, | |
"endFrame": ${startFrame + 8}, | |
"interactions": [ | |
{ | |
"type": "click", | |
"details": "Clicked the user profile avatar.", | |
"frameIndex": ${startFrame + 3}, | |
"x": 0.95, | |
"y": 0.1 | |
}, | |
{ | |
"type": "click", | |
"details": "Clicked the 'Settings' menu item.", | |
"frameIndex": ${startFrame + 7}, | |
"x": 0.9, | |
"y": 0.25 | |
} | |
] | |
}, | |
{ | |
"description": "Search for a document.", | |
"startFrame": ${startFrame + 10}, | |
"endFrame": ${startFrame + 15}, | |
"interactions": [ | |
{ | |
"type": "click", | |
"details": "Clicked on the search bar with placeholder 'Search documents'.", | |
"frameIndex": ${startFrame + 11}, | |
"x": 0.45, | |
"y": 0.15 | |
}, | |
{ | |
"type": "type", | |
"details": "Typed 'product roadmap' into the search bar.", | |
"frameIndex": ${startFrame + 14} | |
} | |
] | |
} | |
] | |
`; |