OmAgentVideoUnderstanding

Running

App Files Files Community

韩宇 commited on Feb 20

Commit

1b7e88c

1 Parent(s): ecb4abf

init

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README.md +127 -12
agent/__init__.py +0 -0
agent/conclude/__init__.py +0 -0
agent/conclude/conclude.py +87 -0
agent/conclude/sys_prompt.prompt +13 -0
agent/conclude/user_prompt.prompt +7 -0
agent/conclude/webpage_conclude.py +81 -0
agent/memories/__init__.py +0 -0
agent/memories/video_ltm_milvus.py +238 -0
agent/misc/scene.py +249 -0
agent/tools/video_rewinder/rewinder.py +99 -0
agent/tools/video_rewinder/rewinder_sys_prompt.prompt +7 -0
agent/tools/video_rewinder/rewinder_user_prompt.prompt +1 -0
agent/video_preprocessor/__init__.py +0 -0
agent/video_preprocessor/sys_prompt.prompt +18 -0
agent/video_preprocessor/user_prompt.prompt +4 -0
agent/video_preprocessor/video_preprocess.py +254 -0
agent/video_preprocessor/webpage_vp.py +252 -0
agent/video_qa/__init__.py +0 -0
agent/video_qa/qa.py +82 -0
agent/video_qa/sys_prompt.prompt +8 -0
agent/video_qa/user_prompt.prompt +1 -0
agent/video_qa/webpage_qa.py +73 -0
app.py +136 -0
calculator_code.py +4 -0
compile_container.py +18 -0
configs/llms/gpt4o.yml +7 -0
configs/llms/json_res.yml +6 -0
configs/llms/text_encoder.yml +3 -0
configs/llms/text_res.yml +6 -0
configs/llms/text_res_stream.yml +7 -0
configs/tools/all_tools.yml +12 -0
configs/workers/conclude.yml +4 -0
configs/workers/dnc_workflow.yml +18 -0
configs/workers/video_preprocessor.yml +12 -0
configs/workers/video_qa.yml +5 -0
container.yaml +154 -0
docs/images/local-ai.png +0 -0
docs/images/video_understanding_workflow_diagram.png +0 -0
docs/local-ai.md +87 -0
omagent_core/__init__.py +0 -0
omagent_core/advanced_components/__init__.py +0 -0
omagent_core/advanced_components/worker/__init__.py +0 -0
omagent_core/advanced_components/worker/conclude/__init__.py +0 -0
omagent_core/advanced_components/worker/conqueror/__init__.py +0 -0
omagent_core/advanced_components/worker/divider/__init__.py +0 -0
omagent_core/advanced_components/worker/task_exit_monitor/__init__.py +0 -0
omagent_core/advanced_components/worker/video_preprocess/__init__.py +0 -0
omagent_core/advanced_components/workflow/cot/README.md +36 -0
omagent_core/advanced_components/workflow/cot/agent/cot_reasoning/cot_reasoning.py +67 -0

README.md CHANGED Viewed

@@ -1,12 +1,127 @@
----
-title: OmAgent
-emoji: 🦀
-colorFrom: yellow
-colorTo: red
-sdk: gradio
-sdk_version: 5.16.2
-app_file: app.py
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Video Understanding Example
+This example demonstrates how to use the framework for hour-long video understanding task. The example code can be found in the `examples/video_understanding` directory.
+```bash
+   cd examples/video_understanding
+```
+## Overview
+This example implements a video understanding task workflow based on the DnC workflow, which consists of following components:
+1. **Video Preprocess Task**
+   - Preprocess the video with audio information via speech-to-text capability
+   - It detects the scene boundaries, splits the video into several chunks and extract frames at specified intervals
+   - Each scene chunk is summarized by MLLM with detailed information, cached and updated into vector database for Q&A retrieval
+   - Video metadata and video file md5 are transferred for filtering
+2. **Video QA Task**
+   - Take the user input question about the video
+   - Retrieve related information from the vector database with the question
+   - Extract the approximate start and end time of the video segment related to the question
+   - Generate video object from serialized data in short-term memory(stm)
+   - Build init task tree with the question to DnC task
+3. **Divide and Conquer Task**
+   - Execute the task tree with the question
+   - Detailed information is referred to the [DnC Example](./DnC.md#overview)
+The system uses Redis for state management, Milvus for long-tern memory storage and Conductor for workflow orchestration.
+### This whole workflow is looked like the following diagram:
+![Video Understanding Workflow](docsmages/video_understanding_workflow_diagram.png)
+## Prerequisites
+- Python 3.10+
+- Required packages installed (see requirements.txt)
+- Access to OpenAI API or compatible endpoint (see configs/llms/*.yml)
+- [Optional] Access to Bing API for WebSearch tool (see configs/tools/*.yml)
+- Redis server running locally or remotely
+- Conductor server running locally or remotely
+## Configuration
+The container.yaml file is a configuration file that manages dependencies and settings for different components of the system, including Conductor connections, Redis connections, and other service configurations. To set up your configuration:
+1. Generate the container.yaml file:
+   ```bash
+   python compile_container.py
+   ```
+   This will create a container.yaml file with default settings under `examples/video_understanding`.
+2. Configure your LLM and tool settings in `configs/llms/*.yml` and `configs/tools/*.yml`:
+   - Set your OpenAI API key or compatible endpoint through environment variable or by directly modifying the yml file
+   ```bash
+   export custom_openai_key="your_openai_api_key"
+   export custom_openai_endpoint="your_openai_endpoint"
+   ```
+   - [Optional] Set your Bing API key or compatible endpoint through environment variable or by directly modifying the yml file
+   ```bash
+   export bing_api_key="your_bing_api_key"
+   ```
+   **Note: It isn't mandatory to set the Bing API key, as the WebSearch tool will rollback to use duckduckgo search. But it is recommended to set it for better search quality.**
+   - The default text encoder configuration uses OpenAI `text-embedding-3-large` with **3072** dimensions, make sure you change the dim value of `MilvusLTM` in `container.yaml`
+   - Configure other model settings like temperature as needed through environment variable or by directly modifying the yml file
+3. Update settings in the generated `container.yaml`:
+   - Modify Redis connection settings:
+     - Set the host, port and credentials for your Redis instance
+     - Configure both `redis_stream_client` and `redis_stm_client` sections
+   - Update the Conductor server URL under conductor_config section
+   - Configure MilvusLTM in `components` section:
+     - Set the `storage_name` and `dim` for MilvusLTM
+     - Set `dim` is to **3072** if you use default OpenAI encoder, make sure to modify corresponding dimension if you use other custom text encoder model or endpoint
+     - Adjust other settings as needed
+   - Configure hyper-parameters for video preprocess task in `examples/video_understanding/configs/workers/video_preprocessor.yml`
+     - `use_cache`: Whether to use cache for the video preprocess task
+     - `scene_detect_threshold`: The threshold for scene detection, which is used to determine if a scene change occurs in the video, min value means more scenes will be detected, default value is **27**
+     - `frame_extraction_interval`: The interval between frames to extract from the video, default value is **5**
+     - `kernel_size`: The size of the kernel for scene detection, should be **odd** number, default value is automatically calculated based on the resolution of the video. For hour-long videos, it is recommended to leave it blank, but for short videos, it is recommended to set a smaller value, like **3**, **5** to make it more sensitive to the scene change
+     - `stt.endpoint`: The endpoint for the speech-to-text service, default uses OpenAI ASR service
+     - `stt.api_key`: The API key for the speech-to-text service, default uses OpenAI API key
+   - Adjust any other component settings as needed
+## Running the Example
+1. Run the video understanding example via Webpage:
+   ```bash
+   python run_webpage.py
+   ```
+   First, select a video or upload a video file on the left; after the video preprocessing is completed, ask questions about the video content on the right.
+2. Run the video understanding example, currently only supports CLI usage:
+   ```bash
+   python run_cli.py
+   ```
+   First time you need to input the video file path, it will take a while to preprocess the video and store the information into vector database.
+   After the video is preprocessed, you can input your question about the video and the system will answer it. Note that the agent may give the wrong or vague answer, especially some questions are related the name of the characters in the video.
+## Troubleshooting
+If you encounter issues:
+- Verify Redis is running and accessible
+- Try smaller `scene_detect_threshold` and `frame_extraction_interval` if you find too many scenes are detected
+- Check your OpenAI API key is valid
+- Check your Bing API key is valid if search results are not as expected
+- Check the `dim` value in `MilvusLTM` in `container.yaml` is set correctly, currently unmatched dimension setting will not raise error but lose partial of the information(we will add more checks in the future)
+- Ensure all dependencies are installed correctly
+- Review logs for any error messages
+- **Open an issue on GitHub if you can't find a solution, we will do our best to help you out!**
+4. Run the video understanding example, currently only supports Webpage usage:
+   ```bash
+   python run_webpage.py
+   ```
+   First, select a video or upload a video file on the left; after the video preprocessing is completed, ask questions about the video content on the right.

agent/__init__.py ADDED Viewed

File without changes

agent/conclude/__init__.py ADDED Viewed

File without changes

agent/conclude/conclude.py ADDED Viewed

	@@ -0,0 +1,87 @@

+from pathlib import Path
+from typing import List
+from omagent_core.advanced_components.workflow.dnc.schemas.dnc_structure import \
+    TaskTree
+from omagent_core.engine.worker.base import BaseWorker
+from omagent_core.memories.ltms.ltm import LTM
+from omagent_core.models.llms.base import BaseLLMBackend
+from omagent_core.models.llms.prompt import PromptTemplate
+from omagent_core.utils.logger import logging
+from omagent_core.utils.registry import registry
+from collections.abc import Iterator
+from pydantic import Field
+CURRENT_PATH = root_path = Path(__file__).parents[0]
+@registry.register_worker()
+class Conclude(BaseLLMBackend, BaseWorker):
+    prompts: List[PromptTemplate] = Field(
+        default=[
+            PromptTemplate.from_file(
+                CURRENT_PATH.joinpath("sys_prompt.prompt"), role="system"
+            ),
+            PromptTemplate.from_file(
+                CURRENT_PATH.joinpath("user_prompt.prompt"), role="user"
+            ),
+        ]
+    )
+    def _run(self, dnc_structure: dict, last_output: str, *args, **kwargs):
+        """A conclude node that summarizes and completes the root task.
+        This component acts as the final node that:
+        - Takes the root task and its execution results
+        - Generates a final conclusion/summary of the entire task execution
+        - Formats and presents the final output in a clear way
+        - Cleans up any temporary state/memory used during execution
+        The conclude node is responsible for providing a coherent final response that
+        addresses the original root task objective based on all the work done by
+        previous nodes.
+        Args:
+            agent_task (dict): The task tree containing the root task and results
+            last_output (str): The final output from previous task execution
+            *args: Additional arguments
+            **kwargs: Additional keyword arguments
+        Returns:
+            dict: Final response containing the conclusion/summary
+        """
+        task = TaskTree(**dnc_structure)
+        self.callback.info(
+            agent_id=self.workflow_instance_id,
+            progress=f"Conclude",
+            message=f"{task.get_current_node().task}",
+        )
+        chat_complete_res = self.simple_infer(
+            task=task.get_root().task,
+            result=str(last_output),
+            img_placeholders="".join(
+                list(self.stm(self.workflow_instance_id).get("image_cache", {}).keys())
+            ),
+        )
+        if isinstance(chat_complete_res, Iterator):
+            last_output = "Answer: "
+            self.callback.send_incomplete(
+                agent_id=self.workflow_instance_id, msg="Answer: "
+            )
+            for chunk in chat_complete_res:
+                if len(chunk.choices) > 0:
+                    current_msg = chunk.choices[0].delta.content if chunk.choices[0].delta.content is not None else ''
+                    self.callback.send_incomplete(
+                        agent_id=self.workflow_instance_id,
+                        msg=f"{current_msg}",
+                    )
+                    last_output += current_msg
+            self.callback.send_answer(agent_id=self.workflow_instance_id, msg="")
+        else:
+            last_output = chat_complete_res["choices"][0]["message"]["content"]
+            self.callback.send_answer(
+                agent_id=self.workflow_instance_id,
+                msg=f'Answer: {chat_complete_res["choices"][0]["message"]["content"]}',
+            )
+        self.stm(self.workflow_instance_id).clear()
+        return {"last_output": last_output}

agent/conclude/sys_prompt.prompt ADDED Viewed

	@@ -0,0 +1,13 @@

+As the final stage of our task processing workflow, your role is to inform the user about the final execution result of the task.
+Your task includes two parts:
+1. Verify the result, ensure it is a valid result of the user's question or task.
+2. Image may be visual prompted by adding bound boxes and labels to the image, this is the important information.
+3. Generate the output message since you may get some raw data, you have to get the useful information and generate a detailed message.
+The task may complete successfully or it can be failed for some reason. You just need to honestly express the situation.
+*** Important Notice ***
+1. Please use the language used in the question when responding.
+2. Your response MUST be based on the results provided to you. Do not attempt to solve the problem on your own or try to correct any errors.
+3. Do not mention your source of information. Present the response as if it were your own.
+4. Handle the conversions between different units carefully.

agent/conclude/user_prompt.prompt ADDED Viewed

	@@ -0,0 +1,7 @@

+Now, it's your turn to complete the task.
+Task (The task you need to complete.): {{task}}
+result (The result from former agents.): {{result}}
+images: {{img_placeholders}}
+Now show your super capability as a super agent that beyond regular AIs or LLMs!

agent/conclude/webpage_conclude.py ADDED Viewed

	@@ -0,0 +1,81 @@

+from pathlib import Path
+from typing import Iterator, List
+from omagent_core.advanced_components.workflow.dnc.schemas.dnc_structure import \
+    TaskTree
+from omagent_core.engine.worker.base import BaseWorker
+from omagent_core.memories.ltms.ltm import LTM
+from omagent_core.models.llms.base import BaseLLMBackend
+from omagent_core.models.llms.prompt import PromptTemplate
+from omagent_core.utils.logger import logging
+from omagent_core.utils.registry import registry
+from openai import Stream
+from pydantic import Field
+CURRENT_PATH = root_path = Path(__file__).parents[0]
+@registry.register_worker()
+class WebpageConclude(BaseLLMBackend, BaseWorker):
+    prompts: List[PromptTemplate] = Field(
+        default=[
+            PromptTemplate.from_file(
+                CURRENT_PATH.joinpath("sys_prompt.prompt"), role="system"
+            ),
+            PromptTemplate.from_file(
+                CURRENT_PATH.joinpath("user_prompt.prompt"), role="user"
+            ),
+        ]
+    )
+    def _run(self, dnc_structure: dict, last_output: str, *args, **kwargs):
+        """A conclude node that summarizes and completes the root task.
+        This component acts as the final node that:
+        - Takes the root task and its execution results
+        - Generates a final conclusion/summary of the entire task execution
+        - Formats and presents the final output in a clear way
+        - Cleans up any temporary state/memory used during execution
+        The conclude node is responsible for providing a coherent final response that
+        addresses the original root task objective based on all the work done by
+        previous nodes.
+        Args:
+            agent_task (dict): The task tree containing the root task and results
+            last_output (str): The final output from previous task execution
+            *args: Additional arguments
+            **kwargs: Additional keyword arguments
+        Returns:
+            dict: Final response containing the conclusion/summary
+        """
+        task = TaskTree(**dnc_structure)
+        self.callback.info(
+            agent_id=self.workflow_instance_id,
+            progress=f"Conclude",
+            message=f"{task.get_current_node().task}",
+        )
+        chat_complete_res = self.simple_infer(
+            task=task.get_root().task,
+            result=str(last_output),
+            img_placeholders="".join(
+                list(self.stm(self.workflow_instance_id).get("image_cache", {}).keys())
+            ),
+        )
+        if isinstance(chat_complete_res, Iterator):
+            last_output = "Answer: "
+            for chunk in chat_complete_res:
+                if len(chunk.choices) > 0:
+                    current_msg = chunk.choices[0].delta.content if chunk.choices[0].delta.content is not None else ''
+                    last_output += current_msg
+            self.callback.send_answer(agent_id=self.workflow_instance_id, msg=last_output)
+        else:
+            last_output = chat_complete_res["choices"][0]["message"]["content"]
+            self.callback.send_answer(
+                agent_id=self.workflow_instance_id,
+                msg=f'Answer: {chat_complete_res["choices"][0]["message"]["content"]}',
+            )
+        self.callback.send_answer(agent_id=self.workflow_instance_id, msg=f"Token usage: {self.token_usage}")
+        self.stm(self.workflow_instance_id).clear()
+        return {"last_output": last_output}

agent/memories/__init__.py ADDED Viewed

File without changes

agent/memories/video_ltm_milvus.py ADDED Viewed

	@@ -0,0 +1,238 @@

+import base64
+import pickle
+from typing import Any, Iterable, List, Optional, Tuple
+from omagent_core.memories.ltms.ltm_base import LTMBase
+from omagent_core.services.connectors.milvus import MilvusConnector
+from omagent_core.utils.registry import registry
+from pydantic import Field
+from pymilvus import (Collection, CollectionSchema, DataType, FieldSchema,
+                      utility)
+@registry.register_component()
+class VideoMilvusLTM(LTMBase):
+    milvus_ltm_client: MilvusConnector
+    storage_name: str = Field(default="default")
+    dim: int = Field(default=128)
+    def model_post_init(self, __context: Any) -> None:
+        pass
+    def _create_collection(self) -> None:
+        # Check if collection exists
+        if not self.milvus_ltm_client._client.has_collection(self.storage_name):
+            index_params = self.milvus_ltm_client._client.prepare_index_params()
+            # Define field schemas
+            key_field = FieldSchema(
+                name="key", dtype=DataType.VARCHAR, is_primary=True, max_length=256
+            )
+            value_field = FieldSchema(
+                name="value", dtype=DataType.JSON, description="Json value"
+            )
+            embedding_field = FieldSchema(
+                name="embedding",
+                dtype=DataType.FLOAT_VECTOR,
+                description="Embedding vector",
+                dim=self.dim,
+            )
+            index_params = self.milvus_ltm_client._client.prepare_index_params()
+            # Create collection schema
+            schema = CollectionSchema(
+                fields=[key_field, value_field, embedding_field],
+                description="Key-Value storage with embeddings",
+            )
+            for field in schema.fields:
+                if (
+                    field.dtype == DataType.FLOAT_VECTOR
+                    or field.dtype == DataType.BINARY_VECTOR
+                ):
+                    index_params.add_index(
+                        field_name=field.name,
+                        index_name=field.name,
+                        index_type="FLAT",
+                        metric_type="COSINE",
+                        params={"nlist": 128},
+                    )
+            self.milvus_ltm_client._client.create_collection(
+                self.storage_name, schema=schema, index_params=index_params
+            )
+            # Create index separately after collection creation
+            print(f"Created storage {self.storage_name} successfully")
+    def __getitem__(self, key: Any) -> Any:
+        key_str = str(key)
+        expr = f'key == "{key_str}"'
+        res = self.milvus_ltm_client._client.query(
+            self.storage_name, expr, output_fields=["value"]
+        )
+        if res:
+            value = res[0]["value"]
+            # value_bytes = base64.b64decode(value_base64)
+            # value = pickle.loads(value_bytes)
+            return value
+        else:
+            raise KeyError(f"Key {key} not found")
+    def __setitem__(self, key: Any, value: Any) -> None:
+        self._create_collection()
+        key_str = str(key)
+        # Check if value is a dictionary containing 'value' and 'embedding'
+        if isinstance(value, dict) and "value" in value and "embedding" in value:
+            actual_value = value["value"]
+            embedding = value["embedding"]
+        else:
+            raise ValueError(
+                "When setting an item, value must be a dictionary containing 'value' and 'embedding' keys."
+            )
+        # Serialize the actual value and encode it to base64
+        # value_bytes = pickle.dumps(actual_value)
+        # value_base64 = base64.b64encode(value_bytes).decode('utf-8')
+        # Ensure the embedding is provided
+        if embedding is None:
+            raise ValueError("An embedding vector must be provided.")
+        # Check if the key exists and delete it if it does
+        if key_str in self:
+            self.__delitem__(key_str)
+        # Prepare data for insertion (as a list of dictionaries)
+        data = [
+            {
+                "key": key_str,
+                "value": actual_value,
+                "embedding": embedding,
+            }
+        ]
+        # Insert the new record
+        self.milvus_ltm_client._client.insert(
+            collection_name=self.storage_name, data=data
+        )
+    def __delitem__(self, key: Any) -> None:
+        key_str = str(key)
+        if key_str in self:
+            expr = f'key == "{key_str}"'
+            self.milvus_ltm_client._client.delete(self.storage_name, expr)
+        else:
+            raise KeyError(f"Key {key} not found")
+    def __contains__(self, key: Any) -> bool:
+        key_str = str(key)
+        expr = f'key == "{key_str}"'
+        # Adjust the query call to match the expected signature
+        res = self.milvus_ltm_client._client.query(
+            self.storage_name,  # Pass the collection name as the first argument
+            filter=expr,
+            output_fields=["key"],
+        )
+        return len(res) > 0
+    """
+    def __len__(self) -> int:
+        milvus_ltm.collection.flush()
+        return self.collection.num_entities
+    """
+    def __len__(self) -> int:
+        expr = 'key != ""'  # Expression to match all entities
+        # self.milvus_ltm_client._client.load(refresh=True)
+        results = self.milvus_ltm_client._client.query(
+            self.storage_name, expr, output_fields=["key"], consistency_level="Strong"
+        )
+        return len(results)
+    def keys(self, limit=10) -> Iterable[Any]:
+        expr = ""
+        res = self.milvus_ltm_client._client.query(
+            self.storage_name, expr, output_fields=["key"], limit=limit
+        )
+        return (item["key"] for item in res)
+    def values(self) -> Iterable[Any]:
+        expr = 'key != ""'  # Expression to match all active entities
+        self.milvus_ltm_client._client.load(refresh=True)
+        res = self.milvus_ltm_client._client.query(
+            self.storage_name, expr, output_fields=["value"], consistency_level="Strong"
+        )
+        for item in res:
+            value_base64 = item["value"]
+            value_bytes = base64.b64decode(value_base64)
+            value = pickle.loads(value_bytes)
+            yield value
+    def items(self) -> Iterable[Tuple[Any, Any]]:
+        expr = 'key != ""'
+        res = self.milvus_ltm_client._client.query(
+            self.storage_name, expr, output_fields=["key", "value"]
+        )
+        for item in res:
+            key = item["key"]
+            value = item["value"]
+            # value_bytes = base64.b64decode(value_base64)
+            # value = pickle.loads(value_bytes)
+            yield (key, value)
+    def get(self, key: Any, default: Any = None) -> Any:
+        try:
+            return self[key]
+        except KeyError:
+            return default
+    def clear(self) -> None:
+        expr = (
+            'key != ""'  # This expression matches all records where 'key' is not empty
+        )
+        self.milvus_ltm_client._client.delete(self.storage_name, filter=expr)
+    def pop(self, key: Any, default: Any = None) -> Any:
+        try:
+            value = self[key]
+            self.__delitem__(key)
+            return value
+        except KeyError:
+            if default is not None:
+                return default
+            else:
+                raise
+    def update(self, other: Iterable[Tuple[Any, Any]]) -> None:
+        for key, value in other:
+            self[key] = value
+    def get_by_vector(
+        self,
+        embedding: List[float],
+        top_k: int = 10,
+        threshold: float = 0.0,
+        filter: str = "",
+    ) -> List[Tuple[Any, Any, float]]:
+        search_params = {
+            "metric_type": "COSINE",
+            "params": {"nprobe": 10, "range_filter": 1, "radius": threshold},
+        }
+        results = self.milvus_ltm_client._client.search(
+            self.storage_name,
+            data=[embedding],
+            anns_field="embedding",
+            search_params=search_params,
+            limit=top_k,
+            output_fields=["key", "value"],
+            consistency_level="Strong",
+            filter=filter,
+        )
+        items = []
+        for match in results[0]:
+            key = match.get("entity").get("key")
+            value = match.get("entity").get("value")
+            items.append((key, value))
+        return items

agent/misc/scene.py ADDED Viewed

	@@ -0,0 +1,249 @@

+from typing import Dict, List, Optional, Tuple, Union
+from PIL import Image
+from pydantic import BaseModel
+from pydub import AudioSegment
+from pydub.effects import normalize
+from scenedetect import (ContentDetector, FrameTimecode, SceneManager,
+                         VideoStream, open_video)
+class Scene(BaseModel):
+    start: FrameTimecode
+    end: FrameTimecode
+    stt_res: Optional[Dict] = None
+    summary: Optional[Dict] = None
+    class Config:
+        """Configuration for this pydantic object."""
+        arbitrary_types_allowed = True
+    @classmethod
+    def init(cls, start: FrameTimecode, end: FrameTimecode, summary: dict = None):
+        return cls(start=start, end=end, summary=summary)
+    @property
+    def conversation(self):
+        # for self deployed whisper
+        if isinstance(self.stt_res, list):
+            output_conversation = "\n".join(
+                [f"{item.get('text', None)}" for item in self.stt_res]
+            )
+        else:
+            output_conversation = self.stt_res
+        return output_conversation
+class VideoScenes(BaseModel):
+    stream: VideoStream
+    audio: Union[AudioSegment, None]
+    scenes: List[Scene]
+    frame_extraction_interval: int
+    class Config:
+        """Configuration for this pydantic object."""
+        extra = "allow"
+        arbitrary_types_allowed = True
+    @classmethod
+    def load(
+        cls,
+        video_path: str,
+        threshold: int = 27,
+        min_scene_len: int = 1,
+        frame_extraction_interval: int = 5,
+        show_progress: bool = False,
+        kernel_size: Optional[int] = None,
+    ):
+        """Load a video file.
+        Args:
+            video_path (str): The path of the video file. Only support local file.
+            threshold (int): The scene detection threshold.
+            min_scene_len (int): Once a cut is detected, this long time must pass before a new one can
+                be added to the scene list. Count in seconds, defaults to 1.
+            show_progress (bool, optional): Whether to display the progress bar when processing the video. Defaults to False.
+        """
+        video = open_video(video_path)
+        scene_manager = SceneManager()
+        weight = ContentDetector.Components(
+            delta_hue=1.0,
+            delta_sat=1.0,
+            delta_lum=0.0,
+            delta_edges=1.0,
+        )
+        if kernel_size is None:
+            scene_manager.add_detector(
+                ContentDetector(
+                    threshold=threshold,
+                    min_scene_len=int(video.frame_rate * min_scene_len),
+                    weights=weight,
+                )
+            )
+        else:
+            scene_manager.add_detector(
+                ContentDetector(
+                    threshold=threshold,
+                    min_scene_len=int(video.frame_rate * min_scene_len),
+                    weights=weight,
+                    kernel_size=kernel_size,
+                )
+            )
+        scene_manager.detect_scenes(video, show_progress=show_progress)
+        scenes = scene_manager.get_scene_list(start_in_scene=True)
+        try:
+            audio = AudioSegment.from_file(video_path)
+            audio = normalize(audio)
+        except (IndexError, OSError):
+            audio = None
+        return cls(
+            stream=video,
+            scenes=[Scene.init(*scene) for scene in scenes],
+            audio=audio,
+            frame_extraction_interval=frame_extraction_interval,
+        )
+    def get_video_frames(
+        self, scene: Union[int, Scene, Tuple[FrameTimecode]], interval: int = None
+    ) -> Tuple[List[Image.Image], List[float]]:
+        """Get the frames of a scene.
+        Args:
+            scene (Union[int, Scene, Tuple[FrameTimecode]]): The scene to get frames. Can be the index of the scene, the scene object or a tuple of start and end frame timecode.
+            interval (int, optional): The interval of the frames to get. Defaults to None.
+        Raises:
+            ValueError: If the type of scene is not int, Scene or tuple.
+        Returns:
+            List[ndarray]: The frames of the scene.
+        """
+        if isinstance(scene, int):
+            scene = self.scenes[scene]
+            start, end = scene.start, scene.end
+        elif isinstance(scene, Scene):
+            start, end = scene.start, scene.end
+        elif isinstance(scene, tuple):
+            start, end = scene
+        else:
+            raise ValueError(
+                f"scene should be int, Scene or tuple, not {type(scene).__name__}"
+            )
+        self.stream.seek(start)
+        frames = []
+        time_stamps = []
+        if interval is None:
+            interval = self.frame_extraction_interval * self.stream.frame_rate
+        scene_len = end.get_frames() - start.get_frames()
+        if scene_len / 10 > interval:
+            interval = int(scene_len / 10) + 1
+        for index in range(scene_len):
+            if index % interval == 0:
+                f = self.stream.read()
+                frames.append(Image.fromarray(f))
+                time_stamps.append(self.stream.position.get_seconds())
+            else:
+                self.stream.read(decode=False)
+        self.stream.seek(0)
+        return frames, time_stamps
+    def get_audio_clip(
+        self, scene: Union[int, Scene, Tuple[FrameTimecode]]
+    ) -> AudioSegment:
+        """Get the audio clip of a scene.
+        Args:
+            scene (Union[int, Scene, Tuple[FrameTimecode]]): The scene to get audio clip. Can be the index of the scene, the scene object or a tuple of start and end frame timecode.
+        Raises:
+            ValueError: If the type of scene is not int, Scene or tuple.
+        Returns:
+            AudioSegment: The audio clip of the scene.
+        """
+        if self.audio is None:
+            return None
+        if isinstance(scene, int):
+            scene = self.scenes[scene]
+            start, end = scene.start, scene.end
+        elif isinstance(scene, Scene):
+            start, end = scene.start, scene.end
+        elif isinstance(scene, tuple):
+            start, end = scene
+        else:
+            raise ValueError(
+                f"scene should be int, Scene or tuple, not {type(scene).__name__}"
+            )
+        return self.audio[
+            int(start.get_seconds() * 1000) : int(end.get_seconds() * 1000)
+        ]
+    def __len__(self):
+        return len(self.scenes)
+    def __iter__(self):
+        self.index = 0
+        return self
+    def __next__(self):
+        if self.index >= len(self.scenes):
+            raise StopIteration
+        scene = self.scenes[self.index]
+        self.index += 1
+        return scene
+    def __getitem__(self, index):
+        return self.scenes[index]
+    def __setitem__(self, index, value):
+        self.scenes[index] = value
+    def to_serializable(self) -> dict:
+        """Convert VideoScenes to a serializable dictionary."""
+        scenes_data = []
+        for scene in self.scenes:
+            scenes_data.append(
+                {
+                    "start_frame": scene.start.frame_num,
+                    "end_frame": scene.end.frame_num,
+                    "stt_res": scene.stt_res,
+                    "summary": scene.summary,
+                }
+            )
+        return {
+            "video_path": self.stream.path,
+            "frame_rate": self.stream.frame_rate,
+            "scenes": scenes_data,
+            "frame_extraction_interval": self.frame_extraction_interval,
+        }
+    @classmethod
+    def from_serializable(cls, data: dict):
+        """Rebuild VideoScenes from serialized data."""
+        video = open_video(data["video_path"])
+        try:
+            audio = AudioSegment.from_file(data["video_path"])
+            audio = normalize(audio)
+        except Exception:
+            audio = None
+        # Rebuild scenes list
+        scenes = []
+        for scene_data in data["scenes"]:
+            start = FrameTimecode(scene_data["start_frame"], data["frame_rate"])
+            end = FrameTimecode(scene_data["end_frame"], data["frame_rate"])
+            scene = Scene.init(start, end)
+            scene.stt_res = scene_data["stt_res"]
+            scene.summary = scene_data["summary"]
+            scenes.append(scene)
+        return cls(
+            stream=video,
+            scenes=scenes,
+            audio=audio,
+            frame_extraction_interval=data["frame_extraction_interval"],
+        )

agent/tools/video_rewinder/rewinder.py ADDED Viewed

	@@ -0,0 +1,99 @@

+import json
+import re
+from pathlib import Path
+from typing import List
+import json_repair
+from omagent_core.models.llms.base import BaseLLMBackend
+from omagent_core.models.llms.prompt import PromptTemplate
+from omagent_core.tool_system.base import ArgSchema, BaseTool
+from omagent_core.utils.logger import logging
+from omagent_core.utils.registry import registry
+from pydantic import Field
+from scenedetect import FrameTimecode
+from ...misc.scene import VideoScenes
+CURRENT_PATH = Path(__file__).parents[0]
+ARGSCHEMA = {
+    "start_time": {
+        "type": "number",
+        "description": "Start time (in seconds) of the video to extract frames from.",
+        "required": True,
+    },
+    "end_time": {
+        "type": "number",
+        "description": "End time (in seconds) of the video to extract frames from.",
+        "required": True,
+    },
+    "number": {
+        "type": "number",
+        "description": "Number of frames of extraction. More frames means more details but more cost. Do not exceed 10.",
+        "required": True,
+    },
+}
+@registry.register_tool()
+class Rewinder(BaseTool, BaseLLMBackend):
+    args_schema: ArgSchema = ArgSchema(**ARGSCHEMA)
+    description: str = (
+        "Rollback and extract frames from video which is already loaded to get more specific details for further analysis."
+    )
+    prompts: List[PromptTemplate] = Field(
+        default=[
+            PromptTemplate.from_file(
+                CURRENT_PATH.joinpath("rewinder_sys_prompt.prompt"),
+                role="system",
+            ),
+            PromptTemplate.from_file(
+                CURRENT_PATH.joinpath("rewinder_user_prompt.prompt"),
+                role="user",
+            ),
+        ]
+    )
+    def _run(
+        self, start_time: float = 0.0, end_time: float = None, number: int = 1
+    ) -> str:
+        if self.stm(self.workflow_instance_id).get("video", None) is None:
+            raise ValueError("No video is loaded.")
+        else:
+            video: VideoScenes = VideoScenes.from_serializable(
+                self.stm(self.workflow_instance_id)["video"]
+            )
+        if number > 10:
+            logging.warning("Number of frames exceeds 10. Will extract 10 frames.")
+            number = 10
+        start = FrameTimecode(timecode=start_time, fps=video.stream.frame_rate)
+        if end_time is None:
+            end = video.stream.duration
+        else:
+            end = FrameTimecode(timecode=end_time, fps=video.stream.frame_rate)
+        if start_time == end_time:
+            frames, time_stamps = video.get_video_frames(
+                (start, end + 1), video.stream.frame_rate
+            )
+        else:
+            interval = int((end.get_frames() - start.get_frames()) / number)
+            frames, time_stamps = video.get_video_frames((start, end), interval)
+        # self.stm.image_cache.clear()
+        payload = []
+        for i, (frame, time_stamp) in enumerate(zip(frames, time_stamps)):
+            payload.append(f"timestamp_{time_stamp}")
+            payload.append(frame)
+        res = self.infer(input_list=[{"timestamp_with_images": payload}])[0]["choices"][
+            0
+        ]["message"]["content"]
+        image_contents = json_repair.loads(res)
+        self.stm(self.workflow_instance_id)["image_cache"] = {}
+        return f"extracted_frames described as: {image_contents}."
+    async def _arun(
+        self, start_time: float = 0.0, end_time: float = None, number: int = 1
+    ) -> str:
+        return self._run(start_time, end_time, number=number)

agent/tools/video_rewinder/rewinder_sys_prompt.prompt ADDED Viewed

	@@ -0,0 +1,7 @@

+You are an intelligent agent, your job is to  describe the content of each image and provide a summary for all the images.
+The format should be in JSON data as follows:
+```json{
+  "timestamp_x": "Description of the image at timestamp_x",
+  ...
+  "timestamp_start - timestamp_end": "Summary of all images, where start is the timestamp of the first image and end is the timestamp of the last image"
+}```

agent/tools/video_rewinder/rewinder_user_prompt.prompt ADDED Viewed

	@@ -0,0 +1 @@


1	+ {{timestamp_with_images}}

agent/video_preprocessor/__init__.py ADDED Viewed

File without changes

agent/video_preprocessor/sys_prompt.prompt ADDED Viewed

	@@ -0,0 +1,18 @@

+You are the most powerful AI agent with the ability to see images. You will help me with video content comprehension and analysis based on continuous video frame extraction and textual content of video conversations.
+--- Output ---
+You will be provided with a series of video frame images arranged in the order of video playback. Please help me answer the following questions and provide the output in the specified json format.
+{
+    "time": Optional[str].The time information of current video clip in terms of periods like morning or evening, seasons like spring or autumn, or specific years and time points. Please make sure to directly obtain the information from the provided context without inference or fabrication. If the relevant information cannot be obtained, please return null.
+    "location": Optional[str]. Describe the location where the current event is taking place, including scene details. If the relevant information cannot be obtained, please return null.
+    "character": Optional[str]. Provide a detailed description of the current characters, including their names, relationships, and what they are doing, etc. If the relevant information cannot be obtained, please return null.
+    "events": List[str]. List and describe all the detailed events in the video content in chronological order. Please integrate the information provided by the video frames and the textual information in the audio, and do not overlook any key points.
+    "scene": List[str]. Give some detailed description of the scene of the video. This includes, but is not limited to, scene information, textual information, character status expressions, and events displayed in the video.
+    "summary": str. Provide an detailed overall description and summary of the content of this video clip. Ensure that it remains objective and does not include any speculation or fabrication. This field is mandatory.
+}
+*** Important Notice ***
+1. You will be provided with the video frames and speech-to-text results. You have enough information to answer the questions.
+2. Sometimes the speech-to-text results maybe empty since there are no person talking. Please analyze based on the information in the images in this situation.

agent/video_preprocessor/user_prompt.prompt ADDED Viewed

	@@ -0,0 +1,4 @@

+Now, it's your turn to complete the task.
+Textual content of video conversations: {{stt_res}}
+Frame images of video playback: {{img_placeholders}}

agent/video_preprocessor/video_preprocess.py ADDED Viewed

	@@ -0,0 +1,254 @@

+import hashlib
+import pickle
+import time
+from pathlib import Path
+from typing import List, Optional, Union
+import json_repair
+from omagent_core.engine.worker.base import BaseWorker
+from omagent_core.models.asr.stt import STT
+from omagent_core.models.encoders.openai_encoder import OpenaiTextEmbeddingV3
+from omagent_core.models.llms.base import BaseLLMBackend
+from omagent_core.models.llms.prompt import PromptTemplate
+from omagent_core.utils.registry import registry
+from pydantic import Field, field_validator
+from pydub import AudioSegment
+from pydub.effects import normalize
+from scenedetect import open_video
+from ..misc.scene import VideoScenes
+CURRENT_PATH = root_path = Path(__file__).parents[0]
+@registry.register_worker()
+class VideoPreprocessor(BaseLLMBackend, BaseWorker):
+    prompts: List[PromptTemplate] = Field(
+        default=[
+            PromptTemplate.from_file(
+                CURRENT_PATH.joinpath("sys_prompt.prompt"), role="system"
+            ),
+            PromptTemplate.from_file(
+                CURRENT_PATH.joinpath("user_prompt.prompt"), role="user"
+            ),
+        ]
+    )
+    text_encoder: OpenaiTextEmbeddingV3
+    stt: STT
+    scene_detect_threshold: Union[float, int] = 27
+    min_scene_len: int = 1
+    frame_extraction_interval: int = 5
+    kernel_size: Optional[int] = None
+    show_progress: bool = True
+    use_cache: bool = False
+    cache_dir: str = "./video_cache"
+    @field_validator("stt", mode="before")
+    @classmethod
+    def validate_asr(cls, stt):
+        if isinstance(stt, STT):
+            return stt
+        elif isinstance(stt, dict):
+            return STT(**stt)
+        else:
+            raise ValueError("Invalid STT type.")
+    def calculate_md5(self, file_path):
+        md5_hash = hashlib.md5()
+        with open(file_path, "rb") as file:
+            for byte_block in iter(lambda: file.read(4096), b""):
+                md5_hash.update(byte_block)
+        return md5_hash.hexdigest()
+    def _run(self, test: str, *args, **kwargs):
+        """
+        Process video files by:
+        1. Calculating MD5 hash of input video for caching
+        2. Loading video from cache if available and use_cache=True
+        3. Otherwise, processing video by:
+           - Extracting audio and video streams
+           - Detecting scene boundaries
+           - Extracting frames at specified intervals
+           - Generating scene summaries using LLM
+           - Caching results for future use
+        Args:
+            video_path (str): Path to input video file
+            *args: Variable length argument list
+            **kwargs: Arbitrary keyword arguments
+        Returns:
+            dict: Dictionary containing video_md5 and video_path
+        """
+        video_path = self.input.read_input(
+            workflow_instance_id=self.workflow_instance_id,
+            input_prompt="Please input the video path:",
+        )["messages"][0]["content"][0]["data"]
+        video_md5 = self.calculate_md5(video_path)
+        kwargs["video_md5"] = video_md5
+        cache_path = (
+            Path(self.cache_dir)
+            .joinpath(video_path.replace("/", "-"))
+            .joinpath("video_cache.pkl")
+        )
+        # Load video from cache if available
+        if self.use_cache and cache_path.exists():
+            with open(cache_path, "rb") as f:
+                loaded_scene = pickle.load(f)
+                try:
+                    audio = AudioSegment.from_file(video_path)
+                    audio = normalize(audio)
+                except Exception:
+                    audio = None
+                video = VideoScenes(
+                    stream=open_video(video_path),
+                    audio=audio,
+                    scenes=loaded_scene,
+                    frame_extraction_interval=self.frame_extraction_interval,
+                )
+                self.callback.send_block(
+                    agent_id=self.workflow_instance_id,
+                    msg="Loaded video scenes from cache.\nResume the interrupted transfer for results with scene.summary of None.",
+                )
+                for index, scene in enumerate(video.scenes):
+                    if scene.summary is None:
+                        self.callback.send_block(
+                            agent_id=self.workflow_instance_id,
+                            msg=f"Resume the interrupted transfer for scene {index}.",
+                        )
+                        video_frames, time_stamps = video.get_video_frames(scene)
+                        try:
+                            chat_complete_res = self.infer(
+                                input_list=[
+                                    {
+                                        "stt_res": scene.conversation,
+                                        "img_placeholders": "".join(
+                                            [
+                                                f"<image_{i}>"
+                                                for i in range(len(video_frames))
+                                            ]
+                                        ),
+                                    }
+                                ],
+                                images=video_frames,
+                            )
+                            scene.summary = chat_complete_res[0]["choices"][0][
+                                "message"
+                            ]["content"]
+                            scene_info = scene.summary.get("scene", [])
+                            events = scene.summary.get("events", [])
+                            start_time = scene.start.get_seconds()
+                            end_time = scene.end.get_seconds()
+                            content = (
+                                f"Time in video: {scene.summary.get('time', 'null')}\n"
+                                f"Location: {scene.summary.get('location', 'null')}\n"
+                                f"Character': {scene.summary.get('character', 'null')}\n"
+                                f"Events: {events}\n"
+                                f"Scene: {scene_info}\n"
+                                f"Summary: {scene.summary.get('summary', '')}"
+                            )
+                            content_vector = self.text_encoder.infer([content])[0]
+                            self.ltm[index] = {
+                                "value": {
+                                    "video_md5": video_md5,
+                                    "content": content,
+                                    "start_time": start_time,
+                                    "end_time": end_time,
+                                },
+                                "embedding": content_vector,
+                            }
+                        except Exception as e:
+                            self.callback.error(
+                                f"Failed to resume scene {index}: {e}. Set to default."
+                            )
+                            scene.summary = {
+                                "time": "",
+                                "location": "",
+                                "character": "",
+                                "events": [],
+                                "scene": [],
+                                "summary": "",
+                            }
+                self.stm(self.workflow_instance_id)["video"] = video.to_serializable()
+            # Cache the processed video scenes
+            with open(cache_path, "wb") as f:
+                pickle.dump(video.scenes, f)
+        # Process video if not loaded from cache
+        if not self.stm(self.workflow_instance_id).get("video", None):
+            video = VideoScenes.load(
+                video_path=video_path,
+                threshold=self.scene_detect_threshold,
+                min_scene_len=self.min_scene_len,
+                frame_extraction_interval=self.frame_extraction_interval,
+                show_progress=self.show_progress,
+                kernel_size=self.kernel_size,
+            )
+            self.stm(self.workflow_instance_id)["video"] = video.to_serializable()
+            for index, scene in enumerate(video.scenes):
+                print(f"Processing scene {index} / {len(video.scenes)}...")
+                audio_clip = video.get_audio_clip(scene)
+                if audio_clip is None:
+                    scene.stt_res = {"text": ""}
+                else:
+                    scene.stt_res = self.stt.infer(audio_clip)
+                video_frames, time_stamps = video.get_video_frames(scene)
+                try:
+                    face_rec = registry.get_tool("FaceRecognition")
+                    for frame in video_frames:
+                        objs = face_rec.infer(frame)
+                        face_rec.visual_prompting(frame, objs)
+                except Exception:
+                    pass
+                try:
+                    chat_complete_res = self.infer(
+                        input_list=[
+                            {
+                                "stt_res": scene.conversation,
+                                "img_placeholders": "".join(
+                                    [f"<image_{i}>" for i in range(len(video_frames))]
+                                ),
+                            }
+                        ],
+                        images=video_frames,
+                    )
+                    scene.summary = chat_complete_res[0]["choices"][0]["message"][
+                        "content"
+                    ]
+                    scene_info = scene.summary.get("scene", [])
+                    events = scene.summary.get("events", [])
+                    start_time = scene.start.get_seconds()
+                    end_time = scene.end.get_seconds()
+                    content = (
+                        f"Time in video: {scene.summary.get('time', 'null')}\n"
+                        f"Location: {scene.summary.get('location', 'null')}\n"
+                        f"Character': {scene.summary.get('character', 'null')}\n"
+                        f"Events: {events}\n"
+                        f"Scene: {scene_info}\n"
+                        f"Summary: {scene.summary.get('summary', '')}"
+                    )
+                    content_vector = self.text_encoder.infer([content])[0]
+                    self.ltm[index] = {
+                        "value": {
+                            "video_md5": video_md5,
+                            "content": content,
+                            "start_time": start_time,
+                            "end_time": end_time,
+                        },
+                        "embedding": content_vector,
+                    }
+                except Exception as e:
+                    self.callback.error(f"Failed to process scene {index}: {e}")
+                    scene.summary = None
+        if self.use_cache and not cache_path.exists():
+            cache_path.parent.mkdir(parents=True, exist_ok=True)
+            with open(cache_path, "wb") as f:
+                pickle.dump(video.scenes, f)
+        return {
+            "video_md5": video_md5
+        }

agent/video_preprocessor/webpage_vp.py ADDED Viewed

	@@ -0,0 +1,252 @@

+import hashlib
+import pickle
+import time
+from pathlib import Path
+from typing import List, Optional, Union
+import json_repair
+from omagent_core.engine.worker.base import BaseWorker
+from omagent_core.models.asr.stt import STT
+from omagent_core.models.encoders.openai_encoder import OpenaiTextEmbeddingV3
+from omagent_core.models.llms.base import BaseLLMBackend
+from omagent_core.models.llms.prompt import PromptTemplate
+from omagent_core.utils.registry import registry
+from pydantic import Field, field_validator
+from pydub import AudioSegment
+from pydub.effects import normalize
+from scenedetect import open_video
+from ..misc.scene import VideoScenes
+CURRENT_PATH = root_path = Path(__file__).parents[0]
+@registry.register_worker()
+class WebpageVideoPreprocessor(BaseLLMBackend, BaseWorker):
+    prompts: List[PromptTemplate] = Field(
+        default=[
+            PromptTemplate.from_file(
+                CURRENT_PATH.joinpath("sys_prompt.prompt"), role="system"
+            ),
+            PromptTemplate.from_file(
+                CURRENT_PATH.joinpath("user_prompt.prompt"), role="user"
+            ),
+        ]
+    )
+    text_encoder: OpenaiTextEmbeddingV3
+    stt: STT
+    scene_detect_threshold: Union[float, int] = 27
+    min_scene_len: int = 1
+    frame_extraction_interval: int = 5
+    kernel_size: Optional[int] = None
+    show_progress: bool = True
+    use_cache: bool = False
+    cache_dir: str = "./video_cache"
+    @field_validator("stt", mode="before")
+    @classmethod
+    def validate_asr(cls, stt):
+        if isinstance(stt, STT):
+            return stt
+        elif isinstance(stt, dict):
+            return STT(**stt)
+        else:
+            raise ValueError("Invalid STT type.")
+    def calculate_md5(self, file_path):
+        md5_hash = hashlib.md5()
+        with open(file_path, "rb") as file:
+            for byte_block in iter(lambda: file.read(4096), b""):
+                md5_hash.update(byte_block)
+        return md5_hash.hexdigest()
+    def _run(self, video_path: str, *args, **kwargs):
+        """
+        Process video files by:
+        1. Calculating MD5 hash of input video for caching
+        2. Loading video from cache if available and use_cache=True
+        3. Otherwise, processing video by:
+           - Extracting audio and video streams
+           - Detecting scene boundaries
+           - Extracting frames at specified intervals
+           - Generating scene summaries using LLM
+           - Caching results for future use
+        Args:
+            video_path (str): Path to input video file
+            *args: Variable length argument list
+            **kwargs: Arbitrary keyword arguments
+        Returns:
+            dict: Dictionary containing video_md5 and video_path
+        """
+        video_md5 = self.calculate_md5(video_path)
+        kwargs["video_md5"] = video_md5
+        cache_path = (
+            Path(self.cache_dir)
+            .joinpath(video_path.replace("/", "-"))
+            .joinpath("video_cache.pkl")
+        )
+        # Load video from cache if available
+        if self.use_cache and cache_path.exists():
+            with open(cache_path, "rb") as f:
+                loaded_scene = pickle.load(f)
+                try:
+                    audio = AudioSegment.from_file(video_path)
+                    audio = normalize(audio)
+                except Exception:
+                    audio = None
+                video = VideoScenes(
+                    stream=open_video(video_path),
+                    audio=audio,
+                    scenes=loaded_scene,
+                    frame_extraction_interval=self.frame_extraction_interval,
+                )
+                self.callback.send_block(
+                    agent_id=self.workflow_instance_id,
+                    msg="Loaded video scenes from cache.\nResume the interrupted transfer for results with scene.summary of None.",
+                )
+                for index, scene in enumerate(video.scenes):
+                    if scene.summary is None:
+                        self.callback.send_block(
+                            agent_id=self.workflow_instance_id,
+                            msg=f"Resume the interrupted transfer for scene {index}.",
+                        )
+                        video_frames, time_stamps = video.get_video_frames(scene)
+                        try:
+                            chat_complete_res = self.infer(
+                                input_list=[
+                                    {
+                                        "stt_res": scene.conversation,
+                                        "img_placeholders": "".join(
+                                            [
+                                                f"<image_{i}>"
+                                                for i in range(len(video_frames))
+                                            ]
+                                        ),
+                                    }
+                                ],
+                                images=video_frames,
+                            )
+                            scene.summary = chat_complete_res[0]["choices"][0][
+                                "message"
+                            ]["content"]
+                            scene_info = scene.summary.get("scene", [])
+                            events = scene.summary.get("events", [])
+                            start_time = scene.start.get_seconds()
+                            end_time = scene.end.get_seconds()
+                            content = (
+                                f"Time in video: {scene.summary.get('time', 'null')}\n"
+                                f"Location: {scene.summary.get('location', 'null')}\n"
+                                f"Character': {scene.summary.get('character', 'null')}\n"
+                                f"Events: {events}\n"
+                                f"Scene: {scene_info}\n"
+                                f"Summary: {scene.summary.get('summary', '')}"
+                            )
+                            content_vector = self.text_encoder.infer([content])[0]
+                            self.ltm[index] = {
+                                "value": {
+                                    "video_md5": video_md5,
+                                    "content": content,
+                                    "start_time": start_time,
+                                    "end_time": end_time,
+                                },
+                                "embedding": content_vector,
+                            }
+                        except Exception as e:
+                            self.callback.error(
+                                f"Failed to resume scene {index}: {e}. Set to default."
+                            )
+                            scene.summary = {
+                                "time": "",
+                                "location": "",
+                                "character": "",
+                                "events": [],
+                                "scene": [],
+                                "summary": "",
+                            }
+                self.stm(self.workflow_instance_id)["video"] = video.to_serializable()
+            # Cache the processed video scenes
+            with open(cache_path, "wb") as f:
+                pickle.dump(video.scenes, f)
+        # Process video if not loaded from cache
+        if not self.stm(self.workflow_instance_id).get("video", None):
+            video = VideoScenes.load(
+                video_path=video_path,
+                threshold=self.scene_detect_threshold,
+                min_scene_len=self.min_scene_len,
+                frame_extraction_interval=self.frame_extraction_interval,
+                show_progress=self.show_progress,
+                kernel_size=self.kernel_size,
+            )
+            self.stm(self.workflow_instance_id)["video"] = video.to_serializable()
+            for index, scene in enumerate(video.scenes):
+                print(f"Processing scene {index} / {len(video.scenes)}...")
+                audio_clip = video.get_audio_clip(scene)
+                if audio_clip is None:
+                    scene.stt_res = {"text": ""}
+                else:
+                    scene.stt_res = self.stt.infer(audio_clip)
+                video_frames, time_stamps = video.get_video_frames(scene)
+                try:
+                    face_rec = registry.get_tool("FaceRecognition")
+                    for frame in video_frames:
+                        objs = face_rec.infer(frame)
+                        face_rec.visual_prompting(frame, objs)
+                except Exception:
+                    pass
+                try:
+                    chat_complete_res = self.infer(
+                        input_list=[
+                            {
+                                "stt_res": scene.conversation,
+                                "img_placeholders": "".join(
+                                    [f"<image_{i}>" for i in range(len(video_frames))]
+                                ),
+                            }
+                        ],
+                        images=video_frames,
+                    )
+                    scene.summary = chat_complete_res[0]["choices"][0]["message"][
+                        "content"
+                    ]
+                    scene_info = scene.summary.get("scene", [])
+                    events = scene.summary.get("events", [])
+                    start_time = scene.start.get_seconds()
+                    end_time = scene.end.get_seconds()
+                    content = (
+                        f"Time in video: {scene.summary.get('time', 'null')}\n"
+                        f"Location: {scene.summary.get('location', 'null')}\n"
+                        f"Character': {scene.summary.get('character', 'null')}\n"
+                        f"Events: {events}\n"
+                        f"Scene: {scene_info}\n"
+                        f"Summary: {scene.summary.get('summary', '')}"
+                    )
+                    content_vector = self.text_encoder.infer([content])[0]
+                    self.ltm[index] = {
+                        "value": {
+                            "video_md5": video_md5,
+                            "content": content,
+                            "start_time": start_time,
+                            "end_time": end_time,
+                        },
+                        "embedding": content_vector,
+                    }
+                except Exception as e:
+                    self.callback.error(f"Failed to process scene {index}: {e}")
+                    scene.summary = None
+        if self.use_cache and not cache_path.exists():
+            cache_path.parent.mkdir(parents=True, exist_ok=True)
+            with open(cache_path, "wb") as f:
+                pickle.dump(video.scenes, f)
+        return {
+            "video_md5": video_md5,
+            "video_path": video_path,
+            "instance_id": self.workflow_instance_id,
+        }

agent/video_qa/__init__.py ADDED Viewed

File without changes

agent/video_qa/qa.py ADDED Viewed

	@@ -0,0 +1,82 @@

+import json
+import re
+from pathlib import Path
+from typing import List
+import json_repair
+from omagent_core.advanced_components.workflow.dnc.schemas.dnc_structure import \
+    TaskTree
+from omagent_core.engine.worker.base import BaseWorker
+from omagent_core.memories.ltms.ltm import LTM
+from omagent_core.models.encoders.openai_encoder import OpenaiTextEmbeddingV3
+from omagent_core.models.llms.base import BaseLLMBackend
+from omagent_core.models.llms.prompt import PromptTemplate
+from omagent_core.utils.registry import registry
+from pydantic import Field
+from ..misc.scene import VideoScenes
+CURRENT_PATH = root_path = Path(__file__).parents[0]
+@registry.register_worker()
+class VideoQA(BaseWorker, BaseLLMBackend):
+    prompts: List[PromptTemplate] = Field(
+        default=[
+            PromptTemplate.from_file(
+                CURRENT_PATH.joinpath("sys_prompt.prompt"), role="system"
+            ),
+            PromptTemplate.from_file(
+                CURRENT_PATH.joinpath("user_prompt.prompt"), role="user"
+            ),
+        ]
+    )
+    text_encoder: OpenaiTextEmbeddingV3
+    def _run(self, video_md5: str, *args, **kwargs):
+        self.stm(self.workflow_instance_id)["image_cache"] = {}
+        self.stm(self.workflow_instance_id)["former_results"] = {}
+        question = self.input.read_input(
+            workflow_instance_id=self.workflow_instance_id,
+            input_prompt="Please input your question:",
+        )["messages"][0]["content"][0]["data"]
+        chat_complete_res = self.simple_infer(question=question)
+        content = chat_complete_res["choices"][0]["message"]["content"]
+        content = json_repair.loads(content)
+        try:
+            start_time = (
+                None if content.get("start_time", -1) == -1 else content.get("start_time")
+            )
+            end_time = (
+                None if content.get("end_time", -1) == -1 else content.get("end_time")
+            )
+        except Exception as e:
+            start_time = None
+            end_time = None
+        question_vector = self.text_encoder.infer([question])[0]
+        filter_expr = ""
+        if video_md5 is not None:
+            filter_expr = f"value['video_md5']=='{video_md5}'"
+        if start_time is not None and end_time is not None:
+            filter_expr += f" and (value['start_time']>={max(0, start_time - 10)} and value['end_time']<={end_time + 10})"
+        elif start_time is not None:
+            filter_expr += f" and value['start_time']>={max(0, start_time - 10)}"
+        elif end_time is not None:
+            filter_expr += f" and value['end_time']<={end_time + 10}"
+        related_information = self.ltm.get_by_vector(
+            embedding=question_vector, top_k=5, threshold=0.2, filter=filter_expr
+        )
+        related_information = [
+            f"Time span: {each['start_time']} - {each['end_time']}\n{each['content']}"
+            for _, each in related_information
+        ]
+        video = VideoScenes.from_serializable(
+            self.stm(self.workflow_instance_id)["video"]
+        )
+        self.stm(self.workflow_instance_id)["extra"] = {
+            "video_information": "video is already loaded in the short-term memory(stm).",
+            "video_duration_seconds(s)": video.stream.duration.get_seconds(),
+            "frame_rate": video.stream.frame_rate,
+            "video_summary": "\n---\n".join(related_information),
+        }
+        return {"query": question, "last_output": None}

agent/video_qa/sys_prompt.prompt ADDED Viewed

	@@ -0,0 +1,8 @@

+You are a master of time extraction, capable of analyzing the temporal relationships in question and extracting timestamps, with the extracted times in seconds.
+Time format in question is in the form of "HH:MM:SS" or "MM:SS".
+---
+The output should be a json object as follows:
+{
+    "start_time": start time in seconds, -1 if not found,
+    "end_time": end time in seconds, -1 if not found,
+}

agent/video_qa/user_prompt.prompt ADDED Viewed

	@@ -0,0 +1 @@


1	+ Question: {{question}}

agent/video_qa/webpage_qa.py ADDED Viewed

	@@ -0,0 +1,73 @@

+from pathlib import Path
+from typing import List
+import json_repair
+from pydantic import Field
+from omagent_core.engine.worker.base import BaseWorker
+from omagent_core.models.encoders.openai_encoder import OpenaiTextEmbeddingV3
+from omagent_core.models.llms.base import BaseLLMBackend
+from omagent_core.models.llms.prompt import PromptTemplate
+from omagent_core.utils.registry import registry
+from ..misc.scene import VideoScenes
+CURRENT_PATH = root_path = Path(__file__).parents[0]
+@registry.register_worker()
+class WebpageVideoQA(BaseWorker, BaseLLMBackend):
+    prompts: List[PromptTemplate] = Field(
+        default=[
+            PromptTemplate.from_file(
+                CURRENT_PATH.joinpath("sys_prompt.prompt"), role="system"
+            ),
+            PromptTemplate.from_file(
+                CURRENT_PATH.joinpath("user_prompt.prompt"), role="user"
+            ),
+        ]
+    )
+    text_encoder: OpenaiTextEmbeddingV3
+    def _run(self, video_md5: str, video_path: str, instance_id: str, question: str, *args, **kwargs):
+        self.stm(self.workflow_instance_id)["image_cache"] = {}
+        self.stm(self.workflow_instance_id)["former_results"] = {}
+        chat_complete_res = self.simple_infer(question=question)
+        content = chat_complete_res["choices"][0]["message"]["content"]
+        content = json_repair.loads(content)
+        try:
+            start_time = (
+                None if content.get("start_time", -1) == -1 else content.get("start_time")
+            )
+            end_time = (
+                None if content.get("end_time", -1) == -1 else content.get("end_time")
+            )
+        except Exception as e:
+            start_time = None
+            end_time = None
+        question_vector = self.text_encoder.infer([question])[0]
+        filter_expr = ""
+        if video_md5 is not None:
+            filter_expr = f"value['video_md5']=='{video_md5}'"
+        if start_time is not None and end_time is not None:
+            filter_expr += f" and (value['start_time']>={max(0, start_time - 10)} and value['end_time']<={end_time + 10})"
+        elif start_time is not None:
+            filter_expr += f" and value['start_time']>={max(0, start_time - 10)}"
+        elif end_time is not None:
+            filter_expr += f" and value['end_time']<={end_time + 10}"
+        related_information = self.ltm.get_by_vector(
+            embedding=question_vector, top_k=5, threshold=0.2, filter=filter_expr
+        )
+        related_information = [
+            f"Time span: {each['start_time']} - {each['end_time']}\n{each['content']}"
+            for _, each in related_information
+        ]
+        video = VideoScenes.from_serializable(
+            self.stm(self.workflow_instance_id)["video"]
+        )
+        self.stm(self.workflow_instance_id)["extra"] = {
+            "video_information": "video is already loaded in the short-term memory(stm).",
+            "video_duration_seconds(s)": video.stream.duration.get_seconds(),
+            "frame_rate": video.stream.frame_rate,
+            "video_summary": "\n---\n".join(related_information),
+        }
+        return {"query": question, "last_output": None}

app.py ADDED Viewed

	@@ -0,0 +1,136 @@

+# Import required modules and components
+import base64
+import hashlib
+import json
+import os
+from pathlib import Path
+from Crypto.Cipher import AES
+class Encrypt(object):
+    @staticmethod
+    def pad(s):
+        AES_BLOCK_SIZE = 16  # Bytes
+        return s + (AES_BLOCK_SIZE - len(s) % AES_BLOCK_SIZE) * \
+            chr(AES_BLOCK_SIZE - len(s) % AES_BLOCK_SIZE)
+    @staticmethod
+    def unpad(s):
+        return s[:-ord(s[len(s) - 1:])]
+    # hashlib md5加密
+    @staticmethod
+    def hash_md5_encrypt(data: (str, bytes), salt=None) -> str:
+        if isinstance(data, str):
+            data = data.encode('utf-8')
+        md5 = hashlib.md5()
+        if salt:
+            if isinstance(salt, str):
+                salt = salt.encode('utf-8')
+            md5.update(salt)
+        md5.update(data)
+        return md5.hexdigest()
+    @staticmethod
+    # @catch_exc()
+    def aes_decrypt(key: str, data: str) -> str:
+        '''
+        :param key: 密钥
+        :param data: 加密后的数据（密文）
+        :return:明文
+        '''
+        key = key.encode('utf8')
+        data = base64.b64decode(data)
+        cipher = AES.new(key, AES.MODE_ECB)
+        # 去补位
+        text_decrypted = Encrypt.unpad(cipher.decrypt(data))
+        text_decrypted = text_decrypted.decode('utf8')
+        return text_decrypted
+secret = 'FwALd7BY8IUrbnrigH3YYlhGD/XvMVX7'
+encrypt = 'sJWveD1LIxIxYGZvZMRlb+8vJjq5yJmXnqSKfHM6AhgmZWMPcFuTNbpJCHNVnjqminXZLsIbFWazoyAUNP1piKOrtBGHF8NaunP/6lp2CJKVMIrxo8z/IxN0IstwcULjaFLilf68/PFXhwZ1gv4PZmu2Z2iwSLAyVxXkmIwjFUp0TQv7xtHpwj2KH/80BgjAOGFlZ8OSwlsum9BqD68a1q3QMi1IcyG1SlUSiiKB5bREfhfXxCgOV2EOYrPPurrT/hHLuZaFSu2YjV/ZkEkumjJZu5sGUElw7dZWdNhJibMjtsA4saNBrjp6gO3/q4i1YLWKTM5HQeTjMkAHt0FH2FigXIER1xZqma94bIaDZoo='
+env = json.loads(Encrypt.aes_decrypt(secret, encrypt))
+for k, v in env.items():
+    os.environ.setdefault(k, v)
+from agent.conclude.webpage_conclude import WebpageConclude
+from agent.video_preprocessor.webpage_vp import WebpageVideoPreprocessor
+from agent.video_qa.webpage_qa import WebpageVideoQA
+from webpage import WebpageClient
+from omagent_core.advanced_components.workflow.dnc.workflow import DnCWorkflow
+from omagent_core.engine.workflow.conductor_workflow import ConductorWorkflow
+from omagent_core.engine.workflow.task.simple_task import simple_task
+from omagent_core.utils.container import container
+from omagent_core.utils.logger import logging
+from omagent_core.utils.registry import registry
+def app():
+    logging.init_logger("omagent", "omagent", level="INFO")
+    # Set current working directory path
+    CURRENT_PATH = root_path = Path(__file__).parents[0]
+    # Import registered modules
+    registry.import_module(project_path=CURRENT_PATH.joinpath("agent"))
+    # Load container configuration from YAML file
+    container.register_stm("SharedMemSTM")
+    container.register_ltm(ltm="VideoMilvusLTM")
+    container.from_config(CURRENT_PATH.joinpath("container.yaml"))
+    # Initialize simple VQA workflow
+    workflow = ConductorWorkflow(name="webpage_video_understanding")
+    process_workflow = ConductorWorkflow(name="webpage_process_video_understanding")
+    # 1. Video preprocess task for video preprocessing
+    video_preprocess_task = simple_task(
+        task_def_name=WebpageVideoPreprocessor,
+        task_reference_name="webpage_video_preprocess",
+        inputs={"video_path": process_workflow.input("video_path")}
+    )
+    # 2. Video QA task for video QA
+    video_qa_task = simple_task(
+        task_def_name=WebpageVideoQA,
+        task_reference_name="webpage_video_qa",
+        inputs={
+            "video_md5": workflow.input("video_md5"),
+            "video_path": workflow.input("video_path"),
+            "instance_id": workflow.input("instance_id"),
+            "question": workflow.input("question"),
+        },
+    )
+    dnc_workflow = DnCWorkflow()
+    dnc_workflow.set_input(query=video_qa_task.output("query"))
+    # 7. Conclude task for task conclusion
+    conclude_task = simple_task(
+        task_def_name=WebpageConclude,
+        task_reference_name="webpage_task_conclude",
+        inputs={
+            "dnc_structure": dnc_workflow.dnc_structure,
+            "last_output": dnc_workflow.last_output,
+        },
+    )
+    # Configure workflow execution flow: Input -> Initialize global variables -> DnC Loop -> Conclude
+    process_workflow >> video_preprocess_task
+    workflow >> video_preprocess_task >> video_qa_task >> dnc_workflow >> conclude_task
+    # Register workflow
+    workflow.register(overwrite=True)
+    process_workflow.register(overwrite=True)
+    # Initialize and start app client with workflow configuration
+    cli_client = WebpageClient(
+        interactor=workflow, processor=process_workflow, config_path="webpage_configs"
+    )
+    cli_client.start_interactor()
+if __name__ == '__main__':
+    app()

calculator_code.py ADDED Viewed

	@@ -0,0 +1,4 @@

+duration = 10.117
+frame_rate = 9.88
+number_of_frames = duration * frame_rate
+print(number_of_frames)

compile_container.py ADDED Viewed

	@@ -0,0 +1,18 @@

+# Import core modules and components
+# Import workflow related modules
+from pathlib import Path
+from omagent_core.utils.container import container
+from omagent_core.utils.registry import registry
+# Set up path and import modules
+CURRENT_PATH = root_path = Path(__file__).parents[0]
+registry.import_module(project_path=CURRENT_PATH.joinpath("agent"))
+# Register required components
+container.register_callback(callback="DefaultCallback")
+container.register_input(input="AppInput")
+container.register_stm("SharedMemSTM")
+container.register_ltm(ltm="VideoMilvusLTM")
+# Compile container config
+container.compile_config(CURRENT_PATH)

configs/llms/gpt4o.yml ADDED Viewed

	@@ -0,0 +1,7 @@

+name: OpenaiGPTLLM
+model_id: gpt-4o
+api_key: ${env| custom_openai_key, openai_api_key}
+endpoint: ${env| custom_openai_endpoint, https://api.openai.com/v1}
+temperature: 0
+vision: true
+response_format: json_object

configs/llms/json_res.yml ADDED Viewed

	@@ -0,0 +1,6 @@

+name: OpenaiGPTLLM
+model_id: gpt-4o
+api_key: ${env| custom_openai_key, openai_api_key}
+endpoint: ${env| custom_openai_endpoint, https://api.openai.com/v1}
+temperature: 0
+response_format: json_object

configs/llms/text_encoder.yml ADDED Viewed

	@@ -0,0 +1,3 @@

+name: OpenaiTextEmbeddingV3
+endpoint: ${env| custom_openai_endpoint, https://api.openai.com/v1}
+api_key: ${env| custom_openai_key, openai_api_key}

configs/llms/text_res.yml ADDED Viewed

	@@ -0,0 +1,6 @@

+name: OpenaiGPTLLM
+model_id: gpt-4o
+api_key: ${env| custom_openai_key, openai_api_key}
+endpoint: ${env| custom_openai_endpoint, https://api.openai.com/v1}
+temperature: 0
+response_format: text

configs/llms/text_res_stream.yml ADDED Viewed

	@@ -0,0 +1,7 @@

+name: OpenaiGPTLLM
+model_id: gpt-4o-mini
+api_key: ${env| custom_openai_key, openai_api_key}
+endpoint: ${env| custom_openai_endpoint, https://api.openai.com/v1}
+temperature: 0
+stream: true
+response_format: text

configs/tools/all_tools.yml ADDED Viewed

	@@ -0,0 +1,12 @@

+llm: ${sub|text_res}
+tools:
+    - Calculator
+    - CodeInterpreter
+    - ReadFileContent
+    - WriteFileContent
+    - ShellTool
+    - name: Rewinder
+      llm: ${sub|text_res}
+    - name: WebSearch
+      bing_api_key: ${env|bing_api_key, microsoft_bing_api_key}
+      llm: ${sub|text_res}

configs/workers/conclude.yml ADDED Viewed

	@@ -0,0 +1,4 @@

+name: Conclude
+llm: ${sub|text_res}
+output_parser:
+  name: StrParser

configs/workers/dnc_workflow.yml ADDED Viewed

	@@ -0,0 +1,18 @@

+- name: ConstructDncPayload
+- name: StructureUpdate
+- name: TaskConqueror
+  llm: ${sub|json_res}
+  tool_manager: ${sub|all_tools}
+  output_parser:
+    name: StrParser
+- name: TaskDivider
+  llm: ${sub|json_res}
+  tool_manager: ${sub|all_tools}
+  output_parser:
+    name: StrParser
+- name: TaskRescue
+  llm: ${sub|text_res}
+  tool_manager: ${sub|all_tools}
+  output_parser:
+    name: StrParser
+- name: TaskExitMonitor

configs/workers/video_preprocessor.yml ADDED Viewed

	@@ -0,0 +1,12 @@

+name: VideoPreprocessor
+llm: ${sub|gpt4o}
+use_cache: true
+scene_detect_threshold: 27
+frame_extraction_interval: 5
+stt:
+  name: STT
+  endpoint: ${env| custom_openai_endpoint, https://api.openai.com/v1}
+  api_key: ${env| custom_openai_key, openai_api_key}
+output_parser:
+  name: DictParser
+text_encoder: ${sub| text_encoder}

configs/workers/video_qa.yml ADDED Viewed

	@@ -0,0 +1,5 @@

+name: VideoQA
+llm: ${sub|gpt4o}
+output_parser:
+  name: StrParser
+text_encoder: ${sub| text_encoder}

container.yaml ADDED Viewed

	@@ -0,0 +1,154 @@

+conductor_config:
+  name: Configuration
+  base_url:
+    value: http://localhost:8080
+    description: The Conductor Server API endpoint
+    env_var: CONDUCTOR_SERVER_URL
+  auth_key:
+    value: null
+    description: The authorization key
+    env_var: AUTH_KEY
+  auth_secret:
+    value: null
+    description: The authorization secret
+    env_var: CONDUCTOR_AUTH_SECRET
+  auth_token_ttl_min:
+    value: 45
+    description: The authorization token refresh interval in minutes.
+    env_var: AUTH_TOKEN_TTL_MIN
+  debug:
+    value: false
+    description: Debug mode
+    env_var: DEBUG
+aaas_config:
+  name: AaasConfig
+  base_url:
+    value: http://localhost:30002
+    description: The aaas task server API endpoint
+    env_var: AAAS_TASK_SERVER_URL
+  token:
+    value: null
+    description: The authorization token
+    env_var: AAAS_TOKEN
+  enable:
+    value: true
+    description: Whether to enable the aaas task server
+    env_var: AAAS_ENABLE
+  domain_token:
+    value: null
+    description: The domain token
+    env_var: DOMAIN_TOKEN
+  is_prod:
+    value: false
+    description: Whether it is a production environment
+    env_var: IS_PROD
+connectors:
+  redis_stream_client:
+    name: RedisConnector
+    id:
+      value: null
+      env_var: ID
+    host:
+      value: localhost
+      env_var: HOST
+    port:
+      value: 6379
+      env_var: PORT
+    password:
+      value: null
+      env_var: PASSWORD
+    username:
+      value: null
+      env_var: USERNAME
+    db:
+      value: 0
+      env_var: DB
+    use_lite:
+      value: true
+      env_var: USE_LITE
+  milvus_ltm_client:
+    name: MilvusConnector
+    id:
+      value: null
+      env_var: ID
+    host:
+      value: ./db.db
+      env_var: HOST
+    port:
+      value: 19530
+      env_var: PORT
+    password:
+      value: ''
+      env_var: PASSWORD
+    username:
+      value: default
+      env_var: USERNAME
+    db:
+      value: default
+      env_var: DB
+    alias:
+      value: alias
+      env_var: ALIAS
+components:
+  AppCallback:
+    name: AppCallback
+    id:
+      value: null
+      env_var: ID
+    bot_id:
+      value: ''
+      env_var: BOT_ID
+    start_time:
+      value: 2025-02-17_15:40:53
+      env_var: START_TIME
+  AppInput:
+    name: AppInput
+    id:
+      value: null
+      env_var: ID
+  AaasCallback:
+    name: AaasCallback
+    id:
+      value: null
+      env_var: ID
+    bot_id:
+      value: ''
+      env_var: BOT_ID
+    start_time:
+      value: 2025-02-17_15:40:53
+      env_var: START_TIME
+  AaasInput:
+    name: AaasInput
+    id:
+      value: null
+      env_var: ID
+  DefaultCallback:
+    name: DefaultCallback
+    id:
+      value: null
+      env_var: ID
+    bot_id:
+      value: ''
+      env_var: BOT_ID
+    start_time:
+      value: 2025-02-17_15:40:53
+      env_var: START_TIME
+    incomplete_flag:
+      value: false
+      env_var: INCOMPLETE_FLAG
+  SharedMemSTM:
+    name: SharedMemSTM
+    id:
+      value: null
+      env_var: ID
+  VideoMilvusLTM:
+    name: VideoMilvusLTM
+    id:
+      value: null
+      env_var: ID
+    storage_name:
+      value: default
+      env_var: STORAGE_NAME
+    dim:
+      value: 3072
+      env_var: DIM

docs/images/local-ai.png ADDED Viewed

docs/images/video_understanding_workflow_diagram.png ADDED Viewed

docs/local-ai.md ADDED Viewed

	@@ -0,0 +1,87 @@

+# Local-ai
+You can use Local-ai to run your own model locally.
+Following the instruction of [Local-ai](https://github.com/mudler/LocalAI) to install Local-ai.
+### Download Local-ai models
+Download [Whisper](https://huggingface.co/ggerganov/whisper.cpp) and [Embedding model](https://huggingface.co/hugging-quants/Llama-3.2-1B-Instruct-Q4_K_M-GGUF).
+Then move the model checkpoint file to the /usr/share/local-ai/models/. **Other path for models is not supported.**
+### Modify config files
+Create Local-ai config files.
+Embedding model yaml
+```yaml
+name: text-embedding-ada-002
+backend: llama-cpp
+embeddings: true
+parameters:
+  model: llama-3.2-1b-instruct-q4_k_m.gguf # model file name in /usr/share/local-ai/models/
+```
+Whisper yaml
+```yaml
+name: whisper
+backend: whisper
+parameters:
+  model: ggml-model-whisper-base.en.bin # model file name in /usr/share/local-ai/models/
+```
+### run the model
+First run
+```bash
+local-ai run <path-to-your-embedding-model-yaml>
+```
+and
+```bash
+local-ai run <path-to-your-whisper-yaml>
+```
+to initially link yaml file to the model.
+Then next time only run
+```bash
+local-ai run
+```
+can load two models.
+**Make sure get model names right, or embedding model may get empty result.**
+![local-ai get model names right](images/local-ai.png)
+### Modify the yaml of OmAgent
+Modify ./configs/llms/json_res.yml
+```yaml
+name: OpenaiTextEmbeddingV3
+model_id: text-embedding-ada-002
+dim: 2048
+endpoint: ${env| custom_openai_endpoint, http://localhost:8080/v1}
+api_key: ${env| custom_openai_key, openai_api_key} # api_key is not needed
+```
+and ./configs/workers/video_preprocessor.yml
+```yaml
+name: VideoPreprocessor
+llm: ${sub|gpt4o}
+use_cache: true
+scene_detect_threshold: 27
+frame_extraction_interval: 5
+stt:
+  name: STT
+  endpoint: http://localhost:8080/v1
+  api_key: ${env| custom_openai_key, openai_api_key}
+  model_id: whisper
+output_parser:
+  name: DictParser
+text_encoder: ${sub| text_encoder}
+```
+and set dim in ./container.yaml
+```yaml
+  VideoMilvusLTM:
+    name: VideoMilvusLTM
+    id:
+      value: null
+      env_var: ID
+    storage_name:
+      value: yyl_video_ltm
+      env_var: STORAGE_NAME
+    dim:
+      value: 2048
+      env_var: DIM
+```
+Then you can use your model locally.

omagent_core/__init__.py ADDED Viewed

File without changes

omagent_core/advanced_components/__init__.py ADDED Viewed

File without changes

omagent_core/advanced_components/worker/__init__.py ADDED Viewed

File without changes

omagent_core/advanced_components/worker/conclude/__init__.py ADDED Viewed

File without changes

omagent_core/advanced_components/worker/conqueror/__init__.py ADDED Viewed

File without changes

omagent_core/advanced_components/worker/divider/__init__.py ADDED Viewed

File without changes

omagent_core/advanced_components/worker/task_exit_monitor/__init__.py ADDED Viewed

File without changes

omagent_core/advanced_components/worker/video_preprocess/__init__.py ADDED Viewed

File without changes

omagent_core/advanced_components/workflow/cot/README.md ADDED Viewed

	@@ -0,0 +1,36 @@

+# Chain-of-Thought Operator
+Chain of Thought (CoT) is a workflow operator that breaks down a complex problem into a series of intermediate steps or thoughts that lead to a final answer.
+You can refer to the example in the `examples/cot` directory to understand how to use this operator.
+# Inputs, Outputs and configs
+## Inputs:
+The inputs that the Chain of Thought (CoT) operator requires are as follows:
+| Name        | Type | Required | Description |
+| ----------- | ----- | -------- | ----------- |
+| query       | str   | true     | The user's question |
+| cot_method  | str   | optional | The CoT method: `few_shot` or `zero_shot` |
+| cot_examples| list  | optional | Examples used for the `few_shot` CoT method |
+| id          | int   | optional | An identifier for tracking the question |
+## Outputs:
+The outputs that the Chain of Thought (CoT) operator returns are as follows:
+| Name     | Type | Description |
+| -------- | ----- | ---- |
+| id | int |  An identifier for tracking the question |
+| question | str |  The complete prompt string containing the query |
+| last_output | str | The final answer generated by the agent |
+| prompt_tokens | int | The total number of tokens in the question |
+| completion_tokens | int | The total number of tokens in the answer |
+## Configs:
+The config of the Chain of Thought (CoT) operator is as follows, you can simply copy and paste the following config into your project as a cot_workflow.yml file.
+```yml
+- name: CoTReasoning
+  llm: ${sub|gpt4o}
+  output_parser:
+    name: StrParser
+  concurrency: 15
+```

omagent_core/advanced_components/workflow/cot/agent/cot_reasoning/cot_reasoning.py ADDED Viewed

	@@ -0,0 +1,67 @@

+from omagent_core.models.llms.openai_gpt import OpenaiGPTLLM
+from omagent_core.engine.worker.base import BaseWorker
+from omagent_core.utils.registry import registry
+from omagent_core.models.llms.base import BaseLLMBackend
+from omagent_core.models.llms.schemas import Message, Content
+from omagent_core.utils.logger import logging
+from omagent_core.models.llms.prompt.prompt import PromptTemplate
+from omagent_core.advanced_components.workflow.cot.schemas.cot_create_examples import CoTExample
+from pydantic import Field
+from pathlib import Path
+from typing import List
+CURRENT_PATH = Path( __file__ ).parents[ 0 ]
+@registry.register_worker()
+class CoTReasoning( BaseLLMBackend, BaseWorker ):
+    prompts: List[ PromptTemplate ] = Field( default=[] )
+    def _run( self, id: int, query: str, cot_method: str, cot_examples: List[ dict ] = [], *args, **kwargs ):
+        """
+        Executes a reasoning task based on the specified Chain-of-Thought (CoT) method.
+        Args:
+            id (int): The identifier for the reasoning task.
+            query (str): The query string to be processed.
+            cot_method (str): The CoT method to use, either 'few_shot' or 'zero_shot'.
+            cot_examples (List[dict], optional): A list of examples for few-shot CoT. Defaults to an empty list.
+            *args: Additional positional arguments.
+            **kwargs: Additional keyword arguments.
+        Returns:
+            dict: A dictionary containing the task id, question, model output, prompt tokens, and completion tokens.
+        Raises:
+            ValueError: If an invalid CoT method is provided.
+        """
+        if cot_method == 'few_shot':
+            self.prompts = [
+                PromptTemplate.from_file( CURRENT_PATH.joinpath( "few_shot_cot.prompt" ), role="user" ),
+            ]
+            assert cot_examples, "Few-shot COT requires examples."
+            demo = CoTExample().create_examples( cot_examples )
+            res = self.simple_infer( query=query, demo=demo )
+            body = self.llm._msg2req( [ p for prompts in self.prep_prompt( [ { "query": query, "demo": demo} ] ) for p in prompts ] )
+        elif cot_method == 'zero_shot':
+            self.prompts = [
+                PromptTemplate.from_file( CURRENT_PATH.joinpath( "zero_shot_cot.prompt" ), role="user" ),
+            ]
+            res = self.simple_infer( query=query )
+            body = self.llm._msg2req( [ p for prompts in self.prep_prompt( [ { "query": query} ] ) for p in prompts ] )
+        else:
+            raise ValueError( f"Invalid cot_method: {cot_method}" )
+        # Extract the reasoning result from the response.
+        prompt_tokens = res[ 'usage' ][ 'prompt_tokens' ]
+        completion_tokens = res[ 'usage' ][ 'completion_tokens' ]
+        last_output = res[ "choices" ][ 0 ][ "message" ][ "content" ]
+        question = body.get( 'messages' )[ 0 ][ 'content' ][ 0 ][ 'text' ]
+        self.callback.send_answer(self.workflow_instance_id, msg=last_output)
+        return { 'id': id, 'question': question, 'last_output': last_output, 'prompt_tokens': prompt_tokens, "completion_tokens": completion_tokens}