-
-
Notifications
You must be signed in to change notification settings - Fork 150
OpenAdapt Architecture (draft)
OpenAdapt is the open source AI-first Process Automation library. We have created, tested, and documented a number of reusable software components that we believe can serve as building blocks for AI-First Process Automation. We make this available to the community for free (MIT license).
We are seeking feedback on our proposed process automation architecture (below).
Image generated via https://colab.research.google.com/drive/1iH2QlFE06-_vDzDO0Z8Yov41xfdNVHdO
- Client is installed on user's desktop computer (Windows or Mac)
- User triggers "start recording" via Tray Icon to start recording time-aligned user Action events (mouse/keyboard), associated Screenshots, and active Window State (retrieved from operating system accessibility API)
- User triggers "stop recording" via Tray Icon to stop recording
- Operating-system level events (e.g. 100 mouse movements sampled at 100 Hz) are merged/reduced into Process-level events (e.g. a single mouse position)
- Personal Health Information (PHI) / Personal Identifiable Information (PII) is scrubbed from all recorded data
- Screenshots are segmented via Segment Anything (https://arxiv.org/abs/2304.02643) and Marks are overlaid on objects for Set-of-Mark prompting (https://arxiv.org/abs/2310.11441).
- Large Language Models (LLMs) / Large Multimodal Models (LMMs) are repeatedly prompted to summarize the Recording into a Process Description (i.e. high level python code) using Chain of Code prompting (https://arxiv.org/abs/2312.04474), in which function calls represent Process Steps (e.g. "scroll_in_options_tab_until_save_button()", “click_save_button()”).
- For evaluating correctness on Recordings, models are also prompted to create a graph representation of the process that can be compared against the recorded events.
- LLMs/LMMs are prompted to generate the Next Action given the current Marked Screenshot and the current Process Step in the Process Description.
- Next Action is played
- LLMs/LMMs are prompted to determine whether the current Process Step in the Process Description was successfully completed (i.e. whether Completion Criteria are satisfied).
- If successfully completed, advance to the next Process Step and continue from step 8. Otherwise, start a Recording, and alert the user that assistance is required.
- If assistance is required, the user is asked to take corrective action, then to stop the recording and/or resume replay via the Tray Icon.
- Recordings and other data are optionally transferrable peer-to-peer via a decentralized and safe open source protocol (magic wormhole).
Event Merging optimizes user interaction data by applying a series of functions:
- merge_consecutive_keyboard_events
- merge_consecutive_mouse_move_events
- merge_consecutive_mouse_scroll_events
- remove_redundant_mouse_move_events
- merge_consecutive_mouse_click_events
These functions condense many events at the operating system level ("primitive" events) into process-level events ("composite" events), creating a dataset that is more economical and easier to reason about for both humans and models.
PHI/PII Scrubbing uses tools like AWS Comprehend, Presidio, and Private AI for data anonymization. Local models provide assessments of the more advanced capabilities of hosted models. Comprehensive and user-friendly visualizations are provided.
Segment Anything hosted on an EC2 server offers easy infrastructure deployment and teardown via openadapt
CLI and SDK. Used to enable Set-of-Mark prompting. Offline alternatives (e.g. LLaVA) available for development and testing.
A repeatable and versioned Process Description is extracted by Combining Set-of-Mark and Chain-of-Code prompting,
- Set-of-Mark Analysis: Marks key objects in screenshots.
- Chain of Code Prompting: Generates high-level Python code, with function calls as process steps (e.g., "scroll_in_options_tab_until_save_button()", “click_save_button()”).
A Process Step is complete when a path from Start Step to End Step in the Process Graph is traversed. Nodes represent Process Steps, edges represent Completion Criteria. Completion Criteria are determined through Set-of-Mark + Chain-of-Code prompting.
-
Chain-of-Code Analysis Prompt:
- "Given the marks on the screenshot, generate Python code to verify if the specific conditions for the Process Step completion are met."
- Example: If the Process Step is 'click_save_button', the code might check for a confirmation message or a change in the button state.
-
Completion Criteria Validation Prompt:
- "Based on the Python code, determine if the current Process Step has been successfully completed and describe the outcome."
- This step involves executing the generated code and interpreting its results to confirm if the Process Step criteria have been satisfied.
- Alternatively, prompt a Large Multimodal Model with the current application state (e.g. latest screenshot, active window states, open files/sockets) to determine whether the Completion Criteria have been satisfied.
The evaluation of correctness in recordings is a critical step in ensuring the accuracy and reliability of the process automation system. To achieve this, models are prompted to create a graph representation of the process. This can then be compared against the recorded events, and used to ground the models' behaviors.
-
Graph Creation from Model Prompts: The model, after analyzing the recorded user actions (such as mouse movements, keyboard inputs, and screenshots), generates a graph representation of the process. This graph is a structured format that delineates the sequence and nature of actions in the form of nodes and edges. Each node represents a significant action or event, while the edges represent the flow or transition between these actions.
-
Structure of the Process Graph: The graph is structured to mirror the logical flow of the process as it was recorded. Nodes in the graph correspond to key actions or decision points in the process, and edges illustrate the progression or dependencies between these actions. For instance, a node might represent an action like "clicking a button," and the subsequent node might represent the result of that action, such as "confirmation message displayed."
-
Comparison with Recorded Events: The key to evaluating correctness lies in comparing this model-generated graph with the actual events captured in the recording. This comparison involves aligning the sequence of actions in the recording with the flow represented in the graph. It checks whether the actions in the recording follow the expected sequence and meet the expected outcomes as represented by the graph.
-
Identifying Discrepancies: Any discrepancies between the graph and the recording are flagged for further analysis. These discrepancies might indicate areas where the process did not execute as expected, or where the model's understanding of the process might be incomplete or inaccurate.
-
Feedback Loop for Model Improvement: These identified discrepancies provide valuable feedback to the model, enabling it to learn and improve its accuracy in creating process graphs. Over time, this leads to a more refined and precise representation of processes, enhancing the overall effectiveness of the system.
-
Practical Application: In practical scenarios, this method helps in automating complex tasks by ensuring that the automated process faithfully replicates the intended actions. It's particularly useful in scenarios where precision and accuracy are crucial, such as in data-sensitive environments or where specific sequences of actions are required.
This approach ensures the system not only automates processes but also continuously learns and adapts.
-
Initial Recording Analysis Prompt:
- "Analyze the user's recorded actions, including mouse movements, keyboard inputs, and screenshots. Identify and list distinct actions and their sequence."
- This step involves parsing the raw data from the user's interaction to form a coherent sequence of actions.
- Example Output: ["Open browser", "Navigate to website", "Scroll down page", "Click download button", "Confirm download"]
-
Process Event Categorization Prompt:
- "Based on the listed actions, categorize each action into a defined process event type (e.g., navigation, input, selection, confirmation)."
- This categorization helps in understanding the nature of each action within the process.
- Example Output: { "Open browser": "Initialization", "Navigate to website": "Navigation", "Scroll down page": "Navigation", "Click download button": "Selection", "Confirm download": "Confirmation" }
-
Graph Node Identification Prompt:
- "Identify potential nodes for the Process Graph based on categorized process events. Suggest a node for each distinct action."
- Nodes in the process graph represent significant steps or milestones in the process.
- Example Output: Nodes = ["Browser Opened", "Website Loaded", "Reached Download Section", "Download Initiated", "Download Confirmed"]
-
Graph Edge Construction Prompt:
- "Construct edges between nodes based on the sequence of actions. Define the criteria for transition from one node to another."
- Edges represent the flow or transition between different stages of the process.
- Example Output: Edges = ["Browser Opened -> Website Loaded", "Website Loaded -> Reached Download Section", "Reached Download Section -> Download Initiated", "Download Initiated -> Download Confirmed"]
-
Process Graph Synthesis Prompt:
- "Synthesize a Process Graph using the identified nodes and constructed edges. Represent the process flow visually or in a structured format."
- This final step involves creating a visual or structured representation of the entire process based on the nodes and edges.
- Example Output: A visual graph or a JSON/XML representation detailing the nodes and edges, depicting the flow of the process.
Please submit comments/questions at https://github.com/OpenAdaptAI/OpenAdapt/discussions/552 🙏