It is possible to use the Windows API to determine which windows are active, minimized, in the background, etc.
The Windows API provides several functions to retrieve information about windows, such as:
* `GetActiveWindow`: Retrieves a handle to the active window.
* `GetWindowPlacement`: Retrieves the show state and position of a window.
* `IsWindowVisible`: Determines whether a window is visible.
* `IsWindowEnabled`: Determines whether a window is enabled.
You can use these functions to determine the state of a window, such as whether it is active, minimized, or in the background.
Additionally, you can use the `ShowWindow` function to change the show state of a window, such as minimizing or restoring it.
It's worth noting that the Windows API also provides a set of constants that define the different show states of a window, such as `SW_SHOWNORMAL`, `SW_SHOWMINIMIZED`, `SW_SHOWMAXIMIZED`, etc.
You can use these constants and functions to write code that interacts with windows and determines their state.
For example, you can use the `GetWindowPlacement` function to retrieve the show state of a window and then use a switch statement to determine what to do based on the show state.
It's also worth noting that the Windows API provides a set of messages that are sent to windows when their state changes, such as `WM_ACTIVATEAPP` when the activation changes from one top-level window to another.
You can use these messages to respond to changes in the state of windows and update your application accordingly.
In summary, the Windows API provides a set of functions and constants that allow you to determine the state of windows and interact with them, and you can use these to write code that determines which windows are active, minimized, in the background, etc.
It is possible to manipulate windows using the Windows API. The Windows API provides a set of functions that allow you to interact with windows, such as:
* `ShowWindow`: Shows or hides a window.
* `SetWindowPos`: Changes the position and size of a window.
* `SetWindowText`: Changes the title of a window.
* `EnableWindow`: Enables or disables a window.
* `BringWindowToTop`: Brings a window to the top of the Z-order.
* `SetFocus`: Sets the focus to a window.
You can use these functions to manipulate windows, such as minimizing or maximizing them, changing their position or size, or bringing them to the front or back of the Z-order.
Additionally, the Windows API provides a set of messages that are sent to windows when their state changes, such as `WM_ACTIVATEAPP` when the activation changes from one top-level window to another. You can use these messages to respond to changes in the state of windows and update your application accordingly.
It's worth noting that the Windows API also provides a set of constants that define the different show states of a window, such as `SW_SHOWNORMAL`, `SW_SHOWMINIMIZED`, `SW_SHOWMAXIMIZED`, etc. You can use these constants and functions to write code that interacts with windows and manipulates their state.
For example, you can use the `ShowWindow` function to minimize a window, or use the `SetWindowPos` function to move a window to a specific position on the screen.
It's also worth noting that the Windows API provides a set of functions that allow you to enumerate windows, such as `EnumWindows`, which allows you to enumerate all top-level windows on the screen. You can use these functions to find and manipulate specific windows.
In summary, the Windows API provides a set of functions and constants that allow you to manipulate windows, and you can use these to write code that interacts with windows and changes their state [1][2][3][4][5].
Traditionally, AI agents would use an AI model to perform Screen Parsing, which is a neural generative model that would identify each control in a window.
Is Microsoft OmniParser an example.
An Example, Microsoft OmniParser is an example of a screen parsing technique that uses a neural generative model to identify each control in a window. According to [1][2][3], OmniParser is a general screen parsing tool that interprets and converts UI screenshots into a structured format, improving existing large language model (LLM) based UI agents. It uses a detection model to parse interactable regions on the screen and a caption model to extract the functional semantics of the detected elements. OmniParser has been shown to significantly improve the performance of LLMs such as GPT-4V on various benchmarks, including ScreenSpot, Mind2Web, and AITW.
My goal is to build an AI agent app developed in Delphi IDE. Traditionally, AI agents would use an AI neural model that would do Screen Parsing, that is, this neural generative model would identify each control in a window. But I need to find a way to be possible to do the same thing using only the Windows API, because it is much lighter, that is, it consumes fewer resources using only the Windows API.
Here is a proposed structure of modules for building an AI agent that interacts with windows and their controls using the Windows API:
A C++ console application project in Visual Studio 2022 that implements the basic logic of the Analysis Modules (suffix `analyzer`) for analyzing data extracted from various forms of interfaces, including Win32, WPF, WinUI, and.NET MAUI. The project should use modern and advanced C++ programming style, follow good practices and conventions, and prioritize maximum execution speed performance.
The project should consist of four main modules: `Source Modules` (suffix `Source`), `Data Modules` (suffix `Data`), `Processing Modules` (suffix `processor`), and `Analysis Modules` (suffix `analyzer`). The `Source Modules` provide the foundation for extracting data from various interfaces, the `Data Modules` should store and manage the extracted data, the `Processing Modules` should process and analyze the data, and the `Analysis Modules` should analyze the processed data.
**Source Modules (suffix `Source`):**
1. The `WindowSource` class that provides a foundation for extracting data from windows. This class should have the following methods:
* `GetWindowHandle()`: Returns the handle of the window.
* `GetWindowTitle()`: Returns the title of the window.
* `GetWindowPosition()`: Returns the position of the window (x, y, width, height).
* `GetWindowSize()`: Returns the size of the window (width, height).
* `GetWindowState()`: Returns the state of the window (e.g. minimized, maximized, normal).
2. The `ControlSource` class that provides a foundation for extracting data from controls. This class should have the following methods:
* `GetControlHandle()`: Returns the handle of the control.
* `GetControlType()`: Returns the type of control (e.g. button, text box, checkbox).
* `GetControlText()`: Returns the text associated with the control.
* `GetControlPosition()`: Returns the position of the control (x, y, width, height).
* `GetControlSize()`: Returns the size of the control (width, height).
* `GetControlState()`: Returns the state of the control (e.g. enabled, disabled, checked).
3. The `UIElementSource` class that provides a foundation for extracting data from UI elements. This class should have the following methods:
* `GetUIElementType()`: Returns the type of UI element (e.g. menu, toolbar, status bar).
* `GetUIElementPosition()`: Returns the position of the UI element (x, y, width, height).
* `GetUIElementSize()`: Returns the size of the UI element (width, height).
* `GetUIElementState()`: Returns the state of the UI element (e.g. visible, hidden).
**Data Modules (suffix `Data`):**
1. The `WindowData` class that stores and manages data extracted from windows. This class should have the following properties:
* `handle`: The handle of the window.
* `title`: The title of the window.
* `position`: The position of the window (x, y, width, height).
* `size`: The size of the window (width, height).
* `state`: The state of the window (e.g. minimized, maximized, normal).
2. The `ControlData` class that stores and manages data extracted from controls. This class should have the following properties:
* `handle`: The handle of the control.
* `type`: The type of control (e.g. button, text box, checkbox).
* `text`: The text associated with the control.
* `position`: The position of the control (x, y, width, height).
* `size`: The size of the control (width, height).
* `state`: The state of the control (e.g. enabled, disabled, checked).
3. The `UIElementData` class that stores and manages data extracted from UI elements. This class should have the following properties:
* `type`: The type of UI element (e.g. menu, toolbar, status bar).
* `position`: The position of the UI element (x, y, width, height).
* `size`: The size of the UI element (width, height).
* `state`: The state of the UI element (e.g. visible, hidden).
**Processing Modules (suffix `processor`):**
1. The `WindowProcessor` class that processes and analyzes data extracted from windows. This class should have the following methods:
* `ProcessWindowData()`: Processes and analyzes the data extracted from a window.
* `GetWindowData()`: Returns the processed and analyzed data.
2. The `ControlProcessor` class that processes and analyzes data extracted from controls. This class should have the following methods:
* `ProcessControlData()`: Processes and analyzes the data extracted from a control.
* `AnalyzeControlData()`: Analyzes the control data.
3. The `UIElementProcessor` class that processes and analyzes data extracted from UI elements. This class should have the following methods:
* `ProcessUIElementData()`: Processes and analyzes the data extracted from a UI element.
* `GetUIElementType()`: Returns the type of UI element.
* `GetUIElementPosition()`: Returns the position of the UI element.
* `GetUIElementSize()`: Returns the size of the UI element.
* `AnalyzeUIElementState()`: Analyzes the state of the UI element.
**Analysis Modules (suffix `analyzer`):**
1. The `WindowAnalyzer` class that analyzes the processed data extracted from windows. This class should have the following methods:
* `AnalyzeWindowData()`: Analyzes the processed data extracted from a window.
* `GetWindowAnalysis()`: Returns the analysis results.
2. The `ControlAnalyzer` class that analyzes the processed data extracted from controls. This class should have the following methods:
* `AnalyzeControlData()`: Analyzes the processed data extracted from a control.
* `GetControlAnalysis()`: Returns the analysis results.
3. The `UIElementAnalyzer` class that analyzes the processed data extracted from UI elements. This class should have the following methods:
* `AnalyzeUIElementData()`: Analyzes the processed data extracted from a UI element.
* `GetUIElementAnalysis()`: Returns the analysis results.
Note that the `Analysis Modules` depend on the `Processing Modules`, `Data Modules`, and `Source Modules`, so the code should be designed to be efficient and scalable.
This structure provides a basic outline of the modules that can be used to build an AI agent that interacts with windows and their controls using the Windows API. Each module can be developed and tested independently, and then integrated with other modules to create a complete system.
Keeping in mind that the final goal is to build a screen parser using only the Windows API (Win32, WPF, WinUI, MAUI, and others).
See what has been done so far below.
AnalysisResult.h:
```
#pragma once
#include "ProcessorResult.h"
#include <vector>
#include <chrono>
namespace WindowParser {
struct WindowAnalysisResult {
bool isAccessible;
bool isInteractive;
std::vector<std::wstring> potentialIssues;
std::chrono::system_clock::time_point analysisTime;
ProcessedWindowInfo processedInfo;
};
struct ControlAnalysisResult {
bool isAccessible;
bool isInteractable;
std::vector<std::wstring> accessibilityIssues;
std::chrono::system_clock::time_point analysisTime;
ProcessedControlInfo processedInfo;
};
struct UIElementAnalysisResult {
bool isAccessible;
bool meetsAccessibilityStandards;
std::vector<std::wstring> recommendations;
std::chrono::system_clock::time_point analysisTime;
ProcessedUIElementInfo processedInfo;
};
}
```
CommonTypes.h:
```
#pragma once
#include <Windows.h>
#include <string>
namespace WindowParser {
struct Position {
int x;
int y;
int width;
int height;
};
struct Size {
int width;
int height;
};
enum class WindowState {
Normal,
Minimized,
Maximized
};
enum class ControlState {
Enabled,
Disabled,
Checked,
Unchecked
};
enum class UIElementState {
Visible,
Hidden
};
}
```
ControlAnalyzer.h:
```
#pragma once
#include "AnalysisResult.h"
#include <functional>
namespace WindowParser {
class ControlAnalyzer {
public:
using AnalysisCallback = std::function<void(const ControlAnalysisResult&)>;
ControlAnalyzer() = default;
~ControlAnalyzer() = default;
ControlAnalysisResult AnalyzeControlData(const ProcessingResult<ProcessedControlInfo>& processedData);
void SetCallback(AnalysisCallback callback) { m_callback = callback; }
private:
AnalysisCallback m_callback;
std::vector<std::wstring> CheckAccessibility(const ProcessedControlInfo& info);
};
}
```
ProcessorResult.h:
```
#pragma once
#include <string>
#include <vector>
#include <memory>
#include "CommonTypes.h"
namespace WindowParser {
enum class ProcessingStatus {
Success,
Failed,
NoData,
InvalidHandle
};
template<typename T>
struct ProcessingResult {
ProcessingStatus status;
std::shared_ptr<T> data;
std::wstring message;
};
struct ProcessedWindowInfo {
bool isResponding;
bool isVisible;
bool hasFocus;
std::wstring processName;
DWORD processId;
};
struct ProcessedControlInfo {
bool isEnabled;
bool isVisible;
bool canInteract;
std::wstring className;
};
struct ProcessedUIElementInfo {
bool isVisible;
bool isEnabled;
std::wstring automationId;
};
}
```
UIElementAnalyzer.h:
```
#pragma once
#include "AnalysisResult.h"
#include <functional>
namespace WindowParser {
class UIElementAnalyzer {
public:
using AnalysisCallback = std::function<void(const UIElementAnalysisResult&)>;
UIElementAnalyzer() = default;
~UIElementAnalyzer() = default;
UIElementAnalysisResult AnalyzeUIElementData(const ProcessingResult<ProcessedUIElementInfo>& processedData);
void SetCallback(AnalysisCallback callback) { m_callback = callback; }
private:
AnalysisCallback m_callback;
std::vector<std::wstring> ValidateAccessibilityStandards(const ProcessedUIElementInfo& info);
};
}
```
WindowAnalyzer.cpp:
```
#include "WindowAnalyzer.h"
#include "WindowProcessor.h"
namespace WindowParser {
WindowAnalysisResult WindowAnalyzer::AnalyzeWindowData(
const ProcessingResult<ProcessedWindowInfo>& processedData) {
WindowAnalysisResult result;
result.analysisTime = std::chrono::system_clock::now();
if (processedData.status != ProcessingStatus::Success || !processedData.data) {
result.isAccessible = false;
result.isInteractive = false;
result.potentialIssues.push_back(L"Failed to process window data");
return result;
}
result.processedInfo = *processedData.data;
result.isInteractive = CheckInteractivity(result.processedInfo);
result.potentialIssues = AnalyzeAccessibility(result.processedInfo);
result.isAccessible = result.potentialIssues.empty();
if (m_callback) {
m_callback(result);
}
return result;
}
WindowAnalysisResult WindowAnalyzer::GetWindowAnalysis(HWND handle) {
WindowProcessor processor;
auto processedData = processor.GetProcessedWindowData(handle);
return AnalyzeWindowData(processedData);
}
std::vector<std::wstring> WindowAnalyzer::AnalyzeAccessibility(const ProcessedWindowInfo& info) {
std::vector<std::wstring> issues;
if (!info.isVisible) {
issues.push_back(L"Window is not visible");
}
if (!info.isResponding) {
issues.push_back(L"Window is not responding");
}
return issues;
}
bool WindowAnalyzer::CheckInteractivity(const ProcessedWindowInfo& info) {
return info.isResponding && info.isVisible;
}
}
```
WindowAnalyzer.h:
```
#pragma once
#include "AnalysisResult.h"
#include <functional>
namespace WindowParser {
class WindowAnalyzer {
public:
using AnalysisCallback = std::function<void(const WindowAnalysisResult&)>;
WindowAnalyzer() = default;
~WindowAnalyzer() = default;
WindowAnalysisResult AnalyzeWindowData(const ProcessingResult<ProcessedWindowInfo>& processedData);
WindowAnalysisResult GetWindowAnalysis(HWND handle);
void SetCallback(AnalysisCallback callback) { m_callback = callback; }
private:
AnalysisCallback m_callback;
std::vector<std::wstring> AnalyzeAccessibility(const ProcessedWindowInfo& info);
bool CheckInteractivity(const ProcessedWindowInfo& info);
};
}
```
WindowData.h:
```
#pragma once
#include "CommonTypes.h"
#include <string>
namespace WindowParser {
class WindowData {
public:
WindowData(HWND handle, const std::wstring& title,
const Position& pos, const Size& size,
WindowState state)
: m_handle(handle)
, m_title(title)
, m_position(pos)
, m_size(size)
, m_state(state) {
}
HWND handle() const { return m_handle; }
const std::wstring& title() const { return m_title; }
const Position& position() const { return m_position; }
const Size& size() const { return m_size; }
WindowState state() const { return m_state; }
private:
HWND m_handle;
std::wstring m_title;
Position m_position;
Size m_size;
WindowState m_state;
};
}
```
WindowProcessor.cpp:
```
#include "WindowProcessor.h"
#include <psapi.h>
namespace WindowParser {
ProcessingResult<ProcessedWindowInfo> WindowProcessor::ProcessWindowData(const WindowData& windowData) {
ProcessingResult<ProcessedWindowInfo> result;
result.data = std::make_shared<ProcessedWindowInfo>();
if (!IsWindow(windowData.handle())) {
result.status = ProcessingStatus::InvalidHandle;
result.message = L"Invalid window handle";
return result;
}
// Get process information
std::wstring processName;
DWORD processId;
if (!GetWindowProcessInfo(windowData.handle(), processName, processId)) {
result.status = ProcessingStatus::Failed;
result.message = L"Failed to get process information";
return result;
}
// Fill processed info
result.data->isResponding = IsWindowEnabled(windowData.handle());
result.data->isVisible = IsWindowVisible(windowData.handle());
result.data->hasFocus = (GetForegroundWindow() == windowData.handle());
result.data->processName = processName;
result.data->processId = processId;
result.status = ProcessingStatus::Success;
if (m_callback) {
m_callback(result);
}
return result;
}
ProcessingResult<ProcessedWindowInfo> WindowProcessor::GetProcessedWindowData(HWND handle) {
auto windowSource = WindowSource::FromHandle(handle);
if (!windowSource) {
ProcessingResult<ProcessedWindowInfo> result;
result.status = ProcessingStatus::InvalidHandle;
result.message = L"Could not create WindowSource from handle";
return result;
}
WindowData data(
handle,
windowSource->GetWindowTitle(),
windowSource->GetWindowPosition(),
windowSource->GetWindowSize(),
windowSource->GetWindowState()
);
return ProcessWindowData(data);
}
bool WindowProcessor::GetWindowProcessInfo(HWND handle, std::wstring& processName, DWORD& processId) {
GetWindowThreadProcessId(handle, &processId);
if (processId == 0) return false;
HANDLE hProcess = OpenProcess(PROCESS_QUERY_INFORMATION | PROCESS_VM_READ, FALSE, processId);
if (!hProcess) return false;
wchar_t buffer[MAX_PATH];
if (GetModuleFileNameEx(hProcess, nullptr, buffer, MAX_PATH)) {
processName = buffer;
}
CloseHandle(hProcess);
return true;
}
}
```
WindowProcessor.h:
```
#include "WindowProcessor.h"
#include <psapi.h>
namespace WindowParser {
ProcessingResult<ProcessedWindowInfo> WindowProcessor::ProcessWindowData(const WindowData& windowData) {
ProcessingResult<ProcessedWindowInfo> result;
result.data = std::make_shared<ProcessedWindowInfo>();
if (!IsWindow(windowData.handle())) {
result.status = ProcessingStatus::InvalidHandle;
result.message = L"Invalid window handle";
return result;
}
// Get process information
std::wstring processName;
DWORD processId;
if (!GetWindowProcessInfo(windowData.handle(), processName, processId)) {
result.status = ProcessingStatus::Failed;
result.message = L"Failed to get process information";
return result;
}
// Fill processed info
result.data->isResponding = IsWindowEnabled(windowData.handle());
result.data->isVisible = IsWindowVisible(windowData.handle());
result.data->hasFocus = (GetForegroundWindow() == windowData.handle());
result.data->processName = processName;
result.data->processId = processId;
result.status = ProcessingStatus::Success;
if (m_callback) {
m_callback(result);
}
return result;
}
ProcessingResult<ProcessedWindowInfo> WindowProcessor::GetProcessedWindowData(HWND handle) {
auto windowSource = WindowSource::FromHandle(handle);
if (!windowSource) {
ProcessingResult<ProcessedWindowInfo> result;
result.status = ProcessingStatus::InvalidHandle;
result.message = L"Could not create WindowSource from handle";
return result;
}
WindowData data(
handle,
windowSource->GetWindowTitle(),
windowSource->GetWindowPosition(),
windowSource->GetWindowSize(),
windowSource->GetWindowState()
);
return ProcessWindowData(data);
}
bool WindowProcessor::GetWindowProcessInfo(HWND handle, std::wstring& processName, DWORD& processId) {
GetWindowThreadProcessId(handle, &processId);
if (processId == 0) return false;
HANDLE hProcess = OpenProcess(PROCESS_QUERY_INFORMATION | PROCESS_VM_READ, FALSE, processId);
if (!hProcess) return false;
wchar_t buffer[MAX_PATH];
if (GetModuleFileNameEx(hProcess, nullptr, buffer, MAX_PATH)) {
processName = buffer;
}
CloseHandle(hProcess);
return true;
}
}
```
WindowSource.cpp:
```
#include "WindowSource.h"
namespace WindowParser {
HWND WindowSource::GetWindowHandle() const {
return m_handle;
}
std::wstring WindowSource::GetWindowTitle() const {
int length = GetWindowTextLengthW(m_handle);
if (length == 0) return std::wstring();
std::wstring title(length + 1, 0);
GetWindowTextW(m_handle, &title[0], length + 1);
title.resize(length);
return title;
}
Position WindowSource::GetWindowPosition() const {
RECT rect;
GetWindowRect(m_handle, &rect);
return { rect.left, rect.top, rect.right - rect.left, rect.bottom - rect.top };
}
Size WindowSource::GetWindowSize() const {
RECT rect;
GetWindowRect(m_handle, &rect);
return { rect.right - rect.left, rect.bottom - rect.top };
}
WindowState WindowSource::GetWindowState() const {
WINDOWPLACEMENT placement = { sizeof(WINDOWPLACEMENT) };
GetWindowPlacement(m_handle, &placement);
switch (placement.showCmd) {
case SW_SHOWMAXIMIZED:
return WindowState::Maximized;
case SW_SHOWMINIMIZED:
return WindowState::Minimized;
default:
return WindowState::Normal;
}
}
std::unique_ptr<WindowSource> WindowSource::FromHandle(HWND handle) {
if (!IsWindow(handle)) return nullptr;
return std::unique_ptr<WindowSource>(new WindowSource(handle));
}
}
```
WindowSource.h:
```
#pragma once
#include "CommonTypes.h"
#include <memory>
namespace WindowParser {
class WindowSource {
public:
HWND GetWindowHandle() const;
std::wstring GetWindowTitle() const;
Position GetWindowPosition() const;
Size GetWindowSize() const;
WindowState GetWindowState() const;
static std::unique_ptr<WindowSource> FromHandle(HWND handle);
private:
HWND m_handle;
WindowSource(HWND handle) : m_handle(handle) {}
};
}
```
Would it be possible now, based on the C++ files: WindowSource.h and WindowSource.cpp, to create a .cpp project (C++ Visual Studio Console App) to do a benchmark? Measure the execution time?
Please create a main() function in the main .cpp file of this new project (C++ Visual Studio Console App) to measure the time in milliseconds of the execution time of scanning all windows. Make a benchmark that interferes as little as possible in the code of the tested modules, that is, that does not contaminate the sample with other processes that do not belong to exactly listing all open windows. In this main() function that you must create, put a mechanism in this benchmark to repeat the test n times; by default it should repeat only once (n=1).
My project is configured to have the `ISO C++17 Standard (/std:c++17)` stardard.
---
Would it be possible now, based on the C++ files: WindowSource.h and WindowSource.cpp, to create a .cpp project (C++ Visual Studio Console App) to do a benchmark? Measure the execution time?
Please create a main() function in the main .cpp file of this new project (C++ Visual Studio Console App) to measure the time in milliseconds of the execution time of scanning all windows. Make a benchmark that interferes as little as possible in the code of the tested modules, that is, that does not contaminate the sample with other processes that do not belong to exactly listing all open windows. In this main() function that you must create, put a mechanism in this benchmark to repeat the test n times; by default it should repeat only once (n=1).