Google has quietly launched a feature that could reshape how AI interacts with the everyday software tools millions of businesses rely on: Gemini 3.5 Flash, the latest iteration of its multimodal AI model, now includes native computer use capabilities, the company announced on June 24, 2026. This means developers and enterprise customers can build agents that not only understand and generate text, images, and code but also observe, reason, and act directly on user interfaces—clicking buttons, filling forms, and navigating applications across browsers, mobile devices, and soon, the entire Windows desktop. The move places Google in direct competition with Microsoft’s vision for AI‑first computing, while adding a crucial layer of security checks designed to satisfy the most risk‑averse enterprise IT departments.
What “Computer Use” Actually Means
The term “computer use” has circulated in AI labs for years, but Google’s implementation in Gemini 3.5 Flash brings it into a production‑ready, commercially available product. Rather than relying exclusively on structured APIs or pre‑defined UI element selectors, computer‑use agents ingest raw screenshots of whatever is on screen—be it a browser window, a legacy Windows application, or a mobile app—and decide where to click, type, or scroll based on their understanding of the visual interface. This is a leap beyond robotic process automation (RPA) bots that follow rigid scripts; Gemini‑powered agents parse the underlying intent of a user command and adapt when an interface changes or an unexpected dialog appears.
Google’s demonstration showed an agent booking a complex corporate travel itinerary across multiple websites, comparing prices, filling forms, and even handling pop‑up windows—all while a developer watched the agent’s reasoning trail in real time. The model was able to explain each action it took, from “I see a cookie consent banner; closing it” to “The cheapest flight is on a different site; navigating there now.” For Windows developers, this opens the door to automating tasks in applications that were never designed to be automated, from legacy ERP systems to custom line‑of‑business tools.
Under the Hood: How Gemini Sees and Acts
Gemini 3.5 Flash anchors its computer‑use abilities in a vision‑language model that processes sequences of screenshots alongside natural language instructions. Every 500 milliseconds, it captures the current screen state, maps UI elements to a semantic understanding of their function, and outputs a set of possible actions—click coordinates, text input strings, scroll gestures—that are then executed through a controlled runtime environment. The model runs inference continuously, allowing it to correct course if an action doesn’t yield the expected result, much like a human user.
Crucially, Google has bundled this with a safety framework that goes far beyond simple content filters. The “security ready checks” flagged in the announcement refer to a series of guardrails that enterprise customers can configure. For example, an agent can be confined to a specific set of domains or application windows, blocked from accessing file system paths containing sensitive data, and forced to request human approval before executing any action that involves financial transactions or personal information. All actions are logged in an immutable audit trail that integrates with SIEM tools like Splunk or Google’s own Chronicle, satisfying compliance requirements under frameworks such as SOC 2, HIPAA, and PCI‑DSS.
Google also introduced a sandboxed execution layer that runs the agent’s input emulation in a separate, low‑privilege process. Even if an agent hallucinates a dangerous command—such as attempting to delete files—the operating system‑level security boundaries prevent it from succeeding unless the action has been explicitly whitelisted. This architecture, combined with the model’s built‑in refusal to execute commands that violate a customer’s acceptable use policy, addresses the single largest concern IT leaders have voiced about autonomous AI agents.
Enterprise Security Ready Checks: Going Beyond Prompts
Security in generative AI has largely revolved around prompt injection and data leakage. Computer use introduces a new threat vector: an agent that can interact with any desktop application could inadvertently expose sensitive information by screenshot caching, keystroke logging, or misinterpreting a UI element and pasting data into a public field. Google’s response is a multi‑layered defense that starts before the agent ever launches.
Administrators can define policies that dictate which screen regions an agent may observe. For instance, a payroll automation agent might see only the browser window containing the payroll portal, while the system tray and other running applications remain blacked out. The agent’s visual processing is further constrained by a pixel‑level redaction overlay that blanks out fields tagged as sensitive in a pre‑scan. Google claims this redaction runs on‑device, not in the cloud, using a small, dedicated model that runs on the local GPU or NPU—a feature that will resonate with Windows shops where data sovereignty is paramount.
Ready checks also extend to runtime behavior. Before any action sequence begins, the system verifies that the target application is in a known, safe state—no pending updates, no active remote desktop connection, no abnormal memory usage. If a deviation is detected, the agent is blocked and the attempt is flagged for a SOC review. Google has published a set of pre‑built compliance packs that map these checks to common regulatory frameworks, dramatically shortening the evaluation cycle for companies that were previously on the fence about agentic AI.
Windows Developers Get a New Automation Toolkit
For the Windows ecosystem, where a huge fraction of enterprise software still relies on Win32, .NET, or even ActiveX controls, computer‑use AI could be transformative. Traditional automation tools like Power Automate or UiPath require building flows that explicitly identify each UI element—a brittle process that breaks when an application is updated or a window resizes. Gemini’s visual grounding sidesteps that fragility, making it possible to automate tasks in applications that have never exposed a public API.
Microsoft has its own vision for AI‑driven desktop automation through Copilot and the Windows Copilot Runtime, but those efforts have been largely confined to modern, cloud‑connected applications. Google’s model, by contrast, can interoperate with anything that renders on screen, from a 20‑year‑old order management system to a custom Visual Basic front end. Early enterprise adopters are reportedly piloting Gemini‑powered agents for invoice processing, IT help desk ticket routing, and software regression testing—use cases where the ability to observe and interact with the actual UI is non‑negotiable.
However, Windows deployment is not yet seamless. The current agent runtime is browser‑based, using Chrome’s DevTools Protocol to interact with web applications, and a companion lightweight agent for Android via Google Play Services. Native Windows desktop interaction is slated for a later release, but Google has confirmed that it will leverage the Windows Accessibility API, meaning any application that supports screen readers can theoretically be driven by an agent. Developers who want to experiment immediately can run the agent inside a virtualized browser on a Windows machine, using containerized micro‑VMs to isolate the session.
The Competitive Landscape: RPA Meets Generative AI
The computer‑use feature positions Gemini 3.5 Flash at the intersection of two massive markets: RPA, which Gartner estimates will top $4 billion in 2026, and generative AI platforms. Established RPA vendors like UiPath and Automation Anywhere have already begun integrating LLMs for text understanding, but they still rely on deterministic UI selectors. Google’s end‑to‑end model, which handles both visual understanding and action generation in one forward pass, reduces the development effort from weeks to hours.
Other tech giants are not far behind. Anthropic’s Claude has demonstrated similar capabilities via its “computer use” API, though it remains limited to research partners. OpenAI’s Codex‑based agent framework has shown promise in code‑related tasks but has not yet shipped a general‑purpose GUI agent. Microsoft’s Copilot stack, deeply embedded in Windows 11 and Edge, has the home‑field advantage but currently requires developers to define specific skills and connectors—a far cry from the zero‑shot, visual approach Gemini offers.
For Windows‑focused enterprises, the decision will likely hinge on trust and ecosystems. A company already invested in Google Workspace, Vertex AI, and Chromebooks will find Gemini 3.5 Flash a natural fit. Microsoft‑centric organizations, particularly those bound by E5 licensing agreements that bundle Copilot, may wait for Redmond’s equivalent, but the pressure to offer similar agentic capabilities in a Windows‑native form factor is now immense.
Challenges That Still Loom
Despite its advances, computer‑use AI is not without pitfalls. Latency remains an issue: every action requires a round‑trip to the cloud, and even with Flash’s optimized inference, complex multi‑step workflows can feel sluggish compared to API‑based automation. Google mitigates this by caching screen states and batching actions, but users accustomed to instantaneous macros may be disappointed.
Hallucination is another concern. The model might misidentify a button, click the wrong link, or misinterpret a dropdown menu, leading to data being entered into the wrong field. Google’s sandboxing and human‑in‑the‑loop approvals catch many of these errors, but the residual risk is not zero. During internal red‑teaming exercises, a Gemini agent accidentally submitted a test purchase order three times because it misread a “processing, please wait” spinner as a stalled page and retried the submission. The incident resulted in a new guardrail that requires confirmation before retrying actions—a fix that highlights the iterative learning curve these agents impose on development teams.
User acceptance is equally critical. Knowledge workers who have spent years building trust in automated tools like mail merge or Quick Steps may be wary of an AI that can autonomously navigate their desktop. Google is addressing this with a transparency pane that overlays the agent’s actions in a semi‑opaque window, letting users “ride along” and intervene with a single keystroke. Early feedback from beta testers suggests that this visibility is the single most important factor in building comfort with the technology.
What Comes Next
Google has committed to expanding computer‑use capabilities throughout 2026, with native Windows desktop integration expected in the fourth quarter. Future releases promise tighter OS integration, including the ability to interact with Windows’ native notification system, system tray, and even UAC prompts—though the latter will require enterprise‑grade certificate signing to satisfy security standards.
The company also plans to open‑source the agent runtime’s communication protocol, allowing third‑party Windows application vendors to build “agent‑friendly” interfaces that expose visual semantics in a machine‑readable schema. Such a move could create a new ecosystem of agent‑optimized software, much as mobile‑first design reshaped web development a decade ago.
For the broader Windows community, Google’s announcement signals that the era of AI merely generating text is over. The next battle will be fought at the OS and application layer, where agents become active participants in the digital workplace. The winners will be those who balance autonomy with airtight security—a bar that Gemini 3.5 Flash, with its ready checks and sandboxed execution, has just raised.