Customers Contact TR

Verifiable Multimodal Systems: Agentic Vision for Gemini 3 Flash



TL;DR

Google introduced Agentic Vision for Gemini 3 Flash, enabling an iterative Think → Act → Observe loop powered by Python code execution. Instead of a single “static glance,” the model can zoom/crop/annotate and compute to ground answers in verifiable evidence—particularly useful for counting, reading fine print, and extracting values from charts/tables.



What’s new

Most multimodal models process an image once and respond from that first pass. If they miss a small detail (tiny text, dense tables, crowded objects), they often guess.


Agentic Vision changes that by making “look again” a first-class capability: Gemini 3 Flash can plan, run code to manipulate/analyze the image, then re-check the evidence before answering.


The Think → Act → Observe loop

Agentic Vision is organized around a recursive workflow:

  1. Think: interpret the user request + initial image; decide what must be verified.
  2. Act: generate and execute Python code (e.g., crop/zoom, rotate, annotate, compute, plot).
  3. Observe: append the tool outputs (cropped images, counts, calculations, plots) back into the model context and continue (or answer).

Agentic Vision uses a Think → Act → Observe loop to iteratively gather evidence before answering.

Why it matters (practical reliability)

Agentic Vision is most valuable when “almost right” is still expensive:

  • Counting in dense scenes (inventory, shelf items, parts)
  • Reading fine print (serial numbers, labels, signage)
  • Extracting numbers from screenshots of tables/charts
  • Verifying visual rules (compliance checks, plan inspection)

Google reports that enabling code execution with Gemini 3 Flash can provide a consistent 5–10% quality boost across most vision benchmarks.



Core capabilities (what you can build)


1) Active zooming & inspection

Gemini 3 Flash can implicitly detect when details are too fine for standard resolution and use code execution to crop and re-inspect relevant regions.


Google highlights a building-plan validation example where iterative cropping improved accuracy.


2) Visual annotation (“visual scratchpad”)

Instead of describing only in text, the model can draw boxes/labels/arrows on the image, making tasks like counting auditable.



3) Deterministic visual math & plotting

Agentic Vision can extract values from visual tables/charts and use Python to compute results and generate plots (e.g., Matplotlib), reducing hallucinated visual arithmetic.



How to try it (AI Studio, Gemini API, Vertex AI)


Option A — Google AI Studio (fastest demo)

Use the AI Studio playground, select Gemini 3 Flash, and enable Tools → Code Execution.


Option B — Gemini API (programmatic)

Gemini 3 Flash preview model ID and docs:


Model page 👉 Gemini 3 Flash Preview | Gemini API | Google AI for Developers

Gemini 3 guide 👉 Gemini 3 Developer Guide | Gemini API | Google AI for Developers

Code execution tool 👉 Code execution | Gemini API | Google AI for Developers


Minimal Python example (conceptual)



# Conceptual example — refer to the official docs for full auth/setup.

# Sources:

# - Gemini 3 guide: https://ai.google.dev/gemini-api/docs/gemini-3

# - Code execution: https://ai.google.dev/gemini-api/docs/code-execution

from google import genai

from google.genai import types

client = genai.Client()

response = client.models.generate_content(

    model="gemini-3-flash-preview",

    contents=[

        image_file,

        "Inspect the image step-by-step. Zoom into fine text if needed and verify it."

    ],

    config=types.GenerateContentConfig(

        tools=[types.Tool(code_execution=types.ToolCodeExecution)]

    ),

)

print(response.text)



Option C — Vertex AI (Cloud)

Model overview (Gemini 3 Flash on Vertex AI) 👉 Gemini 3 Flash | Generative AI on Vertex AI | Google Cloud Documentation

Code execution for multimodal on Vertex AI 👉 Code execution | Generative AI on Vertex AI | Google Cloud Documentation


Limits & considerations (read before production)


Code execution constraints

Code execution has a maximum execution timeout of 30 seconds.

Source 👉 Code execution | Gemini API | Google AI for Developers


Vertex AI reference notes limitations (including no file I/O).

Source 👉 Execute code with the Gemini API | Generative AI on Vertex AI | Google Cloud Documentation


Preview status

Gemini 3 Flash is positioned as Public Preview in the announcement and official docs.

Source 👉 Introducing Agentic Vision in Gemini 3 Flash


Safety notes (from the model card)

The official Gemini 3 Flash model card reports automated safety evaluation deltas vs Gemini 2.5 Flash and notes manual review context.


Model card landing page 👉 Gemini 3 Flash Model Card



Builder patterns (how to get agentic behavior reliably)

These prompt patterns help the model produce verifiable outputs instead of just fluent answers:


Pattern 1 — Evidence-first inspection


“Inspect the image step-by-step. If any text is small or ambiguous, zoom/crop to verify it. Return the final answer and briefly describe what regions you inspected.”

Pattern 2 — Auditable counting


“Count the items reliably. Use image annotation as a scratchpad (boxes/labels) so the count is verifiable. Report the count and note uncertain regions.”

Pattern 3 — Table/chart extraction → compute → plot


“Extract the chart values into a table first. Then compute the requested metric using code execution and generate a clean plot.”

Source context for tool-enabled workflows 👉 Introducing Agentic Vision in Gemini 3 Flash


⭐⭐⭐


Agentic Vision is a meaningful step toward verifiable multimodal systems. By combining vision with tool-backed code execution, Gemini 3 Flash can inspect, annotate, and compute—turning many “probably correct” visual answers into something closer to audited evidence.


Sources

Agentic Vision announcement 👉 Introducing Agentic Vision in Gemini 3 Flash

Gemini 3 Flash on Vertex AI (docs) 👉 Gemini 3 Flash | Generative AI on Vertex AI | Google Cloud Documentation

Gemini 3 Flash preview model (Gemini API) 👉 Gemini 3 Flash Preview | Gemini API | Google AI for Developers

Gemini 3 Developer Guide 👉 Gemini 3 Developer Guide | Gemini API | Google AI for Developers

Code execution (Gemini API) 👉 Code execution | Gemini API | Google AI for Developers

Code execution (Vertex AI) 👉 Code execution | Generative AI on Vertex AI | Google Cloud Documentation

Code execution API reference (Vertex AI) 👉 Execute code with the Gemini API | Generative AI on Vertex AI | Google Cloud Documentation

Gemini 3 Flash model card 👉 Gemini 3 Flash Model Card


Author: Ata Güneş

Date Published: Mar 3, 2026



Discover more from Kartaca

Subscribe now to keep reading and get access to the full archive.

Continue reading