June 23, 2026, (Inside AI) — Aerial imagery holds billions of pixels across 45+ countries, yet turning them into a natural-language-searchable knowledge base has long required manual inspection or custom computer vision models. A new collaboration between AWS and Vexcel, a major geospatial data provider, shows that multimodal embeddings, large language model captioning, and vector search on Amazon Bedrock and Amazon OpenSearch Serverless can index once and query with plain English.
The work, now evolved into the Vexcel Intelligence product, tackled a core question: what is the optimal combination of embedding model, fusion strategy, captioning, and search method for multi-view aerial imagery? The answer emerged from roughly 100 configurations tested on two benchmark queries in Chicago's Grant Park—"swimming pools" and "roads."
Amazon Nova Multimodal Embeddings delivered the highest average F1 scores: 0.621 for pools and 0.555 for roads. Caption integration proved the single most impactful optimization, boosting best-configuration F1 by 11% for pools and 13% for roads. Yet no single fusion or search method dominated; performance varied sharply by feature type, underscoring the need for modular, evaluable architectures.
Why Geospatial Search Demands a New Playbook
Unlike consumer photo search, each aerial tile comprises seven complementary views: an orthophoto, four oblique angles, a digital surface model, and a digital terrain model. A building's façade might only appear from the south oblique; tree canopy in the DSM can obscure ground features. An embedding model must fuse these perspectives, but how?
Ground truth is another hurdle. Without large labeled datasets, the team used OpenStreetMap to automate evaluation. They also had to define "correct"—tile-level matches (at least one feature present) versus entity-level matches (every pool counted)—which reward different system behaviors.
The modular pipeline, built on Amazon Bedrock and OpenSearch Serverless, let engineers swap embedding models, fusion strategies, and search methods via configuration. This enabled rapid A/B testing across five stages: area-of-interest selection, imagery ingestion, embedding and indexing, search, and evaluation against OpenStreetMap.
Experiments Expose Feature-Dependent Tradeoffs
Four experiments revealed actionable insights. First, model choice matters: Amazon Nova Multimodal Embeddings outperformed Cohere Embed v4 and Amazon Titan Multimodal Embeddings G1, especially on distributed features like roads. Second, fusion strategy is feature-dependent: Cohere batch and attention fusion tied for pools at F1 0.638, but attention fusion led for roads at 0.535, while Cohere batch dropped to 0.479.
Third, captions from Amazon Nova 2 Lite significantly boosted results when combined with image embeddings, but text-only search fell 17% short. Caption vocabulary also affected metadata filtering. Fourth, search method performance diverged: basic k-NN, image+caption fusion, and metadata filtering all hit F1 0.638 for pools, but metadata filtering collapsed to 0.358 for roads due to inconsistent tagging.
"The optimal search method depends entirely on the feature type," the team noted. They recommend starting with basic k-NN over caption-enriched embeddings for consistency, then adding specialized methods for underperforming query categories. The evaluation framework also computes nDCG and stratified metrics to reveal how systems handle sparse versus dense tiles.
Vexcel Intelligence is now in preview, offering searchable vector embeddings and an API across its global library. The collaboration also delivered an AI-powered code onboarding chat service for Vexcel's engineers. As new models launch on Amazon Bedrock, the pipeline can swap them in with a configuration change, immediately measuring impact through the evaluation harness.