RAG on the Cloud: Google’s File Search Tool and the Beginning of the End for the DIY Vector Stack
Last week, Google quietly rolled out the File Search Tool inside the Gemini API. It looks small. It isn’t. Here’s what it does, what it costs, and why every team running a homemade RAG pipeline should at least be paying attention.
If you’ve shipped anything serious with an LLM in the last eighteen months, you’ve probably built (or watched a coworker build, or read someone on Hacker News build) the exact same thing: a Retrieval-Augmented Generation pipeline.
The recipe is depressingly familiar by now. Pick a vector database. Pinecone if you have budget, Weaviate if you don’t, Postgres with pgvector if you’re feeling thrifty, Qdrant if you read the right Substacks. Write a chunking script, and then rewrite it three more times because the first version split mid-sentence and the second version didn’t handle PDFs and the third one tokenized markdown headers as body text. Pick an embedding model. Wire up the indexer. Wire up the retriever. Decide on top-k. Argue about reranking. Inject the retrieved chunks into the prompt with the right delimiters. Add citations because your stakeholders want citations. Add a deletion endpoint because GDPR. Add a re-indexing job because the docs change. Monitor the whole thing.
It’s a six-week project that ships as a checkbox feature. And the worst part is that almost none of those weeks are spent on anything proprietary. You’re not building moat. You’re plumbing.
On November 6, Google rolled out the File Search Tool in the Gemini API. It is, in essence, that entire pipeline collapsed into a single API surface. Upload files. Query against them. Get grounded, cited responses back. Done.
I’ve spent the last week kicking the tires on it. This post is what I think it means.
What File Search Actually Is
File Search is a fully managed RAG system that sits inside the existing generateContent endpoint in the Gemini API. You don’t deploy anything. You don’t enable a separate service. You don’t pick a vector database. You point the tool at a “File Search Store,” Google’s name for a managed corpus, and the model uses semantic retrieval to ground its answers in your data.
What Google handles under the hood:
- File ingestion. PDF, DOCX, TXT, JSON, and a long list of programming-language file types, up to 100 MB per file.
- Chunking. They’ve picked a strategy. You don’t get to argue with it. (We’ll come back to this.)
- Embedding. Powered by
gemini-embedding-001, which sat at the top of the Massive Text Embedding Benchmark for a stretch this year. - Vector storage. No database to provision. No quotas to estimate. Just a “store.”
- Retrieval. Semantic vector search at query time, with no exposed knobs for top-k or reranking.
- Context injection. Retrieved chunks get stitched into the prompt automatically.
- Citations. Every grounded response comes back with pointers to the specific sections of the specific files that produced it.
That last bit matters more than it sounds. Citations have been the unsexy bottleneck of half the enterprise RAG demos I’ve seen. Half the projects that go from prototype to “we’ll get back to you” do so because legal or compliance asked for source attribution and nobody had wired it in cleanly. File Search ships with it as a default.
The Pricing Is the Real Story
Most managed RAG offerings die on pricing. Vector databases charge for storage plus per-query operations plus embedding API calls plus network egress, and after six months in production you’re looking at a bill that nobody signed up for and nobody knows how to reduce without re-architecting.
Google’s pricing on File Search is genuinely strange in a good way:
- Storage: free.
- Query-time embedding generation: free.
- Indexing-time embedding generation: $0.15 per 1 million tokens, one-time, when you first ingest a file.
That’s it. The retrieved tokens still count as normal Gemini input context, so the generation half of the bill is unchanged. But the retrieval half, the part that quietly compounds in every other architecture, is essentially zero after ingestion.
Translate that to a real workload. A hundred-thousand-page corpus is roughly 50 million tokens. Indexing that costs you about $7.50 (once). After that, you can run a million queries against it and the retrieval cost is zero. Compare to a self-hosted Pinecone-plus-OpenAI-embeddings stack, where the same corpus might run you several hundred dollars a month in storage alone, plus embedding fees on every query that touches the index.
This is not the pricing of a company trying to recoup infrastructure costs. This is the pricing of a company trying to make it irrational not to use their model.
What It Gets Right
A few things stood out when I actually started building with it.
The integration is invisible. You don’t import a separate SDK. You don’t authenticate against a different service. You add tools=[file_search(...)] to your existing generateContent call and you’re done. For a feature this consequential, the API surface is shockingly boring, which is the highest compliment you can pay an API.
Chunking is no longer your problem. I’ve watched smart engineers spend literal months on chunking strategy. Sliding window? Recursive? Section-aware? Token-bounded? The honest answer is that for 90% of corpora, any reasonable strategy works fine and the right answer is “stop tuning this and ship.” File Search just makes that stance the default.
The embedding model is good. gemini-embedding-001 is genuinely strong on retrieval benchmarks. The fact that you don’t have to pick it, license it, host it, or even know about it is exactly the abstraction developers should want.
Latency is reasonable for the abstraction level. Google’s example is Phaser Studio’s Beam platform, which runs parallel queries across a 3,000-file corpus and returns combined results in under two seconds. That’s not “world-beating”, but it’s “well within the budget for an interactive product”, which is what matters.
It works inside an agent loop. File Search slots in as a tool in the agentic sense, which means you can hand it to a Gemini agent and let the model decide when to query its own corpus. Combined with function calling, this turns into the cleanest agentic-RAG pattern I’ve seen ship from a major lab.
What It Doesn’t Get Right (Yet)
Honest assessment: this is a public preview, and it has the rough edges of a public preview.
You can’t tune retrieval. No top-k knob. No reranker. No metadata filters in the way you’d want them. If the default semantic match doesn’t surface the right chunk for your use case, your only real option is to rewrite the user’s query or restructure your corpus. For specialist domains like legal, medical, or scientific, where retrieval quality is the entire game, this is going to bite.
Chunking is opaque. “We picked a strategy” is great until your corpus is full of dense tables, complex hierarchical documents, or scanned PDFs where parsing matters as much as chunking. Several early users have already flagged that medical tables and legal contracts with nested numbering don’t extract cleanly. If your documents look weird, expect to do pre-processing before ingestion.
It’s locked in. The embeddings are proprietary. You can’t export them. If you decide in 18 months that you want to migrate to a different model provider, you re-embed everything from scratch on the way out. This is a real and growing concern as more workloads consolidate on a single foundation-model vendor.
It doesn’t compose with everything. As of launch, File Search can’t be combined with other Gemini built-in tools like Google Search grounding or URL Context in the same call. For applications that need to mix private corpus retrieval with live web data, you’re back to orchestrating manually.
Data residency is unclear. Google has not yet published a detailed regional compliance matrix. For EU enterprises, healthcare, or anyone with strict in-country storage requirements, this is the kind of thing you cannot ship on vibes. Wait for the documentation.
The free tier is small. 1 GB of free storage per project, 10 stores per project, with a soft recommendation to keep individual stores under 20 GB for latency reasons. Plenty for prototypes. Tight for serious enterprise corpora.
Who Should Use It
The framing I’d use after a week of testing:
Use File Search if: you’re building a support bot, internal knowledge assistant, document Q&A feature, or anything where the corpus is small-to-medium, the documents are well-structured, the retrieval requirements are general-purpose, and you’d rather ship next week than build a pipeline for two months. This is the 80% case for RAG, and File Search nails it.
Hesitate if: you have a specialist retrieval problem (legal precedent search, scientific literature, medical records), you need fine-grained control over chunking or reranking, you have hard data-residency requirements, or you’re already heavily invested in a multi-model architecture and don’t want to deepen Gemini lock-in.
Don’t use it if: retrieval quality is your product’s core differentiator. The whole point of File Search is that the retrieval layer becomes a commodity. If your moat is your custom retrieval, commoditizing it is the opposite of what you want.
The Bigger Picture
Step back from File Search specifically and look at the trajectory.
OpenAI’s Assistants API has had a file search feature for a while. Microsoft has been pushing managed RAG inside Azure AI Search and various Copilot building blocks. Anthropic’s product surface is still leaner here, but their direction of travel is the same. The big labs have all clearly decided that DIY RAG is dead weight that keeps developers off their platforms, and they are racing to abstract it.
What Google did with File Search is push the abstraction further than anyone else has. Not “we’ll host the vector database for you.” Not “we’ll give you a client library that ties together five services.” Just: it’s an API parameter now. The retrieval layer disappears.
This is the same pattern we’ve seen play out in every other infrastructure category. Web servers used to be a thing you operated; now they’re a runtime detail. Databases used to be a thing you tuned; for most apps, they’re now a managed line item. Authentication used to be a thing you implemented; now it’s an SDK.
Retrieval is next. The teams winning in 2026 won’t be the ones with the cleverest custom RAG pipelines. They’ll be the ones who recognized that the retrieval layer was about to become a commodity, stopped maintaining their bespoke version, and redeployed those engineers onto problems that actually differentiate their product.
File Search isn’t the final form of this. It’s a public preview with real gaps. But it’s the clearest signal yet that “build your own RAG” is moving from “necessary skill” to “legacy work.” If you’re a developer, the question to ask yourself right now isn’t “is File Search better than my Pinecone setup?” It’s: “what would I work on next month if my RAG stack was no longer my problem?”
That answer is where your job is going.
Subscribe for more on what’s actually shipping in AI infra, minus the hype.
