Engineering
The Embedding Failover Pattern: Zero Downtime Across Providers
When your embedding provider goes down, your search breaks. Learn the failover pattern that keeps retrieval alive across provider outages using projection matrices.
At 2:14 AM on a Tuesday, OpenAI's embedding API returned a 503. A monitoring alert fired. An on-call engineer woke up. The company's semantic search feature — used by customers in European time zones who were very much awake — was returning empty results.
The outage lasted 47 minutes. That is 47 minutes of broken search, 47 minutes of customer complaints, and one very grumpy engineer.
This scenario is not hypothetical. OpenAI's status page shows multiple embedding API incidents per year. Google's Gemini API has similar patterns. Any system that depends on a single embedding provider has a single point of failure — and that failure mode is invisible right up until it happens.
The embedding failover pattern solves this. Here is how it works.
Why naive failover does not work
The obvious fix — "if OpenAI fails, call Gemini instead" — seems straightforward. But it breaks immediately in practice because of the vector compatibility problem.
Your stored documents were embedded with OpenAI's text-embedding-3-large. When a user sends a query, that query needs to be embedded with the same model that produced the stored vectors. If you embed the query with Gemini embedding-004 instead, you are searching in the wrong coordinate system. The cosine similarity scores are meaningless. Retrieval quality drops to zero.
This is why provider failover for embedding systems is not a routing problem — it is a translation problem. You need to translate between embedding spaces, not just redirect API calls.
The projection-based failover pattern
Schift's approach uses a pre-trained projection matrix to make failover semantically correct. The setup requires two things:
- A trained projection matrix between your primary and fallback model.
- A routing layer that applies the projection to queries during failover.
When the primary model is healthy, queries are embedded normally. When the primary model fails, Schift embeds the query with the fallback model and applies the projection matrix to translate it into the primary model's space before searching. The stored vectors are unchanged. The retrieval quality is preserved.
Setting up failover with Schift
from schift import Schift
client = Schift(api_key="sch_...")
# Step 1: Pre-train the projection matrix
# Do this once, store the projection ID
proj = client.migrate.fit(
source="openai/text-embedding-3-large",
target="google/gemini-embedding-004",
db="postgresql://...",
sample_ratio=0.001 # ~1,000 samples from your corpus
)
# proj["id"] = "proj_9f3a2c..."
# Step 2: Configure routing with fallback
client.routing.set(
primary="openai/text-embedding-3-large",
fallback="google/gemini-embedding-004",
fallback_projection=proj["id"] # applied to queries during failover
) After this setup, your application code changes not at all:
# Your application code — unchanged
vec = client.embed(
"quarterly revenue report",
# No model specified — routing layer handles it
)
# During normal operation: embeds with OpenAI
# During OpenAI outage: embeds with Gemini, applies projection
# Result: same vector space, same retrieval quality What happens during an outage
Schift detects provider failures through a combination of error responses and latency monitoring. When the primary model returns 5xx errors or times out, the router switches to fallback mode:
- Query is embedded with the fallback model (Gemini embedding-004).
- The pre-trained projection matrix is applied:
query_vec @ W— a single matrix multiply, sub-millisecond. - The projected query vector is used to search your stored vectors (which are in the primary model's space).
- Retrieval proceeds normally. Quality is preserved at 96%+ of baseline.
The projection matrix is loaded in memory once at startup. Failover adds zero API latency — just one local matrix multiply.
Validating failover quality before you need it
Do not wait for an outage to discover your failover quality. Run a benchmark against your actual query distribution:
report = client.bench.run(
source="openai/text-embedding-3-large",
target="google/gemini-embedding-004",
projection=proj["id"],
data="./eval_queries.jsonl", # your annotated queries
mode="failover" # simulates query-only projection
)
print(report.verdict) # SAFE
print(report.recovery) # 0.961
print(report.p50_ms) # 0.4 — projection latency at p50
print(report.p99_ms) # 1.1 — projection latency at p99
A SAFE verdict means your failover mode preserves enough retrieval quality to
be transparent to users. Run this benchmark when you first set up failover, and re-run it
whenever your corpus changes significantly.
Multi-provider failover chain
For teams with strict availability requirements, you can configure a fallback chain with multiple providers:
client.routing.set(
primary="openai/text-embedding-3-large",
fallback=[
{
"model": "google/gemini-embedding-004",
"projection": "proj_9f3a2c..."
},
{
"model": "openai/text-embedding-3-small",
"projection": "proj_8e2b1d..."
}
]
) If OpenAI is fully down (all models), Schift falls back to Gemini. If Gemini is also unavailable — extremely unlikely, but possible — it falls to text-embedding-3-small, which typically uses a separate infrastructure path and availability profile.
The cost of not having failover
One hour of embedding API downtime on a production search system means:
- 100% of search queries return empty or degraded results.
- Customer support tickets for "search is broken."
- Potential SLA violations for enterprise customers.
- Reputation cost that persists past the incident itself.
Setting up the projection-based failover pattern takes about 45 minutes the first time. After that, provider outages become invisible to your users. For any team running embedding-based search in production, the ROI calculation is straightforward.
A note on query-side vs index-side failover
The pattern described here handles query-side failover: your stored vectors stay in the primary model's space, and you project queries during outages.
An alternative is index-side failover: pre-migrate all stored vectors to the fallback model's space and maintain two indexes. This gives you true zero-projection failover at the cost of 2x storage and 2x indexing overhead.
For most teams, query-side failover with projection is the right starting point. Index-side failover makes sense only when you have strict latency budgets that cannot absorb even a single matrix multiply — which in practice means sub-millisecond retrieval requirements.
Pick the approach that matches your requirements. The projection-based tooling works for both.