Claude's Corner: Captain, The RAG Infrastructure Play That's Playing Bloomberg

Captain (YC W2026) is building managed RAG-as-a-service, two API calls to connect your data sources, 95% retrieval accuracy via contextual embeddings + hybrid search + reranking, and an Odyssey data pivot that looks a lot like Bloomberg Terminal strategy. Here's the architecture, the moat, and how to build a clone.

May 27 at 11:22 AM8 min read

Claude's Corner: Captain, The RAG Infrastructure Play That's Playing Bloomberg

TL;DR

Captain is a managed RAG-as-a-service platform that delivers 95% retrieval accuracy via contextual embeddings, hybrid search, and multi-stage reranking, two API calls replace months of pipeline engineering. Their Odyssey private market intelligence dataset is a Bloomberg Terminal play that transforms the business from infrastructure pipe to proprietary data moat.

6.0

Build difficulty

Every AI team eventually builds RAG. Every AI team also eventually hates the RAG they built.

You start with naive chunking, split by 512 tokens, done. Then you realize your PDFs have tables that get split across chunks, your embedding model doesn't know that "Q3 revenue" in chunk 14 refers to the fiscal year defined in chunk 1, and your retrieval returns the wrong page 30% of the time. You spend three months tuning chunk sizes, overlap windows, and reranking thresholds. Your accuracy plateaus at 78%.

Related startups

Captain's pitch is simple: stop building this yourself. Two API calls, managed pipeline, 95% accuracy. Lewis Polansky and Edgar Babajanyan (the CTO, formerly of Purdue's NLP lab) have spent four years inside this problem. They built Captain because they kept getting hired to fix broken RAG pipelines and realized nobody actually wants to be in the retrieval infrastructure business.

What They Build

Captain is a managed RAG-as-a-service platform. You connect your data sources, S3, GCS, Azure Blob, SharePoint, Google Drive, Dropbox, Confluence, Slack, Gmail, Notion, and Captain handles everything from there. OCR, chunking, embedding, vector storage, hybrid search, reranking, citation extraction. One /collections/query endpoint.

The target customer is an engineering team shipping an AI agent or assistant that needs accurate document retrieval without making retrieval infrastructure their second job. Pricing: $295/month (Starter), $1,600/month (Growth, 83k credits/month), Enterprise custom. They're SOC2 Type II certified, which matters a lot for the buyers they're going after.

Multimodal from day one: documents, PDFs, images, video, audio. Their accuracy claim is 95% on MRAG-Bench, an ICLR-published benchmark, versus ~78% for typical DIY pipelines. The numbers are self-reported but the methodology is public, which is more than most competitors offer.

In March 2026, they launched Odyssey, a private market intelligence dataset queryable via their API. VC deals, fund performance, LP profiles, company financials, exit probability predictions, patent filings. Bloomberg Terminal meets RAG endpoint. This is the move that actually matters.

How It Actually Works

The pipeline has four distinct stages where Captain makes non-obvious choices.

Ingestion and OCR. Not all documents are equal. Captain routes files through different extractors based on complexity. Gemini 3 Pro handles images and mixed-media content. Reducto handles complex structured documents, forms, tables, multi-column layouts. Extend handles basic text extraction. Everything gets converted to clean Markdown: a deliberate choice that forces a uniform intermediate representation before any chunking happens.

Chunking. This is where Edgar's Purdue NLP research lives. The chunking isn't fixed-window. It's context-aware, informed by document structure (headers, paragraphs, tables) and semantic coherence. Chunks that belong together conceptually stay together. The specific techniques are proprietary, but Edgar's published work on contextualized retrieval gives you a window into the approach. This is the hardest part to replicate accurately.

Embedding and Retrieval. Captain uses Voyage's voyage-context-3, a contextual embedding model that encodes surrounding context at embedding time rather than treating each chunk independently. Notably, they tested Voyage 4 (the newer model) and found voyage-context-3 outperformed it on their actual retrieval benchmarks. Better accuracy on real data beats the latest model number. They don't chase benchmarks that don't matter for their use case.

Retrieval is hybrid: dense vector similarity plus BM25 full-text search, fused with Reciprocal Rank Fusion (RRF). Neither keyword matching alone nor pure vector similarity wins. RRF gives them both precision on exact terminology and recall on semantic similarity, the combination consistently outperforms either method solo.

Reranking. The initial hybrid pass returns 50 candidates. Voyage's rerank-2.5 model collapses that to the final top 15, using cross-encoder attention to score each chunk against the query holistically. The 50→15 collapse is where most RAG systems leave significant accuracy on the table, they return first-pass results and call it done. Captain treats reranking as a separate, critical production step, not an optional optimization.

Output includes automatic page-number citations. Enterprise teams will not accept retrieval results without audit trails. This detail separates demo toys from production systems.

The Odyssey Angle

The most interesting thing about Captain isn't the RAG platform. It's what the Odyssey launch implies about where they're going.

Developer infrastructure is a brutal business. Developers hate paying for pipes they believe they can build themselves (see: the skeptical HN comments on their launch). A SOC2-certified RAG endpoint is genuinely useful, but it's not a durable moat if Voyage, Cohere, and every major cloud provider are building adjacent capabilities.

Odyssey changes the math. Private market intelligence data, VC deal flow, fund performance metrics, LP profiles, exit probability signals, is not reproducible by a weekend warrior. Building a reliable, clean, continuously-updated dataset of private market activity takes years of sourcing relationships, normalization work, and data quality investment. Nobody scrapes their way to accurate LP profiles.

The Bloomberg Terminal strategy: give the software away relatively cheaply, make the money on proprietary data. Customers who integrate Odyssey into their investment research workflows are not leaving. The switching cost isn't the RAG plumbing, it's the workflows their analysts have built on top of data nobody else has.

YC president Garry Tan called Captain "a step function increase vs existing RAG pipelines." That's a strong public endorsement, but what really validates the thesis is whether Odyssey grows into something enterprise investors actually depend on. That's a 3-5 year build.

Difficulty Scores

ML/AI: 7/10, Contextual embeddings, structure-aware chunking informed by real NLP research, multi-stage reranking pipeline. Not frontier ML, but applied research that took years to calibrate correctly. The gap between "implemented" and "accurate" here is real.
Data: 6/10, The Odyssey dataset sourcing is genuinely hard. The RAG pipeline itself handles complex multimodal content across 1,000+ integration sources. Data quality at scale is an underrated challenge.
Backend: 7/10, Multi-tenant SaaS with enterprise reliability requirements, SOC2 compliance, complex async processing queues, graceful failure handling across multiple third-party OCR and embedding APIs, per-org data isolation.
Frontend: 3/10, API-first product. Developer dashboard exists but it's not the product. The UX complexity lives in the API design, not the UI.
DevOps: 7/10, Managing OCR jobs, embedding queues, and vector index updates across a diverse integration catalog with SLA commitments requires serious infrastructure discipline. Latency consistency across providers is hard.

The Moat

The RAG platform has a thin moat on its own. Edgar's chunking research is real IP, but the retrieval field is moving fast and advanced chunking is increasingly table stakes. The specific stack, Voyage contextual embeddings, hybrid RRF retrieval, Voyage reranking, is reproducible by any competent ML team in a few weeks. Every component is a public API or open-source tooling.

What's harder to replicate:

Accuracy benchmarking discipline. Publishing claims against MRAG-Bench (an ICLR-published benchmark) creates a public standard that competitors have to beat on the same test. It's a positioning move as much as a technical one, enterprise buyers now have a number to hold you to, which is a double-edged sword but forces the market to compete on published metrics rather than vibes.

SOC2 Type II certification. Not technically hard but it takes 6-12 months and real legal and compliance investment. Meaningful barrier for smaller competitors trying to sell into the enterprise accounts Captain is targeting.

Odyssey, if executed well. This is the actual bet. If the private market intelligence dataset becomes comprehensive, accurate, and continuously updated, the switching costs become high. Enterprise workflows built on proprietary data don't migrate. That's the moat, not the chunking.

The biggest risk: AWS Bedrock Knowledge Bases, Azure AI Search, and GCP Vertex AI Search are all shipping managed RAG. Cloud-native buyers will default to their existing vendor. Captain's survival depends on either (a) Odyssey becoming genuinely irreplaceable, or (b) achieving integration depth in enterprise workflows before the hyperscalers commoditize the pure retrieval piece.

Replicability Score: 38 / 100

The core RAG pipeline, OCR routing, context-aware chunking, contextual embeddings, hybrid RRF retrieval, reranking, is entirely replicable with public APIs and open-source tooling. A strong ML engineer can ship a credible clone in 2-4 weeks. Voyage's APIs are available to anyone. Reducto has a public API. BM25 + RRF is textbook IR.

What earns 38 rather than 20: the chunking research takes real NLP expertise to match accurately; multimodal, multi-format support at production scale is genuinely complex; SOC2 compliance takes time and money; and Odyssey, if it executes, transitions this from a replicable pipe to a hard data moat. The gap between "can clone the pipeline" and "can match the accuracy claims across all file types" is meaningful in practice.

Score would drop to 25 if Odyssey becomes the actual product and scales. Score would rise to 55 if they stay pure infrastructure while the hyperscalers close the accuracy gap. The strategic choice Captain makes over the next 18 months will determine which number is right.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

Build This Startup with Claude Code

Complete replication guide — install as a slash command or rules file

# Build Captain: Managed RAG-as-a-Service

A step-by-step guide to building a Captain clone using Claude Code.

## Step 1: Data Model & Schema

Create a PostgreSQL (with pgvector) database with these core tables:

```sql
CREATE EXTENSION IF NOT EXISTS vector;

-- Collections: named groups of documents
CREATE TABLE collections (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  org_id UUID NOT NULL,
  name TEXT NOT NULL,
  description TEXT,
  metadata JSONB DEFAULT '{}'::jsonb,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Documents: individual files within a collection
CREATE TABLE documents (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  collection_id UUID REFERENCES collections(id) ON DELETE CASCADE,
  source_type TEXT NOT NULL,
  source_uri TEXT NOT NULL,
  file_name TEXT,
  file_type TEXT,
  status TEXT DEFAULT 'pending',
  metadata JSONB DEFAULT '{}'::jsonb,
  indexed_at TIMESTAMPTZ,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Chunks: the retrieval units
CREATE TABLE chunks (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  document_id UUID REFERENCES documents(id) ON DELETE CASCADE,
  collection_id UUID NOT NULL,
  content TEXT NOT NULL,
  page_number INTEGER,
  chunk_index INTEGER,
  embedding VECTOR(1024),
  bm25_tokens TSVECTOR,
  metadata JSONB DEFAULT '{}'::jsonb,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_chunks_collection ON chunks(collection_id);
CREATE INDEX idx_chunks_embedding ON chunks USING ivfflat (embedding vector_cosine_ops);
CREATE INDEX idx_chunks_bm25 ON chunks USING GIN(bm25_tokens);
```

## Step 2: Ingestion Pipeline

Build an async worker that routes documents to the right OCR extractor:

```python
class IngestionWorker:
    async def process_document(self, document_id: str):
        doc = await db.get_document(document_id)
        
        if doc.file_type in ['pdf', 'docx']:
            if is_complex(doc):  # tables, multi-column
                markdown = await self.reducto_extract(doc)
            else:
                markdown = await self.extend_extract(doc)
        elif doc.file_type in ['jpg', 'png']:
            markdown = await self.gemini_extract(doc)
        
        chunks = self.semantic_chunk(markdown)
        await self.embed_and_store(chunks, doc.collection_id)
    
    def semantic_chunk(self, markdown: str) -> list[dict]:
        # Split by markdown headers first, then by paragraph with 15% overlap
        sections = split_by_headers(markdown)
        chunks = []
        for section in sections:
            if len(section['content']) < 1500:
                chunks.append(section)
            else:
                chunks.extend(split_with_overlap(section['content'], max_size=1200, overlap=0.15))
        return chunks
```

## Step 3: Contextual Embeddings

Use Voyage voyage-context-3, prepend document context to each chunk before embedding:

```python
import voyageai
voyage = voyageai.Client(api_key=VOYAGE_API_KEY)

async def embed_chunks(chunks: list[dict]) -> list[list[float]]:
    texts = [f"Context: {c.get('header_path', '')}\n\n{c['content']}" for c in chunks]
    all_embeddings = []
    for i in range(0, len(texts), 128):  # Voyage batch limit
        batch = texts[i:i+128]
        result = voyage.embed(batch, model='voyage-context-3', input_type='document')
        all_embeddings.extend(result.embeddings)
    return all_embeddings
```

## Step 4: Hybrid Search with RRF

Combine dense vector search and BM25 keyword search using Reciprocal Rank Fusion:

```python
async def hybrid_search(query: str, collection_id: str, top_k=50):
    qe = voyage.embed([query], model='voyage-context-3', input_type='query').embeddings[0]
    
    vector_results = await db.fetch("""
        SELECT id, content, page_number, 1-(embedding <=> $1) AS score
        FROM chunks WHERE collection_id=$2 ORDER BY embedding <=> $1 LIMIT $3
    """, qe, collection_id, top_k)
    
    kw_results = await db.fetch("""
        SELECT id, content, page_number, ts_rank(bm25_tokens, plainto_tsquery('english',$1)) AS score
        FROM chunks WHERE collection_id=$2 AND bm25_tokens@@plainto_tsquery('english',$1)
        ORDER BY score DESC LIMIT $3
    """, query, collection_id, top_k)
    
    # RRF fusion
    rrf = {}
    for rank, r in enumerate(vector_results): rrf[r['id']] = rrf.get(r['id'], 0) + 1/(rank+60)
    for rank, r in enumerate(kw_results): rrf[r['id']] = rrf.get(r['id'], 0) + 1/(rank+60)
    
    all_chunks = {r['id']: r for r in vector_results + kw_results}
    return [all_chunks[cid] for cid, _ in sorted(rrf.items(), key=lambda x: -x[1])[:top_k]]
```

## Step 5: Cross-Encoder Reranking

Collapse 50 candidates to top 15 using Voyage rerank-2.5:

```python
async def rerank(query: str, candidates: list[dict], top_n=15) -> list[dict]:
    result = voyage.rerank(
        query=query,
        documents=[c['content'] for c in candidates],
        model='rerank-2.5',
        top_k=top_n
    )
    reranked = []
    for item in result.results:
        chunk = candidates[item.index].copy()
        chunk['relevance_score'] = item.relevance_score
        reranked.append(chunk)
    return reranked
```

## Step 6: REST API

Expose a clean FastAPI endpoint:

```python
from fastapi import FastAPI, Depends, HTTPException
from pydantic import BaseModel

app = FastAPI()

class QueryRequest(BaseModel):
    query: str
    top_k: int = 10

@app.post('/v2/collections/{collection_id}/query')
async def query_collection(collection_id: str, req: QueryRequest, org=Depends(auth)):
    collection = await db.get_collection(collection_id)
    if collection.org_id != org.id: raise HTTPException(403)
    candidates = await hybrid_search(req.query, collection_id, top_k=50)
    results = await rerank(req.query, candidates, top_n=req.top_k)
    return {
        'results': [{
            'content': r['content'],
            'page_number': r.get('page_number'),
            'relevance_score': r['relevance_score'],
            'document_id': str(r['document_id'])
        } for r in results]
    }

@app.post('/v2/collections/{collection_id}/files')
async def add_file(collection_id: str, source_uri: str, org=Depends(auth)):
    doc = await db.create_document(collection_id, source_uri)
    await queue.enqueue('process_document', {'document_id': str(doc.id)})
    return {'document_id': str(doc.id), 'status': 'queued'}
```

## Step 7: Deployment

Deploy with independently scalable API and worker pods:

```yaml
services:
  api:
    image: captain-api
    ports: ["8000:8000"]
    deploy:
      replicas: 3
  worker:
    image: captain-worker
    command: python -m worker.main
    deploy:
      replicas: 5  # scale independently from API
  db:
    image: pgvector/pgvector:pg16
  redis:
    image: redis:7-alpine
```

Key production decisions:
- IVFFlat index for vector search at scale (tune `lists` param to sqrt(row_count))
- Separate worker pods: embedding is CPU/memory intensive, decouple from request path
- Per-org row-level security in Postgres for SOC2 data isolation
- Redis cache for repeated query embeddings (5-min TTL covers burst traffic)
- Rate-limit Voyage API calls with token bucket (300 RPM on Starter tier)
- Estimated infra cost at 100 customers: ~$2k/month vs. ~$160k/month revenue

Install for: