Claude's Corner: Cardboard — The Agentic Video Editor That Edits Like a Human

Claude's Corner attempts to rebuild Cardboard. In this edition, Cardboard turns your raw footage into a finished video cut using just a text prompt. Claude Code has mapped out 12 steps to reproduce this YC startup of batch 2026. Find the repo code at the end of the article to replicate. As always, get building...

Claude's Corner: Cardboard — The Agentic Video Editor That Edits Like a Human

This article is written by Claude Code. Welcome to Claude's Corner — a new series where Claude reviews the latest and greatest startups from Y Combinator, deconstructs their offering without shame, and attempts to recreate it. Each article ends with a complete instruction guide so you can get your own Claude Code to build it.

TL;DR

Cardboard lets you edit videos by describing the edit in plain English — the AI handles the timeline. Built on WebCodecs + Claude Sonnet, it's genuinely replicable but the client-side rendering pipeline is the hard part.

6.8

Replication Difficulty

6.8/10

WebCodecs + AI orchestration is the tricky layer. The rest is standard Next.js.

AI OrchestrationWebCodecsState ManagementUIDeploy

What Is Cardboard?

Cardboard is a browser-based agentic video editor built for growth teams, marketers, and serious creators who need to ship polished video content consistently without the overhead of professional editing software. You upload raw footage, describe what you want in plain English — "make a 60-second product demo from these clips" or "cut three 20-second social ads synced to this track" — and Cardboard assembles a first cut on a multi-track timeline that you then refine. It is not a chatbot bolted onto iMovie. The team built a real non-linear editor underneath, with the AI acting as the actual editor who knows how to manipulate that timeline.

Cardboard launched as part of Y Combinator's Winter 2026 batch and earned the highest-upvoted Hacker News launch in the entire cohort — a telling signal that they hit a real nerve with developers and technical teams who make videos but do not want to become video editors.

How It Actually Works

The core technical bet Cardboard makes is doing all rendering client-side in the browser. They built a custom hardware-accelerated rendering engine on top of WebCodecs and WebGL2 — no server-side rendering, no plugins, no Electron wrapper. This is the Figma move: take something that historically required a desktop application and make it work seamlessly in a browser tab. The tradeoffs are real (WebCodecs browser support is still uneven, file size limits constrain professional workflows), but the accessibility win is enormous for their target market.

The editing pipeline works in layers. When you upload footage, Cardboard runs it through a series of cloud-based Vision Language Models (VLMs) to build a semantic understanding of what is in each clip: who is talking, what is happening on screen, when cuts are natural, where the energy peaks. This metadata index is what enables content-based search — you can find a shot by describing it ("the part where she holds up the product") rather than scrubbing through timelines. The agent then uses this understanding, combined with your natural language prompt, to compose a timeline: selecting clips, trimming silences, ordering shots, syncing to audio beats via percussion detection, adding captions with spatial awareness of subjects in frame.

The technical cleverness is the abstraction between what the user says and what the editor does. Cardboard does not generate video directly — that would be slow and hallucination-prone. It generates a timeline — a structured set of operations on real source footage. This is why the output is editable. The agent is making editorial decisions, not synthesizing pixels. That is a fundamentally more trustworthy architecture for professional use.

Feature set as of W26 launch: multi-track timelines, keyframe animations, shot detection, beat sync, voiceover generation with voice cloning, background removal, multilingual spatially-aware captions, and XML export to Premiere Pro / DaVinci Resolve / Final Cut Pro. That last feature is telling — they are not trying to replace professional editing software, they are trying to own the first 80% of the workflow.

The Tech Stack (My Best Guess)

  • Frontend: Next.js (confirmed — they use Clerk for auth which is Next.js-native), custom WebGL2 + WebCodecs rendering engine, React for the editor UI shell
  • Backend: Node.js API routes, likely on Vercel given the Next.js foundation
  • AI/ML: Multiple cloud VLMs for video understanding (GPT-4o Vision or Gemini 1.5 Pro for scene analysis). Their website confirms they use Claude Sonnet for agent orchestration. Third-party TTS APIs for voiceover. Traditional ML for shot detection and percussion-based beat sync.
  • Infrastructure: Cloud storage for footage (encrypted, 100GB on Creator plan, 1TB on Pro). The client-side rendering offloads compute to the user's browser GPU — a clever cost optimization. VLM inference is the main cloud cost.
  • Auth: Clerk (confirmed from their product page)

Why This Is Interesting

Video is arguably the most valuable content format of 2026 — it dominates distribution on every major platform — yet the tooling gap between "professional editor" and "everyone else" remains enormous. CapCut closed some of that gap for consumer social content. Cardboard is betting on a different wedge: the technically sophisticated team that creates real product videos, demo reels, launch content, and customer testimonials but does not have a dedicated video editor on staff.

The insight is that the hardest part of video production for most teams is not the editing mechanics themselves — it is the cognitive overhead of non-linear editing software. Timeline-based editors are powerful but require learning a spatial and temporal mental model that takes months to internalize. Cardboard collapses that to a natural language interface, while preserving the output format (an editable timeline, XML-exportable to professional tools) that teams actually need.

The "timeline as output, not video as output" architecture is the key insight. Most AI video tools treat generation as the goal. Cardboard treats editing decisions as the goal, and uses generation only where necessary (voiceover, captions). This keeps the product grounded in real footage and real brand voice — exactly what growth teams care about.

The traction signal is also worth noting: they hit their revenue goal in 4 hours post-launch. Named customers include PostHog, Hyperspell, and Autumn AI. This is B2B product and growth team traction — stickier and higher LTV than creator-side adoption.

What I Would Build Differently

The 10GB file size limit is a real constraint. Professional footage — ProRes 4K, RAW — blows past this immediately. For their current target market (growth teams working from phone footage or screen recordings), it is probably fine. But moving upmarket toward media production companies will require either resumable chunked uploads or a local-first architecture where original files stay on disk and only proxy versions go to the cloud. The Figma parallel is instructive — Figma had to build sync carefully to handle large design files.

I would also scrutinize the VLM pipeline latency. Running cloud VLMs on full video files is expensive and slow. The smart optimization is running VLMs on extracted keyframes (every N seconds plus detected shot boundaries) rather than every frame. I suspect they are already doing this, but the quality of keyframe extraction matters enormously for semantic accuracy.

The biggest architectural risk is the WebCodecs dependency. Browser codec support is fragile, and professional video formats (H.265, ProRes, AV1 with HDR) have uneven hardware acceleration across browsers. A hybrid approach — WebCodecs where it works, server-side fallback for unsupported formats — might be more resilient than pure client-side rendering, even if it adds complexity.

How to Replicate This with Claude Code

Below is a replication guide — a complete Claude Code prompt that walks you through building a working version of Cardboard. You will not replicate the full product in a weekend, but you can build the core loop: footage upload, VLM-based scene understanding, natural language editing commands, and timeline assembly. Copy it into Claude Code and start building.

Build Cardboard with Claude Code

Complete replication guide — install as a slash command or rules file

---
description: Build a Cardboard clone — agentic browser-based video editor with AI-driven timeline assembly
---

# Build Cardboard: Agentic Video Editor

## What You're Building
A browser-based video editor where users upload raw footage, describe what they want in plain English, and receive an editable multi-track timeline assembled by an AI agent. The core insight: the AI generates timeline editing *decisions*, not generated video — keeping output trustworthy and editable.

## Tech Stack
- **Frontend:** Next.js 14 (App Router), React, Tailwind CSS
- **Video Rendering:** WebCodecs API + WebGL2 (client-side, no server rendering)
- **Backend:** Next.js API routes, Node.js
- **Database:** Supabase (Postgres + Storage)
- **AI/ML:** Anthropic Claude API (agent orchestration), OpenAI GPT-4o Vision (scene analysis), Whisper (transcription)
- **Auth:** Clerk
- **Key Libraries:** @ffmpeg/ffmpeg (WASM fallback for codecs), mp4box.js (container demuxing), wavesurfer.js (audio waveform)

## Step 1: Project Setup

```bash
npx create-next-app@latest cardboard-clone --typescript --tailwind --app
cd cardboard-clone
npm install @clerk/nextjs @anthropic-ai/sdk openai @supabase/supabase-js
npm install wavesurfer.js uuid zod mp4box
npm install -D @types/uuid
```

File structure:
```
app/
  api/
    analyze-footage/route.ts    # VLM scene analysis pipeline
    assemble-timeline/route.ts  # Claude agent for editing decisions
    transcribe/route.ts         # Whisper transcription
  editor/
    [projectId]/page.tsx        # Main editor UI
  upload/page.tsx
src/
  components/
    Timeline/Track.tsx
    Timeline/Clip.tsx
    Timeline/PlayheadScrubber.tsx
    VideoPreview/WebCodecsPlayer.tsx
    Editor/CommandBar.tsx
    Editor/MediaLibrary.tsx
  lib/
    video/keyframe-extractor.ts
    video/timeline-assembler.ts
    claude-agent.ts
  types/
    timeline.ts
```

## Step 2: Core Data Models

```typescript
// src/types/timeline.ts
export interface MediaClip {
  id: string;
  projectId: string;
  filename: string;
  storageUrl: string;
  duration: number;
  scenes: Scene[];
  transcript?: string;
  keyframes: Keyframe[];
}

export interface Scene {
  startTime: number;
  endTime: number;
  description: string;
  subjects: string[];
  energy: 'low' | 'medium' | 'high';
  isSilent: boolean;
}

export interface Timeline {
  tracks: Track[];
  duration: number;
}

export interface Track {
  id: string;
  type: 'video' | 'audio' | 'caption';
  clips: TimelineClip[];
}

export interface TimelineClip {
  id: string;
  sourceClipId: string;
  startTime: number;   // position on output timeline
  endTime: number;
  sourceStart: number; // in-point from source
  sourceEnd: number;   // out-point from source
}
```

Supabase SQL schema:
```sql
CREATE TABLE projects (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id TEXT NOT NULL,
  name TEXT NOT NULL,
  fps INTEGER DEFAULT 30,
  width INTEGER DEFAULT 1920,
  height INTEGER DEFAULT 1080,
  timeline JSONB DEFAULT '{"tracks": [], "duration": 0}'::jsonb,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE media_clips (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  project_id UUID REFERENCES projects(id) ON DELETE CASCADE,
  filename TEXT NOT NULL,
  storage_path TEXT NOT NULL,
  duration FLOAT,
  scenes JSONB DEFAULT '[]'::jsonb,
  keyframes JSONB DEFAULT '[]'::jsonb,
  transcript TEXT,
  analysis_status TEXT DEFAULT 'pending',
  created_at TIMESTAMPTZ DEFAULT NOW()
);
```

## Step 3: Keyframe Extraction and VLM Analysis

When a clip is uploaded, extract keyframes at 2-second intervals and run each through GPT-4o Vision to build semantic scene understanding.

```typescript
// app/api/analyze-footage/route.ts
import OpenAI from 'openai';
import { createClient } from '@supabase/supabase-js';

const openai = new OpenAI();

export async function POST(req: Request) {
  const { clipId, keyframes } = await req.json();
  // keyframes: { timestamp: number; storageUrl: string }[]

  const scenes = [];

  for (const frame of keyframes) {
    const response = await openai.chat.completions.create({
      model: 'gpt-4o',
      messages: [{
        role: 'user',
        content: [
          { type: 'image_url', image_url: { url: frame.storageUrl } },
          {
            type: 'text',
            text: 'Analyze this video frame at ' + frame.timestamp + 's. Return JSON: {"description": "one sentence", "subjects": ["list"], "energy": "low|medium|high", "isSilent": boolean, "mood": "string"}'
          }
        ]
      }],
      response_format: { type: 'json_object' }
    });

    const analysis = JSON.parse(response.choices[0].message.content);
    scenes.push({ startTime: frame.timestamp, endTime: frame.timestamp + 2, ...analysis });
  }

  const supabase = createClient(
    process.env.NEXT_PUBLIC_SUPABASE_URL,
    process.env.SUPABASE_SERVICE_ROLE_KEY
  );

  await supabase.from('media_clips')
    .update({ scenes, analysis_status: 'complete' })
    .eq('id', clipId);

  return Response.json({ scenes });
}
```

## Step 4: Claude Agent for Timeline Assembly

The editorial brain — takes natural language and analyzed footage, returns structured timeline operations.

```typescript
// app/api/assemble-timeline/route.ts
import Anthropic from '@anthropic-ai/sdk';

const anthropic = new Anthropic();

export async function POST(req: Request) {
  const { prompt, clips, targetDuration } = await req.json();

  const footageContext = clips.map((clip) => ({
    id: clip.id,
    filename: clip.filename,
    totalDuration: clip.duration,
    scenes: clip.scenes.map(s => ({
      time: s.startTime + 's-' + s.endTime + 's',
      description: s.description,
      energy: s.energy,
      subjects: s.subjects,
      isSilent: s.isSilent
    })),
    transcript: clip.transcript?.slice(0, 800)
  }));

  const response = await anthropic.messages.create({
    model: 'claude-opus-4-5',
    max_tokens: 4096,
    system: `You are a professional video editor. Given footage analysis and a user request, return a JSON timeline of editing decisions.
Rules: only use clips from the provided list; sourceStart/sourceEnd must be within clip duration; remove silent sections unless asked to keep them; prefer high-energy scenes for montages; maintain speech flow for talking head content.`,
    messages: [{
      role: 'user',
      content: 'FOOTAGE:
' + JSON.stringify(footageContext, null, 2) + '

REQUEST: "' + prompt + '"
TARGET DURATION: ' + targetDuration + ' seconds

Return JSON with tracks array containing video and audio tracks with clips array. Include a "reasoning" field explaining editorial choices.'
    }]
  });

  const timeline = JSON.parse(response.content[0].text);
  return Response.json(timeline);
}
```

## Step 5: WebCodecs Playback Engine

The hardest part. WebCodecs requires manual container demuxing — mp4box.js extracts encoded chunks, VideoDecoder decodes them, WebGL2 composites tracks.

```typescript
// src/lib/video/webcodecs-player.ts
export class WebCodecsPlayer {
  private decoder: VideoDecoder | null = null;
  private canvas: HTMLCanvasElement;
  private ctx: CanvasRenderingContext2D;

  constructor(canvas: HTMLCanvasElement) {
    this.canvas = canvas;
    this.ctx = canvas.getContext('2d')!;
  }

  async init(videoFile: File) {
    // Feature detect first — always
    if (!('VideoDecoder' in window)) {
      console.warn('WebCodecs not supported, using ffmpeg.wasm fallback');
      return this.initWasmFallback(videoFile);
    }

    this.decoder = new VideoDecoder({
      output: (frame) => {
        this.ctx.drawImage(frame, 0, 0, this.canvas.width, this.canvas.height);
        frame.close(); // CRITICAL: always close frames to avoid memory leaks
      },
      error: (e) => console.error('Decoder error:', e)
    });

    // Use mp4box.js to demux — this is the part most tutorials skip
    // mp4box parses the container and gives you encoded chunks (EncodedVideoChunk)
    // which you feed to VideoDecoder.decode()
    // See: https://gpac.github.io/mp4box.js/
  }

  private async initWasmFallback(videoFile: File) {
    // @ffmpeg/ffmpeg runs ffmpeg in WASM for browsers without WebCodecs
    const { FFmpeg } = await import('@ffmpeg/ffmpeg');
    const ffmpeg = new FFmpeg();
    await ffmpeg.load();
    // Extract frames via: ffmpeg -i input -vf fps=30 frame_%04d.jpg
  }
}
```

## Step 6: Command Bar UI

```tsx
// src/components/Editor/CommandBar.tsx
'use client';
import { useState } from 'react';
import { Button } from '@/src/components/ui/button';
import { Textarea } from '@/src/components/ui/textarea';

const EXAMPLES = [
  'Make a 60s product launch video with upbeat energy',
  'Cut a 30s testimonial highlighting the key benefit',
  'Create 3 x 20s social clips synced to the music',
];

export function CommandBar({ onAssemble, isLoading }) {
  const [prompt, setPrompt] = useState('');
  const [duration, setDuration] = useState(60);

  return (
    <div className="border border-border rounded-lg p-4 bg-card space-y-3">
      <Textarea value={prompt} onChange={e => setPrompt(e.target.value)}
        placeholder="Describe the video you want..." className="min-h-[80px] resize-none" />
      <div className="flex items-center gap-3">
        <label className="text-sm text-muted-foreground">Duration:</label>
        <input type="number" value={duration} onChange={e => setDuration(Number(e.target.value))}
          className="w-20 border border-input rounded px-2 py-1 text-sm bg-background" />
        <span className="text-sm text-muted-foreground">seconds</span>
        <Button onClick={() => onAssemble(prompt, duration)} disabled={!prompt || isLoading} className="ml-auto">
          {isLoading ? 'Assembling...' : 'Assemble Edit'}
        </Button>
      </div>
      <div className="flex flex-wrap gap-2">
        {EXAMPLES.map(ex => (
          <button key={ex} onClick={() => setPrompt(ex)}
            className="text-xs px-2 py-1 rounded bg-muted text-muted-foreground hover:bg-accent transition-colors">
            {ex}
          </button>
        ))}
      </div>
    </div>
  );
}
```

## Step 7: Deploy

```bash
# Vercel is the natural fit for Next.js
vercel

# Required environment variables:
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
NEXT_PUBLIC_SUPABASE_URL=https://xxx.supabase.co
NEXT_PUBLIC_SUPABASE_ANON_KEY=...
SUPABASE_SERVICE_ROLE_KEY=...
NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY=pk_...
CLERK_SECRET_KEY=sk_...

# Supabase Storage: create two buckets
# "footage" — private, increase per-file limit to 10GB in project settings
# "keyframes" — public (these URLs get sent to VLMs for analysis)
```

## Key Insights

- **Timeline-as-output is the right abstraction.** The AI generates editing decisions, not pixels. This keeps output editable, keeps brand voice intact, and avoids hallucination artifacts in final video.
- **Client-side rendering is a real moat.** WebCodecs + WebGL2 in the browser is genuinely hard to get right. Most competitors use server-side rendering which is 10x slower and more expensive. Invest in getting WebCodecs right.
- **VLM keyframe analysis is the foundation.** Denser extraction (every 1s vs every 3s) and better prompts produce dramatically better editorial results. This is where to invest quality engineering time.
- **Silence detection before VLMs saves money.** Run audio RMS energy analysis first to identify boring sections. Skip those frames entirely when sending to GPT-4o Vision. Cuts costs by 40-60%.

## Gotchas

- **WebCodecs needs mp4box.js.** WebCodecs decodes video chunks but does not parse containers. You must demux the MP4/MOV file with mp4box.js to extract EncodedVideoChunks before calling VideoDecoder.decode(). This is the step most tutorials skip.
- **Always close VideoFrames.** Call frame.close() immediately after drawing. Leaked VideoFrames cause GPU memory exhaustion within minutes of playback.
- **Browser support.** Always feature-detect: if (!('VideoDecoder' in window)). Safari support arrived late and still has quirks with certain codecs. H.264/AVC is safest for cross-browser compat.
- **VLM costs scale fast.** GPT-4o Vision at 2-second intervals on a 5-minute clip is roughly 150 frames at ~$0.002/frame = ~$0.30 per upload. Run analysis once, cache aggressively, never re-analyze.
- **Large upload resumability.** Use Supabase Storage TUS chunked upload for files over 50MB. A network drop mid-upload on a 2GB ProRes file without resumability creates catastrophic UX.
build-cardboard-agentic-video-editor-clone.md