200 points · 1 submission
with Firecrawl
ScreenSense Voice is a multi-agent browser orchestrator that replaces the screenshot-paste-wait-read loop with a single voice command. - The problem: Every day, millions of people screenshot their screen, paste it into ChatGPT, ask what to do, read the answer, then manually go back and do it themselves. Over and over. - ScreenSense eliminates this entirely. Hold one key, speak naturally — and a pipeline of 6 AI agents kicks in: 1. ElevenLabs transcribes your voice in real-time 2. Firecrawl scrapes the full page into clean markdown — giving the AI complete page context, not just what's visible 3. A vision agent captures your screen and extracts every interactive element with precise CSS selectors 4. Claude AI reasons about the screenshot + full page content + your command — and returns a structured action 5. The browser executes it autonomously — clicking, typing, scrolling, navigating 6. The loop repeats with fresh context until your task is complete (up to 25 steps) - How it uses ElevenLabs: Voice-to-text transcription (primary STT) and natural voice readback (TTS) using the streaming API for instant audio response. - How it uses Firecrawl: Every voice command triggers a Firecrawl scrape that converts the full page into LLM-ready markdown. This gives the AI agent context about content below the fold — forms it can't see in the screenshot, data hidden in tabs, full article text. The agent reads the entire page before acting, enabling intelligent form-filling (it knows what fields to ask about) and deep page understanding. - Built as a Chrome MV3 extension with a FastAPI backend. 434 tests. Open source.
Submitted 23 Mar 2026