VoiceInput vs Superwhisper
Superwhisper is the polished Whisper desktop wrapper for English-first users. VoiceInput is built around Chinese typing speed and turns every dictation into a searchable, persona-reviewable memory — not just text dropped at the cursor.
One-line verdict
If you're an English-first user who wants the cleanest Whisper desktop wrapper, Superwhisper is excellent. If you type Chinese (or mixed Chinese-English) all day, want under-1.5-second latency, and want everything you said to be searchable next month — VoiceInput is built for that.
Side-by-side comparison
| Dimension | VoiceInput | Superwhisper |
|---|---|---|
| End-to-end latency (Chinese) | ~1.4s | ~3-6s |
| End-to-end latency (English) | ~1.5s | ~1.5-3s |
| Mixed-language (CJK + English code/brand) | 200+ hotwords + pinyin disambiguation | Generic Whisper, frequent miscaps |
| Local / fully offline | Yes — SenseVoice / Paraformer / Apple | Yes — Whisper local models |
| Cloud option | Volcengine streaming (audio direct, no relay) | OpenAI Whisper / cloud Whisper variants |
| Memory layer (search past dictations) | Built-in. Every line archives with app, time, tags. Full-text search, export. | No. Text drops, then it's gone. |
| AI persona review | 7 built-in personas (Boss, Coach, Therapist, Editor…) re-read your week. Weekly Big5 sketch. | N/A |
| Local typography engine | CJK-Latin spacing, brand casing, unit spacing — all handled in <5ms, zero LLM call | N/A |
| Privacy boundary | Audio + text local; only recording-length seconds reported (toggleable) | Audio sent to cloud (Pro tier); local mode keeps everything offline |
| Free tier | 100% Local + BYOK forever — no card | 7-day trial, then paid |
| Paid tier | Cloud: $5/mo · $79/yr · $49 lifetime | $8.49/mo · $84/yr |
| Distribution | Direct DMG, Sparkle auto-update | Direct DMG, auto-update |
| Open API key (BYOK) | DeepSeek / Kimi / OpenAI / any OpenAI-compatible endpoint | Limited — Whisper-only routing |
Where Superwhisper wins
- English polish. Whisper's English transcription is industry-standard. If you only speak English, Superwhisper's output is hard to beat.
- Single-purpose simplicity. No memory features, no personas — just dictate and move on. Some people prefer that.
- Mature AI prompt library. Superwhisper has built-in prompt presets for email tone, tweet style, etc. VoiceInput's AI tidy stays focused on transcription correction.
Where VoiceInput wins
- Speed on Chinese. Volcengine streaming + local typography = sub-1.5-second on Mandarin and mixed CJK-English. Superwhisper's Whisper backbone is slower for the same input.
- Memory layer. Every voice line auto-archives. A month later you can search "what did I say about onboarding?" — Superwhisper drops the text and forgets.
- AI personas. Same line re-read by Boss, Coach, Therapist, Editor. Weekly MBTI sketch from your real talk. This is a different product category, not just dictation.
- Free tier covers real usage. 100% Local and BYOK paths are free forever. Superwhisper's free tier is a 7-day trial.
- BYOK flexibility. Drop in any OpenAI-compatible endpoint (DeepSeek, Kimi, your own server). Pay model providers directly, ~$0-2/month for typical use.
- Chinese typography. Half-width spacing between CJK and Latin, brand casing (Cursor, Kimi, API stay correct), unit spacing — all handled by a local rules engine in under 5ms.
Speed: where the 1.4 seconds come from
VoiceInput's end-to-end pipeline is built for one number: time from button release to text landing at your cursor. The path:
- Press right Option, audio streams to ASR (Volcengine) over a persistent connection — no handshake on each utterance.
- ASR returns partial transcripts in real time. The local typography engine fixes spacing, casing, and unit formatting in <5ms (no LLM call).
- If AI tidy is enabled, the cleaned text is sent to the LLM with a constrained prompt (three jobs only: homophones, fillers, punctuation). Confidence below 0.5 keeps the original.
- Text injects at the cursor via Accessibility API. Clipboard fallback triggers if the target field rejects the inject.
Superwhisper's Whisper-based pipeline batches audio into chunks of 200ms-1s before processing — fundamentally a different architecture, optimized for English transcription accuracy over real-time latency.
Memory layer: the real product difference
Most dictation apps stop at "speech becomes text." VoiceInput treats each utterance as data worth keeping:
- Tool layer (SPEAK). Hold-to-talk. Text lands at the cursor. Same job Superwhisper does.
- Data layer (RECALL). Every line archives locally with the source app, timestamp, and auto-tags. Full-text search across months of dictation. Export to Markdown / JSON / CSV.
- Memory layer (REFLECT). 7 built-in personas (or your own) re-read your week. AI-picked quotes worth echoing. Weekly Big5 sketch — patterns you wouldn't see yourself.
If you mostly use voice for quick text injection, you don't need the bottom two layers. If you talk through real decisions and want them retrievable later, no Whisper wrapper has built this.
Privacy
Both apps offer a local-only mode. The difference is what leaves your Mac when you opt into cloud:
- VoiceInput cloud. Audio streams directly to Volcengine ASR — no proxy server, no audio storage. Only one number leaves: recording length in seconds (for the global pulse counter, toggleable in Settings). API keys live in macOS Keychain.
- Superwhisper cloud. Audio routes through Whisper API providers. Standard OpenAI / equivalent retention policies apply.
Who should pick which
Pick VoiceInput if
You type Chinese or mixed Chinese-English daily, want sub-1.5-second latency, and want to be able to search what you said last month. Free forever covers most real usage.
Pick Superwhisper if
You only speak English, you want a polished Whisper wrapper, you don't need a memory layer, and $84/year fits your workflow.
FAQ
Is VoiceInput a Superwhisper alternative?
Yes. Both are macOS menu-bar dictation apps. VoiceInput differs in three ways: ~1.4s end-to-end on Chinese (vs 3-6s), every utterance auto-archives into a searchable local memory layer with AI personas, and the Local + BYOK tiers are free forever.
Which is more accurate for mixed Chinese-English?
VoiceInput. Volcengine ASR tuned for Mandarin, plus 200+ brand hotwords (Cursor, Kimi, GitHub) and pinyin disambiguation injected into LLM cleanup. Superwhisper's Whisper backbone underperforms on real-time code-switching CJK/Latin.
Can VoiceInput run fully offline?
Yes. Three local ASR engines: SenseVoice, Paraformer, Apple. Local typography engine handles formatting in <5ms. Toggle off cloud — nothing leaves your Mac.
How much does each cost?
Superwhisper: $84/yr or $8.49/mo. VoiceInput: 100% Local + BYOK free forever. Optional Cloud tier $5/mo, $79/yr, or $49 lifetime.
Can I migrate my Superwhisper history into VoiceInput?
Not yet. Superwhisper doesn't preserve a queryable history — there's nothing structural to migrate. From the moment you start using VoiceInput, every dictation is captured.
Try VoiceInput free
Free forever (100% local). No account, no API key, no setup. macOS 14+.
Download v0.47.0 · 21 MB