Voxtype vs Nerd-dictation
Two approaches to offline speech-to-text on Linux. Both work on Wayland. Which fits your workflow?
At a Glance
| Aspect | Voxtype | Nerd-dictation |
|---|---|---|
| Engine | Whisper (whisper.cpp) | VOSK |
| Language | Rust | Python (single file) |
| Architecture | Systemd daemon | Foreground process |
| Wayland | Native (evdev) | Via ydotool |
| Text Output | wtype (native Wayland) | xdotool/ydotool |
| CJK/Unicode Output | Yes | No (ydotool limitation) |
| Recording Feedback | Audio + Notifications | None |
| GPU Acceleration | Vulkan, CUDA, Metal, ROCm | No |
| Text Processing | Word replacements, spoken punctuation | Python callbacks |
Critical Differences
CJK/Multilingual Text Output
Voxtype uses wtype for text output, which properly handles Korean, Chinese, Japanese, and other Unicode characters. No daemon required.
Nerd-dictation uses ydotool which cannot output CJK characters. ydotool simulates physical key presses, but CJK characters don't map to keyboard keys. You'll get garbled output like 9 . instead of Korean text.
Daemon vs Foreground
Voxtype runs as a systemd user service. It starts automatically at login, runs invisibly, and is always ready.
Nerd-dictation must run in a terminal foreground. You need to keep a terminal window open with the process running. Close the terminal, lose dictation. You can work around this with tmux or custom systemd units, but it's manual setup.
Recording Feedback
Voxtype plays audio cues when recording starts and stops, plus optional desktop notifications. You know it's working without looking at the screen.
Nerd-dictation provides no feedback whatsoever. No sound, no visual indicator, nothing. You press the hotkey and hope it's recording. You find out if it worked when text appears (or doesn't).
Recognition Quality
Voxtype (Whisper)
Whisper provides exceptional accuracy across accents and speaking styles. It handles technical terminology, mixed-language phrases, punctuation and capitalization, and unusual names.
Typical accuracy: 95-99% depending on audio quality and model size.
Nerd-dictation (VOSK)
VOSK is remarkably lightweight but has lower raw accuracy. Output is all lowercase with no automatic punctuation. Works better with clear, deliberate speech.
Typical accuracy: 85-95% depending on clarity and vocabulary.
Setup Complexity
Voxtype
# Install
curl -LO https://github.com/peteonrails/voxtype/releases/download/v0.2.1/voxtype_0.2.1-1_amd64.deb
sudo dpkg -i voxtype_0.2.1-1_amd64.deb
# Interactive model selection and systemd setup
voxtype setup model
voxtype setup systemd
Time to first transcription: ~5 minutes
Nerd-dictation
# Install VOSK
pip install vosk
# Download model manually
mkdir -p ~/.config/vosk
cd ~/.config/vosk
wget https://alphacephei.com/vosk/models/vosk-model-en-us-0.22.zip
unzip vosk-model-en-us-0.22.zip
# Clone and run
git clone https://github.com/ideasman42/nerd-dictation
./nerd-dictation/nerd-dictation begin --vosk-model-dir ~/.config/vosk/vosk-model-en-us-0.22
Time to first transcription: ~15-30 minutes (including troubleshooting)
Resource Usage
| Metric | Voxtype | Nerd-dictation |
|---|---|---|
| Idle | ~50MB, 0% CPU | Not running |
| Active | High CPU for 1-3s | ~200MB, moderate CPU |
| Model size | 300MB - 3GB | ~50MB per language |
Customization
Voxtype
Configuration via ~/.config/voxtype/config.toml:
[hotkey]
key = "rightctrl"
mode = "toggle" # or "push_to_talk"
[audio.feedback]
enabled = true
theme = "subtle"
[text]
# Say "period" to get ".", "open paren" for "(", etc.
spoken_punctuation = true
# Custom word replacements
[text.replacements]
hyperwhisper = "hyprwhspr"
javascript = "JavaScript"
Nerd-dictation
Python callbacks let you transform text with full programming logic:
def process_text(text):
# Arbitrary Python transformations
text = text.replace("period", ".")
text = text.replace("new line", "\n")
return text.capitalize()
The Verdict
Choose Voxtype if you want the best accuracy, GPU acceleration, built-in text processing (spoken punctuation, word replacements), and prefer tools that just work.
Choose Nerd-dictation if you need arbitrary Python transformations, prefer minimal footprint, or enjoy tinkering.
Voxtype now includes built-in text processing that covers most common use cases without needing Python code.