Voxtype vs VoxInput
Different philosophies: embedded Whisper vs API-based transcription. Both target Linux power users.
At a Glance
| Aspect | Voxtype | VoxInput |
|---|---|---|
| Engine | Whisper (embedded) | LocalAI/OpenAI API |
| Language | Rust | Go |
| Architecture | Self-contained daemon | API client + LocalAI server |
| Typing Backend | ydotool | dotool |
| Hotkey Detection | Built-in (evdev) | External (WM keybinds + signals) |
| Voice Activity Detection | No (push-to-talk) | Yes (realtime mode) |
| Setup Complexity | Low | High (Docker, LocalAI) |
| GPU Acceleration | Vulkan, CUDA, Metal, ROCm | Via LocalAI |
| Text Processing | Word replacements, spoken punctuation | None |
Critical Differences
Embedded vs API Architecture
Voxtype embeds whisper.cpp directly. One binary, one process, no external services needed.
VoxInput is an API client that connects to an OpenAI-compatible endpoint. The recommended setup involves running LocalAI in Docker, which provides the transcription service. This is more complex but allows swapping transcription backends.
Setup Complexity
Voxtype setup:
paru -S voxtype
voxtype setup model
voxtype setup systemd
systemctl --user enable --now voxtype
VoxInput setup:
# Install LocalAI via Docker
docker run -d --name localai -p 8080:8080 localai/localai
# Install whisper model via LocalAI web UI
# Open http://localhost:8080, install whisper-1 and silero-vad-ggml
# Install dotool and configure udev rules
# Add user to input group
# Build VoxInput
git clone https://github.com/richiejp/VoxInput
cd VoxInput && go build -o voxinput
# Configure WM keybinds for record/write commands
Voice Activity Detection
Voxtype uses push-to-talk exclusively. You control when recording happens.
VoxInput offers a realtime VAD mode using silero-vad that can automatically detect when you're speaking. This enables hands-free continuous dictation (though the feature is noted as partial/beta).
Feature Comparison
What VoxInput Does Better
- VAD mode - Automatic speech detection for continuous dictation
- Flexible backend - Swap between LocalAI, OpenAI, or any compatible API
- Monitor capture - Can transcribe system audio output, not just microphone
- AI button pressing - Experimental feature to describe UI elements for AI to click
What Voxtype Does Better
- Simple setup - No Docker, no API server, no manual model installation
- Self-contained - Single binary with embedded transcription
- Lower resource usage - No separate API server running
- GPU acceleration - Vulkan, CUDA, Metal, ROCm for fast transcription
- Text processing - Word replacements and spoken punctuation
- Built-in hotkeys - No WM configuration required
- Audio feedback - Know when you're recording
- Waybar integration - Status indicator built-in
The Verdict
Choose Voxtype if you want a simple, self-contained tool that works out of the box. No Docker, no API servers, just install and dictate.
Choose VoxInput if you're already running LocalAI infrastructure, want VAD-based continuous dictation, or need the flexibility to swap transcription backends. Be prepared for a more complex setup.