Skip to main content

Build and run a Silero VAD server

As the user speaks to the EchoKit device, it streams the audio data to the EchoKit server. The ASR service on the server must detect when the user is done talking and now expecting an answer. That is called VAD (Voice Activity Detection). When the ASR detects that the user has finished speaking, it will collect the audio transcript and send it to the LLM service for a response.

Streaming ASR services, such as Gemini Live, ElevenLabs, and OpenAI real-time, has built in VAD services. But for many services based on the /v1/audio/transcriptions API, you will need to supply your own VAD service in the [asr] section of EchoKit server's config.toml.

The VAD service is in fact, optional. If you do not supply it, the EchoKit server would still work. It would determine whether the user is finished speaking by detecting pauses in the speech, which is less reliable and offers inferior user experience than VAD.

The Silero VAD is a leading open-source VAD model. We have created a Rust-based Silero VAD server.

Install libtorch dependencies

Linux x86 CPU with or without CUDA and ROCm

# download libtorch
curl -LO https://download.pytorch.org/libtorch/cu124/libtorch-cxx11-abi-shared-with-deps-2.4.0%2Bcu124.zip
unzip libtorch-cxx11-abi-shared-with-deps-2.4.0%2Bcu124.zip

MacOS on Apple Silicon (M-series) devices

curl -LO https://download.pytorch.org/libtorch/cpu/libtorch-macos-arm64-2.8.0.zip
unzip libtorch-macos-arm64-2.8.0.zip

Then, tell the system where to find your LibTorch.

# Add to ~/.zprofile or ~/.bash_profile
export LD_LIBRARY_PATH=$(pwd)/libtorch/lib:$LD_LIBRARY_PATH
export LIBTORCH=$(pwd)/libtorch

Build the API server

git clone https://github.com/second-state/silero_vad_server

cd silero_vad_server
cargo build --release

Run the API server

VAD_LISTEN=0.0.0.0:9093 nohup target/release/silero_vad_server &

In the EchoKit server configuration, you can now use the VAD server in the [asr] section to use it together with the /v1/audio/transcriptions ASR API.

[asr]
url = "https://api.openai.com/v1/audio/transcriptions"
api_key = "sk_ABCD"
model = "gpt-4o-mini-transcribe"
lang = "en"
vad_url = "http://localhost:9093/v1/audio/vad"