Skip to main content

3 posts tagged with "tts"

View All Tags

Day 20: Running GPT-SoVITS Locally as EchoKit’s TTS Provider | The First 30 Days with EchoKit

· 4 min read

Over the past few days, we’ve been switching EchoKit between different cloud-based TTS providers and voice styles. It’s fun, it’s flexible, and it really shows how modular the EchoKit pipeline is.

But today, I want to go one step further.

Today is about running TTS fully locally. No hosted APIs. No external requests. Just an open-source model running on your own machine — and EchoKit talking through it.

For Day 20, I’m using GPT-SoVITS as EchoKit’s local TTS provider.

What Is GPT-SoVITS?

GPT-SoVITS is an open-source text-to-speech and voice cloning system that combines:

  • A GPT-style text encoder for linguistic understanding
  • SoVITS-based voice synthesis for natural prosody and timbre

Compared to traditional TTS systems, GPT-SoVITS stands out for two reasons.

First, it produces very natural, expressive speech, especially for longer sentences and conversational content.

Second, it supports high-quality voice cloning with relatively small reference audio, which has made it popular in open-source voice communities.

Most importantly for us: GPT-SoVITS can run entirely on your own hardware.

Running GPT-SoVITS Locally

To make local GPT-SoVITS easier to run, we also ported GPT-SoVITS to a Rust-based implementation.

This significantly simplifies local deployment and makes it much easier to integrate with EchoKit.

Check out Build and run a GPT-SoVITS server for details. The following steps are on a MacBook

First, install the LibTorch dependencies:

curl -LO https://download.pytorch.org/libtorch/cpu/libtorch-macos-arm64-2.4.0.zip
unzip libtorch-macos-arm64-2.4.0.zip

Then, tell the system where to find LibTorch:

export DYLD_LIBRARY_PATH=$(pwd)/libtorch/lib:$DYLD_LIBRARY_PATH
export LIBTORCH=$(pwd)/libtorch

Next, clone the source code and build the GPT-SoVITS API server:

git clone https://github.com/second-state/gsv_tts
git clone https://github.com/second-state/gpt_sovits_rs

cd gsv_tts
cargo build --release

Then, download the required models. Since I’m running GPT-SoVITS locally on my MacBook, I’m using the CPU versions:

cd resources
curl -L -o t2s.pt https://huggingface.co/L-jasmine/GPT_Sovits/resolve/main/v2pro/t2s.cpu.pt
curl -L -o vits.pt https://huggingface.co/L-jasmine/GPT_Sovits/resolve/main/v2pro/vits.cpu.pt
curl -LO https://huggingface.co/L-jasmine/GPT_Sovits/resolve/main/v2pro/ssl_model.pt
curl -LO https://huggingface.co/L-jasmine/GPT_Sovits/resolve/main/v2pro/bert_model.pt
curl -LO https://huggingface.co/L-jasmine/GPT_Sovits/resolve/main/v2pro/g2pw_model.pt
curl -LO https://huggingface.co/L-jasmine/GPT_Sovits/resolve/main/v2pro/mini-bart-g2p.pt

Finally, start the GPT-SoVITS API server:

TTS_LISTEN=0.0.0.0:9094 nohup target/release/gsv_tts &

Configure EchoKit to Use the Local TTS Provider

At this point, GPT-SoVITS is running as a local service and exposing a simple HTTP API.

Once the service is up, EchoKit only needs an endpoint that accepts text and returns audio.

Update the TTS section in the EchoKit server configuration:

[tts]
platform = "StreamGSV"
url = "http://localhost:9094/v1/audio/stream_speech"
speaker = "cooper"

Restart the EchoKit server, connect the service to the device, and EchoKit will start using the new local TTS provider.

A Fully Local Voice AI Pipeline

With today’s setup, we can now run the entire voice AI pipeline locally:

  • ASR: local speech-to-text
  • LLM: local open-source language models
  • TTS: GPT-SoVITS running on your own machine

That means:

  • No cloud dependency
  • No external APIs
  • No vendor lock-in

Just a complete, end-to-end voice AI system you can understand, modify, and truly own.


Want to get your own EchoKit device and make it unique?

Join the EchoKit Discord to share your custom voices and see how others are personalizing their voice AI agents.

Day 19: Switching EchoKit’s TTS Provider to Fish.audio| The First 30 Days with EchoKit

· 2 min read

Over the past few days, we’ve been iterating on different parts of EchoKit’s voice pipeline — ASR, LLMs, system prompts, and TTS (including ElevenLabs and Groq).

On Day 19, we switch EchoKit’s Text-to-Speech provider to Fish.audio, purely through a configuration change.

No code changes are required.

What Is Fish.audio

Fish.audio is a modern text-to-speech platform focused on high-quality, expressive voices and fast iteration for developers.

One notable aspect of Fish.audio is the breadth of available voices. It offers a wide range of voice styles, including voices inspired by public figures, pop culture, and anime culture references, which makes it easy to experiment with playful or character-driven agents.

In addition to preset voices, Fish.audio also supports voice cloning, allowing developers to generate speech in a customized voice when needed.

These features make it particularly interesting for conversational and personality-driven voice AI systems.

EchoKit is designed to be provider-agnostic. As long as a TTS service matches the expected interface, it can be plugged into the system without affecting the rest of the pipeline.

The Exact Change in config.toml

Switching to Fish.audio in EchoKit only requires updating the TTS section in the config.toml file:

[tts]
platform = "fish"
speaker = "03397b4c4be74759b72533b663fbd001"
api_key = "YOUR_FISH_AUDIO_API_KEY"

A brief explanation of each field:

  • platform set to "fish" tells EchoKit to use Fish.audio as the TTS provider.
  • speaker specifies the TTS model ID, which can be obtained from the Fish.audio model detail page.
  • api_key is the API key used to authenticate with the Fish.audio service.

After restarting the EchoKit server and reconnecting the device, all voice output is generated by Fish.audio.

Everything else remains unchanged:

  • ASR stays the same
  • The LLM and system prompts stay the same
  • Conversation flow and tool calls stay the same

With Fish.audio added to the list of supported TTS providers, EchoKit’s voice layer becomes even more flexible — making it easier to experiment with different voices without reworking the system.


Want to get your own EchoKit device and make it unique?

Join the EchoKit Discord to share your welcome voices and see how others are personalizing their voice AI agents!

Day 18: Switching EchoKit to Groq PlayAI TTS | The First 30 Days with EchoKit

· 3 min read

Over the past two weeks, we’ve built almost every core component of a voice AI agent on EchoKit:

ASR to turn speech into text. LLMs to reason, chat, and call tools. System prompts to shape personality. MCP servers to let the agent take real actions. TTS to give EchoKit a voice.

Today, we close the loop again — but this time, with a new voice engine.

We’re switching EchoKit’s TTS backend to Groq’s PlayAI TTS.

Why change TTS?

Text-to-speech is often treated as the “last step” in a voice pipeline, but in practice, it’s the part users feel the most.

Latency, voice stability, and natural prosody directly affect whether a voice agent feels responsive or awkward. Since Groq already powers our ASR and LLM experiments with very low latency, it made sense to test their TTS offering as well.

PlayAI TTS fits EchoKit’s design goals nicely: It’s fast, simple to integrate, and exposed through an OpenAI-compatible API.

That means no special SDK, and no changes to EchoKit’s core architecture.

Switching EchoKit to Groq PlayAI TTS

On EchoKit, swapping TTS providers is mostly a configuration change.

To use Groq PlayAI TTS, we update the tts section in config.toml like this:

[tts]
platform = "openai"
url = "https://api.groq.com/openai/v1/audio/speech"
model = "Playai-tts"
api_key = "gsk_xxx"
voice = "Fritz-PlayAI"

A few things worth calling out:

The platform stays as openai because Groq exposes an OpenAI-compatible endpoint. We point the url directly to Groq’s audio speech API. The model is set to Playai-tts. Voices are selected via the voice field — here we’re using Fritz-PlayAI.

Once this is in place, no other code changes are required.

Restart the EchoKit server, reconnect the EchoKit device and the new server, and the agent speaks with a new voice.

The bigger picture

Most importantly, switching different tts providers reinforces one of EchoKit’s core ideas: every part of the voice pipeline should be swappable.

It’s about treating voice as a first-class system component — something you can experiment with, replace, and optimize just like models or prompts.

EchoKit doesn’t lock you into one vendor or one voice. If tomorrow you want to try a different TTS engine, or even run one locally, the architecture already supports that.


Want to get your own EchoKit device and make it unique?

Join the EchoKit Discord to share your welcome voices and see how others are personalizing their voice AI agents!