Skip to main content

Day 20: Running GPT-SoVITS Locally as EchoKit’s TTS Provider | The First 30 Days with EchoKit

· 4 min read

Over the past few days, we’ve been switching EchoKit between different cloud-based TTS providers and voice styles. It’s fun, it’s flexible, and it really shows how modular the EchoKit pipeline is.

But today, I want to go one step further.

Today is about running TTS fully locally. No hosted APIs. No external requests. Just an open-source model running on your own machine — and EchoKit talking through it.

For Day 20, I’m using GPT-SoVITS as EchoKit’s local TTS provider.

What Is GPT-SoVITS?

GPT-SoVITS is an open-source text-to-speech and voice cloning system that combines:

  • A GPT-style text encoder for linguistic understanding
  • SoVITS-based voice synthesis for natural prosody and timbre

Compared to traditional TTS systems, GPT-SoVITS stands out for two reasons.

First, it produces very natural, expressive speech, especially for longer sentences and conversational content.

Second, it supports high-quality voice cloning with relatively small reference audio, which has made it popular in open-source voice communities.

Most importantly for us: GPT-SoVITS can run entirely on your own hardware.

Running GPT-SoVITS Locally

To make local GPT-SoVITS easier to run, we also ported GPT-SoVITS to a Rust-based implementation.

This significantly simplifies local deployment and makes it much easier to integrate with EchoKit.

Check out Build and run a GPT-SoVITS server for details. The following steps are on a MacBook

First, install the LibTorch dependencies:

curl -LO https://download.pytorch.org/libtorch/cpu/libtorch-macos-arm64-2.4.0.zip
unzip libtorch-macos-arm64-2.4.0.zip

Then, tell the system where to find LibTorch:

export DYLD_LIBRARY_PATH=$(pwd)/libtorch/lib:$DYLD_LIBRARY_PATH
export LIBTORCH=$(pwd)/libtorch

Next, clone the source code and build the GPT-SoVITS API server:

git clone https://github.com/second-state/gsv_tts
git clone https://github.com/second-state/gpt_sovits_rs

cd gsv_tts
cargo build --release

Then, download the required models. Since I’m running GPT-SoVITS locally on my MacBook, I’m using the CPU versions:

cd resources
curl -L -o t2s.pt https://huggingface.co/L-jasmine/GPT_Sovits/resolve/main/v2pro/t2s.cpu.pt
curl -L -o vits.pt https://huggingface.co/L-jasmine/GPT_Sovits/resolve/main/v2pro/vits.cpu.pt
curl -LO https://huggingface.co/L-jasmine/GPT_Sovits/resolve/main/v2pro/ssl_model.pt
curl -LO https://huggingface.co/L-jasmine/GPT_Sovits/resolve/main/v2pro/bert_model.pt
curl -LO https://huggingface.co/L-jasmine/GPT_Sovits/resolve/main/v2pro/g2pw_model.pt
curl -LO https://huggingface.co/L-jasmine/GPT_Sovits/resolve/main/v2pro/mini-bart-g2p.pt

Finally, start the GPT-SoVITS API server:

TTS_LISTEN=0.0.0.0:9094 nohup target/release/gsv_tts &

Configure EchoKit to Use the Local TTS Provider

At this point, GPT-SoVITS is running as a local service and exposing a simple HTTP API.

Once the service is up, EchoKit only needs an endpoint that accepts text and returns audio.

Update the TTS section in the EchoKit server configuration:

[tts]
platform = "StreamGSV"
url = "http://localhost:9094/v1/audio/stream_speech"
speaker = "cooper"

Restart the EchoKit server, connect the service to the device, and EchoKit will start using the new local TTS provider.

A Fully Local Voice AI Pipeline

With today’s setup, we can now run the entire voice AI pipeline locally:

  • ASR: local speech-to-text
  • LLM: local open-source language models
  • TTS: GPT-SoVITS running on your own machine

That means:

  • No cloud dependency
  • No external APIs
  • No vendor lock-in

Just a complete, end-to-end voice AI system you can understand, modify, and truly own.


Want to get your own EchoKit device and make it unique?

Join the EchoKit Discord to share your custom voices and see how others are personalizing their voice AI agents.

Day 19: Switching EchoKit’s TTS Provider to Fish.audio| The First 30 Days with EchoKit

· 2 min read

Over the past few days, we’ve been iterating on different parts of EchoKit’s voice pipeline — ASR, LLMs, system prompts, and TTS (including ElevenLabs and Groq).

On Day 19, we switch EchoKit’s Text-to-Speech provider to Fish.audio, purely through a configuration change.

No code changes are required.

What Is Fish.audio

Fish.audio is a modern text-to-speech platform focused on high-quality, expressive voices and fast iteration for developers.

One notable aspect of Fish.audio is the breadth of available voices. It offers a wide range of voice styles, including voices inspired by public figures, pop culture, and anime culture references, which makes it easy to experiment with playful or character-driven agents.

In addition to preset voices, Fish.audio also supports voice cloning, allowing developers to generate speech in a customized voice when needed.

These features make it particularly interesting for conversational and personality-driven voice AI systems.

EchoKit is designed to be provider-agnostic. As long as a TTS service matches the expected interface, it can be plugged into the system without affecting the rest of the pipeline.

The Exact Change in config.toml

Switching to Fish.audio in EchoKit only requires updating the TTS section in the config.toml file:

[tts]
platform = "fish"
speaker = "03397b4c4be74759b72533b663fbd001"
api_key = "YOUR_FISH_AUDIO_API_KEY"

A brief explanation of each field:

  • platform set to "fish" tells EchoKit to use Fish.audio as the TTS provider.
  • speaker specifies the TTS model ID, which can be obtained from the Fish.audio model detail page.
  • api_key is the API key used to authenticate with the Fish.audio service.

After restarting the EchoKit server and reconnecting the device, all voice output is generated by Fish.audio.

Everything else remains unchanged:

  • ASR stays the same
  • The LLM and system prompts stay the same
  • Conversation flow and tool calls stay the same

With Fish.audio added to the list of supported TTS providers, EchoKit’s voice layer becomes even more flexible — making it easier to experiment with different voices without reworking the system.


Want to get your own EchoKit device and make it unique?

Join the EchoKit Discord to share your welcome voices and see how others are personalizing their voice AI agents!

Day 18: Switching EchoKit to Groq PlayAI TTS | The First 30 Days with EchoKit

· 3 min read

Over the past two weeks, we’ve built almost every core component of a voice AI agent on EchoKit:

ASR to turn speech into text. LLMs to reason, chat, and call tools. System prompts to shape personality. MCP servers to let the agent take real actions. TTS to give EchoKit a voice.

Today, we close the loop again — but this time, with a new voice engine.

We’re switching EchoKit’s TTS backend to Groq’s PlayAI TTS.

Why change TTS?

Text-to-speech is often treated as the “last step” in a voice pipeline, but in practice, it’s the part users feel the most.

Latency, voice stability, and natural prosody directly affect whether a voice agent feels responsive or awkward. Since Groq already powers our ASR and LLM experiments with very low latency, it made sense to test their TTS offering as well.

PlayAI TTS fits EchoKit’s design goals nicely: It’s fast, simple to integrate, and exposed through an OpenAI-compatible API.

That means no special SDK, and no changes to EchoKit’s core architecture.

Switching EchoKit to Groq PlayAI TTS

On EchoKit, swapping TTS providers is mostly a configuration change.

To use Groq PlayAI TTS, we update the tts section in config.toml like this:

[tts]
platform = "openai"
url = "https://api.groq.com/openai/v1/audio/speech"
model = "Playai-tts"
api_key = "gsk_xxx"
voice = "Fritz-PlayAI"

A few things worth calling out:

The platform stays as openai because Groq exposes an OpenAI-compatible endpoint. We point the url directly to Groq’s audio speech API. The model is set to Playai-tts. Voices are selected via the voice field — here we’re using Fritz-PlayAI.

Once this is in place, no other code changes are required.

Restart the EchoKit server, reconnect the EchoKit device and the new server, and the agent speaks with a new voice.

The bigger picture

Most importantly, switching different tts providers reinforces one of EchoKit’s core ideas: every part of the voice pipeline should be swappable.

It’s about treating voice as a first-class system component — something you can experiment with, replace, and optimize just like models or prompts.

EchoKit doesn’t lock you into one vendor or one voice. If tomorrow you want to try a different TTS engine, or even run one locally, the architecture already supports that.


Want to get your own EchoKit device and make it unique?

Join the EchoKit Discord to share your welcome voices and see how others are personalizing their voice AI agents!

Day 17: Giving EchoKit a Voice — Using ElevenLabs TTS | The First 30 Days with EchoKit

· 2 min read

Over the past three weeks, we’ve covered almost every core piece of a voice AI agent:

  • ASR: turning human speech into text
  • LLMs: reasoning, chatting, and tool calling
  • System prompts: shaping personality and behavior
  • MCP tools: letting EchoKit take real actions

Today, we complete the loop.

It’s time to talk about TTS — Text to Speech.

Without TTS, your agent can think, plan, and decide — but it can’t speak back. And for a voice-first device like EchoKit, that’s a deal breaker.

In Day 17, we’ll start with one of the most popular choices: ElevenLabs TTS.

Why ElevenLabs?

ElevenLabs is widely used because it offers:

  • Very natural-sounding voices
  • Low latency for real-time conversations
  • Multiple languages and accents
  • Voice cloning support (we’ll get to that later 😉)

For builders, it’s also simple to integrate and well-documented — which makes it a great first TTS provider for EchoKit.

What EchoKit Needs for ElevenLabs TTS

EchoKit’s ElevenLabs configuration lives in the EchoKit server’s config.toml file.

[tts]
platform = "Elevenlabs"
token = ""
voice = "yj30vwTGJxSHezdAGsv" # The voice I choose here is Jessa
  • platform: set to "Elevenlabs"
  • token: your ElevenLabs API key. You can generate one from the ElevenLabs Developer Dashboard
  • voice: the voice ID you want EchoKit to speak with

⚠️ Important: If you pick a voice in ElevenLabs, you must add it to “My Voices”. Otherwise, your API key may not be able to call it, even if the voice plays fine in the UI.

That’s it. model_id is optional in EchoKit’s config and not required for basic TTS.

Restart and Reconnect the Server

After updating the config, restart the EchoKit server, then reconnect the EchoKit device.

When you chat with the device again, you should hear EchoKit speak back — using the voice you selected.

With TTS working, EchoKit finally feels complete as a voice AI companion.

Want to get your own EchoKit device and make it unique?

Join the EchoKit Discord to share your welcome voices and see how others are personalizing their voice AI agents!

Day 16: Dynamic Personality for EchoKit | The First 30 Days with EchoKit

· 2 min read

In previous instalments we explored switching LLM providers and giving EchoKit different personalities through system prompts. Today let's learn a powerful new feature —dynamic system prompt loading.

Why dynamic system prompts?

A system prompt sets EchoKit’s tone, role and behaviour. Thanks to the growing ecosystem of open‑source prompts, you can choose from thousands of prebuilt personalities—sites like LLMs.txt offer extensive collections. Previously, changing EchoKit’s character required editing a local file and restarting the server. Now the server can fetch a system prompt from a remote URL, insert it into the context and cache it. This lets you:

  • Update behaviour remotely. Change the text at the URL and EchoKit adopts a new persona on the next restart.
  • Experiment without redeploying. Quickly swap prompts or test new conversation flows without editing code.
  • Iterate on demos. Focus on creativity rather than configuration while your EchoKit responds in new ways.

How to use a remote prompt

Open your config.toml and find the [[llm.sys_prompts]] section. Instead of embedding the full text, wrap a plain‑text URL in double braces:

[[llm.sys_prompts]]
role = "system"
content = """
{{ https://raw.githubusercontent.com/alabulei1/echokit-dynamic-prompt/refs/heads/main/prompt.txt }}
"""

On startup, EchoKit will:

  1. Fetch the content from that URL.
  2. Insert it as the system prompt.
  3. Cache it for later use.

Want to give it a try? GitHub raw files are convenient hosts because it's free and they can return plain text.

When does EchoKit reload the prompt?

Dynamic prompts are fetched only during a full restart:

  • When you power the device off and back on.
  • When you press the RST hardware button.

Interrupting a conversation with the K0 button or a temporary Wi‑Fi reconnection will not reload the prompt. This ensures ongoing sessions remain consistent while still giving you the freedom to change behaviour by updating the remote file.

Summary

Dynamic system prompt loading opens up a new level of flexibility for EchoKit. You no longer need to modify local files or restart the server to change your agent’s behaviour; instead, you can pull any prompt hosted on the web and swap personas at will.


Want to get your EchoKit Device and make it unique?

Join the EchoKit Discord to share your creative welcome voices and see how others are personalizing their Voice AI agents!

Day 15: EchoKit × MCP — Search the Web with Your Voice | The First 30 Days with EchoKit

· 4 min read

Over the past few days in The First 30 Days with EchoKit, we’ve explored how EchoKit connects to various LLM providers—OpenAI, OpenRouter, Groq, Grok and even local models. But switching models only affects how smart EchoKit is.

Next, we showed how changing the system prompt can transform EchoKit’s personality without touching any code—turning it into a coach, a cat, or a Shakespearean actor. Today, we’re going to extend what EchoKit can do by plugging into the broader ecosystem of tools through the Model Context Protocol (MCP).

Recent industry news makes this especially timely: on December 9, 2025, Anthropic donated MCP to the Linux Foundation and co‑founded the Agentic AI Foundation (AAIF) with Block and OpenAI. MCP is now joined by Block’s Goose agent framework and OpenAI’s AGENTS.md spec as the founding projects of the AAIF.

🧠 What is MCP?

MCP acts like a “USB‑C port” for AI agents. It defines a client–server protocol that lets models call external tools, databases or APIs through standardised actions. MCP servers wrap services—such as file systems, web searches or device controls—behind simple JSON‑RPC endpoints. MCP Clients (like EchoKit or Anthropic’s Claude Code) connect to one or more MCP servers and dynamically discover available tools. When the model needs information or wants to perform an action, it sends a tool request; the server executes the tool and returns results for the model to use.

MCP’s adoption has been rapid: within a year of its release there were over 10,000 public MCP servers and more than 97 million SDK downloads. It’s been integrated into major platforms like ChatGPT, Claude, Cursor, Gemini, Microsoft Copilot and VS Code. By placing MCP under the AAIF, Anthropic and its partners ensure that this crucial infrastructure remains open, neutral and community‑driven.

🔧 Connect EchoKit to an MCP Server

To make EchoKit call external tools, we simply point it to an MCP server. Add a section like the following to your config.toml:

[[llm.mcp_server]]
server = "MCP_SERVER_URL"
type = "http_streamable"

server – the URL of the MCP server (replace this with the server you want to use).

type – http_streamable and SSE mode are supported.

Once configured, EchoKit will automatically maintain a connection to the MCP server. When the LLM detects that it needs to call a tool, it issues a request via MCP and merges the response into the LLM. So, if you want to use MCP server, the LLM you used must support tool call. Here are some recommendations:

  • Open source models: Qwen3, GPT-OSS, Llama 3.1
  • Close source models: Gemini, OpenAI, Claude

🌐 Example: Adding a Web Search Tool

To demonstrate, let’s connect EchoKit to a web‑search MCP server. Many open‑source servers provide a search tool that scrapes public search engine results—often without requiring API keys.

Adding the server to your configuration. Here I use the GPT-OSS-120B model hosted on Groq and the tavily MCP server:

[llm]
llm_chat_url = "https://api.groq.com/openai/v1/chat/completions"
api_key = "YOUR API KEY"
model = "openai/gpt-oss-120b"
history = 5

[llm.mcp_server]]
server = "http://eu.echokit.dev:8011/mcp"
type = "http_streamable"

After that, save the file and restart EchoKit as usual.

Ask: “Tell me the latest update of MCP.”

Under the hood, EchoKit’s LLM recognises that it needs up‑to‑date information. It invokes the search tool on your MCP server, passing your query.

The MCP server performs the web search and returns structured results (titles, URLs and snippets). EchoKit then synthesises a natural‑language answer, summarising the findings and citing the sources.

You can also use other MCP server tools like the Google Calendar MCP server to add and edit events, Slack MCP server to send a message to the Slack channel, Home Assistant MCP server to control home devices. All of these tools become accessible through your voice.

📌 Why This Matters

Integrating MCP gives EchoKit access to a rapidly expanding tool ecosystem. You’re no longer limited to predetermined voice commands; your agent can search the web, read files, run code, query databases or control smart devices—all through a voice interface. The AAIF’s stewardship of MCP ensures that these capabilities remain open and interoperable, so EchoKit can continue to evolve alongside the broader agentic AI community.


Want to explore more or share what you’ve built with MCP servers?

Ready to get your own EchoKit?

Start building your own voice AI agent today.

Day 14: Give EchoKit a New Personality with System Prompt | The First 30 Days with EchoKit

· 3 min read

Over the past few days, we explored how EchoKit connects to different LLM providers — OpenAI, OpenRouter, Groq, Grok and even fully local models like Qwen3.

But switching the model only decides how smart EchoKit is.

Today, we’re doing something much more fun: we’re changing who EchoKit is.

With one simple system prompt, you can turn EchoKit into a cat, a coach, a tired office worker, a sarcastic companion, or a dramatic Shakespeare actor. No code. No firmware change. Just one text block in your configuration.

Let’s make EchoKit come alive.

What Is a System Prompt, and Why Does It Matter?

A system prompt is the personality, behavior guideline, and “soul” you give your LLM.

It defines:

  • How EchoKit speaks
  • What role it plays
  • Its tone and attitude
  • How it should respond in different situations

System prompt is incredibly powerful. Change it, and the same model can behave like a completely different agent.

Where the System Prompt Lives in EchoKit

In your config.toml, under the [[llm.sys_prompts]] section, you’ll find:

[[llm.sys_prompts]]
role = "system"
content = """
(your prompt goes here)
"""

Just edit this text, save the file, and restart the EchoKit server.

If your WiFi and EchoKit server didn't change, press the rst button on the device to make the new system prompt take effect.

5 Fun and Hilarious Prompt Ideas You Can Try Today

Below are ready-to-use system prompts. Copy, paste, enjoy.

1. The “Explain Like I’m Five” Tutor

You explain everything as if you're teaching a five-year-old. 
Simple, patient, cute, and crystal clear.

2. The Shakespearean AI

You speak like a dramatic Shakespeare character, 
as if every mundane question is a matter of cosmic destiny.

3. The Confused but Hardworking AI Intern

You are a slightly confused intern who tries extremely hard. 
Sometimes you misunderstand things in funny ways, but you stay cheerful.

4. The Cat That Doesn’t Understand Human Problems

You are a cat. 
You interpret all human activities through a cat’s perspective.
Add 'meow' occasionally.
You don't truly understand technology.

5. The Absurd Metaphor Philosopher

You must include at least one ridiculous metaphor in every reply. 
Be philosophical but humorous.

Have fun — EchoKit becomes a completely different creature depending on what you choose.

Prompt Debugging Tips

If your character “breaks,” try adding:

  • “Stay in character.”
  • “Keep responses short.”
  • “If unsure, make up a fun explanation.”
  • “Use a consistent tone.”

Prompt tuning is an art. A few careful sentences can reshape the entire interaction.

Try giving your EchoKit different personalities now.


Want to explore more or share what you’ve built?

Ready to get your own EchoKit?

Start building your own voice AI agent today.

Day 13 — Running an LLM Locally for EchoKit | The First 30 Days with EchoKit

· 3 min read

Over the last few days, we explored several cloud-based LLM providers — OpenAI, OpenRouter, and Grok. Each offers unique advantages, but today we’re doing something completely different: we’re running the open-source Qwen3-4B model locally and using it as EchoKit’s LLM provider.

There’s no shortage of great open-source LLMs—Llama, Mistral, DeepSeek, Qwen, and many others—and you can pick whichever model best matches your use case.

Likewise, you can run a local model in several different ways. For today’s walkthrough, though, we’ll focus on a clean, lightweight, and portable setup: Qwen3-4B (GGUF) running inside a WASM LLM server powered by WasmEdge. This setup exposes an OpenAI-compatible API, which makes integrating it with EchoKit simple and seamless.

Run the Qwen3-4B Model Locally

Step 1 — Install WasmEdge

WasmEdge is a lightweight, secure WebAssembly runtime capable of running LLM workloads through the LlamaEdge extension.

Install it:

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install_v2.sh | bash -s

Verify the installation:

wasmedge --version

You should see a version number printed.

Step 2 — Download Qwen3-4B in GGUF Format

We’ll use a quantized version of Qwen3-4B, which keeps memory usage manageable while delivering strong performance.

curl -Lo Qwen3-4B-Q5_K_M.gguf https://huggingface.co/second-state/Qwen3-4B-GGUF/resolve/main/Qwen3-4B-Q5_K_M.gguf

Step 3 — Download the LlamaEdge API Server (WASM)

This small .wasm application loads GGUF models and exposes an OpenAI-compatible chat API, which EchoKit can connect to directly.

curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-api-server.wasm

Step 4 — Start the Local LLM Server

Now let’s launch the Qwen3-4B model locally and expose the /v1/chat/completions endpoint:

wasmedge --dir .:. \
--nn-preload default:GGML:AUTO:Qwen3-4B-Q5_K_M.gguf \
llama-api-server.wasm \
--model-name Qwen3-4B \
--prompt-template qwen3-no-think \
--ctx-size 4096

If everything starts up correctly, the server will be available at:

http://localhost:8080

Connect EchoKit to Your Local LLM

Open your EchoKit server’s config.toml and update the LLM settings:

[llm]
llm_chat_url = "http://localhost:8080/v1/chat/completions"
api_key = "N/A"
model = "Qwen3-4B"
history = 5

Save the file and restart your EchoKit server.

Next, pair your EchoKit device and connect it to your updated server.

Now try speaking to your device:

“EchoKit, what do you think about running local models?”

Watch your terminal — you should see EchoKit sending requests to your local endpoint.

Your EchoKit is now fully powered by a local Qwen3-4B model.

Today we reached a major milestone: EchoKit can now run entirely on your machine, with no external LLM provider required.


This tutorial is only one small piece of what EchoKit can do. If you want to build your own voice AI device, try different LLMs, or run fully local models like Qwen — EchoKit gives you everything you need in one open-source kit.

Want to explore more or share what you’ve built?

  • Join the EchoKit Discord
  • Show us your custom models, latency tests, and experiments — the community is growing fast.

Ready to get your own EchoKit?

Start building your own voice AI agent today.

End-to-End vs. ASR-LLM-TTS: Which One Is The Right Choice to Build Voice AI Agent?

· 5 min read

The race to build the perfect Voice AI Agent has primarily split into two lanes: the seamless, ultra-low latency End-to-End (E2E) model (like Gemini Live), and the highly configurable ASR-LLM-TTS modular pipeline. While the speed and fluidity of the End-to-End approach have garnered significant attention, we argue that for enterprise-grade applications, the modular ASR-LLM-TTS architecture provides the strategic advantage of control, customization, and long-term scalability.

This is not simply a technical choice; it is a business decision that determines whether your AI Agent will be a generic tool or a highly specialized, branded extension of your operations.

The Allure of the Integrated Black Box (Low Latency, High Constraint)

End-to-End models are technologically impressive. By integrating the speech-to-text (ASR), large language model (LLM), and text-to-speech (TTS) functions into a single system, they achieve significantly lower latency compared to pipeline systems. The resulting conversation feels incredibly fluid, with minimal pauses—an experience that is highly compelling in demonstrations.

However, this integration creates a “black box”. Once the user's voice enters the system, you lose visibility and the ability to intervene until the synthesized voice comes out. For general consumer-grade assistants, this simplification works. But for companies with specialized vocabulary, unique brand voices, and strict compliance needs, simplicity comes at the cost of surgical control.

Lessons Learned from the Front Lines: The Echokit Experience

Our understanding of this architectural divide is forged through experience building complex, scalable voice platforms. In the early days of advanced voice interaction—systems like echokit—we tackled the challenge of delivering functional, high-quality, and reliable Voice AI using the available modular components.

These pioneering efforts, long before current E2E models were mainstream, taught us a crucial lesson: The ability to inspect, isolate, and optimize each stage (ASR, NLU/LLM, TTS) is non-negotiable for achieving enterprise-level accuracy and customization. We realized that while assembling the pipeline was complex, the resulting control over domain-specific accuracy, language model behavior, and distinct voice output ultimately delivered superior business results and a truly unique brand experience.

More importantly, EchoKit, which is open source, ensures complete transparency and adaptability.

The Power of the Modular Pipeline: Control and Precision (Higher Latency, Full Control)

The ASR-LLM-TTS pipeline breaks the Voice AI process down into three discrete, controllable stages. While this sequential process often results in higher overall latency compared to E2E solutions, this modularity is a deliberate architectural choice that grants businesses the power to optimize every single touchpoint.

  1. ASR (Acoustic and Language Model Fine-tuning): You can specifically train the ASR component on your industry jargon, product names, or regional accents. This is crucial in sectors like finance, healthcare, or manufacturing, where misrecognition of a single term can be disastrous. The pipeline allows you to correct ASR errors before they even reach the LLM, ensuring higher fidelity input.
  2. LLM (Knowledge Injection and Logic Control): This is the brain. You have the flexibility to swap out the LLM (whether it's GPT, Claude, or a custom model) and deeply integrate your proprietary knowledge bases (RAG), MCP servers, business rules, and specific workflow logic. You maintain complete control over the reasoning path and ensure responses are accurate and traceable.
  3. TTS (Brand Voice and Emotional Context): This is the face and personality of your brand. You can select, fine-tune, or even clone a unique voice that perfectly matches your brand identity, adjusting emotional tone and pacing. Your agent should sound like your company, not a generic robot.

Voice AI Architecture Comparison: E2E vs. ASR-LLM-TTS

The choice boils down to a fundamental trade-off between Latency vs. Customization.

FeatureEnd-to-End (E2E) Model (e.g., Gemini Live)ASR-LLM-TTS Pipeline (Modular)
Primary AdvantageUltra-Low Latency & Fluidity. Excellent for fast, generic conversation.Maximum Customization & Control. Optimized for business value.
LatencySignificantly Lower. Integrated processing minimizes delays.Generally Higher. Sequential processing introduces latency between stages.
ArchitectureIntegrated Black Box. All components merged.Three Discrete Modules. ASR $\to$ LLM $\to$ TTS.
CustomizationLow. Limited ability to adjust individual components or voices.High. Each module can be independently trained and swapped.
Brand VoiceLimited. Locked to vendor's available TTS options.Full Control. Can implement custom voice cloning and precise emotion tagging.
Optimization PathAll-or-Nothing. Optimization requires waiting for the vendor to update the entire model.Component-Specific. Allows precise fixes and continuous improvement on any single module.
Strategic Lock-inHigh. Tightly bound to the single End-to-End vendor/platform.Low. Flexibility to integrate best-of-breed components from different vendors.

The Verdict: Choosing a Strategic Asset

While the ultra-low latency of an End-to-End agent is undoubtedly attractive, it is crucial to ask: Does speed alone deliver business value?

For most enterprise use cases—where the Agent handles critical customer service, sales inquiries, or technical support—the ability to be accurate, on-brand, and deeply integrated is far more valuable than shaving milliseconds off the response time.

The ASR-LLM-TTS architecture, validated by our experience with systems like echokit, is the strategic choice because it treats the Voice AI Agent not as a simple conversational tool, but as a controllable, customizable, and continuously optimizable business asset. By opting for modularity, you retain the control necessary to adapt to market changes, ensure data compliance, and, most importantly, deliver a unique and expert-level experience that truly reflects your brand.

Which solution delivers the highest long-term ROI and the strongest brand experience? The answer is clear: Control is the key to enterprise Voice AI.

Day 12 — Switching EchoKit to Grok (with Built-in Web Search) | The First 30 Days with EchoKit

· 3 min read

Over the past days, we’ve been exploring how EchoKit’s ASR → LLM → TTS pipeline works. We learned how to replace different ASR providers, and this week we shifted our focus to the LLM — the part that thinks, reasons, and decides how EchoKit should reply.

We have connected EchoKit to OpenAI and OpenRouter. Today, we’re trying something different: Grok — a super-fast LLM with built-in web search.

Why Grok?

Grok, developed by X, stands out for a few practical reasons:

  • ⚡ Extremely fast inference Great for voice AI agents like EchoKit.

  • 🔍 Built-in web search Your device can answer questions using fresh information from the internet.

  • 🔌 OpenAI-compatible API Minimizes changes — EchoKit can talk to it just like it talks to OpenAI.

For a small device that depends on fast responses, Grok is an excellent option.

How to Use Grok as Your LLM in EchoKit

All you need to do is update your config.toml of your EchoKit Server. No code changes, no rewriting your server — just swap URLs and keys.

1. Set Grok as the LLM provider

In your config.toml, make sure the [llm] section points to Grok:

[llm]
llm_chat_url = "https://api.x.ai/v1/chat/completions"
api_key = "YOUR_API_KEY"
model = "grok-4-1-fast-non-reasoning"
history = 5

You can find your Grok API key in your xAI account dashboard. You will need to buy credits before using the Grok API.

Don't rush to close the config.toml window.

This is the special part.

Add the following section in the config.toml file:

[llm.extra]
search_parameters = { mode = "auto" }

mode = "auto" allows Grok to decide when it should fetch information from the web. Ask anything news-related, trending, or timely — Grok will search when needed.

Restart the EchoKit server

After that, save these changes, and restart your EchoKit server.

If your server is outdated, you'll need to recompile it from source. Support for Grok with built-in web search was added in a commit on December 5, 2025.

Try It Out

Press the K0 button to chat with EchoKit and try these prompts:

  • “What’s the latest news in AI today?”
  • “How’s the Bitcoin price right now?”
  • “What's the current time in San Francisco?”

If everything is configured correctly, you’ll notice Grok pulling fresh information in its responses. It feels different — the answers are more grounded in what’s happening right now.

Switching EchoKit to Grok was surprisingly simple — just a few lines in a config file. Now my device can do real-time search when a question needs up-to-date info.


If you want to share your experience or see what others are building with EchoKit + Grok:

  • Join the EchoKit Discord
  • Or share your latency tests, setups, and experiments — we love seeing them

Want to get your own EchoKit device?