Configure the ASR-LLM-TTS Pipeline in EchoKit Server
EchoKit supports two pipeline approaches: the ASR-LLM-TTS pipeline (classic modular approach) and the end-to-end pipeline (single integrated model like Gemini Live).
- ASR-LLM-TTS pipeline for EchoKit: Current article
- End-to-end pipeline for EchoKit
The ASR-LLM-TTS pipeline offers greater flexibility and customization - you can choose any voice, control costs by mixing different providers, integrate external knowledge, and run components locally for privacy. While end-to-end models can reduce the latency, the classic pipeline gives you full control over each component.
How the ASR-LLM-TTS Pipeline Works
EchoKit processes your voice through three distinct stages:
- ASR (Automatic Speech Recognition) converts speech to text
- LLM (Large Language Model) generates a natural language response
- TTS (Text-to-Speech) converts the response back into speech
You can customize each stage independently by editing the EchoKit server's config.toml
file.
Prerequisites
Before configuring your pipeline, ensure you have:
- EchoKit server soruce code – Follow the guide if needed
- API keys for your chosen providers (Groq, OpenAI, etc.)
- Audio files for welcome messages (optional)
- Local services running (if using local models)
Configure Server Address and Welcome Audio
addr = "0.0.0.0:8080"
hello_wav = "hello.wav"
addr
: The server's listening address and port- Use
0.0.0.0
to accept connections from any network interface - Common ports:
8080
,9090
,3000
or any numbers you prefer
- Use
hello_wav
: Welcome audio file played when a device connects- Supports WAV format
- Notice the file path
- Optional – remove this line if you don't want a greeting
Configure the ASR Model and VAD Model
[asr]
url = "https://api.groq.com/openai/v1/audio/transcriptions"
lang = "en"
api_key = "gsk_your_groq_api_key_here"
model = "whisper-large-v3-turbo"
vad_realtime_url = "ws://localhost:9093/v1/audio/realtime_vad"
ASR (Speech-to-Text)
- Converts speech into text
- OpenAI-compatible, works with multiple providers
- Popular Providers:
- Groq (fast, cost-effective)
- OpenAI (high accuracy)
- Azure Speech Services
- Local Whisper deployment
- Set
lang
to your target language (en
,zh
,es
,fr
, etc.) - Required: Yes – the pipeline cannot function without ASR
VAD (Voice Activity Detection)
- Detects when humans are speaking vs. silence/noise
- Reduces unnecessary processing and improves response quality
- Local Deployment: coming soon
- Optional: Remove
vad_realtime_url
if not using VAD
Configure the LLM Model
[llm]
llm_chat_url = "https://api.groq.com/openai/v1/chat/completions"
api_key = "gsk_your_groq_api_key_here"
model = "gemma2-9b-it"
history = 3
LLM Settings
- Generates intelligent responses from transcribed text
- OpenAI-compatible, supports many providers
- Popular Models:
qwen-3-4b
– Fast, efficient, good for conversationsllama-3.1-8b-instant
– Balanced performancegpt-4o-mini
– High quality, higher costclaude-4
– Fast, creative responses- Local LLM deployment
History Management
history
: Number of previous messages to remember1
– Only remembers the last exchange (minimal context)3-5
– Good balance of context and resource usage10+
– Rich context but higher API costs and latency
- Recommendation: Start with
3
and adjust based on your use case
Configure the TTS Model
[tts]
platform = "Groq"
url = "https://api.groq.com/openai/v1/audio/speech"
api_key = "gsk_your_groq_api_key_here"
model = "playai-tts"
voice = "Aaliyah-PlayAI"
TTS (Text-to-Speech)
- Converts LLM responses into natural-sounding speech
- OpenAI-compatible, supports many providers
- Required: Yes – needed for EchoKit to speak responses
- Popular Model Provider:
- Groq (fast, cost-effective)
- OpenAI (high accuracy)
- Azure Speech Services
- Local GPT-SoVits deployment: coming soon
Custom Voice Training
💡 Advanced: Create custom voices using GPT-SoVITS
Configure the System Prompt
[[llm.sys_prompts]]
role = "system"
content = """
Your name is EchoKit, a smart and highly individualistic AI assistant. Your current mission is to help users solve various problems and respond in a natural and friendly manner. Keep responses concise but helpful.
"""
System Prompt Guidelines
- Defines: EchoKit's personality, behavior, and response style
- Best Practices:
- Keep it concise but specific
- Include response length guidelines
- Specify the desired tone and personality
- Mention any domain expertise needed
Example System Prompts
Language Tutor:
content = """
You are EchoKit, a patient and encouraging language tutor. Help users practice conversation, correct mistakes gently, and provide vocabulary tips. Keep responses short and conversational.
"""
Technical Assistant:
content = """
You are EchoKit, a knowledgeable technical assistant. Provide clear, accurate answers about programming, software, and technology. Use simple explanations and offer practical solutions.
"""
Companion:
content = """
You are EchoKit, a friendly and empathetic companion. Listen actively, provide emotional support, and engage in meaningful conversations. Be warm, understanding, and encouraging.
"""
Complete Configuration Example
# Server settings
addr = "0.0.0.0:8080"
hello_wav = "welcome.wav"
# Speech recognition
[asr]
url = "https://api.groq.com/openai/v1/audio/transcriptions"
lang = "en"
api_key = "gsk_your_api_key_here"
model = "whisper-large-v3-turbo"
vad_realtime_url = "ws://localhost:9093/v1/audio/realtime_vad"
# Language model
[llm]
llm_chat_url = "https://api.groq.com/openai/v1/chat/completions"
api_key = "gsk_your_api_key_here"
model = "gemma2-9b-it"
history = 3
# Text-to-speech
[tts]
platform = "Groq"
url = "https://api.groq.com/openai/v1/audio/speech"
api_key = "gsk_your_api_key_here"
model = "playai-tts"
voice = "Cooper-PlayAI"
# System personality
[[llm.sys_prompts]]
role = "system"
content = """
Your name is EchoKit, a helpful AI assistant. Provide clear, concise responses and maintain a friendly, professional tone. Keep answers brief but informative.
"""
Performance Optimization
For Speed:
- Use Groq for fast inference
- Set
history = 1
for minimal context - Choose "turbo" or "instant" model variants
For Quality:
- Use OpenAI or Anthropic models
- Increase
history
for better context - Use premium TTS services like ElevenLabs
For Cost:
- Use local model deployments
- Reduce
history
to minimize token usage - Choose cost-effective providers like Groq
Next Steps
👉 After modifying the configuration file, restart the server to apply changes.