EchoKit server config options
The EchoKit server orchestrates multiple AI services to turn user voice input into voice responses. It generally takes two approaches.
- The pipeline approach. It divides up the task into multiple steps, and use a different AI service to process each step.
- The ASR service turns the user input voice audio into text.
- The LLM service generates a text response to the user input. The LLM could be aided by built-in tools, such as web searches and custom tools in MCP servers.
- The TTS service converts the response text to voice.
- The end-to-end real-time model approach. It utilizes multimodal models that could directly ingest voice input and generate voice output, such as Google Gemini Live.
The pipeline approach offers greater flexibility and customization - you can choose any voice, control costs by mixing different providers, integrate external knowledge, and run components locally for privacy. While end-to-end models can reduce the latency, the classic pipeline gives you full control over each component.
You can configure how those AI services work together through EchoKit server's config.toml file.
Prerequisites
- Started an EchoKit server. Follow the quick start guide if needed
- Obtained API keys for your favoriate AI API providers (OpenAI, Groq, xai, Open Router, ElevenLabs, Gemini etc.)
Configure server address and welcome audio
addr = "0.0.0.0:8080"
hello_wav = "hello.wav"
addr: The server's listening address and port- Use
0.0.0.0to accept connections from any network interface - Make sure that your firewall allows incoming connections to the port (
8080in this example)
- Use
hello_wav: Optional welcome audio file played when a device connects- Supports 16kHz WAV format
- Make sure that the file is in the same folder as
config.toml
Configure AI services
The rest of the config.toml specifies how to use different AI services. Each service will be covered in its own chapter.
- The
[asr]section configures the voice-to-text services. - The
[llm]section configures the large language model services, including tools and MCP actions. - The
[tts]section configures the text-to-voice services.
It is important to note that each of sections has those fields.
- A
platformfield that designates the service protocol. A common example isopenaifor OpenAI compatible API endpoints. - A
urlfield for the service URL endpoint. It is typically ahttps://orwss://URL. The latter is the Web Socket address for streaming services. - Optional fields that are specific to the
platform. That includesapi_key,model, and others.
Complete Configuration Example
You will need a free API key from Groq.
# Server settings
addr = "0.0.0.0:8080"
hello_wav = "hello.wav"
# Speech recognition using the OpenAI transcriptions API, but hosted by Groq (instead of OpenAI)
[asr]
platform = "openai"
url = "https://api.groq.com/openai/v1/audio/transcriptions"
lang = "en"
api_key = "gsk_your_api_key_here"
model = "whisper-large-v3-turbo"
# Language model using the OpenAI chat completions API, but hosted by Groq (instead of OpenAI)
[llm]
platform = "openai_chat"
url = "https://api.groq.com/openai/v1/chat/completions"
api_key = "gsk_your_api_key_here"
model = "gpt-oss-20b"
history = 10
# Text-to-speech using the OpenAI speech API, but hosted by Groq (instead of OpenAI)
[tts]
platform = "openai"
url = "https://api.groq.com/openai/v1/audio/speech"
api_key = "gsk_your_api_key_here"
model = "playai-tts"
voice = "Cooper-PlayAI"
# System personality
[[llm.sys_prompts]]
role = "system"
content = """
Your name is EchoKit, a helpful AI assistant. Provide clear, concise responses and maintain a friendly, professional tone. Keep answers brief but informative.
"""