Voice to text services (ASR)

The EchoKit server supports popular ASR providers.

Platform	URL example	Notes
`openai`	`https://api.openai.com/v1/audio/transcriptions`	Supports endpoint URLs from any OpenAI-compatible services, such as Groq and Open Router.
`paraformer_v2`	`wss://dashscope.aliyuncs.com/api-ws/v1/inference`	A Web socket streaming ASR service endpoint supported by the ALi Cloud

OpenAI and compatible services

The OpenAI /v1/audio/transcriptions API is supported by OpenAI, Open Router, Groq, Azure, AWS and many other providers. This is a non-streaming service endpoint, meaning that EchoKit server must determine when the user is done talking (via an VAD service), and then submit the entire audio to get a transscription.

OpenAI example

[asr]
platform = "openai"
url = "https://api.openai.com/v1/audio/transcriptions"
api_key = "sk_ABCD"
model = "gpt-4o-mini-transcribe"
lang = "en"
vad_url = "http://localhost:9093/v1/audio/vad"

Groq example

[asr]
platform = "openai"
url = "https://api.groq.com/openai/v1/audio/transcriptions"
api_key = "gsk_ABCD"
model = "whisper-large-v3"
lang = "en"
prompt = "Hello\n你好\n(noise)\n(bgm)\n(silence)\n"
vad_url = "http://localhost:9093/v1/audio/vad"

Notice that in both examples, we are using a locally hosted VAD service to detect when the user is finished speaking. It is optional and you can learn about it here.

Ali Cloud streaming ASR

The Bailian service from Ali Cloud provides excellent ASR models for Chinese language recognition. It is also a streaming ASR service -- it would take an audio stream as input and send back text and voice activity events as they happen. There is no need to a separate VAD service in this case.

[asr]
platform = "paraformer_v2"
url = "wss://dashscope.aliyuncs.com/api-ws/v1/inference"
paraformer_token = "sk-API-KEY"

ElevenLabs streaming ASR

Coming soon ...

OpenAI and compatible services​

Ali Cloud streaming ASR​

ElevenLabs streaming ASR​

OpenAI and compatible services

Ali Cloud streaming ASR

ElevenLabs streaming ASR