# Turning spoken commands into JSON tool calls on iPhones

> Source: <https://blog.wildedge.dev/posts/in-app-voice-assistant>
> Published: 2026-06-22 08:51:50+00:00

# Speech-to-tool pipeline performance measured 

Voice interfaces feel good only when the action happens quickly. For dictation, users tolerate some delay because the output is long-form text. For [tool calls](https://platform.openai.com/docs/guides/function-calling), the expected output is small: start a timer, create a reminder, change a setting, trigger a workflow. A few seconds of latency can make the interaction feel heavy and artificial.

We ran a benchmark inside an iOS app to compare two ways of turning spoken intent into a structured tool call:

**Direct speech-to-tool:** pass audio to one model and ask it to produce the tool call.**Two-step speech-to-text, then text-to-tool:** transcribe the audio first, then pass the transcript to a small text model that returns the tool call.

The practical question was simple: if the feature has to run on device, which path gets to a valid JSON tool call faster?

## The benchmark 

The benchmark used the [WildEdge Swift SDK](https://cocoapods.org/pods/WildEdge) to report processing-time telemetry from the app. [WildEdge](https://wildedge.dev) remote configuration handled prompt and model selection between benchmark runs, which let us compare paths without rebuilding the [TestFlight](https://developer.apple.com/testflight/) app.

The test set was intentionally narrow:

- 18
`.m4a`

recordings across two voices - 9 short command cases
- 3 voice-to-action use cases
- English input from non-native English speakers
- Expected JSON tool-call output for each case

Download the [benchmark audio dataset](https://drive.google.com/file/d/1TIa-j4y5QI6amfmk7Da8LF-nR_ag5Pn0/view).

The dataset contains 18 short `.m4a`

command recordings. In total, it covers **60.203 seconds** of audio and **981.3 kB** of files. The average clip length is **3.345 seconds**, and the average clip size is **54.5 kB**.

This was primarily a latency benchmark, not a full model accuracy evaluation. Proper accuracy evaluation would require a much larger and more varied dataset. In this narrow test set, output accuracy was similar across models and close to 100% for most cases because the commands were intentionally simple.

Some other constraints matter:

- No streaming. Each run starts after the full speech file is available.
- Audio conversion happens outside the measured interval. The accepted input format was a WAV container with Linear PCM audio, 16,000 Hz sample rate, one mono channel, and 16-bit integer samples.
- No model fine-tuning was involved; task-specific fine-tuning may change final latency results by reducing prompt and schema-handling overhead. For more context, see
[Let’s build an on-device voice agent](https://paulabartabajo.substack.com/p/lets-build-an-on-device-voice-agent). - Text-to-tool prompts are lightly adapted per model while staying similar in length.
- The combined two-step results below are summed stage medians, not full paired end-to-end runs.

Several techniques can reduce perceived latency or time to first token: streaming input, partial decoding, voice activity detection, speculative execution, and overlapping pipeline stages. We did not evaluate those here. For this benchmark, we intentionally provided complete recordings first, then measured raw processing time across different models and devices.

## Approaches compared 

The charts below show the shape of each pipeline. They are normalized stage diagrams, not the measured benchmark medians; the measured results are reported in the sections below.

### Direct speech-to-tool 

The model receives speech input and directly produces the tool call.

This avoids an explicit intermediate transcript step, which may reduce latency and simplify the pipeline. The tradeoff is that the app needs multimodal speech-to-tool capability, which is more complex to package than a simple text-only `llama.cpp`

setup.

### Speech-to-text, then text-to-tool 

The speech input is first transcribed into text. That text is then passed to a second step that generates the tool call.

This approach may be easier to debug and inspect because the app does not need to run a multimodal speech-to-tool model. The first stage produces plain text, and the second stage can use a smaller text-to-tool model. The tradeoff is that the app now has two stages to orchestrate.

## Hardware matters 

We used iPhone 16 Pro as the primary benchmark device, then ran a smaller hardware sweep to understand how the direct speech-to-tool path changes across older Apple hardware. For this baseline, [LeapSDK](https://github.com/Liquid4All/leap-sdk) loaded [Liquid's LFM2 Audio 1.5B model](https://huggingface.co/LiquidAI/LFM2-Audio-1.5B).

The device order below is oldest to newest: iPhone 11, iPhone 12, iPhone 13 mini, iPhone 13 Pro Max, and iPhone 16 Pro.

The largest jump was from iPhone 11 to iPhone 12. Median direct speech-to-tool latency dropped from **17.39 seconds** on iPhone 11 to **2.61 seconds** on iPhone 12, a roughly **6.7x** improvement.

On iPhone 16 Pro, the same LeapSDK direct LFM speech-to-tool path produced a valid tool-call JSON in **1.36 seconds median**. That is inside our rough 1.5-second practical line for a voice-to-action feature, though the preferred target for this interaction is still sub-second.

LeapSDK-reported token throughput showed the same hardware story:

Throughput rose from **0.94 tokens/s** on iPhone 11 to **15.21 tokens/s** on iPhone 16 Pro, about **16.1x** higher. We treat this as [LeapSDK](https://www.liquid.ai/)-reported throughput because `GenerationStats.tokenPerSecond`

does not specify which tokens it counts.

Thermals were not perfectly controlled, so these results are best read as real-device engineering data, not a lab-grade hardware benchmark. One mid-generation result was especially noisy: the iPhone 13 mini did not clearly beat iPhone 12 on median latency in this run, and its thermal profile suggests that the devices may have had different thermal characteristics, either due to construction design or simply due to age and wear.

## The framework used to load the model also matters 

The cross-device run was still useful because it kept the direct speech-to-tool baseline consistent: [LeapSDK](https://github.com/Liquid4All/leap-sdk) loaded the same LFM model across iPhones. But on the main iPhone 16 Pro benchmark device, we also compared two one-step loading paths for the same model: **Liquid LFM2.5 Audio 1.5B Speech To Tool**.

With [llama.cpp](https://github.com/ggml-org/llama.cpp) plus `libmtmd`

, the one-step LFM path landed at **726 ms median**. With LeapSDK, the same direct speech-to-tool path landed at **1,359.5 ms median**.

That made `llama.cpp + libmtmd`

about **1.87x faster** than the LeapSDK direct baseline, or **46.6% lower latency**. This combination turned out to be more useful for the direct speech-to-tool path, so the architecture comparisons below use the **726 ms** result as the better one-step LFM baseline. The LeapSDK results are still valuable for the cross-device iPhone comparison because they isolate the hardware effect.

## Speech-to-text is cheap here 

For the two-step path, we first measured speech-to-text alone.

| Rank | Model | Model size [MB] | Median duration [ms] |
|---|---|---|---|
| 1 |
|

[OpenAI Whisper Base](https://github.com/openai/whisper)[Apple Speech Recognizer](https://developer.apple.com/documentation/speech/sfspeechrecognizer)The size column is the approximate model asset download size. Apple Speech Recognizer is system-provided rather than downloaded as an app-managed model file.

[OpenAI Whisper Tiny](https://github.com/openai/whisper) was the fastest speech-to-text result at **87 ms median**. [OpenAI Whisper Base](https://github.com/openai/whisper) followed at **118 ms median**. [Apple Speech Recognizer](https://developer.apple.com/documentation/speech/sfspeechrecognizer) was **184 ms median**, about **2.1x** slower than Whisper Tiny in this run, but all three are small compared with the generation-heavy stages.

That matters because the two-step path is often dismissed as obviously slower. In this benchmark, transcription was not the bottleneck.

## Text-to-tool processing is significant in the two-step path 

The next stage was text-to-tool: take the transcript and produce the JSON tool call.

| Rank | Model | Model size [MB] | Median duration [ms] |
|---|---|---|---|
| 1 |
|

[FunctionGemma 270M](https://huggingface.co/bartowski/google_functiongemma-270m-it-GGUF)Q4_K_M GGUF Text To Tool[Qwen3 0.6B](https://qwenlm.github.io/blog/qwen3/)4-bit Text To Tool[TinyLlama 1.1B](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)4-bit GGUF Text To Tool[Apple Foundation Models](https://developer.apple.com/documentation/foundationmodels)Text To ToolThe size column is the approximate model asset download size used by the benchmark app. It excludes runtime and framework binaries; Apple Foundation Models is system-provided rather than downloaded as an app-managed model file.

The fastest text-to-tool model was ** LFM2 350M** at about

**301 ms median**.

**followed at**

[FunctionGemma 270M](https://huggingface.co/bartowski/google_functiongemma-270m-it-GGUF)**371 ms median**. The slowest model in this run was

**at**

[Apple Foundation Models](https://developer.apple.com/documentation/foundationmodels)**1111 ms median**.

The FunctionGemma run used Bartowski's GGUF conversion of `google_functiongemma-270m-it`

: `google_functiongemma-270m-it-Q4_K_M.gguf`

, `Q4_K_M`

quantization, and the `llama_cpp`

provider. The model file is available from [Hugging Face](https://huggingface.co/bartowski/google_functiongemma-270m-it-GGUF/resolve/main/google_functiongemma-270m-it-Q4_K_M.gguf).

For voice-to-action, model and prompt size matter more than architectural neatness. The output is small and highly structured. If the model is much larger than the task requires, the user pays for it directly in latency.

## 2 step processing results 

Because Whisper Tiny was the fastest speech-to-text stage, we used it for the compact combined two-step estimates below. The table includes the faster candidates for the architecture comparison plus an all-Apple reference path.

These are summed medians, not full paired end-to-end runs:

| Rank | Speech-to-text model | Speech-to-text model size [MB] | Text-to-tool model | Text-to-tool model size [MB] | Total model size [MB] | STT median [ms] | Text-to-tool median [ms] | Total median [ms] |
|---|---|---|---|---|---|---|---|---|
| 1 |
|

[LFM2 350M](https://arxiv.org/abs/2511.23404)Text To Tool[OpenAI Whisper Tiny](https://github.com/openai/whisper)[FunctionGemma 270M](https://huggingface.co/bartowski/google_functiongemma-270m-it-GGUF)Q4_K_M GGUF Text To Tool[OpenAI Whisper Tiny](https://github.com/openai/whisper)[Qwen3 0.6B](https://qwenlm.github.io/blog/qwen3/)4-bit Text To Tool[OpenAI Whisper Tiny](https://github.com/openai/whisper)[TinyLlama 1.1B](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)4-bit GGUF Text To Tool[Apple Speech Recognizer](https://developer.apple.com/documentation/speech/sfspeechrecognizer)[Apple Foundation Models](https://developer.apple.com/documentation/foundationmodels)Text To ToolThe fastest two-step estimate, **Whisper Tiny + LFM2 350M**, landed at about **388 ms median** before orchestration overhead.

The size columns show the packaging tradeoff. The fastest path needs about **304 MB** of app-managed model downloads. The all-Apple path was not the most performant at **1295 ms**, but both pieces are provided "on device" by Apple, so the app-managed model download is **0 MB**.

## Direct vs two-step on iPhone 16 Pro 

To make the architecture comparison explicit, we put the better direct one-step LFM result next to selected two-step estimates, including the all-Apple path.

Using the better `llama.cpp + libmtmd`

direct result from the framework comparison above, the fastest two-step estimate was about **1.9x faster** than the direct one-step LFM path: **388 ms** vs **726 ms**. The **Whisper Tiny + FunctionGemma 270M** and **Whisper Tiny + Qwen3 0.6B** estimates also stayed below that direct baseline at **458 ms** and **577 ms**.

There is also a slower but operationally interesting all-Apple option: [Apple Speech Recognizer](https://developer.apple.com/documentation/speech/sfspeechrecognizer) plus [Apple Foundation Models](https://developer.apple.com/documentation/foundationmodels). That is not the fastest path, but for simple cases it may be easier to include in an app because both pieces are provided by Apple.

This does not prove that two-step is always better. It does show that direct audio-to-tool is not automatically the low-latency choice.

Several unmeasured factors could change the outcome:

**Streaming performance:** this benchmark starts after the full recording is available. A streaming direct speech-to-tool model might start work earlier, while a two-step pipeline might overlap speech recognition and tool-call generation differently.**Fine-tuned speech-to-tool models:** no model was fine-tuned for these commands. A task-specific direct speech-to-tool model could reduce prompt overhead, produce shorter generations, and close part of the latency gap.**Errors:** this dataset was intentionally simple, and most outputs were close to 100% correct. A harder dataset could change the tradeoff. Two-step systems can fail through transcription errors or text-to-tool errors, while direct speech-to-tool systems hide both problems inside one model output.

## Why two steps can win 

Direct speech-to-tool has an appealing architecture: one model, one prompt, one output. It avoids an intermediate transcript and can be easier to package conceptually. If you later decide to fine-tune for the task, there is also one model surface to tune.

But the model has to solve two problems at once, and the multimodal path usually requires more initial work:

- understand the audio
- map the intent into a constrained JSON tool call

The two-step path lets each model do a smaller job. Speech-to-text is usually very fast for short command clips. Text-to-tool is also simpler than a multimodal speech-to-tool mapping because the model receives a short transcript instead of raw audio. For short commands, that split can be faster than asking one audio model to do everything.

The bigger practical difference is app footprint. In this benchmark, the best direct speech-to-tool path required loading `llama.cpp + libmtmd`

plus an audio-capable model in the app. The two-step path can use a speech recognizer and a smaller text-to-tool model instead. Direct speech-to-tool can still expose a raw transcription, so debugging access is not the main differentiator here.

## What WildEdge made visible 

This benchmark is small, but it is a good example of why [edge ML telemetry](/posts/your-ml-telemetry-should-live-in-your-data-stack) needs to be stage-level and device-aware.

The useful questions were not just "which model is faster?"

They were:

- How does latency change across different devices (iPhone generations)?
- What processing stage is the critical path that we could optimize?
- Are thermal states distorting a device comparison?

## Conclusions 

For short on-device voice commands, our current takeaways are:

- Recommended single-shot model:
**Liquid LFM2.5 Audio 1.5B Speech To Tool**, measured at** 726 ms median**on iPhone 16 Pro with`llama.cpp + libmtmd`

. This was the only direct speech-to-tool model we managed to fit and run on our test devices. - For two-step processing,
**LFM2 350M**,** FunctionGemma 270M**, and** Qwen3 0.6B**are all worth checking further. With Whisper Tiny as the speech-to-text stage, all three stayed below one second in this benchmark; a larger follow-up should evaluate accuracy before treating any of them as the default. **Apple Speech Recognizer + Apple Foundation Models** is an interesting alternative for apps that want to stay in the Apple ecosystem. It was not the fastest path, but both pieces are provided on device by Apple and do not require an app-managed model download.
