Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -72,14 +72,15 @@ You can build this pipeline for various platforms and independently benchmark th
|Platform|Details|
|---|---|
|Linux|x86_64 - KleidiAI is disabled by default, aarch64 - KleidiAI is enabled by default.|
|Android|Cross-compile for an Android device, ensure the Android NDK path is set and correct toolchain file is provided. KleidiAI enabled by default.|
|Android|Cross-compile for an Android device, ensure the Android NDK path is set and correct toolchain file is provided. KleidiAI enabled by default. SME kernels can be used if available on device.|
|macOS|Native or cross-compilation for a Mac device. KleidiAI and SME kernels can be used if available on device.|

Currently, this module provides a thin C++ layer as well as JNI bindings for developers targeting Android based applications, supported backends are:
|Framework|Dependency|Input modalities supported|Output modalities supported|Neural Network|
|---|---|---|---|---|
|llama.cpp|https://github.com/ggml-org/llama.cpp|`image`, `text`|`text`|phi-2,Qwen2-VL-2B-Instruct|
|llama.cpp|https://github.com/ggml-org/llama.cpp|`image`, `text`|`text`|phi-2, qwen-2-VL, llama-3.2-1B|
|onnxruntime-genai|https://github.com/microsoft/onnxruntime-genai|`text`|`text`|phi-4-mini-instruct-onnx|
|mnn|https://github.com/alibaba/MNN|`image`, `text`|`text`|qwen-2.5-VL, llama-3.2-1B|
|mediapipe|https://github.com/google-ai-edge/mediapipe|`text`|`text`|gemma-2b-it-cpu-int4|


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,21 +22,31 @@ In the graphic below, a Google Pixel 8 Pro phone is connected to the USB cable:

## Launch the Voice Assistant

The app starts with this welcome screen:
The app starts with this welcome screen, you can choose between the Chat and Benchmark mode:

![welcome image alt-text#center](voice_assistant_view1.png "Welcome Screen")
![welcome image alt-text#center](voice_assistant_welcome.png "Welcome Screen")

Tap **Press to talk** at the bottom of the screen to begin speaking your request.
Tap **Chat** to check the voice assistant pipeline in action.

## Voice Assistant Chat

In the chat mode, tap **Press to talk** at the bottom of the screen to begin speaking your request.

![Chat image alt-text#center](voice_assistant_view1.png "Chat Screen")

## Voice Assistant controls

You can use application controls to enable extra functionality or gather performance data.
You can use application controls to enable extra functionality or gather performance data.

|Button|Control name|Description|
|---|---|---|
|1|Performance counters|Performance counters are hidden by default, click this to show speech recognition time, LLM encode and decode rate.|
|1|Back to welcome screen|Go back to welcome screen to select mode - chat or benchmark.|
|2|Speech generation|Speech generation is disabled by default, click this to use Android Text-to-Speech and get audible answers.|
|3|Reset conversation|By default, the application keeps context so you can follow-up questions, click this to reset voice assistant conversation history.|
|4|Memory used|This metric shows memory used by this application as well as memory available on the device.|
|5|Device thermal status|This metric shows current heat level of the device and whether the system is applying performance throttling to prevent overheating.|
|6|User performance metrics|Performance metrics for user's query including time to transcript the query (STT - Speech-to-Text module) and time for LLM to encode the query, measured in tokens per second.|
|7|Voice assistant metrics|Performance metrics for voice assistant's reply - decode performance measured in tokens per second.|

Click the icon circled in red in the top left corner to show or hide these metrics:

Expand All @@ -58,7 +68,7 @@ Choose the image, and add image for voice assistant:

![add image alt-text#center](add_image.png "Add image to the question")

You can now ask questions related to this image, the large language model will you the image and text for multimodal question answering.
You can now ask questions related to this image, the large language model will use both the image and text for multimodal question answering.

![ask question image alt-text#center](voice_assistant_multimodal_2.png "Add image to the question")

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,4 @@ KleidiAI simplifies development by abstracting away low-level optimization: deve

As newer versions of the architecture become available, KleidiAI becomes even more powerful: simply updating the library allows applications like the multimodal Voice Assistant to take advantage of the latest architectural improvements such as SME2, without requiring any code changes. This means better performance on newer devices with no additional effort from developers.

Now that you can build the Voice Assistant with and without KleidiAI, you can test out the benchmarking functionality it provides.
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---

title: Benchmark Voice Assistant

weight: 8

### FIXED, DO NOT MODIFY

layout: learningpathall

---

## Benchmarking

The Voice Assistant application also provides a benchmark mode so you can easily test out the performance of an LLM model with a sample number of input and output tokens.

![welcome image alt-text#center](voice_assistant_welcome.png "Welcome Screen")

Tap **Benchmark** to navigate to benchmark screen.

![Benchmark image alt-text#center](voice_assistant_benchmark_1.png "Benchmark Screen")

## Benchmark controls

You can use application controls to enable extra functionality or gather performance data.

|Setting|Default|Description|
|---|---|---|
|Input tokens|128|Number of prompt (input) tokens fed to the model before generation starts.|
|Output tokens|128|Number of new tokens the model should generate after the prompt.|
|Threads|4|Number of CPU threads used for inference.|
|Iterations|5|Number of measured benchmark runs to collect stable, averaged measurements.|
|Warmup|1|Number of warmup iterations which are not counted in benchmarking, these eliminate one-time overheads before measuring.

To deep dive into more specific performance, you can build the Voice Assistant modules individually and run benchmarks on your Android device.


Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
---

title: Performance

weight: 9

### FIXED, DO NOT MODIFY

layout: learningpathall

---

## Benchmarking LLM on Android phone

You can also benchmark the LLM functionality on Android phone outside of RTVA application. For this, you can use the Large Language Models repository:

```
https://gitlab.arm.com/kleidi/kleidi-examples/large-language-models
```

and build for your chosen LLM backend, ensure that `NDK_PATH` is set properly. SME kernels are enabled by default, so let's first build with SME disabled:

```
cmake --preset=x-android-aarch64 -B build/ -DBUILD_BENCHMARK=ON -DLLM_FRAMEWORK=mnn -DMNN_SME2=OFF
cmake --build ./build
```

{{% notice %}}
For troubleshooting any build issues, refer to [large-language-models README](https://gitlab.arm.com/kleidi/kleidi-examples/large-language-models/-/blob/main/README.md?ref_type=heads)
{{% /notice %}}

### Phone setup

Now that you have all the libraries and executables needed, you can create a benchmarking directory and push the needed libraries to the phone:

```sh
adb shell mkdir /data/local/tmp/benchmark_test/
adb push build/lib/* /data/local/tmp/benchmark_test/
```
```output
build/lib/archive/: 9 files pushed. 140.0 MB/s (36970298 bytes in 0.252s)
build/lib/libMNN.so: 1 file pushed. 139.5 MB/s (4973176 bytes in 0.034s)
build/lib/libarm-llm-jni.so: 1 file pushed. 153.8 MB/s (3832152 bytes in 0.024s)
11 files pushed. 137.0 MB/s (45775626 bytes in 0.319s)
```

This will copy the executables you can run:
```sh
adb push build/bin/* /data/local/tmp/benchmark_test/
```
```output
build/bin/arm-llm-bench-cli: 1 file pushed. 134.3 MB/s (3415344 bytes in 0.024s)
build/bin/llm-cpp-tests: 1 file pushed. 157.7 MB/s (17783848 bytes in 0.108s)
build/bin/llm_bench: 1 file pushed. 22.6 MB/s (85688 bytes in 0.004s)
build/bin/llm_demo: 1 file pushed. 12.6 MB/s (34656 bytes in 0.003s)
4 files pushed. 141.7 MB/s (21319536 bytes in 0.143s)
```
Finally, copy the models to benchmark:
```sh
adb push resources_downloaded/models/mnn/ /data/local/tmp/benchmark_test/
```

### Benchmarking the models

To make sure the screen stays on and the CPU is not throttled use the following commands:

```sh
adb shell svc power stayon true
adb shell dumpsys deviceidle disable
```

You can now run the executable in ADB shell, providing the path to libraries and the number of iterations to benchmark:

```sh
adb shell
cd /data/local/tmp/benchmark_test/
LD_LIBRARY_PATH=./ ./arm-llm-bench-cli -m mnn/llama-3.2-1b/ -i 128 -o 128 -t 1 -n 5 -w 1
```

As you see in the output, the flags used by executable are listed below:
* `-m` : path to the specific model or a directory with model and configuration files
* `-i` : number input tokens to use
* `-o` : number output tokens to generate
* `-t` : number of threads to use
* `-n` : number of iterations used for benchmarking
* `-w` : number of warmup iterations, not included in benchmarking

```output

=== ARM LLM Benchmark ===

Parameters:
model_path : mnn/llama-3.2-1b/
num_input_tokens : 128
num_output_tokens : 128
num_threads : 1
num_iterations : 5
num_warmup : 1


======= Results =========

| Framework | Threads | Test | Performance |
| ------------------ | ------- | ------ | -------------------------- |
| mnn | 1 | pp128 | 196.446 ± 0.377 (t/s) |
| mnn | 1 | tg128 | 27.222 ± 0.369 (t/s) |
| mnn | 1 | TTFT | 687.931 ± 2.279 (ms) |
| mnn | 1 | Total | 5354.526 ± 63.163 (ms) |

```

To get benchmark numbers with use of SME kernels, you can rerun the full "Benchmarking LLM on Android phone" section without setting `MNN_SME2` to `OFF`. Omitting the `MNN_SME2` flag enables SME instructions by default.:

```
cmake --preset=x-android-aarch64 -B build/ -DBUILD_BENCHMARK=ON -DLLM_FRAMEWORK=mnn
cmake --build ./build
```


## Example performance with a Vivo X300 Android phone

The table table shows the measurements taken on a Vivo X300 Android phone:

| LLM Framework | Model | Threads | Without SME2 | With SME2 | Uplift |
|-------------------|-----------------------|---------|----------------|-----------|----------|
| mnn | qwen25vl-3b | 1 | 85 | 134 | 57.65 % |
| | | 2 | 95 | 140 | 47.37 % |
| | llama-3.2-1B | 1 | 196 | 339 | 72.96 % |
| | | 2 | 275 | 396 | 44.00 % |
| llama.cpp | qwen-2-VL | 1 | 113 | 146 | 29.20 % |
| | | 2 | 92 | 139 | 51.09 % |
| | llama-3.2-1B | 1 | 148 | 173 | 16.89 % |
| | | 2 | 124 | 191 | 54.03 % |
| | phi-2 | 1 | 58 | 77 | 32.76 % |
| | | 2 | 46 | 60 | 30.43 % |


{{% notice Note %}}
The Android system enforces throttling, so your own results may vary slightly.
{{% /notice %}}

These measurements show how fast the model processes (encodes) 128 input tokens when running on a single CPU thread. As the results illustrate, SME2 delivers a significant performance boost even when using just one or two CPU cores on an Android phone, meaning faster processing without needing to involve multiple CPU cores.

Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,9 @@ learning_objectives:
- Optimize performance of multimodal Voice Assistant using KleidiAI and SME2.

prerequisites:
- An Android phone that supports the i8mm Arm architecture feature (8-bit integer matrix multiplication). This Learning Path was tested on a Google Pixel 8 Pro.
- An Android phone that supports the i8mm Arm architecture feature (8-bit integer matrix multiplication).
- An Android phone with support for SME (Scalable Matrix Extension) instructions, required for SME performance checking
- This Learning Path was tested on a Vivo X300 Pro.
- A development machine with [Android Studio](https://developer.android.com/studio) installed.

author:
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.