The RTen machine learning runtime - a 2024 retrospective

Dec 30, 2024 · 9-minute read

RTen is a machine learning runtime for Rust that I have been working on since mid 2022. It allows you to take models that have been trained in Python, using a framework such as PyTorch or Keras and run them in a Rust project.

This includes many pre-trained models that are available on places such as Hugging Face. Models are first exported to ONNX which is a portable exchange format for models, then converted to a FlatBuffers-based format (.rten) which can be loaded more efficiently at runtime.

The project started out as a runtime for the specific use case of Ocrs, an OCR library and CLI tool. Since then it has grown to be a much more general purpose ONNX runtime that can run a wide range of models.

This is the largest open source project I have created from scratch, clocking in at around 60K lines of Rust (thanks tokei!). Along the way I have learned a lot about writing high performance number-crunching code, as well as how Rust’s approach to memory and thread safety fares in this context. I plan to do a separate blog post about that.

The Rust machine learning runtime landscape

There are several machine learning runtimes available for Rust. These include Tract, Candle and Burn. There are also wrappers around established C++ runtimes such as Ort (ONNX Runtime).

A complete comparison is out of scope for this post, but I will list some of the main dimensions along which they vary, plus try to indicate where RTen sits in this context. Broadly speaking, runtimes vary by:

The hardware they are best optimized for (CPU, GPU, NPU, low or high end)
Whether models are written as Rust programs that perform operations eagerly (like PyTorch), represented as model graphs which are interpreted (like CoreML, TensorFlow Lite or ONNX Runtime) or compiled into model-specific optimized programs.
Whether the runtime is a Rust-native, or a wrapper around an existing C++ project.
Maturity, popularity and other non-ML specific attributes

In this framing RTen is Rust-native, inference-only interpreter of exported model graphs and is initially focused on CPU inference. The reason for focusing on CPU inference is because the initial use case was concerned with smaller models for which this is viable, and I wanted to support a wide range of systems.

Some of its relative strengths are:

It has a lean set of dependencies, all of which are entirely written in Rust. This makes it portable to any environment where the Rust standard library can build, including WebAssembly. It also avoids compile times blowing up too much. The rten CLI tool is an example of a small application that uses the library. A from-scratch release build takes about 30 seconds on my 2020 Intel MacBook Pro.

To prove the point about portability, earlier this year Xie Jingyi figured out how to embed a copy of RTen inside a font, in order to synthesize hand writing as you type using the font rendering engine’s WebAssembly shaper support. Crazy, but also pretty cool :)
Among the options that are not wrappers for existing C++ projects, CPU inference performance is comparatively well optimized.

By contrast, its major limitations are:

There is no support for GPUs or other accelerators.
Support for quantized models is incomplete (they run, but very slowly). This means it isn’t a great choice for LLMs beyond 1B params or so, as memory bandwidth (ie. the time taken to stream weight data into the CPU cores) becomes the dominant factor in inference performance at that scale.

RTen development highlights in 2024

RTen was initially released along with Ocrs at the end of 2023. Since then a lot of development work has happened, which I will split into features and performance:

Features

Support for ONNX operators improved. 17 new operators were added, including conditionally run subgraphs (If). This was fewer than I anticipated adding, but it turns out that many recent models can be described with a relatively small set of operations, which were already supported in the initial release.
A new rten-generate crate was added to make it much easier to run generative models (aka. transformer decoders) such as Whisper or LLMs. This crate handles the process of feeding the model input tokens and sampling results in a loop.
14 examples were added or improved covering a variety of use cases, including:
- Transformer-based OCR models: Nougat (PDF -> Markdown) and TrOCR (printed or handwritten text)
- Speech recognition: Whisper and wav2vec.
  
  Here I’d like to thank Igor Yusupov for creating an initial Whisper demo project that prompted me to work on many general improvements that benefit the broad class of encoder-decoder models which Whisper falls into.
- Voice activity detection: Silero VAD
- TTS: Piper
- Object detection: YOLOv11
- Image captioning and image/text similarity: CLIP, Mozilla’s DistilViT
- Image segmentation: RMBG (background removal) and Segment Anything
- Image depth estimation: Depth Anything
- LLMs / Chat: GPT2, Qwen 2 / SmolLM
The built-in debugging and profiling tools were improved. There is a CLI tool that makes it easy to inspect the inputs and outputs of models, run them with randomly generated inputs and benchmark execution times.
Support for models larger than 2GB. Both the FlatBuffers and ONNX formats have 2GB file size limits. ONNX handles this by storing larger weights in a separate file. For RTen I designed a simple container format which consists of a header, then the model graph as a FlatBuffer, then the tensor data with an alignment that allows for memory mapping.

Performance status and progress

Broadly speaking, machine learning model inference consists of two kinds of work: mixing and transforming numbers, and moving data around in memory. In order to achieve good performance you need to use all available forms of parallelism for the first part (ILP, SIMD and multi-threading), and minimize the amount of time spent on the latter. Also general overhead for each operation needs to be minimized, since more complex models have thousands of operations in total.

I primarily use ONNX Runtime (ORT) as a yardstick for performance since it is a mature and widely used runtime. RTen’s relative performance will vary depending on the model and CPU. On the very mid-level Intel i5 on which most development has been done, RTen is typically about 20% slower than ONNX Runtime compared to over 100% slower at the start of the year. About half of the gap comes from less efficient use of multiple cores and the remainder from tuning and optimizations that apply in both single and multi-core contexts.

For transformer decoders I also look at whisper.cpp / llama.cpp. When I first got a Whisper (base model) demo running it took 25 seconds to transcribe two minutes of audio (~5x realtime). That has now been reduced to under 5.5 seconds (>20x realtime). This is slightly faster than whisper.cpp on my system. Whisper.cpp is faster in the decoding phase, but this is offset by being slower during encoding.

Changes made in 2024 that contributed to this progress include:

A lot of work was done to reduce unnecessary memory reads and writes during inference. This includes:
- Avoiding zero-initializing operator output buffers that are going to be fully overwritten
- Avoiding repeatedly allocating and freeing large buffers during inference.
  
  Allocating and freeing large buffers initially accounted for a small but significant fraction of execution time. To mitigate these costs RTen now maintains a pool of used buffers for the duration of each inference. Operators allocate from this pool if possible, otherwise they fall back to the system allocator. When a tensor is no longer needed, its buffer is extracted and added to the pool until the end of inference.
- Avoiding copying weights into individual buffers when loading the model. Instead the entire model is either memory mapped, or “served” from the single Vec<u8> into which the model file was read from disk initially. The file format is designed so that alignment of tensors almost always enables this in practice, with a fallback to copying if not.
A graph optimizer was added. This reduces the memory bandwidth used for inference by applying optimizations such as:
- Constant propagation. This eliminates parts of the graph that don’t change between runs.
- Fusions. These pattern-match sequences of operations and replace them with alternatives that compute the same result but require fewer passes over the data.
Better matrix-vector multiplication kernels were added. Auto-regressive models such as LLMs or Whisper spend a lot of time in matrix-multiply operations which are actually vector-matrix products. These require a different strategy to achieve optimal performance compared to large matrix-matrix multiplication with square inputs (M=N=K).
The execution planner was made smarter. It will now re-order the execution sequence to allow running more operations in-place on the input rather than allocating a new output buffer.
Copying of tensors has been optimized via blocking and tiling. This turns out to be especially important in the common case where tensor dimension sizes are powers of 2, as this causes poor cache usage.
The thread pool was tuned so that the number of threads matches the number of physical rather than logical processors. For compute heavy work like ML inference, this can be significantly more efficient on systems with SMT / Hyper Threading.

A related issue I plan to look at next year is tuning the thread pool size better on systems with heterogeneous processors (eg. performance and efficiency cores).
Vectorized (SIMD) kernels were added for key operations such as reductions, normalization (batch norm, layer norm, RMS norm) and popular activation functions (GeLU, SiLU, Swish)
The image-matrix conversions (im2col, col2im) used by Conv and ConvTranspose operations were significantly optimized.
The portable SIMD library used for implementing vectorized kernels gained support for Arm and AVX-512. AVX-512 currently requires nightly Rust.
A recent addition is the ability to prepack weights into a format that is optimized for the matrix multiplication kernels used by the current hardware when the model is loaded. This trades slower model load times and more memory usage for modest inference improvement wins.
Various optimizations were applied to reduce general interpreter overhead (reducing object sizes, fewer allocations, more efficient ref counting, faster hashing)

Plans for 2025

Some of the areas I plan to make progress next year are:

Continue to work towards performance parity with ONNX Runtime for fp32 inference, as determined on a few key models (eg. most downloaded models on Hugging Face).
Ship production-grade support for quantized models
Take advantage of WebAssembly Relaxed SIMD which exposes some important fused multiply-add and int8 dot-product instructions
Look into newer instruction sets and co-processors on more modern CPUs (eg. Arm SVE / SME, Apple’s AMX co-processor)
Start to look at GPU support
Actually do some more work on the downstream Ocrs project, which was the original reason for creating the runtime.

Feedback

I am interested in hearing from folks who have use cases in this area, especially if you encountered shortcomings with the options you’ve looked at so far.