RTen in 2025
Dec 31, 2025 · 12-minute read
RTen is a machine learning runtime for Rust that was first released in December 2023. It enables you to take models that have been trained in Python (eg. using PyTorch) and run them in Rust. It supports models in ONNX format, which you can find many of on Hugging Face or export from frameworks such as PyTorch.
The project started out as a runtime for Ocrs, an OCR library and CLI tool. It has since evolved into a general purpose ONNX runtime which can run a wide variety of popular models across many domains (vision, audio, text, etc.). The current version only supports CPU inference, though I plan to change this next year.
Some general characteristics of RTen as an ML runtime are:
- Support for a wide range of models
- Good CPU inference performance
- Written entirely in Rust, including all of the low-level compute kernels. This keeps the build process simple and makes it easy to port to new environments.
- Relatively lightweight, in terms of number of dependencies, build time and binary size.
- A focus on fast model load times (good for eg. CLI tools)
- A holistic approach to safety and robustness. This includes following the Rustic approach of limiting the scope of unsafe and building up safe abstractions, as well as being judicious about the choice of dependencies.
Progress in 2025
I wrote about progress in 2024 in a previous post. During 2025, there were 9 feature releases bringing improvements in:
- Ease of use
- Model compatibility
- Quantization
- Performance and memory safety
Ease of use
The biggest improvement to ease of use is that RTen can now load .onnx models
directly. Previously you used to have to run a Python tool to convert them to a
custom .rten format. The custom format is designed for efficient loading, by
avoiding the need to copy weights. ONNX models in contrast are just very large
Protocol Buffers messages and
the data is not appropriately aligned for memory mapping. Most inference
runtimes will read an ONNX file into memory, then copy each tensor’s weights
into aligned memory. By writing a custom Protocol Buffers (and entirely safe)
parser, I was able to minimize the number of copies that happen when reading
large .onnx files. This brings the convenience of not needing to convert
models with only a modest hit to loading time. More recent ONNX tooling also
supports using separate files for the model structure (.onnx) and weights
(.onnx.data or .onnx_data), where the data in the weights files is already
appropriately aligned. RTen can now efficiently load models with external
weights.
Supporting loading ONNX models directly makes it easier to switch to, or away from, RTen as an inference runtime. This reduces the risk of adoption because you can just ship a new binary, and don’t have to coordinate deployment of new model files.
Other changes include easier to use APIs for constructing values passed into the model and extracting outputs, better documentation and more helpful error messages.
To test these changes, I ported a few projects which use ONNX Runtime via the
ort crate to use RTen. This includes a
port of
parakeet-rs (speech recognition via
NVIDIA’s Parakeet V3) and a
port of
Magicka (Google’s machine-learning powered
file type detection) (Note that you need to check out the correct branch. The
main/master branch in these repositories is the original code). The resulting
ports run at approximately the same speed as the original version on my M3
MacBook Pro. Your mileage may vary. The upside is that the ported versions have
fewer dependencies, smaller binaries and avoid some of the complications that
come with relying on a large pre-built static library of C++ code. Note that I
don’t intend to maintain these ports, their purpose is just to demonstrate the
changes required.
Model compatibility
A machine learning runtime is only useful if it can run the models you care about. In the past year the number of operations and data types has expanded. This enables running a wide variety of pre-trained models across various domains (vision, audio, text etc.). New capabilities include:
- Support for the
Loopoperator and sequence types. This is ONNX’s version of “for” or “while” loops and it comes up in models that deal with variable-length sequences. - Support for 18 new operators in total
- Support for int8 and int4-quantized models, taking advantage of various architecture specific instructions (VNNI on x86, SDOT and i8mm on Arm)
- Initial support for the
MatMulNBitsoperator which is not part of the ONNX standard but is used in practice by many published LLM models.
Enabled by these changes, I have added examples for various new models including Llama 3, SmolLMv3, Kokoro, ByT5 (tokenizer-free translation), RT-DETR (object detection) and ModernBERT. For people who follow /r/LocalLLaMA, I know Llama 3 is old in AI years, but one has to start somewhere.
Ecosystem reliance on non-standard operators
One current problem in the ONNX ecosystem is that many published models for generative AI rely on operators which are supported by ONNX Runtime but are not part of the ONNX standard. There is fortunately an easy way to export many of these models using only standard operators, but nevertheless this situation is confusing for non-expert developers. I think this reliance is a mistake that adds technical debt, as the non-standard ops are more prone to specification problems.
Nevertheless, I tend to believe that de facto standards matter more than de
jure ones, so have started to add support for the most important non-standard
ops, starting with MatMulNBits. This operator provides fine-grained
(blockwise) quantization with int4 weights and is essential to achieve a good
combination of accuracy, performance and memory usage.
Fortunately many of these operators do have equivalent standard ones in the latest ONNX version, so this need might reduce over the next year.
Quantization
Quantization reduces file sizes and improves performance by reducing the amount of memory bandwidth needed to move weights between main memory and compute cores. This year RTen gained support for int8-quantized models, taking advantage of architecture-specific instructions for int8 dot products (AVX-512 VNNI or “DL Boost”, Arm dot product and i8mm), as well as the initial support for int4-quantized models, which is not yet fully optimized.
The ONNX Runtime documentation around quantization doesn’t make it easy in my view to choose which quantization settings to use to get good results across a range of hardware. So I wrote some documentation of my own to explain the different choices around quantization that can affect accuracy and performance, and an associated script to make it easy to quantize models with what I consider to be sensible defaults.
Performance and safety
Outside of quantization, there were performance improvements in many other areas. On my M3 Pro system, I generally now find that CPU inference performance is in the same ballpark as ONNX Runtime. It may be a little better or worse depending on the model. Improvements include:
-
A re-designed portable SIMD library. This allows defining efficient vectorized operations across all supported platforms using entirely safe code. Unlike
std::simd, the API is designed to be vector-length agnostic, similar to Google’s Highway. The intent is that the API will also support ISAs with runtime-determined vector lengths such as SVE and RVV, although support for that doesn’t exist in in rustc yet. -
More vectorized and parallelized operations. In order to get the maximum amount of performance out of a CPU, you need to use all available kinds of parallelism (superscalar execution, SIMD, multi-threading) and minimize bottlenecks (loading data from memory, branch mispredictions). Various operators were revised to use more of this parallelism.
-
More efficient data copying and filling. A lot of operations in ML models are about moving data around (eg. gathering embeddings, concatentating tensors, transposing weights). It is important to minimize the amount of data transfer that happens, and optimizing that which has to be done. For example, the most common uses of
Gathercan be expressed asmemcpys rather than element-by-element copies. When usingPadto zero-fill part of a tensor, there are specific CPU instructions that can zero entire cache lines at a time, which will be used automatically if the compiler knows at compile time that the fill value is zero. -
Improvements to matrix multiplication on Arm. The Arm kernels did not previously take advantage of fused instructions that combine broadcasting from a lane and performing a multiply-accumulate or a dot-product. Now they do, plus there have been various other optimizations.
-
A symbolic shape inference system which can infer symbolic expressions that describe the size of dynamic dimensions of values in a model. This enables applying more complex graph optimizations by making it possible to verify that preconditions are met. It is currently used to enable various attention-related fusions for operations like Grouped-Query Attention. This is currently opt-in because it is a work in progress. One other interesting possibility from symbolic shape inference is to infer constraints on dynamic input dimensions (eg. there may be a maximum size for a
sequence_lengthdimension). It would be useful to surface these via an API. -
More graph optimizations and better infrastructure for writing graph optimization patterns. Graph optimizations are changes made to the model graph when it is loaded to eliminate redundant operations and fuse (combine) multiple operations together to reduce the amount of data movement in memory. They mostly work via a pattern-match-and-replace system. There are also passes similar to what you’d find in an optimizing compiler such as constant propagation.
-
Improved profiling infrastructure. RTen has a handy CLI tool which can be used to run a model with randomly generated inputs for quick testing. It also has built in profiling that can be conveniently turned on via an environment variable.
-
More intelligent selection of the number of CPU cores. On systems with hyper-threading (SMT) or hetereogenous cores (a mix of performance and efficiency cores) it is often sub-optimal to naively create as many threads as there are virtual cores (as returned by
std::thread::available_parallelism) which for example Rayon will do by default. -
Support for AVX-512 out of the box. The necessary intrinsics and target features were stabilized in rustc in the middle of the year, so now this CPU feature is used by default.
Working with a growing Rust project
The RTen repository now contains around ~90K lines of Rust code, so I’ve gained some perspective on techniques and tools for managing a medium-large project with a very small team.
Build times
As the volume of code has grown, I have done some refactoring to mitigate the impact on build times and keep iteration velocity high. This includes:
-
Splitting out code from the main crate into sub-crates which can be compiled in parallel.
For this cargo build timings are used to track build times for crates and the sequence in which crates are built.
-
Refactoring code to use concrete types or trait objects rather than generics in contexts where this doesn’t hurt runtime performance.
For this I used the
cargo llvm-linestool to help analyze the amount of code generated within each crate. -
Auditing dependencies to ensure unused features are disabled where possible.
There is still more than can be done in this space. For example quite a few operations in ONNX models only copy elements around (eg. Transpose, Gather, Concat, Tile) and don’t care about the meaning of the bits. Only one implementation is needed per element size (1, 2 or 4 bytes), but the current code uses separate implementations for i8 and u8, and for i32 and f32.
Use of AI tooling
The vast majority of the code is written by hand. Going forwards I expect most code to still be written this way because I like to have a good understanding of the details. Also in the act of writing code useful thinking happens about the wider problem. There are however times when repetitive changes need to be made across the codebase, and AI tooling can be very useful for this. I make AI-driven edits using Claude Code and it has gotten much better at Rust over the past year. A common technique I will use is to do the first instance of a refactoring, or create the first implementation of a new trait, and then ask Claude to read my initial manual commit and follow the example in other cases. I break its work into easily reviewable chunks of no more than a few hundred lines at a time. This follows guidelines used amongst my team at work to make the PR review process easier for other humans.
Future roadmap
GPU and NPU support
One of the main goals for next year is to introduce initial support for accelerators (ie. GPUs and NPUs). I plan to start with Metal because that is an area where RTen can offer a valuable alternative to ONNX Runtime’s limited CoreML support. For CPU inference I gained a lot of mileage from using a portable SIMD abstraction, as this enables me to quickly and safely write vectorized kernels for all supported architectures (AVX2, AVX-512, Arm, WebAssembly, auto-vectorization friendly generic). For targeting GPUs and NPUs the range of options is much more open and less mature. Choices range from vendor-specific graph compilers (eg. Metal Performance Shaders Graph), to cross-platform graphics APIs (WebGPU, Vulkan) to the nascent support for writing GPU kernels directly in Rust. I think I would prefer to do the latter, in order to minimize dependencies and benefit from having this code be understandable by standard Rust tooling. However a mixture of approaches will probably be required. For NPUs specifically there isn’t a standard abstraction yet, and on some platforms (Apple) there is no supported low-level interface for targeting the NPU directly. The only option is to target a whole-graph compiler (MPS Graph).
Outside GPUs and NPUs, CPUs have also gained matrix multiplication co-processors such as AMX (on Intel Xeon) and SME (on Arm). Support for these would be valuable as well, although the hardware has limited distribution at present.
WebNN backend and implementation
There is work happening on a standard web API for constructing model graphs called WebNN. RTen could provide both an implementation of this API, but also use it as a backend when compiling to WASM.
Contributing
RTen follows a largely Demo-Driven Development approach. I find interesting models and try to make them run, them optimize them. For anyone interesting in contributing, I would recommend taking the same approach. Start by finding an ONNX model you want to run and try to build a demo. If something doesn’t work or is slow, file an issue.