Off-Prem

Edge + IoT

MIT boffins cram ML training into microcontroller memory

Neat algorithmic trick squeezing into 256KB of RAM, barely enough for inference let alone teaching


Researchers claim to have developed techniques to enable the training of a machine learning model using less than a quarter of a megabyte of memory, making it suitable for operation in microcontrollers and other edge hardware with limited resources.

The researchers at MIT and the MIT-IBM Watson AI Lab say they have found "algorithmic solutions" that make the training process more efficient and less memory-intensive.

The techniques can be used to train a machine learning model on a microcontroller in a matter of minutes, it is claimed, and they have produced a paper on the subject, titled "On-Device Training Under 256KB Memory" [PDF].

According to the authors, on-device training of a model will enable it to adapt in response to new data collected by the device's sensors. By training and adapting locally at the edge, the model can learn to continuously improve its predictions for the life of the application.

However, the problem with implementing such a solution is that edge devices are often constrained in their memory size and processing power. At one end of the scale, tiny IoT devices based on microcontrollers may have as little as 256KB of SRAM, the paper states, which is barely enough for the inference work of some deep learning models, let alone the training.

Meanwhile, deep learning training systems like PyTorch and TensorFlow are often run on clusters of servers with gigabytes of memory at their disposal, and while there are edge deep learning inference frameworks, some of these lack support for the back-propagation to adjust the models.

In contrast, the intelligent algorithms and framework that the researchers have developed is able to reduce the amount of computation required to train a model, it is claimed.

This is no mean feat, since training a typical deep learning model undergoes hundreds of updates as it learns, and because there may be millions of weights and activations involved, training a model requires much more memory than running a pre-trained model.

(That said, if there are similar projects out there doing non-trivial training on microcontroller devices, let us know.)

One of the MIT solutions developed to make the training process more efficient is sparse update, which skips the gradient computation of less important layers and sub-tensors by using an algorithm to identify only the most important weights to update during each round of training.

The algorithm works by freezing the weights one at a time until it detects the accuracy dip to a set threshold. The remaining weights are then updated, while the activations corresponding to the frozen weights do not need to be stored.

"Updating the whole model is very expensive because there are a lot of activations, so people tend to update only the last layer, but as you can imagine, this hurts the accuracy," explained MIT Associate Professor Song Han, one of the paper's authors. "For our method, we selectively update those important weights and make sure the accuracy is fully preserved," he added.

The second solution is to reduce the size of the weights using quantization, typically from 32 bits to just 8 bits, to cut the amount of memory needed for both training and inference. Quantization-aware scaling (QAS) is then used to adjust the ratio between weight and gradient, to avoid any drop in accuracy that may result from training with the quantized values.

The system changes the order of steps in the training process so more work is completed in the compilation stage, before the model is deployed on the edge device, according to Han.

"We push a lot of the computation, such as auto-differentiation and graph optimization, to compile time. We also aggressively prune the redundant operators to support sparse updates. Once at runtime, we have much less workload to do on the device," he said.

The final part of the solution is a lightweight training system, Tiny Training Engine (TTE), that implements these algorithms on a simple microcontroller.

According to the paper, the framework is the first machine learning solution to enable on-device training of convolutional neural networks with a memory budget of less than 256KB.

The authors say that the training system has been demonstrated operating on a commercially available microcontroller, an STM32F746 based on an Arm Cortex-M7 core with 320KB of SRAM and produced by STMicroelectronics.

This was used to train a computer vision model to detect people in images, which it was able to successfully complete after just 10 minutes of training, the research states.

With this success under their belt, the researchers now say they want to apply what they have learned to other machine learning models and types of data, such as language models and time-series data.

They believe these techniques could be used to shrink the size of larger models without sacrificing accuracy, which could help reduce the carbon footprint of training large-scale machine-learning models in future. ®

Send us news
6 Comments

Despite Wall Street jitters, AI hopefuls keep spending billions on AI infrastructure

Sunk cost fallacy? No, I just need a little more cash for this AGI thing I’ve been working on

We meet the protesters who want to ban Artificial General Intelligence before it even exists

STOP AI warns of doomsday scenario, demands governments pull the plug on advanced models

How nice that state-of-the-art LLMs reveal their reasoning ... for miscreants to exploit

Blueprints shared for jail-breaking models that expose their chain-of-thought process

This open text-to-speech model needs just seconds of audio to clone your voice

El Reg shows you how to run Zyphra's speech-replicating AI on your own box

LLM aka Large Legal Mess: Judge wants lawyer fined $15K for using AI slop in filing

Plus: Anthropic rolls out Claude 3.7 Sonnet

UK's new thinking on AI: Unless it's causing serious bother, you can crack on

Plus: Keep calm and plug Anthropic's Claude into public services

Microsoft warns Trump: Where the US won't sell AI tech, China will

Rule hamstringing our datacenters is 'gift' to Middle Kingdom, vice chair argues

Hurrah! AI won't destroy developer or DBA jobs

Bureau of Labor Statics warns lawyers and customer service reps to brace for change, says techies will be fine

If you thought training AI models was hard, try building enterprise apps with them

Aleph Alpha's Jonas Andrulis on the challenges of building sovereign AI

Network edge? You get 64-bit Armv9 AI. You too, watches. And you, server remote management. And you...

Arm rolls out the Cortex-A320 for small embedded gear that dreams of big-model inference

Hey programmers – is AI making us dumber?

If you don't need to think about easy questions, will you be able to answer complex questions?

Microsoft's drawback on datacenter investment may signal AI demand concerns

Investment bank claims software giant ditched 'at least' 5 land parcels due to potential 'oversupply'