[AINews] Mixture of Depths: Dynamically allocating compute in transformer-based language models
This is AI News! an MVP of a service that goes thru all AI discords/Twitters/reddits and summarizes what people are talking about, so that you can keep up without the fatigue. Signing up here opts you in to the real thing when we launch it 🔜
AI News for 4/4/2024-4/5/2024. We checked 5 subreddits and 364 Twitters and 26 Discords (386 channels, and 5819 messages) for you. Estimated reading time saved (at 200wpm): 631 minutes. 19663
Top news of the day is DeepMind's MoD paper describing a technique that, given a compute budget, can dynamically allocate FLOPs to different layers instead of uniformly. The motivation is well written:
Not all problems require the same amount of time or effort to solve. Analogously, in language modeling not all tokens and sequences require the same time or effort to accurately make a prediction. And yet, transformer models expend the same amount of compute per token in a forward pass. Ideally, transformers would use smaller total compute budgets by not spending compute unnecessarily.
The method uses top-k routing allowing for selective processing of tokens, thus maintaining a fixed compute budget. You can compare it to a "depth" sparsity version of how MoEs scale model "width":