Last Week's New AI Papers

Monthly AI Paper Summaries

Last Week's New AI Papers - Aug 6, 2024

0:00

-11:39

Last Week's New AI Papers - Aug 6, 2024

we a couple days late bc I was sick

Tunadorable

May 06, 2024

Transcript

Welcome to Tunadorable's weekly AI newsletter, where we summarize his favorite articles of the week that he plans to read.

This article was written by gpt-3.5-turbo-16k on 2024-05-06.

# KAN - Kolmogorov-Arnold Networks

This paper introduces Kolmogorov-Arnold Networks (KANs) as an alternative to Multi-Layer Perceptrons (MLPs) in deep learning models. While MLPs have fixed activation functions on nodes, KANs have learnable activation functions on edges. KANs replace weight parameters with spline functions, allowing for greater accuracy and interpretability. The authors demonstrate that KANs can achieve comparable or better accuracy than MLPs with much smaller models in data fitting and PDE solving tasks. Theoretical and empirical analysis also shows that KANs have faster neural scaling laws than MLPs. KANs are shown to be interpretable and can be used for scientific discoveries, as demonstrated in examples from mathematics and physics. The paper concludes that KANs are promising alternatives to MLPs, with potential for improving deep learning models.

# Time Machine GPT

This paper introduces a series of language models called TimeMachine GPT (TiMaGPT) that are pre-trained on historical data up until specific cutoff dates. These models are designed to be nonprognosticative, meaning they do not have knowledge of future events or linguistic changes. The authors highlight the importance of temporal separation in language models and the potential for information leakage from future states to past states. They provide access to the models and training datasets for researchers interested in analyzing language evolution and evaluating the performance of temporally dynamic models. The authors compare their models with conventionally temporally adapted models and show that the traditional approach can lead to information leakage and unrealistic knowledge of future events. They also demonstrate the performance of their models on static benchmarks, showing consistent performance over time. The models are particularly useful for time-series forecasting tasks where strict separation of temporal data is important. The limitations of the study include the use of small GPT-2 models and the need for larger models trained on expanded datasets.

# Exponentially Faster Language Modelling

In this paper, the authors introduce UltraFastBERT, a modified version of the BERT language model that uses fast feedforward networks (FFNs) instead of traditional feedforward networks in its intermediate layers. They demonstrate that UltraFastBERT achieves comparable performance to traditional BERT models while using only a fraction of the neurons during inference. Specifically, UltraFastBERT-1x11, the deepest model with the highest acceleration potential, uses only 0.3% of its neurons for inference and achieves a 78x speedup over the corresponding feedforward layer on CPUs. The authors provide a high-level CPU implementation of conditional matrix multiplication (CMM), the operation underlying fast feedforward network inference, which achieves significant acceleration. They also discuss the compatibility of FFFs with existing hardware and provide comparisons of CPU and GPU implementations. The results suggest that conditional neural execution has considerable potential for accelerating language modeling tasks.

# Better & Faster Large Language Models via Multi-token Prediction

This research proposes a training method called multi-token prediction for large language models. Instead of predicting just the next token, the models are trained to predict multiple future tokens at once. This approach improves the efficiency of the models and enhances their downstream capabilities without increasing training time. The authors conducted experiments on code and natural language tasks, showing that multi-token prediction consistently outperforms baseline models. The method is particularly effective for larger models and remains beneficial even with multiple training epochs. Additionally, models trained with multi-token prediction are faster at inference, making them more efficient. The findings suggest that multi-token prediction is a promising technique for training more powerful and faster language models.

# Fast Feedforward Networks

This research introduces the Fast Feedforward (FFF) architecture, which is a log-time alternative to feedforward networks. FFF divides the input space into regions and performs learning on these regions using small leaf feedforward networks. This allows FFFs to be much faster than traditional feedforward networks. The study shows that FFFs can achieve comparable predictive performance to feedforward networks while being up to 220 times faster. FFFs also outperform mixture-of-experts networks in terms of both speed and training performance. Additionally, FFFs can be used as parts of larger architectures, such as vision transformers, while preserving predictive performance. The research provides detailed algorithms for training and inference in FFFs and conducts experiments to evaluate their performance on image classification tasks. The results demonstrate the impact of different parameters, such as leaf size, depth, and training width, on the predictive performance and speed of FFFs. Overall, the research highlights the potential of FFFs as a faster and efficient alternative to traditional feedforward networks.

# Understanding LLMs Requires More Than Statistical Generalization

This paper argues that understanding Large Language Models (LLMs) requires more than just statistical generalization. While LLMs achieve low training and test loss, there are additional properties that are not captured by statistical generalization measures. The authors highlight the non-identifiability of AR probabilistic models, where models with zero or near-zero KL divergence can exhibit different behaviors. They provide three case studies illustrating the practical implications of non-identifiability: non-identifiability of zero-shot rule extrapolation, approximate non-identifiability of in-context learning, and non-identifiability of fine-tunability. The authors propose studying LLMs in the saturation regime and suggest research directions focusing on LLM-relevant generalization measures, transferability, and inductive biases.

# A Comprehensive Survey of Dynamic Graph Neural Networks - Models, Frameworks, Benchmarks, Experiments and Challenges

This paper presents a comprehensive survey of dynamic graph neural networks (DGNNs). The authors analyze 81 recent DGNN models using a novel taxonomy and provide an overview of 12 DGNN frameworks. They also introduce evaluation benchmarks for DGNNs and conduct experimental comparisons of nine representative DGNN models and three frameworks. The results show variations in training accuracy, efficiency, and memory usage. The paper identifies key challenges in DGNN research and offers suggestions for future improvements. Overall, this survey provides a valuable resource for understanding and comparing different DGNN models and frameworks.

# An exactly solvable model for emergence and scaling laws

In this paper, the authors investigate the phenomenon of emergence and scaling laws in deep learning models. They propose a framework where each new skill or ability is represented as a basis function. They solve a simple multi-linear model in this skill-basis and derive analytic expressions for the emergence of new skills and the scaling laws of the loss with training time, data size, model size, and optimal compute.

The authors compare their calculations to simulations of a two-layer neural network trained on a multitask sparse parity problem, where the tasks in the dataset follow a power-law distribution. They find that their simple model captures the sigmoidal emergence of multiple new skills as training time, data size, or model size increases in the neural network.

The authors also provide an intuitive derivation of the scaling laws, showing how stage-like training in the multilinear model leads to power-law scaling of the loss with respect to various factors.

One potential critique is that the model is simplified and may not fully capture the complexity of real-world deep learning models. Additionally, the study focuses on a specific problem and dataset, so the generalizability of the findings to other domains may be limited.

The implications of this research are that the emergence of new skills in deep learning models can be predicted and understood using a simple multilinear model. This can have implications for designing more efficient and scalable deep learning models in the future.

# Octopus v4 - Graph of language models

This paper introduces Octopus v4, a novel approach that integrates multiple open-source language models into a graph structure. The Octopus v4 model leverages functional tokens to intelligently direct user queries to the most appropriate vertical model and reformat the query for optimal performance. The graph structure represents the relationships between different models, their capabilities, and their optimal use cases. By treating each language model as a node in the graph and establishing edges based on their compatibility, complementary features, or task-specific performance, the Octopus v4 model coordinates the collaboration between multiple models. This framework optimizes the inference process by selecting the most suitable specialized models based on the user's query, activating only two models that each have fewer than 10B parameters for one-step inference. The use of functional tokens allows the Octopus v4 model to accurately select functions and efficiently generate responses. The graph structure and Octopus v4 model enable fast execution, accurate selection of functions, and efficient coordination among multiple language models.

# DynaMo - Accelerating Language Model Inference with Dynamic Multi-Token Sampling

This paper introduces DynaMo, a suite of language models that accelerate text generation by predicting multiple tokens at a time. The authors propose a modified training objective and architecture for multi-token prediction. They also propose novel methods to enhance the estimated joint probability distribution and improve text generation quality. The models are evaluated on NLU benchmarks, multi-token perplexity, and open-ended text generation tasks. The results show that DynaMo models achieve significant speed-ups while maintaining the quality of generated text. Potential critiques include the reliance on independent token predictions and the need for further evaluation on different datasets. The implications of this work include the potential for faster and more efficient text generation in resource-constrained environments.

Thanks for reading/listening, that's all for this week.

Please consider checking out Tunadorable's youtube channel where he provides commentary on the above papers.

https://youtube.com/@Tunadorable

Here is the most up-to-date version of the python scripts I currently use to create this newsletter:

https://github.com/evintunador/arxiv-summaries-workflow