This Week's New AI Papers

Monthly AI Paper Summaries

This Week's New AI Papers - June 9, 2024

0:00

-20:52

This Week's New AI Papers - June 9, 2024

Tunadorable

Jun 09, 2024

Transcript

Welcome to Tunadorable's weekly AI newsletter, where we summarize his favorite articles of the week that he plans to read.

This article was written by gpt-3.5-turbo-16k on 2024-06-09.

# What Should Embeddings Embed? Autoregressive Models Represent Latent Generating Distributions

This research paper explores the question of what information is captured in the embeddings of autoregressive language models (LMs). The authors connect the autoregressive prediction objective to the concept of predictive sufficient statistics to identify three cases where the optimal content of embeddings can be identified: exchangeable data, latent state models, and discrete hypothesis spaces. In these cases, the embeddings should capture the relevant predictive sufficient statistics, such as the sufficient statistics of the data, the posterior distribution over states, or the posterior distribution over hypotheses. The authors conduct empirical probing studies on transformers to validate these findings and show that the embeddings indeed encode the expected information. The results suggest that LMs represent latent structure that captures the posterior distribution over the generative process underlying the text. The findings have implications for the design and evaluation of LMs and provide insights into their behavior. Potential critiques may include the limited scope of the study and the assumption of perfect autoregressive modeling.

# Scalable MatMul-free Language Modeling

This paper introduces a MatMul-free language model (LM) that eliminates matrix multiplication operations, which are computationally expensive, from large language models. The authors propose using ternary weights in dense layers and a modified Gated Recurrent Unit (GRU) for self-attention-like functions. They demonstrate that their MatMul-free models achieve performance comparable to state-of-the-art Transformers while reducing memory usage and computational cost. The authors also provide a hardware-efficient implementation for GPUs and an FPGA accelerator to further optimize memory consumption and energy efficiency. The results show that their approach significantly reduces memory usage and improves computational efficiency compared to traditional models. The implications of this work are that it paves the way for more efficient and scalable language models, bringing us closer to brain-like efficiency in natural language processing tasks.

# Open-Endedness is Essential for Artificial Superhuman Intelligence

This paper argues that open-endedness is an essential property of artificial superhuman intelligence (ASI). The authors provide a formal definition of open-endedness, which states that a system is open-ended if it continuously generates artifacts that are both novel and learnable to an observer. They propose that open-endedness can be achieved through the combination of open-ended algorithms and foundation models, which are large language models trained on vast amounts of data. The authors suggest that research into open-ended systems is crucial for the safe and beneficial development of increasingly general and autonomous AI.

# Artificial Generational Intelligence - Cultural Accumulation in Reinforcement Learning

This paper explores the concept of cultural accumulation in reinforcement learning (RL) agents. Cultural accumulation refers to the ability of agents to learn and accumulate knowledge and skills over multiple generations. The authors propose two models of cultural accumulation: in-context accumulation, where agents learn from the behavior of previous generations within a single episode, and in-weights accumulation, where agents learn from the weights of previous generations across multiple training runs.

The authors conduct experiments on three different environments: Memory Sequence, Goal Sequence, and Travelling Salesperson. They compare the performance of agents trained with cultural accumulation to agents trained for a single lifetime with the same total experience. They find that agents with cultural accumulation consistently outperform those without it.

In the in-context accumulation model, agents learn to learn from the behavior of other agents within a single episode. They use an oracle agent during training to provide guidance, and gradually reduce the dependence on the oracle during evaluation. The best performing agent from each generation is used as a guide for the next generation.

In the in-weights accumulation model, agents learn from the weights of previous generations across multiple training runs. The weights of the best performing agent from each generation are used as a starting point for the next generation.

The results demonstrate that both in-context and in-weights accumulation lead to improved performance compared to agents trained for a single lifetime. This suggests that cultural accumulation can enhance the learning capabilities of RL agents.

One potential critique of this work is that the experiments are limited to a small set of environments. It would be valuable to explore the effectiveness of cultural accumulation in a broader range of tasks.

The implications of this work are significant for the field of RL. Cultural accumulation provides a mechanism for agents to learn from the experiences of previous generations, leading to improved performance and the potential for open-ended learning. This research opens up new avenues for developing more advanced and flexible RL algorithms. Additionally, the concept of cultural accumulation has parallels to human cultural evolution, providing insights into how knowledge and skills are accumulated and transmitted in human societies.

# Transformers are SSMs - Generalized Models and Efficient Algorithms Through Structured State Space Duality

This paper introduces a framework called Structured State Space Duality (SSD) that connects state space models (SSMs) and attention mechanisms, such as those used in Transformers. The authors show that SSMs and attention have a deep theoretical connection through the abstraction of structured matrices. This connection allows for the development of efficient algorithms to compute SSMs and opens up new possibilities for architecture design and systems optimizations.

The authors propose a new architecture called Mamba-2, which is based on the SSD framework. Mamba-2 is a refinement of the Mamba SSM, with a core layer that is 2-8 times faster than Mamba while maintaining competitive performance with Transformers on language modeling tasks. The authors also demonstrate that Mamba-2 outperforms other models on downstream evaluations.

The SSD framework enables the use of established conventions and techniques for attention in the design of SSMs. It also allows for the application of optimization techniques developed for Transformers to improve the training efficiency of SSMs. The authors demonstrate the effectiveness of these techniques in achieving faster training and inference times for SSMs.

One potential critique of the paper is that the theoretical connections between SSMs and attention may not be immediately intuitive to readers unfamiliar with the field. However, the authors provide clear explanations and visualizations to help readers understand the concepts.

The implications of this work are significant for the development of more efficient and powerful sequence models. The SSD framework bridges the gap between SSMs and attention mechanisms, enabling the transfer of algorithms and optimizations between these two families of models. This opens up new directions for research and development in the field of deep learning.

# Memorization in deep learning - A survey

This survey explores the phenomenon of memorization in deep learning, where deep neural networks (DNNs) tend to memorize specific details from training examples instead of learning general patterns. The authors provide a systematic framework to organize memorization definitions and evaluation methods based on generalization and security/privacy domains. They review literature on DNN memorization behaviors, its impact on security and privacy, and its connection with forgetting. Additionally, the survey discusses various applications leveraging memorization and forgetting mechanisms. The findings contribute to a better understanding of memorization in DNNs and its implications for AI development and ethical concerns.

# Transformers need glasses! Information over-squashing in language tasks

This paper investigates the limitations of decoder-only Transformers, which are the backbone of many large language models (LLMs). The authors conduct a theoretical analysis of the information propagation in these models and identify two key limitations: representational collapse and over-squashing.

Representational collapse refers to the phenomenon where certain sequences of inputs yield arbitrarily close representations in the final token of the Transformer. This means that the model is unable to distinguish between these sequences, leading to errors in tasks involving counting or copying.

Over-squashing, on the other hand, is a result of the unidirectional flow of information in decoder-only Transformers. Tokens that appear earlier in the input sequence have more paths to reach the final token representation, while tokens that appear later have fewer paths. This leads to a loss of sensitivity to specific tokens in the input.

The authors provide empirical evidence to support their theoretical analysis, showing that contemporary LLMs are indeed affected by these limitations. They also propose simple solutions to mitigate these issues.

Critiques of this work may include the limited scope of the analysis, as it focuses only on decoder-only Transformers and specific tasks. Additionally, the empirical evidence provided may not cover a wide range of LLMs and may not be generalizable to all language models.

The implications of this research are significant, as it highlights fundamental limitations in the architecture of decoder-only Transformers. Understanding these limitations can guide further improvements in LLMs and help address failure cases in tasks involving counting, copying, and other fundamental computation operations.

# Position Paper - An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience

This paper proposes a conceptual framework and methodological strategies for inner interpretability in AI, drawing lessons from the field of Cognitive Neuroscience. Inner interpretability aims to understand the internal mechanisms of AI systems, but recent critiques question its usefulness and highlight issues such as incomplete mechanisms, weak motivations for abstractions, and overoptimistic bottom-up approaches. These issues are similar to those faced by Cognitive Neuroscience in understanding the brain. Cognitive Neuroscience has tackled these issues by developing multilevel conceptual frameworks and methodological strategies. The paper suggests applying these lessons to inner interpretability, including the importance of complete mechanistic explanations, choosing appropriate levels of abstraction, and balancing bottom-up and top-down approaches. By adopting this framework, inner interpretability can address critiques and advance our understanding of AI systems.

# Federated Model Heterogeneous Matryoshka Representation Learning

The paper proposes a federated model heterogeneous Matryoshka Representation Learning (FedMRL) approach for supervised learning tasks in federated learning (FL). FedMRL addresses the challenges of data, system, and model heterogeneity by adding a shared global auxiliary homogeneous small model to interact with clients' local heterogeneous models. It introduces adaptive representation fusion and multi-granular representation learning to enhance knowledge transfer between the server and client models. Theoretical analysis shows that FedMRL achieves a non-convex convergence rate of O(1/T). Experimental results on benchmark datasets demonstrate the superiority of FedMRL in terms of model accuracy, communication cost, and computation overhead compared to seven state-of-the-art baselines.

# Multi-layer Learnable Attention Mask for Multimodal Tasks

This paper introduces the Learnable Attention Mask (LAM), a mechanism designed to prioritize and regulate tokens in complex input sequences. The LAM module generates a mask that is applied to the attention scores in a transformer-based model, allowing for dynamic token prioritization. The authors validate the efficacy of the LAM through experiments on various datasets, including tasks such as audio description generation, moment retrieval, highlight detection, image classification, and video captioning. The results demonstrate significant performance improvements when the LAM is applied to multimodal encoders, while minimal improvements are observed for single-modality encoders. The multi-layer version of the LAM further enhances performance by incorporating different information aspects at each layer of the transformer network. The paper also provides an analysis of the attention mask's influence on model performance, showcasing the benefits of the LAM module. However, potential critiques could include the need for further evaluation on additional datasets and the scalability of the LAM to even longer input sequences. Overall, the LAM presents a valuable contribution to enhancing model performance in multimodal settings and provides a flexible solution for integrating token prioritization capabilities into existing transformer encoder architectures.

# Too Big to Fail - Larger Language Models are Disproportionately Resilient to Induction of Dementia-Related Linguistic Anomalies

This study investigates the behavior of large-scale neural language models (NLMs) in relation to dementia-related linguistic anomalies. The authors propose a method to simulate cognitive impairment in NLMs by masking attention heads, which represent different aspects of language processing. They hypothesize that larger NLMs require a larger proportion of attention heads to be masked to exhibit similar degradation to smaller models. The study finds that larger models are indeed more resilient to masking, suggesting an analogue to the concept of cognitive reserve in the human brain. The results demonstrate the potential of NLMs to model aspects of neurodegenerative disorders and aging.

# Towards Scalable Automated Alignment of LLMs - A Survey

This survey explores the recent advancements in automated alignment methods for large language models (LLMs). The traditional alignment methods based on human annotation are becoming unsustainable due to the rapid development of LLMs. Automated alignment aims to construct scalable alignment systems with minimal human intervention. The survey categorizes automated alignment methods into four categories: aligning through inductive bias, aligning through behavior imitation, aligning through model feedback, and aligning through environment feedback. Each category is discussed in detail, covering the current progress, limitations, and potential future directions. The survey also explores the underlying mechanisms of automated alignment and discusses the essential factors that make it feasible and effective.

# What Do Language Models Learn in Context? The Structured Task Hypothesis

This paper investigates the theories behind in-context learning (ICL) in large language models (LLMs). The authors explore three hypotheses that explain how LLMs learn in context: task selection, meta-learning, and structured task selection. Through a series of experiments on text classification tasks, they provide evidence against the first two hypotheses and support the third hypothesis. The results suggest that LLMs can learn a novel task in context by composing tasks learned during pre-training. These findings have implications for understanding the capabilities and learning mechanisms of LLMs.

# ReLU-KAN - New Kolmogorov-Arnold Networks that Only Need Matrix Addition, Dot Multiplication, and ReLU

This paper presents ReLU-KAN, a new implementation of Kolmogorov-Arnold Networks (KAN) that overcomes the limitations of GPU parallel computing. ReLU-KAN simplifies the basis function design of KAN by using ReLU and point-wise multiplication, allowing for efficient CUDA computing. The architecture is implemented in PyTorch and achieves a 20x speedup compared to traditional KAN with 4-layer networks. ReLU-KAN also exhibits stable training and superior fitting accuracy compared to KAN, while still preserving the "catastrophic forgetting avoidance" property. The results demonstrate the potential of ReLU-KAN for efficient and accurate deep learning applications.

# Social Simulacra - Creating Populated Prototypes for Social Computing Systems

This paper introduces social simulacra, a prototyping technique that uses large language models to generate realistic social behaviors that may arise in a populated social computing system. The authors create a web-based tool called SimReddit to demonstrate the capabilities of social simulacra. The tool allows designers to create a new subreddit and generates user personas and interactions based on the designer's input. The authors conduct evaluations to assess the believability of the generated content and the usefulness of social simulacra for designers. The results show that participants have difficulty distinguishing between real and generated content, and designers find social simulacra valuable for exploring the range of social behaviors and refining their designs.

# Representations as Language - An Information-Theoretic Framework for Interpretability

This research introduces a new approach to interpretability in deep-learning models, focusing on the structure of the representations they learn. The authors propose an information-theoretic framework to measure the compression, regularity, variation, and disentanglement of the representations. They apply this framework to Transformer models trained on two semantic parsing datasets and analyze the trajectory of training in terms of two distinct phases. In the first phase, the model aligns representations with tokens and parts of speech, while in the second phase, representations become more robust to noise. The authors also investigate the impact of model size on the structure of representations. The results show that larger models compress their representations more than smaller models and that the structure of the representations is related to generalization performance. However, it's worth noting that the specific results may be influenced by the chosen hyperparameters and dataset. Overall, this research provides insights into the systematic structure of representations learned by deep-learning models and highlights the importance of robustness and compression for generalization.

# Prediction-powered Generalization of Causal Inferences

This paper addresses the challenge of generalizing causal inferences from a randomized controlled trial (RCT) to a target population where some effect modifiers have a different distribution. The authors propose a method that combines data from the trial with an additional observational study (OS) to improve generalization. They develop prediction-powered estimators that leverage a predictive model learned from the OS, without making any assumptions about the OS. The authors show that their methods facilitate better generalization when the OS is of high quality, and remain robust even when the OS has unmeasured confounding. They provide theoretical and empirical evidence to support their claims.

Thanks for reading/listening, that's all for this week.

Please consider checking out Tunadorable's youtube channel where he provides commentary on the above papers.

https://youtube.com/@Tunadorable

Here is the most up-to-date version of the python scripts I currently use to create this newsletter:

https://github.com/evintunador/arxiv-summaries-workflow