This Week's New AI Papers

Monthly AI Paper Summaries

This Week's New AI Papers - May 25, 2024

0:00

-25:52

This Week's New AI Papers - May 25, 2024

one of these days imma need to add arxiv links

Tunadorable

May 26, 2024

Transcript

Welcome to Tunadorable's weekly AI newsletter, where we summarize his favorite articles of the week that he plans to read.

This article was written by gpt-3.5-turbo-16k on 2024-05-25.

# Preparing for Black Swans - The Antifragility Imperative for Machine Learning

This paper introduces the concept of "antifragility" in the context of machine learning systems and proposes a formal definition based on online decision making. Antifragility refers to systems that not only withstand volatility and uncertainty but actually benefit from them. The paper argues that current approaches focused on robustness and resistance to change are insufficient for high-stakes applications. Instead, the paper advocates for the design of systems that can adapt and thrive in the face of nonstationarity and distribution shifts.

The paper provides a rigorous definition of antifragility based on dynamic regret, which measures the system's performance relative to the best possible action over time. Antifragility is defined as achieving sublinear dynamic regret for any sequence of actions in a changing environment. The key requirement is that the system's response to environmental variability is strictly concave, indicating that it benefits from volatility.

The paper also discusses the challenges and limitations of achieving antifragility in machine learning systems. It highlights existing lower bounds that suggest antifragility may be theoretically unattainable. However, the paper suggests that these bounds may not capture the full range of real-world scenarios and that practical systems with appropriate inductive biases could potentially exceed these bounds.

The implications of antifragility in machine learning are significant, particularly for high-stakes applications where reliability and safety are paramount. Antifragile systems have the potential to adapt and improve in the face of unforeseen disruptions, reducing the risk of catastrophic failures. The paper identifies potential computational pathways for engineering antifragility, including meta-learning, safe exploration, continual learning, and multi-objective optimization. It also emphasizes the need for clear guidelines, risk assessment frameworks, and interdisciplinary collaboration to ensure responsible application of antifragility principles.

One potential critique of the paper is the lack of concrete examples or case studies to illustrate the concept of antifragility in machine learning. While the paper provides examples from various domains, it would be helpful to see specific applications or algorithms that demonstrate antifragility. Additionally, the paper acknowledges the need for further research and development to address the challenges and limitations of achieving antifragility.

Overall, this paper provides a thought-provoking exploration of antifragility in machine learning and lays the foundation for future research in this area. It highlights the importance of designing systems that can thrive in volatile and uncertain environments, rather than simply resisting or mitigating the effects of change.

# xLSTM - Extended Long Short-Term Memory

This paper introduces Extended Long Short-Term Memory (xLSTM), an extension of the LSTM architecture that overcomes some of its limitations and scales to large language models. The xLSTM introduces exponential gating and modified memory structures to enhance the storage capacity and revisability of LSTMs. It consists of two variants: sLSTM, which uses a scalar memory and introduces exponential gating, and mLSTM, which uses a matrix memory and a covariance update rule. These variants are integrated into residual block backbones to create xLSTM blocks and architectures. The xLSTM models show promising performance in language modeling tasks compared to state-of-the-art Transformers and State Space Models. The xLSTM architecture has linear computation and constant memory complexity, making it suitable for industrial applications and edge implementations.

# Visualizing, Rethinking, and Mining the Loss Landscape of Deep Neural Networks

This paper explores the loss landscape of deep neural networks (DNNs) by visualizing and mining 1D and 2D curves. The authors categorize the 1D curves into v-basin, v-side, w-basin, w-peak, and vvv-basin curves. They find that Gaussian perturbations commonly lead to v-basin curves, while using the negative gradient or the direction to subsequent checkpoints results in v-side curves. They also propose algorithms to mine w-basin and w-peak curves. The authors provide theoretical insights into the observed phenomena and demonstrate various types of 2D surfaces. The results show that the loss surfaces are smoother than expected, and it is difficult to obtain complex 1D curves by Gaussian perturbation alone. The findings have implications for understanding the loss landscape of DNNs and could aid in improving training strategies.

# Equipping Transformer with Random-Access Reading for Long-Context Understanding

This paper introduces a random-access reading strategy for transformer-based language models to efficiently process long documents. The strategy allows the model to skip irrelevant tokens based on its confidence in predicting future context. The skipping mechanism is applied during both pretraining and fine-tuning phases and has been shown to improve model performance in long-context language modeling tasks. The effectiveness of the method is demonstrated through experiments on the C4 corpus, where skipping during pretraining reduces perplexity and skipping during fine-tuning adapts short-text models to handle long contexts. The proposed method can be further enhanced by incorporating a memory module and leveraging hierarchical structures in documents.

# Images that Sound - Composing Images and Sounds on a Single Canvas

In this paper, the authors propose a method for generating spectrograms that simultaneously look like natural images and sound like natural audio. They call these spectrograms "images that sound". The approach leverages pre-trained text-to-image and text-to-spectrogram diffusion models that operate in a shared latent space. By denoising with both the audio and image diffusion models in parallel, they generate samples that are likely under both models. The generated spectrograms can be converted into natural sounds using a pretrained vocoder or colorized to obtain visually pleasing results. The authors demonstrate the effectiveness of their method through quantitative evaluations, perceptual studies, and qualitative comparisons against baselines. The results show that their method generates spectrograms that better align with both the audio and image prompts, producing high-quality samples of images that sound.

# Worldwide Federated Training of Language Models

This paper introduces WorldLM, a system for training language models (LMs) using federated learning (FL) on a global scale. FL allows organizations to collaborate and train LMs without sharing their raw data, addressing concerns around privacy and data ownership. WorldLM addresses the challenges of global FL by creating federations of federations, where each federation can adapt to its unique legal, privacy, and security requirements. It uses a partially-personalized aggregation approach, where a shared backbone model is combined with personalized key layers. This allows the model to adapt to the data distribution within each federation. WorldLM also includes a cross-federation information sharing mechanism, where residual layer embeddings are routed to the most relevant sub-federation based on similarity. Experimental results show that WorldLM outperforms standard FL and approaches the performance of fully personalized local models. It also maintains its advantages even when privacy-enhancing techniques are applied. The methodology is evaluated using hierarchical datasets constructed from The Pile and multilingual datasets from mC4. The results demonstrate the effectiveness of WorldLM in training LMs across different industries and languages.

# Scaling Monosemanticity - Extracting Interpretable Features from Claude 3 Sonnet

This paper focuses on using sparse autoencoders to extract interpretable features from a large transformer model called Claude 3 Sonnet. The authors successfully trained sparse autoencoders on the model and found a diversity of highly abstract features that respond to and cause abstract behaviors. These features include concepts like famous people, countries, cities, and type signatures in code. They are multilingual, multimodal, and encompass both abstract and concrete instantiations of the same idea. Some of the features are potentially safety-relevant, including those related to security vulnerabilities, bias, deception, and dangerous content. However, the existence of these features does not imply actual harm, and further research is needed to understand their implications and how they can be used for AI safety. The authors also discuss the scaling laws they used to optimize the training of sparse autoencoders and the methodology they employed to assess the interpretability of the learned features. They provide examples of interpretable features related to the Golden Gate Bridge, brain sciences, monuments, and transit infrastructure. The specificity of these features was evaluated using an automated interpretability pipeline and the results support their proposed interpretations. Overall, this research demonstrates that sparse autoencoders can extract meaningful and interpretable features from large transformer models, paving the way for further investigations into AI safety and model behavior.

# The Future of Large Language Model Pre-training is Federated

This paper proposes a new approach to training large language models (LLMs) called federated learning (FL). FL allows for collaboration between different organizations by training the models using their own data and computational resources, without sharing the data directly. The authors have developed a system for federated LLM training that is scalable, flexible, and reproducible. They have successfully trained a billion-scale federated LLM using limited resources, demonstrating that FL can achieve competitive performance with centralized training. The authors also found that larger federated models find a consensus across clients more easily than smaller ones. This approach allows data-rich actors to participate in LLM training and democratizes the process. The paper highlights the challenges of training LLMs, such as the limited availability of high-quality language data and expensive hardware requirements. It discusses the potential of FL to address these challenges and expand the data and computational sources for LLM training. The paper also mentions previous work on federated fine-tuning of LLMs and its benefits, but emphasizes the need for federated pre-training to fully leverage the advantages of FL. Overall, the paper presents a promising direction for the future of LLM training, enabling collaboration and access to diverse data sources while maintaining privacy and efficiency.

# Quantifying Emergence in Large Language Models

This research proposes a quantifiable definition of emergence in large language models (LLMs) and a low-cost method to estimate the strength of emergence. The authors model LLMs as a Markov process and define emergence as a process where the entropy reduction of the entire sequence exceeds the entropy reduction of individual tokens. They use mutual information between transformer layers to compute entropy reduction and employ an estimation algorithm for high-dimensional continuous representations.

The authors conduct comprehensive experiments on different LLMs under in-context learning (ICL) and natural sentence settings. They find that their emergence metric aligns with existing observations based on performance metrics and reveals novel emergence patterns. They also suggest potential applications of their methodology, such as developing empirical formulas about emergence, detecting hallucinations in LLMs, and estimating the emergence of larger LLMs using smaller ones.

Critiques of the research could include the limitations of using performance metrics to evaluate emergence and the reliance on synthetic datasets for the ICL experiments. Additionally, the generalizability of the findings to other LLMs and the scalability of the proposed method to extremely large or closed-resource LLMs could be questioned.

The implications of this research are significant as it provides a quantifiable definition of emergence in LLMs and a low-cost method to estimate its strength. This opens up possibilities for further exploration and understanding of emergence in LLMs and its potential applications in various fields.

# Generative modeling through internal high-dimensional chaotic activity

This research explores the use of high-dimensional chaotic dynamics in recurrent neural networks as a way to generate new data points that resemble a given training dataset. The authors propose three different architectures and training algorithms for these generative models. The models are trained using simple learning rules that involve simulating the dynamics of the system and updating the parameters based on the training dataset. The authors demonstrate that these models can generate samples that resemble the training dataset, and they quantify the quality of the generated samples using standard accuracy measures. The results show that the models can successfully generate samples that capture the statistical properties of the training dataset. The authors suggest that this approach could be a more biologically plausible way to train generative models, as it does not require external noise injection. They also discuss potential improvements to the learning rules and future directions for research.

# Towards Graph Contrastive Learning - A Survey and Beyond

This survey focuses on Graph Contrastive Learning (GCL) in the context of self-supervised learning (SSL) on graph-structured data. GCL is a technique that aims to learn informative representations from unlabeled graph data by comparing positive and negative examples in the embedding space. The survey provides a comprehensive overview of GCL, including augmentation strategies, contrastive modes, and contrastive optimization objectives. It also explores extensions of GCL to weakly supervised learning, transfer learning, and other data-efficient learning scenarios. Additionally, the survey discusses real-world applications of GCL in domains such as drug discovery, genomics analysis, and recommender systems. The challenges and potential future directions of GCL are also outlined. The survey fills a gap in the existing literature by providing a dedicated exploration of GCL and its potential in various contexts.

# Future You - A Conversation with an AI-Generated Future Self Reduces Anxiety, Negative Emotions, and Increases Future Self-Continuity

This paper introduces "Future You," an AI-powered chat intervention designed to improve future self-continuity and mental well-being. The system allows users to chat with a virtual version of their future selves, personalized based on a pre-intervention survey. The future self character adopts the persona of an age-progressed image of the user's present self, and the system generates a synthetic memory to create a believable narrative. After interacting with the system, users reported decreased anxiety and increased future self-continuity. This research demonstrates the effectiveness of AI-generated characters in improving future self-continuity and well-being.

# Scaling-laws for Large Time-series Models

This research paper investigates the scaling behavior of large time-series models (LTMs) and establishes power-law scaling relations with respect to parameter count, dataset size, and training compute. The study uses decoder-only transformer models trained on a diverse dataset comprising around 8 billion data points across 30 million time-series from various domains. The results show that the performance of LTMs scales approximately as a power law with model size, compute resources, and dataset size. The authors also find that the model performance is only weakly sensitive to architecture details such as aspect ratio and the number of attention heads. The findings suggest that LTMs have the potential to achieve state-of-the-art performance in time-series forecasting tasks when provided with sufficient data and model size. One potential critique is the break in power-law behavior observed in the mean squared error and continuous ranked probability score, which warrants further investigation. The implications of this research are that LTMs can serve as foundation models for time-series forecasting across various domains, enabling zero-shot prediction capabilities and improved accuracy.

# Attention as an RNN

In this paper, we propose Aaren, an attention-based module that combines the efficiency of traditional RNNs with the performance of Transformers. We show that attention can be viewed as a special RNN and introduce a parallelized method for computing attention as a many-to-many RNN. Aaren achieves comparable performance to Transformers on various sequential tasks, including reinforcement learning, event forecasting, time series classification, and time series forecasting, while being more time and memory-efficient. The experiments demonstrate that Aarens can process new tokens efficiently, making them suitable for streaming data.

# Language Reconstruction with Brain Predictive Coding from fMRI Data

This paper explores the use of brain predictive coding in fMRI-to-text decoding, where the goal is to reconstruct natural language from brain signals. The authors propose a model called PREDFT, which consists of a main decoding network and a side network for brain predictive coding. The main decoding network uses a combination of convolutional neural networks and Transformers to reconstruct the text, while the side network extracts brain predictive coding representations from related brain regions. The two networks are then fused together to improve the decoding performance.

The authors conduct experiments on a large naturalistic language comprehension fMRI dataset and compare the performance of PREDFT to a state-of-the-art model called UniCoRN. They evaluate the decoding performance using automatic metrics such as BLEU and ROUGE. The results show that PREDFT outperforms UniCoRN in terms of decoding quality, achieving higher scores in BLEU and ROUGE.

One potential critique of the study is that the selection of regions of interest for the side network is arbitrary and may not capture the full range of brain predictive coding functions. Additionally, the study focuses on a specific dataset and may not generalize to other fMRI datasets or language tasks.

The implications of this research are significant as it provides insights into how the human brain encodes and decodes language, and how this knowledge can be leveraged to improve language reconstruction from brain signals. The findings suggest that brain predictive coding can be a useful heuristic for guiding fMRI-to-text decoding, leading to more accurate and coherent language generation. This research opens up new avenues for studying the neural basis of language and has potential applications in brain-computer interfaces and neurorehabilitation.

# Nonequilbrium physics of generative diffusion models

This paper analyzes generative diffusion models from a physics perspective, specifically focusing on the forward and reverse diffusion processes. The authors derive the fluctuation theorem, entropy production, and potential energy to understand the underlying mechanisms of the models. They show how the reverse generative dynamics can be treated as a statistical inference problem, with the time-dependent state variables serving as quenched disorder. The paper provides a unified principle that connects machine learning and non-equilibrium thermodynamics. The results have implications for designing better algorithms and understanding the phase transitions in generative diffusion models.

# Your Transformer is Secretly Linear

This paper investigates the linearity properties of transformer decoders and explores their implications for model optimization and efficiency. The authors analyze the embedding transformations between sequential layers and find a near-perfect linear relationship. However, linearity decreases when the residual component is removed. The authors propose new techniques for pruning and distillation of linear layers, which show promising results in reducing model size without significant performance loss. They also introduce a regularization approach based on cosine similarity, which improves model performance on benchmarks and decreases layer linearity. The findings challenge the traditional understanding of transformer architectures and provide insights into their operation and potential for optimization. However, the limitations of the study include the focus on transformer decoders and the need for further evaluation on larger models and different domains.

# Super Tiny Language Models

This paper introduces Super Tiny Language Models (STLMs) that aim to deliver high performance with significantly reduced parameter counts compared to traditional large language models (LLMs). The approach involves innovative techniques such as byte-level tokenization with pooling, weight tying, and efficient training strategies. These methods can reduce parameter count by 90% to 95% while maintaining competitive performance. The research aims to make high-performance language models more accessible and practical for a wide range of applications.

The paper discusses related works on parameter reduction techniques such as weight tying, pruning, quantization, and knowledge distillation. It also explores approaches to improve data quality and training efficiency through data selection and knowledge distillation.

The authors propose a technical approach that includes a research repository with clean interfaces and understandable model code to facilitate experiments on small models. They provide details on the training data, evaluation metrics, and the initial benchmarking results of a 10-layer baseline model.

The paper outlines several research projects that will be explored in the future, including weight tying, byte-level/tokenizer-free models with pooling mechanisms, early exit and conditional computation, next thought prediction, dropout and learning rate scheduling, and curriculums/data mixes for training.

The results of the initial benchmarking show that the baseline model is overfitting the small training dataset, indicating the need for further improvement. The proposed research projects aim to address this and improve the performance of STLMs.

Critiques of this approach could include concerns about the trade-off between model size reduction and performance, the generalizability of findings from small models to larger ones, and the limitations of the training data used.

The implications of this research are significant as it addresses the challenges posed by large language models in terms of computational demands, energy consumption, and accessibility. The development of STLMs with reduced parameter counts can make high-performance language models more practical for various applications.

# TimeMixer - Decomposable Multiscale Mixing for Time Series Forecasting

This paper proposes TimeMixer, a novel architecture for time series forecasting that leverages multiscale mixing. The authors observe that time series exhibit distinct temporal variations at different scales, with fine scales capturing microscopic information and coarse scales reflecting macroscopic trends. Based on this observation, TimeMixer extracts multiscale information using Past-Decomposable-Mixing (PDM) blocks, which decompose and mix seasonal and trend components separately. In the future prediction phase, TimeMixer uses Future-Multipredictor-Mixing (FMM) to ensemble predictions from multiple scales. The authors evaluate TimeMixer on various benchmarks and demonstrate its superior performance compared to state-of-the-art models in both long-term and short-term forecasting tasks. The efficient design of TimeMixer allows for favorable runtime efficiency.

Thanks for reading/listening, that's all for this week.

Please consider checking out Tunadorable's youtube channel where he provides commentary on the above papers.

https://youtube.com/@Tunadorable

Here is the most up-to-date version of the python scripts I currently use to create this newsletter:

https://github.com/evintunador/arxiv-summaries-workflow