This Week's New AI Papers

Monthly AI Paper Summaries

This Week's New AI Papers - June 15, 2024

0:00

-14:33

This Week's New AI Papers - June 15, 2024

Tunadorable

Jun 15, 2024

Transcript

Welcome to Tunadorable's weekly AI newsletter, where we summarize his favorite articles of the week that he plans to read.

This article was written by gpt-3.5-turbo-16k on 2024-06-15.

# Grokfast - Accelerated Grokking by Amplifying Slow Gradients

This paper introduces a method called GROKFAST (Grokking Fast) that accelerates the phenomenon of delayed generalization in machine learning models. The authors propose amplifying the low-frequency components of the gradients to speed up the generalization process. They treat the change in model parameters over training iterations as a random signal and use a low-pass filter to amplify the slow-varying component of the gradients.

The authors conduct experiments on various tasks and datasets, including algorithmic data, images, languages, and graphs, to demonstrate the effectiveness of their method. They show that GROKFAST can achieve more than 50 times faster generalization compared to the baseline approach. The experiments also reveal that applying GROKFAST in combination with weight decay leads to even faster generalization.

One limitation of the approach is the increased memory requirement when using a windowed moving average filter. To address this, the authors propose an alternative approach using an exponential moving average filter, which reduces the memory footprint while still achieving similar acceleration results.

Overall, GROKFAST provides a practical solution for accelerating the grokking phenomenon, allowing machine learning practitioners to achieve faster generalization in their models. The method is simple to implement and applicable to various tasks and datasets. However, it may not be suitable for larger models due to the memory requirements.

# Why Warmup the Learning Rate? Underlying Mechanisms and Improvements

This paper investigates the practice of warmup in deep learning, specifically the warmup of the learning rate. The authors conduct extensive experiments using different architectures, datasets, and optimizers to understand why warmup is beneficial and how it can be improved.

The authors find that the primary benefit of warmup is to allow the network to tolerate larger learning rates, which leads to better performance and more robust hyperparameter tuning. They show that warmup helps the network move towards flatter regions of the loss landscape that can handle larger learning rates.

The authors identify two main mechanisms at play during warmup: natural progressive sharpening and natural sharpness reduction. The type of mechanism depends on factors such as initialization and parameterization. They also observe that the warmup duration has an impact on the warmup mechanism, with shorter warmup durations resulting in more intense catapults (training instabilities) and longer warmup durations leading to smaller catapults.

The authors extend their analysis to adaptive optimizers like Adam and find that the underlying mechanisms of warmup are similar to SGD but with a different measure of sharpness. They propose an alternative initialization for Adam, called GI-Adam, that provides benefits similar to warmup and consistently improves performance.

The authors also suggest a way to save on warmup time by using the catapult mechanism to estimate the initial sharpness scale, which allows for a more principled choice of the initial learning rate. This can significantly reduce or even eliminate the need for warmup.

Overall, the paper provides valuable insights into the mechanisms and benefits of warmup in deep learning. The findings can help improve training practices and make hyperparameter tuning more robust. One potential critique is that the experiments are mostly focused on image classification tasks, so the generalizability of the findings to other domains may be limited. Nevertheless, the implications of the study are significant for the deep learning community.

# The Factorization Curse - Which Tokens You Predict Underlie the Reversal Curse and More

In this paper, the authors investigate the "reversal curse," a phenomenon where language models struggle to retrieve information in a different order than they were trained on. They propose a framework called the "factorization curse," which characterizes the reversal curse as a failure to learn the same joint distribution under different factorizations.

The authors conduct a series of experiments to explore different training objectives and their effects on knowledge retrieval. They compare autoregressive training, autoregressive training with reversed sequences, masked language modeling with fixed masking rates, and a factorization-agnostic objective called MLM-U.

Through their experiments, the authors find that the factorization curse is an inherent failure of the next-token prediction objective used in popular language models. They also observe that factorization-agnostic objectives, such as MLM-U, show improved knowledge retrieval capabilities.

The authors highlight the implications of their findings for downstream tasks that require reliable knowledge retrieval. They suggest that finetuning strategies for these tasks may provide mixed results unless the models have already seen the right sequence of tokens.

Overall, the paper provides insights into the limitations of current language models and proposes factorization-agnostic objectives as a potential solution to improve knowledge storage and retrieval capabilities.

# Attention as a Hypernetwork

This research investigates the mechanisms underlying the ability of transformers to generalize to novel problem instances. The authors propose a novel perspective that views multi-head attention as a hypernetwork, where a low-dimensional latent code specifies key-query specific operations. They find that this latent code is highly structured and captures information about the subtasks performed by the network. The authors also propose a modification of linear attention called Hypernetwork Linear Attention (HYLA), which strengthens compositional generalization on abstract reasoning tasks. They develop a challenging abstract reasoning task called SRAVEN based on the Raven Progressive Matrices test and demonstrate how scaling model size and data enables compositional generalization. Overall, the study provides insights into how transformers achieve compositional generalization and suggests potential improvements for attention mechanisms.

# Standard Language Ideology in AI-Generated Language

This position paper explores standard language ideology in AI-generated language, focusing on the reinforcement of language hierarchies in generative AI technologies. The authors discuss how standard language ideology is reflected and reinforced in language models, such as ChatGPT, and the implications for minoritized language communities. They present a taxonomy of open problems, including the default production of "standard" language varieties, lower quality of service for minoritized varieties, stereotyping of languages, appropriation and manipulation of minoritized varieties, and the erasure of minoritized language varieties. The paper highlights the societal implications of these issues, such as the reinforcement of linguistic biases and the perpetuation of harmful stereotypes. The authors call for alternative, more emancipatory approaches to AI-generated language that challenge existing power structures.

# Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?

This paper examines the challenge of predicting specific downstream capabilities of advanced AI systems with increasing scale. While scaling laws for pretraining performance are well-established, the predictability of downstream capabilities remains elusive. The authors identify a new factor that hinders predictability on multiple-choice benchmarks: the sequence of transformations that compute downstream metrics degrades the statistical relationship between performance and scale. Specifically, downstream metrics rely on comparing the correct choice against a small set of specific incorrect choices. This means that accurately predicting downstream capabilities requires not only understanding how probability mass concentrates on the correct choice with scale but also how it fluctuates on specific incorrect choices. The authors empirically study the fluctuations of probability mass on incorrect choices and find that it varies with increasing compute. This research sheds light on the challenges of predicting downstream capabilities and contributes to developing more predictable evaluations of frontier AI models.

# Explore the Limits of Omni-modal Pretraining at Scale

This paper proposes a scalable pretraining paradigm, called Multimodal Context (MiCo), to build omni-modal intelligence that can understand any modality and learn universal representations. The authors collect large-scale multimodal paired data, including text, image, audio, video, depth, and normal maps, to train models using MiCo. They design an omni-modal learning architecture inspired by how the human brain processes multimedia signals and construct a multimodal context to enhance coherent multimodal understanding. The pretrained models are evaluated on various tasks and achieve state-of-the-art performance, demonstrating the effectiveness of MiCo in omni-modal learning. The paper also discusses the challenges and limitations of current multimodal learning approaches. The proposed paradigm opens up possibilities for developing comprehensive omni-modal intelligence and has implications for the future development of AI models.

# The Prompt Report - A Systematic Survey of Prompting Techniques

This paper presents a systematic survey of prompting techniques in the field of Generative Artificial Intelligence (GenAI). The authors establish a structured understanding of prompts by creating a taxonomy of 58 text-based prompting techniques and analyzing their usage. They also provide a comprehensive vocabulary of 33 terms related to prompting. The paper includes a meta-analysis of the literature on natural language prefix-prompting and discusses the use of prompts in multilingual and multimodal settings. Additionally, the authors explore extensions of prompting techniques, such as agents and evaluation methods, and discuss potential issues related to security and alignment. The paper concludes with a discussion of benchmarking and two case studies on prompt engineering. Overall, this paper provides a valuable resource for researchers and developers working with GenAI systems.

# Towards Bidirectional Human-AI Alignment - A Systematic Review for Clarifications, Framework, and Future Directions

This paper presents a systematic review of over 400 research papers on human-AI alignment, focusing on the period between 2019 and 2024. The goal is to clarify the definitions and scopes of human-AI alignment and propose a conceptual framework called "Bidirectional Human-AI Alignment" that encompasses both aligning AI to humans and aligning humans to AI. The framework emphasizes the long-term and dynamic nature of alignment, which is often overlooked in current research. The paper also identifies four key research questions within this framework and discusses the findings from the literature analysis. Finally, the paper outlines three challenges for future research directions and provides examples of potential solutions. The contributions of this paper include providing clarified definitions and scopes of human-AI alignment, developing a comprehensive framework, and identifying future research directions to achieve long-term and dynamic alignment.

# Step-by-Step Diffusion - An Elementary Tutorial

This tutorial introduces diffusion models, which are generative models that learn to transform an easy-to-sample distribution into a target distribution. The main idea is to construct a sequence of distributions that interpolate between the two distributions and then learn a reverse sampler to generate samples from the target distribution.

The tutorial presents the DDPM (Denoising Diffusion Probabilistic Models) sampler, a stochastic reverse sampler that approximates the reverse process by learning the conditional expectations between adjacent distributions. The correctness of the DDPM sampler is heuristically proven by showing that the true conditional distribution can be well-approximated by a Gaussian.

The tutorial also discusses the abstract concept of diffusions, which can be applied to various settings, including deterministic samplers, discrete domains, and flow matching. It provides an algorithmic description of the DDPM sampler and explains how to use it to sample from the target distribution.

The core assertion is that diffusion models offer a general framework for generative modeling by reducing the problem of sampling from a target distribution to the problem of learning a reverse sampler. The tutorial provides a simplified mathematical explanation of diffusion models and their samplers, making it accessible to a technical audience.

Potential critiques of diffusion models include the choice of base distribution, the accuracy of the reverse sampler approximation, and the scalability of the method to high-dimensional data. The tutorial acknowledges these limitations and provides references for further exploration.

The implications of diffusion models are that they provide a principled approach to generative modeling and offer flexibility in modeling complex distributions. They can be used in various applications, such as image generation and data augmentation. The tutorial highlights the importance of parameter tuning, noise schedules, and model architecture in practice.

Thanks for reading/listening, that's all for this week.

Please consider checking out Tunadorable's youtube channel where he provides commentary on the above papers.

https://youtube.com/@Tunadorable

Here is the most up-to-date version of the python scripts I currently use to create this newsletter:

https://github.com/evintunador/arxiv-summaries-workflow