This Week's New AI Papers

Monthly AI Paper Summaries

This Week's New AI Papers - May 18, 2024

0:00

-18:13

This Week's New AI Papers - May 18, 2024

Tunadorable

May 19, 2024

Transcript

Welcome to Tunadorable's weekly AI newsletter, where we summarize his favorite articles of the week that he plans to read.

This article was written by gpt-3.5-turbo-16k on 2024-05-18.

# What is it for a Machine Learning Model to Have a Capability?

This paper addresses the question of what it means for a machine learning (ML) model to have a capability. The authors develop an account of ML model capabilities by drawing on the philosophical literature on abilities. Their core proposal is a conditional analysis of model abilities (CAMA): a machine learning model has a capability to perform a task X if it would reliably succeed at doing X if it tried. The authors provide an operationalization of CAMA applicable to large language models (LLMs) and show how it can help make sense of various features of ML model evaluation practice and suggest procedures for fair inter-model comparisons.

# Societal Adaptation to Advanced AI

This paper argues that in addition to modifying AI capabilities, society should focus on adapting to advanced AI to reduce potential negative impacts. Capability-modifying interventions, such as regulating deployment or filtering inputs and outputs, become less feasible as the number of AI developers increases. Adaptive interventions, on the other hand, can reduce negative impacts downstream of AI diffusion. The paper introduces a conceptual framework that identifies adaptive interventions, such as avoidance, defense, and remedy strategies, and applies it to examples of AI risk, including election manipulation, cyberterrorism, and loss of control to AI decision-makers. The framework helps society build resilience to advanced AI by implementing a three-step cycle of adaptation. The paper concludes with recommendations for governments, industry, academia, and nonprofits.

# Generalized Holographic Reduced Representations

This paper introduces Generalized Holographic Reduced Representations (GHRR), an extension of Hyperdimensional Computing (HDC), a brain-inspired computational paradigm. GHRR addresses the limitation of HDC in encoding complex data structures by introducing a flexible, non-commutative binding operation. The authors provide a theoretical analysis of GHRR, showing that it satisfies the basic properties of HDC. They also explore the kernel and binding properties of GHRR and compare it to FHRR, a specific implementation of HDC. The authors conduct empirical experiments that demonstrate the flexible non-commutativity of GHRR, its improved decoding accuracy for compositional structures, and its enhanced memorization capacity compared to FHRR. The results suggest that GHRR has the potential to be a computationally and data-efficient alternative to deep learning for learning representations. However, potential critiques could include the need for further empirical validation and scalability analysis in larger-scale applications. The implications of this research include the development of more powerful and flexible models for encoding complex data structures and the potential to reduce the computational and data costs of learning representations.

# The Platonic Representation Hypothesis

The Platonic Representation Hypothesis proposes that representations in AI models, particularly deep neural networks, are converging towards a shared statistical model of reality. This convergence is observed across different model architectures, training objectives, and data modalities. The hypothesis suggests that as models become larger and are trained on more diverse tasks and data, they require representations that capture more information about the underlying reality. The alignment of representations is measured using similarity metrics, and evidence from various studies supports the convergence of representations. For example, different models with different architectures and objectives align with each other, larger models exhibit greater alignment, and representations across modalities (such as vision and language) also align. The convergence of representations has implications for the universality and generalizability of AI models, as well as their ability to capture the statistical regularities of the real world. However, there are limitations and potential counterexamples to the hypothesis that should be considered.

# Machine Unlearning - A Comprehensive Survey

This survey provides a comprehensive overview of machine unlearning, a recent research area that aims to protect users' privacy by removing their data from trained machine learning models. The survey categorizes machine unlearning methods into four scenarios: centralized unlearning, distributed and irregular data unlearning, unlearning verification, and privacy and security issues in machine unlearning.

In the centralized unlearning scenario, the survey classifies methods into two categories: exact unlearning and approximate unlearning. Exact unlearning focuses on designing efficient retraining schemes with less computation cost, while approximate unlearning aims to unlearn a posterior that approximates the model retrained on the remaining dataset. The survey provides a detailed introduction to the techniques used in these two categories.

In addition to centralized unlearning, the survey also discusses distributed and irregular data unlearning scenarios, including federated unlearning and graph unlearning. It reviews the studies on unlearning verification and explores the privacy and security issues associated with machine unlearning.

The survey highlights the challenges in different machine unlearning scenarios and presents potential research directions to address these challenges. It also discusses the implications of machine unlearning in terms of privacy protection and security enhancement.

Overall, this survey provides a comprehensive understanding of machine unlearning and its various applications and challenges. It serves as a valuable resource for researchers and practitioners in the field of machine learning and privacy protection.

# HMT - Hierarchical Memory Transformer for Long Context Language Processing

This paper introduces a novel framework called Hierarchical Memory Transformer (HMT) for processing long-context language. HMT imitates the memory hierarchy of the human brain to enable and improve the long-context processing ability of any model. It uses a memory-augmented segment-level recurrent model to organize memory into sensory, short-term, and long-term layers, allowing the model to recall relevant information from history. HMT outperforms existing models designed for long-context processing and can further enhance the effectiveness of these models. It is a model-independent framework that can be easily integrated into future language models.

# A Brief Introduction to Causal Inference in Machine Learning

This lecture note introduces the concept of causal inference in machine learning, focusing on how to determine the causal relationship between variables. It uses probabilistic graphical models and structural causal models to represent the causal relationships among variables. The note explains how to learn from data to infer the parameters of the models and how to use them to make predictions. It also discusses the limitations and challenges of causal inference in machine learning. The implications of causal inference in machine learning include improving out-of-distribution generalization and understanding the underlying causal mechanisms in complex systems. However, there are still many open questions and challenges in the field that need further research.

# A Survey of Large Language Models for Graphs

This paper presents a comprehensive survey of the use of Large Language Models (LLMs) in graph learning tasks. The authors propose a taxonomy to categorize existing methods based on the framework design, including GNNs as Prefix, LLMs as Prefix, LLMs-Graphs Integration, and LLMs-Only. They discuss representative works in each category and highlight their strengths and limitations. The survey also explores potential avenues for future research, including addressing integration challenges and venturing into new application areas. The paper serves as a valuable resource for researchers and practitioners interested in leveraging LLMs for graph learning tasks.

# DEPTH - Discourse Education through Pre-Training Hierarchically

We present DEPTH, a hierarchical language model that improves the discourse understanding capabilities of encoder-decoder models like T5. DEPTH combines the pre-training objectives of T5 with the discourse-oriented objectives of Sentence-Level Language Model (SLM) to train the model to represent both sub-word and sentence-level dependencies. During pre-training, DEPTH learns semantic and discourse-level representations faster than T5, outperforming it in span-corruption loss. In downstream evaluations, DEPTH demonstrates its ability to quickly learn diverse tasks that require syntactic, semantic, and discourse capabilities. These findings suggest that DEPTH can enhance the discourse understanding capabilities of encoder-decoder models without sacrificing performance in other natural language understanding tasks.

# Dynamic Activation Pitfalls in LLaMA Models - An Empirical Study

This study investigates the efficacy of dynamic activation mechanisms in LLaMA language models. The researchers conducted extensive experiments on different dynamic activation strategies and found that LLaMA models generally underperform compared to their ReLU counterparts, particularly when high sparsity is required. The authors attribute these deficiencies to the complexity of predicting activation heads and neurons, inadequate sparsity resulting from activation functions, and insufficient preservation of information resulting from KV cache skipping. The study provides empirical evidence and analysis on the limitations of dynamic activation in LLaMA models and suggests avenues for improving future sparsity schemes.

# Matching domain experts by training from scratch on domain knowledge

This study explores the performance of language models trained on domain-specific knowledge in predicting the results of neuroscience experiments. The researchers trained a relatively small language model (GPT-2) on a dataset of 1.3 billion tokens of neuroscience literature. They found that by fine-tuning the pretrained GPT-2 on neuroscience data, or by training GPT-2 from scratch with a specialized tokenizer trained on neuroscience data, the models achieved expert-level performance in predicting the outcomes of neuroscience experiments. The results suggest that even small language models can attain expert-level performance through domain-specific training approaches. The study also highlights the importance of specialized tokenization in preserving domain-specific terminologies, which improves the models' performance on specialized tasks. The findings raise questions about the nature of scientific progress and suggest that statistical pattern recognition by language models may play a significant role in scientific discovery. However, it is important to note that there is still a performance gap between the language models used in this study and larger, more advanced models. Future research should aim to narrow this gap and explore the specific elements necessary for achieving human-like performance.

# Improving Transformers with Dynamically Composable Multi-Head Attention

This paper introduces Dynamically Composable Multi-Head Attention (DCMHA), a new attention architecture for Transformers that addresses the limitations of Multi-Head Attention (MHA) by dynamically composing attention heads. DCMHA increases the expressive power of the model while being parameter and computation efficient. The authors propose a Compose function that transforms the attention score and weight matrices in an input-dependent way, allowing for dynamic composition of attention heads. DCMHA can be used as a drop-in replacement for MHA in any Transformer architecture. The experiments show that DCFormer, the Transformer model with DCMHA, outperforms the original Transformer on various tasks and model scales, achieving comparable performance to models with much larger compute. The code and models are available for further exploration.

# Biology-inspired joint distribution neurons based on Hierarchical Correlation Reconstruction allowing for multidirectional neural networks

This research introduces a new type of neural network called the Hierarchical Correlation Reconstruction (HCR) network. Unlike traditional neural networks, HCR networks model joint distributions instead of just values. This allows for multidirectional propagation of both values and probability distributions.

The HCR neuron, the basic building block of the HCR network, represents the joint distribution of variables as a linear combination using an orthonormal polynomial basis. The coefficients of this linear combination correspond to the moments of the joint distribution and can be estimated and updated using various methods.

The HCR neuron can be trained using standard backpropagation or other techniques. It can also be optimized by selecting an appropriate basis and reducing the number of considered coefficients. Additionally, the HCR network can propagate not only values but also entire probability distributions through its neurons.

The implications of this research are significant. HCR networks have the potential to improve upon existing neural network architectures by incorporating joint distribution models and allowing for multidirectional propagation. They can be applied in various fields, such as biology, finance, and information theory.

However, there are challenges and limitations to consider. Training the intermediate layers of HCR networks can be difficult, and further research is needed to optimize training methods. The selection of an appropriate basis and the reduction of coefficients also require careful consideration. Additionally, the interpretation and analysis of the coefficients in HCR networks may pose challenges.

Overall, the HCR network shows promise as a powerful and flexible neural network architecture. Further research is needed to explore its practical applications, optimize training methods, and address the challenges and limitations associated with its implementation.

# Memory Mosaics

This paper introduces a new architecture called Memory Mosaics, which consists of multiple associative memories working together to carry out a prediction task. Memory Mosaics possess compositional and in-context learning capabilities, similar to transformers but with more transparent mechanisms. The paper demonstrates these capabilities on toy examples and shows that Memory Mosaics perform as well as transformers on medium-scale language modeling tasks. The training process of Memory Mosaics leads to predictive disentanglement, where the overall prediction task is decomposed into smaller sub-problems that can be independently solved and recombined. The paper also discusses the implications of this disentanglement and the potential for using Memory Mosaics in various domains.

# Deep video representation learning - a survey

This paper provides a comprehensive survey of deep video representation learning, focusing on the extraction of spatial and temporal features from videos. The authors classify different types of features based on their spatial and temporal modeling approaches. They compare the pros and cons of these features in terms of robustness to occlusion, view, illumination, and background variations. The authors discuss how extra modules, such as part information, additional input information, and attention mechanisms, can be added to improve the robustness of spatial features. For spatially sparse features, the authors discuss the use of RNNs, CNNs, and GNN/GCN architectures, and highlight their advantages and limitations. They also present extra modules that can be added to these architectures for better robustness. The paper concludes by discussing the remaining challenges in deep video representation learning. Overall, this survey provides valuable insights into the design and evaluation of deep video features, and offers suggestions for selecting suitable features for different video processing and analysis tasks.

# Beyond Scaling Laws - Understanding Transformer Performance with Associative Memory

This paper investigates the performance of Transformer-based language models and the relationship between model size, dataset size, and memorization. The authors propose a theoretical framework based on associative memory, specifically using Hopfield networks, to model the behavior of Transformer layers. They introduce a new energy function that captures the memorization process and provides a better explanation for the attention mechanism. The authors also conduct experiments with GPT-2 and vanilla Transformers to validate their theoretical results. The paper highlights the importance of understanding the convergence dynamics of training loss during memorization and provides insights into the optimal cross-entropy loss for model training.

Thanks for reading/listening, that's all for this week.

Please consider checking out Tunadorable's youtube channel where he provides commentary on the above papers.

https://youtube.com/@Tunadorable

Here is the most up-to-date version of the python scripts I currently use to create this newsletter:

https://github.com/evintunador/arxiv-summaries-workflow