Newest AI Papers This Week

Monthly AI Paper Summaries

Newest AI Papers This Week - June 1, 2024

0:00

-14:32

Newest AI Papers This Week - June 1, 2024

Tunadorable

May 31, 2024

Transcript

Welcome to Tunadorable's weekly AI newsletter, where we summarize his favorite articles of the week that he plans to read.

This article was written by gpt-3.5-turbo-16k on 2024-05-31.

# Why Algorithms Remain Unjust - Power Structures Surrounding Algorithmic Activity

This paper examines the power structure surrounding Algorithmic Activity, which refers to the research, development, training, and deployment of algorithms in society. The author argues that Algorithmic Activity perpetuates social injustices because it is dominated by economic power rather than social empowerment. Traditional approaches to addressing algorithmic injustices, such as algorithmic reformism, have failed because they do not consider the larger power dynamics at play. The author employs a framework developed by Erik Olin Wright to analyze the power configuration surrounding Algorithmic Activity and proposes transformative strategies for social empowerment. The paper concludes by calling for future work to integrate these transformations and develop mechanisms for social empowerment in Algorithmic Activity.

# Transformers represent belief state geometry in their residual stream

This study explores the computational structure learned by transformers, specifically in relation to the representation of belief states. The researchers propose that transformers trained on next-token prediction develop internal structures that reflect the meta-dynamics of belief updating over hidden states of the data-generating process. They use the theory of optimal prediction and the Mixed State Presentation (MSP) to make predictions about the geometry of internal activations in transformers.

To test their framework, the researchers conduct experiments using data generated from processes with hidden ground truth structure. They find that the predicted belief state geometry, even when highly nontrivial and fractal, is linearly represented in the residual stream of transformers. This suggests that transformers encode and utilize information beyond the local next-token predictions they are explicitly trained on.

The researchers also investigate cases where the belief state geometry is represented in the final residual stream or distributed across multiple layers. They provide a framework to explain these observations, showing that the representation of belief states can depend on the specific characteristics of the data generating process.

The results of the study provide strong evidence that transformers learn to represent the geometry of belief states in their internal activations. This suggests that transformers not only capture the hidden structure of the data generating process but also learn how to update their beliefs about the hidden state of the world as they synchronize to it in context.

One potential critique of the study is that it focuses on a specific type of language model (transformers) and a specific type of training task (next-token prediction). It would be interesting to see if similar results hold for other types of models and tasks.

The implications of this research are significant. It provides a theoretical framework for understanding the computational structure of transformers and sheds light on how they encode and utilize information beyond local predictions. This has implications for the development of more interpretable and explainable language models, as well as for improving their performance on tasks that require reasoning and understanding of complex structures.

# Intelligence as Computation

This paper proposes a conceptualization of intelligence as computation, aiming to provide a unified view for all disciplines of intelligence research. It argues that the existing conceptualizations of intelligence, such as computational intelligence, physical intelligence, neural intelligence, embodied intelligence, morphological intelligence, and mechanical intelligence, are influenced by remnants of dualism and lack consistency. The paper suggests that intelligence should be understood as a subset of computation implementing specific computational principles. It emphasizes the need to focus on computational mechanisms rather than describing intelligent behavior and highlights the importance of interactions between agents and their environment. The proposed conceptualization has implications for embodiment and suggests a multidisciplinary research agenda for a unified science of intelligence.

# Matryoshka Multimodal Models

The paper introduces Matryoshka Multimodal Models (M3), a novel approach for representing visual content in large multimodal models. M3 learns nested sets of visual tokens that capture information at different levels of granularity, allowing for flexible control over the number of tokens used during inference. The approach is evaluated on image and video understanding benchmarks and achieves comparable performance to existing models while using significantly fewer tokens. The results show that most benchmarks only require a small number of tokens to achieve high accuracy. The study also highlights the gap between the model's performance and the oracle upper bound, suggesting room for improvement in the trade-off between performance and token length.

# A Systematic Review of Federated Generative Models

This systematic review focuses on the intersection of Federated Learning (FL) and Generative Models, providing a comprehensive overview of research conducted between 2019 and 2024. The review compares nearly 100 papers, categorizing them based on FL and Generative Model methods, privacy considerations, data types, evaluation methods, and more.

The review highlights that Federated Generative Models, particularly Federated GANs, have gained significant attention from researchers and have been applied to various applications such as medical imaging, anomaly detection, data augmentation, and financial fraud detection. Many papers in this field have addressed privacy concerns and have satisfied the requirements of differential privacy, enhancing the robustness of Federated GANs.

The review also identifies emerging topics such as Diffusion-based Federated Models, which have shown improved convergence and communication costs compared to GAN-based FL models. However, scalability and cross-device FL remain open challenges that require further investigation.

Additionally, the review emphasizes the unresolved challenges in privacy and integrity considerations for tabular data-based models and non-GAN-based FL.

Overall, the review provides insights into the state-of-the-art Federated Generative Models, identifies research gaps, and offers a roadmap for future research in this evolving field.

# Emergence of a High-Dimensional Abstraction Phase in Language Transformers

This research paper investigates the geometric properties of language models (LMs), specifically the intrinsic dimension (ID) of their representations. The authors find that LMs go through a distinct phase characterized by a peak in ID, which corresponds to the first full linguistic abstraction of the input. This phase is significantly reduced when the models are trained on random text or untrained, indicating the importance of linguistic processing. The layer at which the ID peak occurs also correlates with the quality of the LM. Additionally, the highest-dimensional representations of different networks predict each other, suggesting that they encode similar information. The ID peak serves as a borderline between representations that perform poorly and fairly in syntactic and semantic tasks, as well as in downstream NLP tasks. The results suggest that a high-dimensional abstraction phase underlies core linguistic processing in many LM architectures.

# An Introduction to Vision-Language Modeling

This paper provides an introduction to Vision-Language Models (VLMs) and discusses different training paradigms and evaluation methods for these models. VLMs aim to connect vision and language, enabling applications such as image captioning and visual question answering. The paper categorizes VLMs into four training paradigms: contrastive, masking, generative, and pretrained backbones. It discusses the strengths and weaknesses of each paradigm and provides examples of models within each category. The paper also provides guidelines for training VLMs, including data selection, software tools, and hyperparameter choices. It highlights the importance of grounding and alignment in VLM training. The paper then discusses evaluation methods for VLMs, including benchmarking visio-linguistic abilities, assessing biases, and detecting hallucinations and memorization. It emphasizes the need for robust evaluation benchmarks and highlights current challenges in evaluating VLMs. Finally, the paper briefly touches upon extending VLMs to videos and the challenges associated with video data.

# Models That Prove Their Own Correctness

This paper introduces the concept of Self-Proving models, which are machine learning models that can prove the correctness of their outputs to a verification algorithm. The authors propose two methods for training Self-Proving models: Transcript Learning (TL) and Reinforcement Learning from Verifier Feedback (RLVF). TL relies on access to transcripts of accepting interactions between the model and the verifier, while RLVF emulates the interaction with the verifier. The authors also present variants of these algorithms that use annotations to improve learning. The efficacy of TL and Annotated Transcript Learning (ATL) is demonstrated through experiments on Self-Proving transformers that compute the Greatest Common Divisor (GCD) of two integers. The results show that TL and ATL significantly improve the verifiability of the models compared to a baseline model.

# Yuan 2.0-M32 - Mixture of Experts with Attention Router

Yuan 2.0 -M32 is a language model based on Mixture of Experts (MoE) architecture that utilizes 32 experts, of which 2 experts are active for each token. It introduces a new router network called Attention Router, which considers the correlation between experts, resulting in a 3.8% improvement in accuracy compared to the classical router network. The model is trained from scratch with 2,000B tokens and achieves competitive performance with only 3.7B active parameters and 7.4 GFlops forward computation per token, which is significantly lower than other large-scale models. It outperforms Llama3-70B on MATH and ARC-Challenge benchmarks with accuracy scores of 55.89 and 95.8 respectively. The model's training computation consumption is only 9.25% of a dense model at the same parameter scale. The source code and models are released on GitHub for further research and development.

# Strategies to Counter Artificial Intelligence in Law Enforcement - Cross-Country Comparison of Citizens in Greece, Italy and Spain

This study explores citizens' strategies to counter the use of Artificial Intelligence (AI) by law enforcement agencies (LEAs) in Greece, Italy, and Spain. The researchers conducted an online survey and found that citizens were moderately likely to use counter-strategies, with Greek participants being the most likely. The most popular strategies were technical (privacy settings, encryption) and social (asking others not to post personal information). Overall, attitudes towards AI usage by LEAs, perceived vulnerability to AI biases, and fear of crime influenced the likelihood of counter-strategies. However, AI knowledge did not have a significant impact. These findings highlight the conscious and strategic choices made by citizens in response to LEAs' AI capabilities. Further research is needed to investigate counter-strategies in other countries and explore individual motivations for their choices.

# Matryoshka Query Transformer for Large Vision-Language Models

This paper introduces the Matryoshka Query Transformer (MQT), a model architecture that allows for flexible selection of the number of visual tokens in large vision-language models (LVLMs). The model utilizes a query transformer to encode images into visual tokens, and during training, only a subset of the tokens is used. The MQT-LLAVA model, which combines MQT with LLaVA-1.5, achieves similar or better performance compared to LLaVA-1.5 while using a significantly reduced number of tokens. The performance and computational trade-offs of different numbers of visual tokens are explored across 11 benchmarks, with varying impacts on different tasks. The results demonstrate the potential for achieving high accuracy with reduced computational costs by adjusting the number of visual tokens.

# Accelerating Transformers with Spectrum-Preserving Token Merging

This paper introduces a novel token merging method called PITOME that accelerates Transformer models, such as Vision Transformers (ViTs), by reducing computational and memory requirements while maintaining accuracy. PITOME prioritizes the preservation of informative tokens by using an energy score metric, which identifies large clusters of similar tokens as high-energy and suitable for merging, while smaller isolated clusters are considered low-energy and preserved. Experimental results demonstrate that PITOME achieves superior performance compared to existing token merging methods across various tasks, including image-text retrieval, visual question answering, image classification, and text classification. Theoretical analysis shows that PITOME preserves the spectral properties of the original token space under certain assumptions.

Thanks for reading/listening, that's all for this week.

Please consider checking out Tunadorable's youtube channel where he provides commentary on the above papers.

https://youtube.com/@Tunadorable

Here is the most up-to-date version of the python scripts I currently use to create this newsletter:

https://github.com/evintunador/arxiv-summaries-workflow