0 likes | 3 Views
The generalization capabilities vary: transformers succeed in generalizing for comparison but not for composition when tested with out-of-distribution examples. <br><br>
E N D
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
Phased Consistency Model The Consistency Model (CM) has advanced diffusion model generation, yet its adaptation for high-resolution, text-conditioned image generation in latent space (LCM) has been suboptimal. This paper identifies three critical flaws in LCM and introduces the Phased Consistency Model (PCM), which expands the design space and resolves these issues. Evaluations show that PCM significantly outperforms LCM in settings ranging from 1 to 16 generation steps. Notably, PCM is designed for multi-step refinement but also excels in 1-step generation, matching or surpassing the performance of state-of-the-art methods tailored for single-step processes. Moreover, PCM's approach proves versatile, extending to video generation and achieving leading results in few-step text-to-video generation.
An Introduction to Vision-Language Modeling The recent surge in LLMs has spurred efforts to adapt these models for visual applications, leading to the development of vision-language models (VLMs). VLMs, capable of tasks like navigating unfamiliar environments or generating images from text descriptions, are poised to significantly change our interaction with technology. However, the integration of discrete language with the high-dimensional, continuous nature of vision presents unique challenges. This paper serves as an introduction to VLMs, covering their fundamentals, operation, and training methodologies. It also explores evaluation techniques for VLMs and extends the discussion to video applications, aiming to clarify the complexities of bridging vision with language for newcomers to the field.
GNN-RAG: Graph Neural Retrieval for Large Language Model Reasoning Knowledge Graphs (KGs), which represent factual knowledge as a graph of triplets (head, relation, tail), facilitate Question Answering over KGs (KGQA) by grounding reasoning in provided information. While LLMs excel in natural language understanding and are thus dominant in QA tasks, Graph Neural Networks (GNNs) are effective in handling the complex graph structure of KGs. This paper introduces GNN-RAG, a novel method that merges the language understanding capabilities of LLMs with the reasoning power of GNNs in a retrieval-augmented generation (RAG) approach. The process involves using a GNN to reason over a dense KG subgraph to retrieve answer candidates, then extracting and verbalizing the shortest paths between question entities and these candidates for LLM processing. Additionally, a retrieval augmentation technique is developed to enhance KGQA performance. GNN-RAG has shown to surpass or match GPT-4 in widely recognized KGQA benchmarks like WebQSP and CWQ, particularly excelling in multi-hop and multi-entity question scenarios, improving answer F1 scores by 8.9--15.5 percentage points.
Transformers Can Do Arithmetic with the Right Embeddings The limited capability of transformers in arithmetic tasks is primarily due to their inability to precisely track digit positions in large numerical spans. This issue is addressed by introducing a position-encoding embedding for each digit, enhancing the transformer's performance in arithmetic operations. Further architectural enhancements like input injection and recurrent layers amplify this effect. With improved position tracking, the study explores whether transformers can tackle arithmetic problems that surpass the complexity and size encountered during training. Results show that with training on only 20-digit numbers using a single GPU for one day, the enhanced model reaches up to 99% accuracy on 100-digit addition problems. These advancements in numeracy also lead to performance improvements in other complex reasoning tasks such as sorting and multiplication.
MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series LLMs have achieved notable success across various tasks, yet leading models like GPT, Gemini, and Claude remain proprietary, often without detailed public insights into their training. In contrast, open-source initiatives have released models such as LLaMA-3, although these typically lack comprehensive disclosure, such as intermediate checkpoints and training codes. To enhance transparency in the field, the research community has introduced fully open LLMs like Pythia, Amber, and OLMo, which provide extensive details including pre-training corpora and training methodologies. Despite these efforts, these fully open models still lag behind the performance of top proprietary LLMs in reasoning, knowledge, and coding tasks. Addressing this gap, MAP-Neo, a transparent, bilingual 7B parameter LLM trained on 4.5T high-quality tokens, is introduced as the first fully open-sourced bilingual LLM matching the performance of leading LLMs.
Attention as an RNN The introduction of Transformers has been a significant advancement in sequence modeling, capitalizing on GPU parallelism to enhance performance. Yet, their high computational cost at inference limits their use in resource-constrained environments, such as mobile and embedded devices. This paper presents a novel perspective where attention mechanisms are interpreted as a type of Recurrent Neural Network (RNN) that can efficiently produce a many-to-one RNN output. It further posits that Transformers are akin to RNN variants but lack efficient token updating capabilities crucial for sequence modeling. To address this, a new method leveraging the parallel prefix scan algorithm is introduced to compute attention's many-to-many RNN output more efficiently.
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models The rapid advancement of large language and vision models (LLVMs) has significantly benefited from visual instruction tuning, particularly through the use of open-source datasets and enhanced vision encoders to compete with sophisticated proprietary LLVMs. These improvements stem from the complex information demands of tasks requiring deep image understanding, common-sense knowledge, and procedural reasoning for complex problem-solving. This paper introduces Meteor, a new efficient LLVM that utilizes a multifaceted rationale to boost its understanding and response capabilities. Meteor employs the Mamba architecture, which processes sequential data with linear time complexity and introduces a novel concept for efficiently embedding lengthy rationales.