Daily Paper Cast – Podcast

Afleveringen

GuardReasoner: Towards Reasoning-based LLM Safeguards
1 feb· Daily Paper Cast
🤗 Upvotes: 46 | cs.CR, cs.AI, cs.LG

Authors:
Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, Bryan Hooi

Title:
GuardReasoner: Towards Reasoning-based LLM Safeguards

Arxiv:
http://arxiv.org/abs/2501.18492v1

Abstract:
As LLMs increasingly impact safety-critical applications, ensuring their safety using guardrails remains a key challenge. This paper proposes GuardReasoner, a new safeguard for LLMs, by guiding the guard model to learn to reason. Concretely, we first create the GuardReasonerTrain dataset, which consists of 127K samples with 460K detailed reasoning steps. Then, we introduce reasoning SFT to unlock the reasoning capability of guard models. In addition, we present hard sample DPO to further strengthen their reasoning ability. In this manner, GuardReasoner achieves better performance, explainability, and generalizability. Extensive experiments and analyses on 13 benchmarks of 3 guardrail tasks demonstrate its superiority. Remarkably, GuardReasoner 8B surpasses GPT-4o+CoT by 5.74% and LLaMA Guard 3 8B by 20.84% F1 score on average. We release the training data, code, and models with different scales (1B, 3B, 8B) of GuardReasoner : https://github.com/yueliu1999/GuardReasoner/.
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs
1 feb· Daily Paper Cast
🤗 Upvotes: 22 | cs.CL

Authors:
Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu

Title:
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs

Arxiv:
http://arxiv.org/abs/2501.18585v1

Abstract:
Large language models (LLMs) such as OpenAI's o1 have demonstrated remarkable abilities in complex reasoning tasks by scaling test-time compute and exhibiting human-like deep thinking. However, we identify a phenomenon we term underthinking, where o1-like LLMs frequently switch between different reasoning thoughts without sufficiently exploring promising paths to reach a correct solution. This behavior leads to inadequate depth of reasoning and decreased performance, particularly on challenging mathematical problems. To systematically analyze this issue, we conduct experiments on three challenging test sets and two representative open-source o1-like models, revealing that frequent thought switching correlates with incorrect responses. We introduce a novel metric to quantify underthinking by measuring token efficiency in incorrect answers. To address underthinking, we propose a decoding strategy with thought switching penalty TIP that discourages premature transitions between thoughts, encouraging deeper exploration of each reasoning path. Experimental results demonstrate that our approach improves accuracy across challenging datasets without requiring model fine-tuning. Our findings contribute to understanding reasoning inefficiencies in o1-like LLMs and offer a practical solution to enhance their problem-solving capabilities.
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Zijn er afleveringen die ontbreken?

Klik hier om de feed te vernieuwen.
Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch
1 feb· Daily Paper Cast
🤗 Upvotes: 15 | cs.CL

Authors:
Arthur Douillard, Yanislav Donchev, Keith Rush, Satyen Kale, Zachary Charles, Zachary Garrett, Gabriel Teston, Dave Lacey, Ross McIlroy, Jiajun Shen, Alexandre Ramé, Arthur Szlam, Marc'Aurelio Ranzato, Paul Barham

Title:
Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch

Arxiv:
http://arxiv.org/abs/2501.18512v1

Abstract:
Training of large language models (LLMs) is typically distributed across a large number of accelerators to reduce training time. Since internal states and parameter gradients need to be exchanged at each and every single gradient step, all devices need to be co-located using low-latency high-bandwidth communication links to support the required high volume of exchanged bits. Recently, distributed algorithms like DiLoCo have relaxed such co-location constraint: accelerators can be grouped into ``workers'', where synchronizations between workers only occur infrequently. This in turn means that workers can afford being connected by lower bandwidth communication links without affecting learning quality. However, in these methods, communication across workers still requires the same peak bandwidth as before, as the synchronizations require all parameters to be exchanged across all workers. In this paper, we improve DiLoCo in three ways. First, we synchronize only subsets of parameters in sequence, rather than all at once, which greatly reduces peak bandwidth. Second, we allow workers to continue training while synchronizing, which decreases wall clock time. Third, we quantize the data exchanged by workers, which further reduces bandwidth across workers. By properly combining these modifications, we show experimentally that we can distribute training of billion-scale parameters and reach similar quality as before, but reducing required bandwidth by two orders of magnitude.
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding
1 feb· Daily Paper Cast
🤗 Upvotes: 15 | cs.AI, cs.CL, cs.CV, cs.LG

Authors:
Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, Bowen Zhou

Title:
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

Arxiv:
http://arxiv.org/abs/2501.18362v1

Abstract:
We introduce MedXpertQA, a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning. MedXpertQA includes 4,460 questions spanning 17 specialties and 11 body systems. It includes two subsets, Text for text evaluation and MM for multimodal evaluation. Notably, MM introduces expert-level exam questions with diverse images and rich clinical information, including patient records and examination results, setting it apart from traditional medical multimodal benchmarks with simple QA pairs generated from image captions. MedXpertQA applies rigorous filtering and augmentation to address the insufficient difficulty of existing benchmarks like MedQA, and incorporates specialty board questions to improve clinical relevance and comprehensiveness. We perform data synthesis to mitigate data leakage risk and conduct multiple rounds of expert reviews to ensure accuracy and reliability. We evaluate 16 leading models on MedXpertQA. Moreover, medicine is deeply connected to real-world decision-making, providing a rich and representative setting for assessing reasoning abilities beyond mathematics and code. To this end, we develop a reasoning-oriented subset to facilitate the assessment of o1-like models.
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Large Language Models Think Too Fast To Explore Effectively
1 feb· Daily Paper Cast
🤗 Upvotes: 10 | cs.AI, q-bio.NC

Authors:
Lan Pan, Hanbo Xie, Robert C. Wilson

Title:
Large Language Models Think Too Fast To Explore Effectively

Arxiv:
http://arxiv.org/abs/2501.18009v1

Abstract:
Large Language Models have emerged many intellectual capacities. While numerous benchmarks assess their intelligence, limited attention has been given to their ability to explore, an essential capacity for discovering new information and adapting to novel environments in both natural and artificial systems. The extent to which LLMs can effectively explore, particularly in open-ended tasks, remains unclear. This study investigates whether LLMs can surpass humans in exploration during an open-ended task, using Little Alchemy 2 as a paradigm, where agents combine elements to discover new ones. Results show most LLMs underperform compared to humans, except for the o1 model, with those traditional LLMs relying primarily on uncertainty driven strategies, unlike humans who balance uncertainty and empowerment. Representational analysis of the models with Sparse Autoencoders revealed that uncertainty and choices are represented at earlier transformer blocks, while empowerment values are processed later, causing LLMs to think too fast and make premature decisions, hindering effective exploration. These findings shed light on the limitations of LLM exploration and suggest directions for improving their adaptability.
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training
1 feb· Daily Paper Cast
🤗 Upvotes: 10 | cs.LG, cs.CL

Authors:
Benjamin Feuer, Chinmay Hegde

Title:
WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training

Arxiv:
http://arxiv.org/abs/2501.18511v1

Abstract:
Language model (LLM) post-training, from DPO to distillation, can refine behaviors and unlock new skills, but the open science supporting these post-training techniques is still in its infancy. One limiting factor has been the difficulty of conducting large-scale comparative analyses of synthetic data generating models and LLM judges. To close this gap, we introduce WILDCHAT-50M, the largest public chat dataset to date. We extend the existing WildChat dataset to include responses not only from GPT, but from over 50 different open-weight models, ranging in size from 0.5B to 104B parameters. We conduct an extensive comparative analysis and demonstrate the potential of this dataset by creating RE-WILD, our own public SFT mix, which outperforms the recent Tulu-3 SFT mixture from Allen AI with only 40% as many samples. Our dataset, samples and code are available at https://github.com/penfever/wildchat-50m.
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding
1 feb· Daily Paper Cast
🤗 Upvotes: 10 | cs.CV, cs.AI, cs.CL, cs.LG, cs.RO

Authors:
Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Guizilini, Yue Wang

Title:
PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding

Arxiv:
http://arxiv.org/abs/2501.16411v2

Abstract:
Understanding the physical world is a fundamental challenge in embodied AI, critical for enabling agents to perform complex tasks and operate safely in real-world environments. While Vision-Language Models (VLMs) have shown great promise in reasoning and task planning for embodied agents, their ability to comprehend physical phenomena remains extremely limited. To close this gap, we introduce PhysBench, a comprehensive benchmark designed to evaluate VLMs' physical world understanding capability across a diverse set of tasks. PhysBench contains 10,002 entries of interleaved video-image-text data, categorized into four major domains: physical object properties, physical object relationships, physical scene understanding, and physics-based dynamics, further divided into 19 subclasses and 8 distinct capability dimensions. Our extensive experiments, conducted on 75 representative VLMs, reveal that while these models excel in common-sense reasoning, they struggle with understanding the physical world -- likely due to the absence of physical knowledge in their training data and the lack of embedded physical priors. To tackle the shortfall, we introduce PhysAgent, a novel framework that combines the generalization strengths of VLMs with the specialized expertise of vision models, significantly enhancing VLMs' physical understanding across a variety of tasks, including an 18.4\% improvement on GPT-4o. Furthermore, our results demonstrate that enhancing VLMs' physical world understanding capabilities can help embodied agents such as MOKA. We believe that PhysBench and PhysAgent offer valuable insights and contribute to bridging the gap between VLMs and physical world understanding.
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
o3-mini vs DeepSeek-R1: Which One is Safer?
1 feb· Daily Paper Cast
🤗 Upvotes: 6 | cs.SE, cs.AI

Authors:
Aitor Arrieta, Miriam Ugarte, Pablo Valle, José Antonio Parejo, Sergio Segura

Title:
o3-mini vs DeepSeek-R1: Which One is Safer?

Arxiv:
http://arxiv.org/abs/2501.18438v1

Abstract:
The irruption of DeepSeek-R1 constitutes a turning point for the AI industry in general and the LLMs in particular. Its capabilities have demonstrated outstanding performance in several tasks, including creative thinking, code generation, maths and automated program repair, at apparently lower execution cost. However, LLMs must adhere to an important qualitative property, i.e., their alignment with safety and human values. A clear competitor of DeepSeek-R1 is its American counterpart, OpenAI's o3-mini model, which is expected to set high standards in terms of performance, safety and cost. In this paper we conduct a systematic assessment of the safety level of both, DeepSeek-R1 (70b version) and OpenAI's o3-mini (beta version). To this end, we make use of our recently released automated safety testing tool, named ASTRAL. By leveraging this tool, we automatically and systematically generate and execute a total of 1260 unsafe test inputs on both models. After conducting a semi-automated assessment of the outcomes provided by both LLMs, the results indicate that DeepSeek-R1 is highly unsafe as compared to OpenAI's o3-mini. Based on our evaluation, DeepSeek-R1 answered unsafely to 11.98% of the executed prompts whereas o3-mini only to 1.19%.
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation
1 feb· Daily Paper Cast
🤗 Upvotes: 1 | cs.AI, cs.CL, cs.HC

Authors:
Faria Huq, Zora Zhiruo Wang, Frank F. Xu, Tianyue Ou, Shuyan Zhou, Jeffrey P. Bigham, Graham Neubig

Title:
CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation

Arxiv:
http://arxiv.org/abs/2501.16609v1

Abstract:
While much work on web agents emphasizes the promise of autonomously performing tasks on behalf of users, in reality, agents often fall short on complex tasks in real-world contexts and modeling user preference. This presents an opportunity for humans to collaborate with the agent and leverage the agent's capabilities effectively. We propose CowPilot, a framework supporting autonomous as well as human-agent collaborative web navigation, and evaluation across task success and task efficiency. CowPilot reduces the number of steps humans need to perform by allowing agents to propose next steps, while users are able to pause, reject, or take alternative actions. During execution, users can interleave their actions with the agent by overriding suggestions or resuming agent control when needed. We conducted case studies on five common websites and found that the human-agent collaborative mode achieves the highest success rate of 95% while requiring humans to perform only 15.2% of the total steps. Even with human interventions during task execution, the agent successfully drives up to half of task success on its own. CowPilot can serve as a useful tool for data collection and agent evaluation across websites, which we believe will enable research in how users and agents can work together. Video demonstrations are available at https://oaishi.github.io/cowpilot.html
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate
31 jan· Daily Paper Cast
🤗 Upvotes: 28 | cs.CL

Authors:
Yubo Wang, Xiang Yue, Wenhu Chen

Title:
Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate

Arxiv:
http://arxiv.org/abs/2501.17703v2

Abstract:
Supervised Fine-Tuning (SFT) is commonly used to train language models to imitate annotated responses for given instructions. In this paper, we challenge this paradigm and propose Critique Fine-Tuning (CFT), a strategy where models learn to critique noisy responses rather than simply imitate correct ones. Inspired by human learning processes that emphasize critical thinking, CFT encourages deeper analysis and nuanced understanding-traits often overlooked by standard SFT. To validate the effectiveness of CFT, we construct a 50K-sample dataset from WebInstruct, using GPT-4o as the teacher to generate critiques in the form of ([query; noisy response], critique). CFT on this dataset yields a consistent 4-10% improvement over SFT on six math benchmarks with different base models like Qwen2.5, Qwen2.5-Math and DeepSeek-Math. We further expand to MetaMath and NuminaMath datasets and observe similar gains over SFT. Notably, our model Qwen2.5-Math-CFT only requires 1 hour training on 8xH100 over the 50K examples. It can match or outperform strong competitors like Qwen2.5-Math-Instruct on most benchmarks, which use over 2M samples. Moreover, it can match the performance of SimpleRL, which is a deepseek-r1 replication trained with 140x more compute. Ablation studies show that CFT is robust to the source of noisy response and teacher critique model. Through these findings, we argue that CFT offers a more effective alternative to advance the reasoning of language models.
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Atla Selene Mini: A General Purpose Evaluation Model
31 jan· Daily Paper Cast
🤗 Upvotes: 24 | cs.CL, cs.AI

Authors:
Andrei Alexandru, Antonia Calvi, Henry Broomfield, Jackson Golden, Kyle Dai, Mathias Leys, Maurice Burger, Max Bartolo, Roman Engeler, Sashank Pisupati, Toby Drane, Young Sun Park

Title:
Atla Selene Mini: A General Purpose Evaluation Model

Arxiv:
http://arxiv.org/abs/2501.17195v1

Abstract:
We introduce Atla Selene Mini, a state-of-the-art small language model-as-a-judge (SLMJ). Selene Mini is a general-purpose evaluator that outperforms the best SLMJs and GPT-4o-mini on overall performance across 11 out-of-distribution benchmarks, spanning absolute scoring, classification, and pairwise preference tasks. It is the highest-scoring 8B generative model on RewardBench, surpassing strong baselines like GPT-4o and specialized judges. To achieve this, we develop a principled data curation strategy that augments public datasets with synthetically generated critiques and ensures high quality through filtering and dataset ablations. We train our model on a combined direct preference optimization (DPO) and supervised fine-tuning (SFT) loss, and produce a highly promptable evaluator that excels in real-world scenarios. Selene Mini shows dramatically improved zero-shot agreement with human expert evaluations on financial and medical industry datasets. It is also robust to variations in prompt format. Preliminary results indicate that Selene Mini is the top-ranking evaluator in a live, community-driven Judge Arena. We release the model weights on HuggingFace (https://hf.co/AtlaAI/Selene-1-Mini-Llama-3.1-8B) and Ollama to encourage widespread community adoption.
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Exploring the sustainable scaling of AI dilemma: A projective study of corporations' AI environmental impacts
31 jan· Daily Paper Cast
🤗 Upvotes: 14 | cs.AI, cs.CY, cs.LG

Authors:
Clément Desroches, Martin Chauvin, Louis Ladan, Caroline Vateau, Simon Gosset, Philippe Cordier

Title:
Exploring the sustainable scaling of AI dilemma: A projective study of corporations' AI environmental impacts

Arxiv:
http://arxiv.org/abs/2501.14334v2

Abstract:
The rapid growth of artificial intelligence (AI), particularly Large Language Models (LLMs), has raised concerns regarding its global environmental impact that extends beyond greenhouse gas emissions to include consideration of hardware fabrication and end-of-life processes. The opacity from major providers hinders companies' abilities to evaluate their AI-related environmental impacts and achieve net-zero targets. In this paper, we propose a methodology to estimate the environmental impact of a company's AI portfolio, providing actionable insights without necessitating extensive AI and Life-Cycle Assessment (LCA) expertise. Results confirm that large generative AI models consume up to 4600x more energy than traditional models. Our modelling approach, which accounts for increased AI usage, hardware computing efficiency, and changes in electricity mix in line with IPCC scenarios, forecasts AI electricity use up to 2030. Under a high adoption scenario, driven by widespread Generative AI and agents adoption associated to increasingly complex models and frameworks, AI electricity use is projected to rise by a factor of 24.4. Mitigating the environmental impact of Generative AI by 2030 requires coordinated efforts across the AI value chain. Isolated measures in hardware efficiency, model efficiency, or grid improvements alone are insufficient. We advocate for standardized environmental assessment frameworks, greater transparency from the all actors of the value chain and the introduction of a "Return on Environment" metric to align AI development with net-zero goals.
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation
31 jan· Daily Paper Cast
🤗 Upvotes: 8 | cs.SE, cs.AI

Authors:
Aitor Arrieta, Miriam Ugarte, Pablo Valle, José Antonio Parejo, Sergio Segura

Title:
Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation

Arxiv:
http://arxiv.org/abs/2501.17749v1

Abstract:
Large Language Models (LLMs) have become an integral part of our daily lives. However, they impose certain risks, including those that can harm individuals' privacy, perpetuate biases and spread misinformation. These risks highlight the need for robust safety mechanisms, ethical guidelines, and thorough testing to ensure their responsible deployment. Safety of LLMs is a key property that needs to be thoroughly tested prior the model to be deployed and accessible to the general users. This paper reports the external safety testing experience conducted by researchers from Mondragon University and University of Seville on OpenAI's new o3-mini LLM as part of OpenAI's early access for safety testing program. In particular, we apply our tool, ASTRAL, to automatically and systematically generate up to date unsafe test inputs (i.e., prompts) that helps us test and assess different safety categories of LLMs. We automatically generate and execute a total of 10,080 unsafe test input on a early o3-mini beta version. After manually verifying the test cases classified as unsafe by ASTRAL, we identify a total of 87 actual instances of unsafe LLM behavior. We highlight key insights and findings uncovered during the pre-deployment external testing phase of OpenAI's latest LLM.
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Any2AnyTryon: Leveraging Adaptive Position Embeddings for Versatile Virtual Clothing Tasks
31 jan· Daily Paper Cast
🤗 Upvotes: 8 | cs.CV

Authors:
Hailong Guo, Bohan Zeng, Yiren Song, Wentao Zhang, Chuang Zhang, Jiaming Liu

Title:
Any2AnyTryon: Leveraging Adaptive Position Embeddings for Versatile Virtual Clothing Tasks

Arxiv:
http://arxiv.org/abs/2501.15891v1

Abstract:
Image-based virtual try-on (VTON) aims to generate a virtual try-on result by transferring an input garment onto a target person's image. However, the scarcity of paired garment-model data makes it challenging for existing methods to achieve high generalization and quality in VTON. Also, it limits the ability to generate mask-free try-ons. To tackle the data scarcity problem, approaches such as Stable Garment and MMTryon use a synthetic data strategy, effectively increasing the amount of paired data on the model side. However, existing methods are typically limited to performing specific try-on tasks and lack user-friendliness. To enhance the generalization and controllability of VTON generation, we propose Any2AnyTryon, which can generate try-on results based on different textual instructions and model garment images to meet various needs, eliminating the reliance on masks, poses, or other conditions. Specifically, we first construct the virtual try-on dataset LAION-Garment, the largest known open-source garment try-on dataset. Then, we introduce adaptive position embedding, which enables the model to generate satisfactory outfitted model images or garment images based on input images of different sizes and categories, significantly enhancing the generalization and controllability of VTON generation. In our experiments, we demonstrate the effectiveness of our Any2AnyTryon and compare it with existing methods. The results show that Any2AnyTryon enables flexible, controllable, and high-quality image-based virtual try-on generation.https://logn-2024.github.io/Any2anyTryonProjectPage/
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation
31 jan· Daily Paper Cast
🤗 Upvotes: 6 | cs.CR, cs.AI, cs.CL, cs.LG

Authors:
Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu

Title:
Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation

Arxiv:
http://arxiv.org/abs/2501.17433v1

Abstract:
Recent research shows that Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks -- models lose their safety alignment ability after fine-tuning on a few harmful samples. For risk mitigation, a guardrail is typically used to filter out harmful samples before fine-tuning. By designing a new red-teaming method, we in this paper show that purely relying on the moderation guardrail for data filtration is not reliable. Our proposed attack method, dubbed Virus, easily bypasses the guardrail moderation by slightly modifying the harmful data. Experimental results show that the harmful data optimized by Virus is not detectable by the guardrail with up to 100\% leakage ratio, and can simultaneously achieve superior attack performance. Finally, the key message we want to convey through this paper is that: \textbf{it is reckless to consider guardrail moderation as a clutch at straws towards harmful fine-tuning attack}, as it cannot solve the inherent safety issue of the pre-trained LLMs. Our code is available at https://github.com/git-disl/Virus
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text
31 jan· Daily Paper Cast
🤗 Upvotes: 6 | cs.CL, cs.AI

Authors:
Jenna Russell, Marzena Karpinska, Mohit Iyyer

Title:
People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text

Arxiv:
http://arxiv.org/abs/2501.15654v1

Abstract:
In this paper, we study how well humans can detect text generated by commercial LLMs (GPT-4o, Claude, o1). We hire annotators to read 300 non-fiction English articles, label them as either human-written or AI-generated, and provide paragraph-length explanations for their decisions. Our experiments show that annotators who frequently use LLMs for writing tasks excel at detecting AI-generated text, even without any specialized training or feedback. In fact, the majority vote among five such "expert" annotators misclassifies only 1 of 300 articles, significantly outperforming most commercial and open-source detectors we evaluated even in the presence of evasion tactics like paraphrasing and humanization. Qualitative analysis of the experts' free-form explanations shows that while they rely heavily on specific lexical clues ('AI vocabulary'), they also pick up on more complex phenomena within the text (e.g., formality, originality, clarity) that are challenging to assess for automatic detectors. We release our annotated dataset and code to spur future research into both human and automated detection of AI-generated text.
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
30 jan· Daily Paper Cast
🤗 Upvotes: 29 | cs.AI, cs.CV, cs.LG

Authors:
Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V. Le, Sergey Levine, Yi Ma

Title:
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Arxiv:
http://arxiv.org/abs/2501.17161v1

Abstract:
Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their roles in enhancing model generalization capabilities remain unclear. This paper studies the difference between SFT and RL on generalization and memorization, focusing on text-based rule variants and visual variants. We introduce GeneralPoints, an arithmetic reasoning card game, and adopt V-IRL, a real-world navigation environment, to assess how models trained with SFT and RL generalize to unseen variants in both textual and visual domains. We show that RL, especially when trained with an outcome-based reward, generalizes across both rule-based textual and visual variants. SFT, in contrast, tends to memorize training data and struggles to generalize out-of-distribution scenarios. Further analysis reveals that RL improves the model's underlying visual recognition capabilities, contributing to its enhanced generalization in the visual domain. Despite RL's superior generalization, we show that SFT remains essential for effective RL training; SFT stabilizes the model's output format, enabling subsequent RL to achieve its performance gains. These findings demonstrates the capability of RL for acquiring generalizable knowledge in complex, multi-modal tasks.
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Optimizing Large Language Model Training Using FP4 Quantization
30 jan· Daily Paper Cast
🤗 Upvotes: 15 | cs.LG, cs.CL

Authors:
Ruizhe Wang, Yeyun Gong, Xiao Liu, Guoshuai Zhao, Ziyue Yang, Baining Guo, Zhengjun Zha, Peng Cheng

Title:
Optimizing Large Language Model Training Using FP4 Quantization

Arxiv:
http://arxiv.org/abs/2501.17116v1

Abstract:
The growing computational demands of training large language models (LLMs) necessitate more efficient methods. Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce these costs. While FP8 precision has demonstrated feasibility, leveraging FP4 remains a challenge due to significant quantization errors and limited representational capacity. This work introduces the first FP4 training framework for LLMs, addressing these challenges with two key innovations: a differentiable quantization estimator for precise weight updates and an outlier clamping and compensation strategy to prevent activation collapse. To ensure stability, the framework integrates a mixed-precision training scheme and vector-wise quantization. Experimental results demonstrate that our FP4 framework achieves accuracy comparable to BF16 and FP8, with minimal degradation, scaling effectively to 13B-parameter LLMs trained on up to 100B tokens. With the emergence of next-generation hardware supporting FP4, our framework sets a foundation for efficient ultra-low precision training.
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation
30 jan· Daily Paper Cast
🤗 Upvotes: 11 | cs.CV

Authors:
Chenguo Lin, Panwang Pan, Bangbang Yang, Zeming Li, Yadong Mu

Title:
DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation

Arxiv:
http://arxiv.org/abs/2501.16764v1

Abstract:
Recent advancements in 3D content generation from text or a single image struggle with limited high-quality 3D datasets and inconsistency from 2D multi-view generation. We introduce DiffSplat, a novel 3D generative framework that natively generates 3D Gaussian splats by taming large-scale text-to-image diffusion models. It differs from previous 3D generative models by effectively utilizing web-scale 2D priors while maintaining 3D consistency in a unified model. To bootstrap the training, a lightweight reconstruction model is proposed to instantly produce multi-view Gaussian splat grids for scalable dataset curation. In conjunction with the regular diffusion loss on these grids, a 3D rendering loss is introduced to facilitate 3D coherence across arbitrary views. The compatibility with image diffusion models enables seamless adaptions of numerous techniques for image generation to the 3D realm. Extensive experiments reveal the superiority of DiffSplat in text- and image-conditioned generation tasks and downstream applications. Thorough ablation studies validate the efficacy of each critical design choice and provide insights into the underlying mechanism.
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling
30 jan· Daily Paper Cast
🤗 Upvotes: 10 | cs.CL, cs.LG

Authors:
Hongzhi Huang, Defa Zhu, Banggu Wu, Yutao Zeng, Ya Wang, Qiyang Min, Xun Zhou

Title:
Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

Arxiv:
http://arxiv.org/abs/2501.16975v1

Abstract:
Tokenization is a fundamental component of large language models (LLMs), yet its influence on model scaling and performance is not fully explored. In this paper, we introduce Over-Tokenized Transformers, a novel framework that decouples input and output vocabularies to improve language modeling performance. Specifically, our approach scales up input vocabularies to leverage multi-gram tokens. Through extensive experiments, we uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance, regardless of model size. Using a large input vocabulary, we achieve performance comparable to double-sized baselines with no additional cost. Our findings highlight the importance of tokenization in scaling laws and provide practical insight for tokenizer design, paving the way for more efficient and powerful LLMs.
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Laat meer zien

Afleveringen

GuardReasoner: Towards Reasoning-based LLM Safeguards

Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs

Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

Large Language Models Think Too Fast To Explore Effectively

WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training

PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding

o3-mini vs DeepSeek-R1: Which One is Safer?

CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation

Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate

Atla Selene Mini: A General Purpose Evaluation Model

Exploring the sustainable scaling of AI dilemma: A projective study of corporations' AI environmental impacts

Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation

Any2AnyTryon: Leveraging Adaptive Position Embeddings for Versatile Virtual Clothing Tasks

Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation

People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Optimizing Large Language Model Training Using FP4 Quantization

DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation

Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling