Afleveringen

  • In this episode of Neural Search Talks, we're chatting with Manuel Faysse, a 2nd year PhD student from CentraleSupélec & Illuin Technology, who is the first author of the paper "ColPali: Efficient Document Retrieval with Vision Language Models". ColPali is making waves in the IR community as a simple but effective new take on embedding documents using their image patches and the late-interaction paradigm popularized by ColBERT. Tune in to learn how Manu conceptualized ColPali, his methodology for tackling new research ideas, and why this new approach outperforms all classic multimodal embedding models. A must-watch episode!Timestamps:0:00 Introduction with Jakub & Manu4:09 The "Aha!" moment that led to ColPali7:06 Challenges that had to be solved9:16 The main idea behind ColPali13:20 How ColPali simplifies the IR pipeline15:54 The ViDoRe benchmark18:23 Why ColPali is superior to CLIP-based retrievers20:41 The training setup used for ColPali24:00 Optimizations to make ColPali more efficient29:00 How ColPali could work with text-only datasets31:21 Outro: The next steps for this line of research

  • In this episode of Neural Search Talks, we're chatting with Ronak Pradeep, a PhD student from the University of Waterloo, about his experience using LLMs in Information Retrieval, both as a backbone of ranking systems and for their end-to-end evaluation. Ronak analyzes the impact of the advancements in language models on the way we think about IR systems and shares his insights on efficiently integrating them in production pipelines, with techniques such as knowledge distillation.Timestamps:0:00 Introduction & the impact of the LLM day in SIGIR 20242:11 The perspective of the IR community on LLMs6:10 Language models as backbones for Information Retrieval13:49 The feasibility & tricks for using LLMs in production IR pipelines20:11 Ronak's hidden gems from the SIGIR 2024 programme21:36 Outro

  • Zijn er afleveringen die ontbreken?

    Klik hier om de feed te vernieuwen.

  • In this episode of Neural Search Talks, we're chatting with Omar Khattab, the author behind popular IR & LLM frameworks like ColBERT and DSPy. Omar describes the current state of using AI models in production systems, highlighting how thinking at the right level of abstraction with the right tools for optimization can deliver reliable solutions that extract the most out of the current generation of models. He also lays out his vision for a future of Artificial Programmable Intelligence (API), rather than jumping on the hype of Artificial General Intelligence (AGI), where the goal would be to build systems that effectively integrate AI, with self-improving mechanisms that allow the developers to focus on the design and the problem, rather than the optimization of the lower-level hyperparameters.Timestamps:0:00 Introduction with Omar Khattab1:14 How to reliably integrate LLMs in production-grade software12:19 DSPy's philosophy differences from agentic approaches14:55 Omar's background in IR that helped him pivot to DSPy25:47 The strengths of DSPy's optimization framework39:22 How DSPy has reimagined modularity in AI systems45:45 The future of using AI models for self-improvement49:41 How open-sourcing a project like DSPy influences its development52:32 Omar's vision for the future of AI and his research agenda59:12 Outro

  • In this episode of Neural Search Talks, we're chatting with Florin Cuconasu, the first author of the paper "The Power of Noise", presented at SIGIR 2024. We discuss the current state of the field of Retrieval-Augmented Generation (RAG), and how LLMs interact with retrievers to power modern Generative AI applications, with Florin delivering practical advice for those developing RAG systems, and laying out his research agenda for the near future.Timestamps:0:00 Introduction & how RAG has taken over the IR literature1:40 How retrievers and LLMs interact in Retrieval-Augmented Generation2:55 What practitioners should pay attention to when developing RAG systems5:04 What is the power of noise in the context of RAG?7:31 Florin's long-term research agenda on RAG interactions9:25 How advances in LLMs can impact IR research11:26 Outro

  • In this episode of Neural Search Talks, we're chatting with Nandan Thakur about the state of model evaluations in Information Retrieval. Nandan is the first author of the paper that introduced the BEIR benchmark, and since its publication in 2021, we've seen models try to hill-climb on the leaderboard, but also fail to outperform the BM25 baseline in subsets like Touché 2020. Plus some insights into what the future of benchmarking IR systems might look like, such as the newly announced TREC RAG track this year.

    Timestamps:0:00 Introduction & the vibe at SIGIR'241:19 Nandan's two papers at the conference2:09 The backstory of the BEIR benchmark5:55 The shortcomings of BEIR in 20248:04 What's up with the Touché 2020 subset of BEIR11:24 The problem with overfitting on benchmarks13:09 TREC-RAG: the future of IR benchmarking17:34 MIRACL & the importance of multilinguality in IR21:38 Outro

  • In this episode of Neural Search Talks, we're chatting with Aamir Shakir from Mixed Bread AI, who shares his insights on starting a company that aims to make search smarter with AI. He details their approach to overcoming challenges in embedding models, touching on the significance of data diversity, novel loss functions, and the future of multilingual and multimodal capabilities. We also get insights on their journey, the ups and downs, and what they're excited about for the future.

    Timestamps:0:00 Introduction0:25 How did mixedbread.ai start?2:16 The story behind the company name and its "bakers"4:25 What makes Berlin a great pool for AI talent6:12 Building as a GPU-poor team7:05 The recipe behind mxbai-embed-large-v19:56 The Angle objective for embedding models15:00 Going beyond Matryoshka with mxbai-embed-2d-large-v117:45 Supporting binary embeddings & quantization19:07 Collecting large-scale data is key for robust embedding models21:50 The importance of multilingual and multimodal models for IR24:07 Where will mixedbread.ai be in 12 months?26:46 Outro

  • Ash shares his journey from software development to pioneering in the AI infrastructure space with Unum. He discusses Unum's focus on unleashing the full potential of modern computers for AI, search, and database applications through efficient data processing and infrastructure. Highlighting Unum's technical achievements, including SIMD instructions and just-in-time compilation, Ash also touches on the future of computing and his vision for Unum to contribute to advances in personalized medicine and extending human productivity.

    Timestamps:0:00 Introduction0:44 How did Unum start and what is it about?6:12 Differentiating from the competition in vector search17:45 Supporting modern features like large dimensions & binary embeddings27:49 Upcoming model releases from Unum30:00 The future of hardware for AI34:56 The impact of AI in society37:35 Outro

  • In this episode of Neural Search Talks, Andrew Yates (Assistant Prof at the University of Amsterdam) Sergi Castella (Analyst at Zeta Alpha), and Gabriel Bénédict (PhD student at the University of Amsterdam) discuss the prospect of using GPT-like models as a replacement for conventional search engines.Generative Information Retrieval (Gen IR) SIGIR Workshop

    Workshop organized by Gabriel Bénédict, Ruqing Zhang, and Donald Metzler https://coda.io/@sigir/gen-ir Resources on Gen IR: https://github.com/gabriben/awesome-generative-information-retrieval

    References

    Rethinking Search: https://arxiv.org/abs/2105.02274 Survey on Augmented Language Models: https://arxiv.org/abs/2302.07842 Differentiable Search Index: https://arxiv.org/abs/2202.06991 Recommender Systems with Generative Retrieval: https://shashankrajput.github.io/Generative.pdf

    Timestamps:00:00 Introduction, ChatGPT Plugins02:01 ChatGPT plugins, LangChain04:37 What is even Information Retrieval?06:14 Index-centric vs. model-centric Retrieval12:22 Generative Information Retrieval (Gen IR)21:34 Gen IR emerging applications24:19 How Retrieval Augmented LMs incorporate external knowledge29:19 What is hallucination?35:04 Factuality and Faithfulness41:04 Evaluating generation of Language Models47:44 Do we even need to "measure" performance?54:07 How would you evaluate Bing's Sydney?57:22 Will language models take over commercial search?1:01:44 NLP academic research in the times of GPT-41:06:59 Outro

  • Andrew Yates (Assistant Prof at University of Amsterdam) and Sergi Castella (Analyst at Zeta Alpha) discuss the paper "Task-aware Retrieval with Instructions" by Akari Asai et al. This paper proposes to augment a conglomerate of existing retrieval and NLP datasets with natural language instructions (BERRI, Bank of Explicit RetRieval Instructions) and use it to train TART (Multi-task Instructed Retriever).  

    📄 Paper: https://arxiv.org/abs/2211.09260

    🍻 BEIR benchmark: https://arxiv.org/abs/2104.08663

    📈 LOTTE (Long-Tail Topic-stratified Evaluation, introduced in ColBERT v2): https://arxiv.org/abs/2112.01488

    Timestamps: 

    00:00 Intro: "Task-aware Retrieval with Instructions"

    02:20 BERRI, TART, X^2 evaluation

    04:00 Background: recent works in domain adaptation

    06:50 Instruction Tuning 08:50 Retrieval with descriptions

    11:30 Retrieval with instructions

    17:28 BERRI, Bank of Explicit RetRieval Instructions

    21:48 Repurposing NLP tasks as retrieval tasks

    23:53 Negative document selection

    27:47 TART, Multi-task Instructed Retriever

    31:50 Evaluation: Zero-shot and X^2 evaluation

    39:20 Results on Table 3 (BEIR, LOTTE)

    50:30 Results on Table 4 (X^2-Retrieval)

    55:50 Ablations

    57:17 Discussion: user modeling, future work, scale

  • Marzieh Fadaee — NLP Research Lead at Zeta Alpha — joins Andrew Yates and Sergi Castella to chat about her work in using large Language Models like GPT-3 to generate domain-specific training data for retrieval models with little-to-no human input. The two papers discussed are "InPars: Data Augmentation for Information Retrieval using Large Language Models" and "Promptagator: Few-shot Dense Retrieval From 8 Examples".

    InPars: https://arxiv.org/abs/2202.05144

    Promptagator: https://arxiv.org/abs/2209.11755

    Timestamps:

    00:00 Introduction

    02:00 Background and journey of Marzieh Fadaee

    03:10 Challenges of leveraging Large LMs in Information Retrieval

    05:20 InPars, motivation and method

    14:30 Vanilla vs GBQ prompting

    24:40 Evaluation and Benchmark

    26:30 Baselines

    27:40 Main results and takeaways (Table 1, InPars)

    35:40 Ablations: prompting, in-domain vs. MSMARCO input documents

    40:40 Promptagator overview and main differences with InPars

    48:40 Retriever training and filtering in Promptagator

    54:37 Main Results (Table 2, Promptagator)

    1:02:30 Ablations on consistency filtering (Figure 2, Promptagator)

    1:07:39 Is this the magic black-box pipeline for neural retrieval on any documents

    1:11:14 Limitations of using LMs for synthetic data

    1:13:00 Future directions for this line of research

  • Andrew Yates (Assistant Professor at the University of Amsterdam) and Sergi Castella (Analyst at Zeta Alpha) discus the two influential papers introducing ColBERT (from 2020) and ColBERT v2 (from 2022), which mainly propose a fast late interaction operation to achieve a performance close to full cross-encoders but at a more manageable computational cost at inference; along with many other optimizations.

    📄 ColBERT: "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT" by Omar Khattab and Matei Zaharia. https://arxiv.org/abs/2004.12832

    📄 ColBERTv2: "ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction" by Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. https://arxiv.org/abs/2112.01488

    📄 PLAID: "An Efficient Engine for Late Interaction Retrieval" by Keshav Santhanam, Omar Khattab, Christopher Potts, and Matei Zaharia. https://arxiv.org/abs/2205.09707

    📄 CEDR: "CEDR: Contextualized Embeddings for Document Ranking" by Sean MacAvaney, Andrew Yates, Arman Cohan, and Nazli Goharian. https://arxiv.org/abs/1904.07094

    🪃 Feedback form: https://scastella.typeform.com/to/rg7a5GfJ

    Timestamps:

    00:00 Introduction

    00:42 Why ColBERT?

    03:34 Retrieval paradigms recap

    08:04 ColBERT query formulation and architecture

    09:04 Using ColBERT as a reranker or as an end-to-end retriever

    11:28 Space Footprint vs. MRR on MS MARCO

    12:24 Methodology: datasets and negative sampling

    14:37 Terminology for cross encoders, interaction-based models, etc.

    16:12 Results (ColBERT v1) on MS MARCO

    18:41 Ablations on model components

    20:34 Max pooling vs. mean pooling

    22:54 Why did ColBERT have a big impact?

    26:31 ColBERTv2: knowledge distillation

    29:34 ColBERTv2: indexing improvements

    33:59 Effects of clustering compression in performance

    35:19 Results (ColBERT v2): MS MARCO

    38:54 Results (ColBERT v2): BEIR

    41:27 Takeaway: strong specially in out-of-domain evaluation

    43:59 Qualitatively how do ColBERT scores look like?

    46:21 What's the most promising of all current neural IR paradigms

    49:34 How come there's still so much interest in Dense retrieval?

    51:09 Many to many similarity at different granularities

    53:44 What would ColBERT v3 include?

    56:39 PLAID: An Efficient Engine for Late Interaction Retrieval

    Contact: [email protected]

  • How much of the training and test sets in TREC or MS Marco overlap? Can we evaluate on different splits of the data to isolate the extrapolation performance?

    In this episode of Neural Information Retrieval Talks, Andrew Yates and Sergi Castella i Sapé discuss the paper "Evaluating Extrapolation Performance of Dense Retrieval" byJingtao Zhan, Xiaohui Xie, Jiaxin Mao, Yiqun Liu, Min Zhang, and Shaoping Ma.

    📄 Paper: https://arxiv.org/abs/2204.11447

    ❓ About MS Marco: https://microsoft.github.io/msmarco/

    ❓About TREC: https://trec.nist.gov/

    🪃 Feedback form: https://scastella.typeform.com/to/rg7a5GfJ  

    Timestamps: 

    00:00 Introduction 

    01:08 Evaluation in Information Retrieval, why is it exciting 

    07:40 Extrapolation Performance in Dense Retrieval 

    10:30 Learning in High Dimension Always Amounts to Extrapolation 

    11:40 3 Research questions 

    16:18 Defining Train-Test label overlap: entity and query intent overlap 

    21:00 Train-test Overlap in existing benchmarks TREC 

    23:29 Resampling evaluation methods: constructing distinct train-test sets 

    25:37 Baselines and results: ColBERT, SPLADE

    29:36 Table 6: interpolation vs. extrapolation performance in TREC 

    33:06 Table 7: interplation vs. extrapolation in MS Marco 

    35:55 Table 8: Comparing different DR training approaches 

    40:00 Research Question 1 resolved: cross encoders are more robust than dense retrieval in extrapolation 

    42:00 Extrapolation and Domain Transfer: BEIR benchmark. 

    44:46 Figure 2: correlation between extrapolation performance and domain transfer performance 

    48:35 Broad strokes takeaways from this work 

    52:30 Is there any intuition behind the results where Dense Retrieval generalizes worse than Cross Encoders? 

    56:14 Will this have an impact on the IR benchmarking culture? 

    57:40 Outro   

    Contact: [email protected]

  • Andrew Yates (Assistant Professor at the University of Amsterdam) and Sergi Castella i Sapé discuss the recent "Open Pre-trained Transformer (OPT) Language Models" from Meta AI (formerly Facebook). In this replication work, Meta developed and trained a 175 Billion parameter Transformer very similar to GPT-3 from OpenAI, documenting the process in detail to share their findings with the community. The code, pretrained weights, and logbook are available on their Github repository (links below). 

    Links 

    ❓Feedback Form: https://scastella.typeform.com/to/rg7a5GfJ

    📄 OPT paper: https://arxiv.org/abs/2205.01068

    👾 Code: https://github.com/facebookresearch/metaseq

    📒 Logbook: https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/OPT175B_Logbook.pdf

    ✍️ OPT Official Blog Post: https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/  

    OpenAI Embeddings API: https://openai.com/blog/introducing-text-and-code-embeddings/

    Nils Reimers' critique of OpenAI Embeddings API: https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9 

    Timestamps: 

    00:00 Introduction and housekeeping: new feedback form, ACL conference highlights 

    02:42 The convergence between NLP and Neural IR techniques 

    06:43 Open Pretrained Transformer motivation and scope, reproducing GPT-3 and open-sourcing 

    08:16 Basics of OPT: architecture, pre-training objective, teacher forcing, tokenizer, training data 

    13:40 Preliminary experiments findings: hyperparameters, training stability, spikiness 

    20:08 Problems that appear at scale when training with 992 GPUs

    23:01 Using temperature to check whether GPUs are working

    25:00 Training the largest model: what to do when the loss explodes? (which happens quite often)

    29:15 When they switched away from AdamW to SGD

    32:00 Results: successful but not quite GPT-3 level.

    Toxicity? 35:45 Replicability of Large Language Models research. Was GPT-3 replicable? What difference does it make?

    37:25 What makes a paper replicable?

    40:33 Directions in which large Language Models are applied to Information Retrieval

    45:15 Final thoughts and takeaways

  • We discuss Conversational Search with our usual cohosts Andrew Yates and Sergi Castella i Sapé; along with a special guest Antonios Minas Krasakis, PhD candidate at the University of Amsterdam. 

    We center our discussion around the ConvDR paper: "Few-Shot Conversational Dense Retrieval" by Shi Yu et al. which was the first work to perform Conversational Search without an explicit conversation to query rewriting step.

    Timestamps:

    00:00 Introduction

    00:50 Conversational AI and Conversational Search

    05:40 What makes Conversational Search challenging

    07:00 ConvDR paper introduction

    10:10 Passage representations

    11:30 Conversation representations: query rewriting

    19:12 ConvDR novel proposed method: teacher-student setup with ANCE

    22:50 Datasets and benchmarks: CAsT, CANARD

    25:32 Teacher-student advantages and knowledge distillation vs. ranking loss functions

    28:09 TREC CAsT and OR-QuAC

    35:50 Metrics: MRR, NDCG, holes@10

    44:16 Main Results on CAsT and OR-QuAC (Table 2)

    57:35 Ablations on combinations of loss functions (Table 4)

    1:00:10 How fast is ConvDR? (Table 3)

    1:02:40 Qualitative analysis on ConvDR embeddings (Figure 4)

    1:04:50 How has this work aged? More recent works in similar directions: Contextualized Quesy Embeddings for Conversational Search.

    1:07:02 Is "end-to-end" the silver-bullet for Conversational Search?

    1:10:04 Will conversational search become more mainstream?

    1:18:44 Latest initiatives for Conversational Search

  • Andrew Yates and Sergi Castella discuss the paper titled "Transformer Memory as a Differentiable Search Index" by Yi Tay et al at Google. This work proposes a new approach to document retrieval in which document ids are memorized by a transformer during training (or "indexing") and for retrieval, a query is fed to the model, which then generates autoregressively relevant doc ids for that query.

    Paper: https://arxiv.org/abs/2202.06991

    Timestamps:

    00:00 Intro: Transformer memory as a Differentiable Search Index (DSI)

    01:15 The gist of the paper, motivation

    4:20 Related work: Autoregressive Entity Linking

    7:38 What is an index? Conventional vs. "differentiable"

    10:20 Indexing and Retrieval definitions in the context of the DSI

    12:40 Learning representations for documents

    17:20 How to represent document ids: atomic, string, semantically relevant

    22:00 Zero-shot vs. finetuned settings

    24:10 Datasets and baselines

    27:08 Dinetuned results

    36:40 Zero-shot results

    43:50 Ablation results

    47:15 Where could this model be useds?

    52:00 Is memory efficiency a fundamental problem of this approach?

    55:14 What about semantically relevant doc ids?

    60:30 Closing remarks 

    Contact: [email protected]

  • In this third episode of the Neural Information Retrieval Talks podcast, Andrew Yates and Sergi Castella discuss the paper "Learning to Retrieve Passages without Supervision" by Ori Ram et al.  

    Despite the massive advances in Neural Information Retrieval in the past few years, statistical models still overperform neural models when no annotations are available at all. This paper proposes a new self-supervised pertaining task for Dense Information Retrieval that manages to beat BM25 on some benchmarks without using any label.  

    Paper: https://arxiv.org/abs/2112.07708 

    Timestamps:

    00:00 Introduction

    00:36 "Learning to Retrieve Passages Without Supervision"

    02:20 Open Domain Question Answering

    05:05 Related work: Families of Retrieval Models

    08:30 Contrastive Learning

    11:18 Siamese Networks, Bi-Encoders and Dual-Encoders

    13:33 Choosing Negative Samples

    17:46 Self supervision: how to train IR models without labels.

    21:31 The modern recipe for SOTA Retrieval Models

    23:50 Methodology: a new proposed self supervision task

    26:40 Datasets, metrics and baselines

    \33:50 Results: Zero-Shot performance

    43:07 Results: Few-shot performance

    47:15 Practically, is not using labels relevant after all?

    51:37 How would you "break" the Spider model?

    53:23 How long until Neural IR models outperform BM25 out-of-the-box robustly?

    54:50 Models as a service: OpenAI's text embeddings API

    Contact: [email protected]

  • We discuss the Information Retrieval publication "The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes" by Nils Reimers and Iryna Gurevych, which explores how Dense Passage Retrieval performance degrades as the index size varies and how it compares to traditional sparse or keyword-based methods.

    Timestamps:

    00:00 Co-host introduction

    00:26 Paper introduction

    02:18 Dense vs. Sparse retrieval

    05:46 Theoretical analysis of false positives(1)

    08:17 What is low vs. high dimensional representations

    11:49 Theoretical analysis o false positives (2)

    20:10 First results: growing the MS-Marco index

    28:35 Adding random strings to the index

    39:17 Discussion, takeaways

    44:26 Will dense retrieval replace or coexist with sparse methods?

    50:50 Sparse, Dense and Attentional Representations for Text Retrieval

    Referenced work:

    Sparse, Dense and Attentional Representations for Text Retrieval by Yi Luan et al. 2020. 

  • In this first episode of Neural Information Retrieval Talks, Andrew Yates and Sergi Castellla discuss the paper "Shallow Pooling for Sparse Labels" by Negar Arabzadeh,  Alexandra Vtyurina, Xinyi Yan and Charles L. A. Clarke from the University of Waterloo, Canada.

    This paper puts the spotlight on the popular IR benchmark MS MARCO and investigates whether modern neural retrieval models retrieve documents that are even more relevant than the original top relevance annotations. The results have important implications and raise the question of to what degree this benchmark is still an informative north star to follow.

    Contact: [email protected]

    Timestamps:

    00:00 — Introduction.

    01:52 — Overview and motivation of the paper.

    04:00 — Origins of MS MARCO.

    07:30 — Modern approaches to IR: keyword-based, dense retrieval, rerankers and learned sparse representations.

    13:40 — What is "better than perfect" performance on MS MARCO?

    17:15 — Results and discussion: how often are neural rankers preferred over original annotations on MS MARCO? How should we interpret these results?

    26:55 — The authors' proposal to "fix" MS MARCO: shallow pooling

    32:40 — How does TREC Deep Learning compare?

    38:30 — How do models compare after re-annotating MS MARCO passages?

    45:00 — Figure 5 audio description.

    47:00 — Discussion on models' performance after re-annotations.

    51:50 — Exciting directions in the space of IR benchmarking.

    1:06:20 — Outro.

    Related material:

    - Leo Boystov paper critique blog post: http://searchivarius.org/blog/ir-leaderboards-never-tell-full-story-they-are-still-useful-and-what-can-be-done-make-them-even

    - "MS MARCO Chameleons: Challenging the MS MARCO Leaderboard with Extremely Obstinate Queries" https://dl.acm.org/doi/abs/10.1145/3459637.3482011