GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval

Company

Zilliz

Date Published

Jan. 9, 2025

Author

Haziqa Sajid

Word count

2837

Language

English

Hacker News points

None

URL

zilliz.com/blog/generative-pseudo-labeling-for-unsupervised-domain-adaptation-of-dense-retrieval

Summary

GPL (Generative Pseudo Labeling) is an unsupervised domain adaptation technique designed to improve the performance of dense retrieval models when applied to new domains. It combines a query generator with pseudo-labeling, using a T5 model to generate synthetic queries for each passage in the target domain. GPL outperforms other domain adaptation methods, improving performance by up to 9.3 nDCG@10 on MS MARCO and up to 4.5 nDCG@10 over QGen (Query generation). It addresses the challenges of dense retrieval models, including data requirements, sensitivity to domain shifts, lexical gap, and zero-shot performance. GPL uses a cross-encoder to score (query, passage) pairs, assigning fine-grained relevance scores that provide more detailed information than binary labels used in other methods. The method is robust against query generation variability, initialization checkpoint choice, and corpus size variations, achieving consistent improvements across 18 datasets. GPL has implications for enhanced semantic search in vector databases like Milvus, reducing the need for labeled data and improving performance by adapting models to new domains without requiring large amounts of training data. Future research directions include simplifying the training pipeline, exploring alternative pre-training methods, domain-specific tuning, investigating alternatives to cross-encoders, adapting GPL to low-resource languages, and combining GPL with other adaptation methods.