Company
Date Published
Jan. 9, 2025
Author
Haziqa Sajid
Word count
2837
Language
English
Hacker News points
None

Summary

GPL (Generative Pseudo Labeling) is an unsupervised domain adaptation technique designed to improve the performance of dense retrieval models when applied to new domains. It combines a query generator with pseudo-labeling, using a T5 model to generate synthetic queries for each passage in the target domain. GPL outperforms other domain adaptation methods, improving performance by up to 9.3 nDCG@10 on MS MARCO and up to 4.5 nDCG@10 over QGen (Query generation). It addresses the challenges of dense retrieval models, including data requirements, sensitivity to domain shifts, lexical gap, and zero-shot performance. GPL uses a cross-encoder to score (query, passage) pairs, assigning fine-grained relevance scores that provide more detailed information than binary labels used in other methods. The method is robust against query generation variability, initialization checkpoint choice, and corpus size variations, achieving consistent improvements across 18 datasets. GPL has implications for enhanced semantic search in vector databases like Milvus, reducing the need for labeled data and improving performance by adapting models to new domains without requiring large amounts of training data. Future research directions include simplifying the training pipeline, exploring alternative pre-training methods, domain-specific tuning, investigating alternatives to cross-encoders, adapting GPL to low-resource languages, and combining GPL with other adaptation methods.