Announcing pg_tiktoken: A Postgres Extension for Fast BPE Tokenization
The release of pg_tiktoken, a Postgres extension for fast BPE tokenization, has been announced. This new extension provides efficient text data analysis and processing within Postgres databases using the Byte Pair Encoding (BPE) algorithm. It is a wrapper around OpenAI's tokenizer, known for its speed and performance in natural language processing tasks. The tiktoken_encode function allows users to tokenize text inputs, while the tiktoken_count function returns the number of tokens in a text. Supported models include cl100k_base, p50k_base, p50k_edit, and r50k_base (or gpt2). The extension is optimized for speed and efficiency and supports various text inputs, including multiple languages and special characters.
Company
Neon
Date published
March 14, 2023
Author(s)
Stas Kelvich
Word count
775
Language
English
Hacker News points
None found.