/plushcap/analysis/neon/neon-announcing-pg-tiktoken-a-postgres-extension-for-fast-bpe-tokenization

Announcing pg_tiktoken: A Postgres Extension for Fast BPE Tokenization

What's this blog post about?

The release of pg_tiktoken, a Postgres extension for fast BPE tokenization, has been announced. This new extension provides efficient text data analysis and processing within Postgres databases using the Byte Pair Encoding (BPE) algorithm. It is a wrapper around OpenAI's tokenizer, known for its speed and performance in natural language processing tasks. The tiktoken_encode function allows users to tokenize text inputs, while the tiktoken_count function returns the number of tokens in a text. Supported models include cl100k_base, p50k_base, p50k_edit, and r50k_base (or gpt2). The extension is optimized for speed and efficiency and supports various text inputs, including multiple languages and special characters.

Company
Neon

Date published
March 14, 2023

Author(s)
Stas Kelvich

Word count
775

Hacker News points
None found.

Language
English


By Matt Makai. 2021-2024.