Airbyte now supports extracting text from documents
Airbyte now supports extracting text from documents stored in S3, Azure blob storage, and Google Drive sources. The extracted textual content is emitted as markdown, allowing users to leverage this data in search scenarios and when building language model-powered applications. This feature enables the utilization of valuable unstructured data such as meeting notes, specifications, roadmaps, and descriptions of planned features. Airbyte can extract all valuable data from these documents and send it to a warehouse for further processing. The new experimental "Document File Type Format" allows users to extract text content from PDFs, Word, PowerPoint, and Google documents just like structured data stored in CSV or Avro formats.
Company
Airbyte
Date published
Nov. 7, 2023
Author(s)
Joe Reuter
Word count
634
Language
English
Hacker News points
None found.