Ingesting PDFs into Weaviate

Company

Weaviate

Date Published

May 23, 2023

Author

Erika Cardenas, Mohd Shukri Hasan

Word count

1776

Language

English

Hacker News points

None

URL

weaviate.io/blog/ingesting-pdfs-into-weaviate

Summary

The latest advancements in multimodal deep learning have made it possible to extract high quality data from PDF documents and add it to a Weaviate workflow. Optical Character Recognition (OCR) technology is used to convert different types of visual documents into machine-readable formats, with new models like LayoutLMv3 and Donut leveraging both text and visual information using multimodal transformers. Unstructured, an open-source company working at the cutting edge of PDF processing, allows businesses to ingest diverse data sources and convert them into data that can be passed to a Language Learning Model (LLM). This enables users to chat with their PDFs by converting private documents from their company into text format.