/plushcap/analysis/weaviate/weaviate-ingesting-pdfs-into-weaviate

Ingesting PDFs into Weaviate

What's this blog post about?

The latest advancements in multimodal deep learning have made it possible to extract high quality data from PDF documents and add it to a Weaviate workflow. Optical Character Recognition (OCR) technology is used to convert different types of visual documents into machine-readable formats, with new models like LayoutLMv3 and Donut leveraging both text and visual information using multimodal transformers. Unstructured, an open-source company working at the cutting edge of PDF processing, allows businesses to ingest diverse data sources and convert them into data that can be passed to a Language Learning Model (LLM). This enables users to chat with their PDFs by converting private documents from their company into text format.

Company
Weaviate

Date published
May 23, 2023

Author(s)
Erika Cardenas, Mohd Shukri Hasan

Word count
1776

Language
English

Hacker News points
None found.