Despite decades of mastering structured data, an estimated 80% of enterprise knowledge remains locked away in PDFs, images, and office documents. Traditional Intelligent Document Processing (IDP) solutions have historically been fragmented, relying on disparate NLP and computer vision APIs that lacked integration and governance. Databricks aims to change this with its unified approach, integrating data intelligence directly into the data lifecycle. The company announced its Databricks Document Intelligence and Lakeflow solutions, designed to help data engineers build and automate end-to-end IDP workflows.
This new offering enables the ingestion of unstructured data, its parsing using AI grounded in enterprise context, and scaled orchestration, all within Databricks' governed platform. The goal is to surface previously hidden documents into trusted, queryable datasets, unlocking new insights and business value.
Ingestion with Lakeflow Connect
Enterprise documents often reside in siloed systems, accessible only through fragile custom integrations. Lakeflow Connect addresses this by offering built-in connectors for sources like SharePoint and Google Drive, providing zero-maintenance ingestion. Documents are directly ingested into Unity Catalog Volumes and tables, immediately benefiting from access control, lineage, and auditing.
This approach ensures that granular, attribute-based policies already in place for structured data can be applied to unstructured content. Lakeflow Connect also supports fast, incremental reads and writes, optimizing for large document libraries and enabling both batch processing and near-real-time data flows.