OSI-openVectors, documents and extraction

langextract

google

Python library from Google for LLM-powered structured extraction with source grounding.

36.8k stars(as of 2026-06-07)View on GitHub

Overview

What is langextract?

A Python library from Google that uses an LLM to pull structured information out of unstructured text, then grounds every extraction back to its exact location in the source ('source grounding') and renders an interactive HTML view. It calls no model itself - you bring a provider: Gemini (default), OpenAI, or local models via Ollama (no API key needed).

Analysis

Pros & Cons

Pros

Apache-2.0, permissive and OSI-open, no copyleft
Provider-agnostic: cloud (Gemini/OpenAI/Vertex) or fully local via Ollama with no API key
Source-grounding and an out-of-the-box HTML visualization are a genuine differentiator

Cons

For cloud models it needs an external LLM API: running token costs, and your text leaves your machine (local only via Ollama)
The README states plainly 'this is not an officially supported Google product' - no SLA
Accuracy is the project's own claim and depends on the chosen model, prompt and examples

License

Apache-2.0 (OSI-open)

When it is interesting

Turning documents, reports or notes into structured data with traceable provenance.

When it is too early

If you need a supported product with guarantees, or cannot send text to a cloud model and do not want to run Ollama locally.

This repo featured in the 2026-06 edition of the Open-Source AI Radar.

Similar repositories

LEANN

StarTrail-org

11.9k

RAG on everything - graph-based vector index claiming 97% storage savings for private on-device search.

OSI-openVectors, documents and extraction

turbovec

RyanCodrai

11.5k

Rust vector index with TurboQuant compression (ICLR 2026) - SIMD kernels, online ingest.

OSI-openVectors, documents and extraction

chandra

datalab-to

11.1k

High-accuracy document digitization (OCR/layout) with code and an open model.

Open weight, with conditionsVectors, documents and extraction