langextract
Python library from Google for LLM-powered structured extraction with source grounding.
What is langextract?
A Python library from Google that uses an LLM to pull structured information out of unstructured text, then grounds every extraction back to its exact location in the source ('source grounding') and renders an interactive HTML view. It calls no model itself - you bring a provider: Gemini (default), OpenAI, or local models via Ollama (no API key needed).
Pros & Cons
Pros
- Apache-2.0, permissive and OSI-open, no copyleft
- Provider-agnostic: cloud (Gemini/OpenAI/Vertex) or fully local via Ollama with no API key
- Source-grounding and an out-of-the-box HTML visualization are a genuine differentiator
Cons
- For cloud models it needs an external LLM API: running token costs, and your text leaves your machine (local only via Ollama)
- The README states plainly 'this is not an officially supported Google product' - no SLA
- Accuracy is the project's own claim and depends on the chosen model, prompt and examples
License
Apache-2.0 (OSI-open)
When it is interesting
Turning documents, reports or notes into structured data with traceable provenance.
When it is too early
If you need a supported product with guarantees, or cannot send text to a cloud model and do not want to run Ollama locally.
This repo featured in the 2026-06 edition of the Open-Source AI Radar.
LEANN
StarTrail-org
RAG on everything - graph-based vector index claiming 97% storage savings for private on-device search.
turbovec
RyanCodrai
Rust vector index with TurboQuant compression (ICLR 2026) - SIMD kernels, online ingest.
chandra
datalab-to
High-accuracy document digitization (OCR/layout) with code and an open model.