chandra
datalab-to
High-accuracy document digitization (OCR/layout) with code and an open model.
What is chandra?
A document digitization model for demanding OCR and layout extraction, usable with a GPU locally or through Datalab's managed API.
Pros & Cons
Pros
- Very broad: tables, forms, handwriting, 90+ languages
- Usable both locally (HuggingFace) and as a hosted API
- Backed by an established team (Marker/Surya)
Cons
- The model is Modified OpenRAIL-M: free only for research, personal use, and startups under $2M - not unrestricted OSI-open
- A GPU is effectively required for local use
- Benchmark claims are self-reported
License
Apache-2.0 (code) (Open weight, with conditions) - model license: Modified OpenRAIL-M
Code Apache-2.0, model Modified OpenRAIL-M (open weight, with a revenue/use condition). Worth checking carefully before commercial use, especially commercial self-use above the $2M threshold.
When it is interesting
Demanding document digitization with a GPU or via the API.
When it is too early
Commercial self-use above the $2M threshold (check the model license carefully).
Commercial alternative & related
- Commercial counterpart: Datalab API
This repo featured in the 2026-06 edition of the Open-Source AI Radar.
langextract
Python library from Google for LLM-powered structured extraction with source grounding.
LEANN
StarTrail-org
RAG on everything - graph-based vector index claiming 97% storage savings for private on-device search.
turbovec
RyanCodrai
Rust vector index with TurboQuant compression (ICLR 2026) - SIMD kernels, online ingest.