Open weight, with conditionsVectors, documents and extraction

chandra

datalab-to

High-accuracy document digitization (OCR/layout) with code and an open model.

11.1k stars(as of 2026-06-05)View on GitHub

Overview

What is chandra?

A document digitization model for demanding OCR and layout extraction, usable with a GPU locally or through Datalab's managed API.

Analysis

Pros & Cons

Pros

Very broad: tables, forms, handwriting, 90+ languages
Usable both locally (HuggingFace) and as a hosted API
Backed by an established team (Marker/Surya)

Cons

The model is Modified OpenRAIL-M: free only for research, personal use, and startups under $2M - not unrestricted OSI-open
A GPU is effectively required for local use
Benchmark claims are self-reported

License

Apache-2.0 (code) (Open weight, with conditions) - model license: Modified OpenRAIL-M

Code Apache-2.0, model Modified OpenRAIL-M (open weight, with a revenue/use condition). Worth checking carefully before commercial use, especially commercial self-use above the $2M threshold.

When it is interesting

Demanding document digitization with a GPU or via the API.

When it is too early

Commercial self-use above the $2M threshold (check the model license carefully).

Context

Commercial alternative & related

Commercial counterpart: Datalab API

This repo featured in the 2026-06 edition of the Open-Source AI Radar.

Similar repositories

langextract

google

36.8k

Python library from Google for LLM-powered structured extraction with source grounding.

OSI-openVectors, documents and extraction

LEANN

StarTrail-org

11.9k

RAG on everything - graph-based vector index claiming 97% storage savings for private on-device search.

OSI-openVectors, documents and extraction

turbovec

RyanCodrai

11.5k

Rust vector index with TurboQuant compression (ICLR 2026) - SIMD kernels, online ingest.

OSI-openVectors, documents and extraction