OSI-openVectors, documents and extraction

PDF Oxide

yfedoseev

Rust-core PDF toolkit with 7 language bindings - extraction, markdown conversion and an MCP server.

825 stars(as of 2026-06-14)View on GitHub Homepage

Overview

What is PDF Oxide?

PDF Oxide is a Rust-native PDF library for text/image extraction, markdown/HTML conversion, creation, editing, merging, splitting, watermarking and forms. Bindings cover Python, Go, JS/TS, .NET, Java/Kotlin and WebAssembly, plus a CLI and an MCP server. It claims 0.8ms mean per document, 5-29x faster than common Python libs (project's own claim), validated on 3,830 test PDFs.

Analysis

Pros & Cons

Pros

Broad language coverage (7 bindings + CLI + MCP) from one Rust core
70 releases and a 100% pass rate on 3,830 diverse PDFs suggests real reliability
MCP server is a direct on-ramp for RAG document pipelines

Cons

Low star count relative to scope - community support and longevity less proven
Speed figures are self-reported with no linked independent benchmark
Markdown quality on complex tables/multi-column layouts not demonstrated

License

MIT OR Apache-2.0 (OSI-open)

When it is interesting

Building document-ingestion pipelines for RAG where PDF extraction speed and multi-language support matter.

When it is too early

If you need battle-tested handling of malformed or scanned PDFs - PyMuPDF has a larger edge-case community.

Context

Commercial alternative & related

Commercial counterpart: LlamaParse / AWS Textract

This repo featured in the 2026-07 edition of the Open-Source AI Radar.

Similar repositories

langextract

google

36.8k

Python library from Google for LLM-powered structured extraction with source grounding.

OSI-openVectors, documents and extraction

LEANN

StarTrail-org

11.9k

RAG on everything - graph-based vector index claiming 97% storage savings for private on-device search.

OSI-openVectors, documents and extraction

turbovec

RyanCodrai

11.5k

Rust vector index with TurboQuant compression (ICLR 2026) - SIMD kernels, online ingest.

OSI-openVectors, documents and extraction