PDF Oxide
yfedoseev
Rust-core PDF toolkit with 7 language bindings - extraction, markdown conversion and an MCP server.
What is PDF Oxide?
PDF Oxide is a Rust-native PDF library for text/image extraction, markdown/HTML conversion, creation, editing, merging, splitting, watermarking and forms. Bindings cover Python, Go, JS/TS, .NET, Java/Kotlin and WebAssembly, plus a CLI and an MCP server. It claims 0.8ms mean per document, 5-29x faster than common Python libs (project's own claim), validated on 3,830 test PDFs.
Pros & Cons
Pros
- Broad language coverage (7 bindings + CLI + MCP) from one Rust core
- 70 releases and a 100% pass rate on 3,830 diverse PDFs suggests real reliability
- MCP server is a direct on-ramp for RAG document pipelines
Cons
- Low star count relative to scope - community support and longevity less proven
- Speed figures are self-reported with no linked independent benchmark
- Markdown quality on complex tables/multi-column layouts not demonstrated
License
MIT OR Apache-2.0 (OSI-open)
When it is interesting
Building document-ingestion pipelines for RAG where PDF extraction speed and multi-language support matter.
When it is too early
If you need battle-tested handling of malformed or scanned PDFs - PyMuPDF has a larger edge-case community.
Commercial alternative & related
- Commercial counterpart: LlamaParse / AWS Textract
This repo featured in the 2026-07 edition of the Open-Source AI Radar.
langextract
Python library from Google for LLM-powered structured extraction with source grounding.
LEANN
StarTrail-org
RAG on everything - graph-based vector index claiming 97% storage savings for private on-device search.
turbovec
RyanCodrai
Rust vector index with TurboQuant compression (ICLR 2026) - SIMD kernels, online ingest.