Skip to main content
AI Tool Radar
OSI-openVectors, documents and extraction

PDF Oxide

yfedoseev

Rust-core PDF toolkit with 7 language bindings - extraction, markdown conversion and an MCP server.

825 stars(as of 2026-06-14)View on GitHubHomepage

What is PDF Oxide?

PDF Oxide is a Rust-native PDF library for text/image extraction, markdown/HTML conversion, creation, editing, merging, splitting, watermarking and forms. Bindings cover Python, Go, JS/TS, .NET, Java/Kotlin and WebAssembly, plus a CLI and an MCP server. It claims 0.8ms mean per document, 5-29x faster than common Python libs (project's own claim), validated on 3,830 test PDFs.

Pros & Cons

Pros

  • Broad language coverage (7 bindings + CLI + MCP) from one Rust core
  • 70 releases and a 100% pass rate on 3,830 diverse PDFs suggests real reliability
  • MCP server is a direct on-ramp for RAG document pipelines

Cons

  • Low star count relative to scope - community support and longevity less proven
  • Speed figures are self-reported with no linked independent benchmark
  • Markdown quality on complex tables/multi-column layouts not demonstrated

License

MIT OR Apache-2.0 (OSI-open)

When it is interesting

Building document-ingestion pipelines for RAG where PDF extraction speed and multi-language support matter.

When it is too early

If you need battle-tested handling of malformed or scanned PDFs - PyMuPDF has a larger edge-case community.

Commercial alternative & related

  • Commercial counterpart: LlamaParse / AWS Textract

This repo featured in the 2026-07 edition of the Open-Source AI Radar.