doc2md: Solving PDF Reading in Claude Code (15K lines, multi-extractor, QC pipeline) #813

neoncapy · 2026-02-22T20:57:40Z

neoncapy
Feb 22, 2026

The Problem

Anthropic's copyright filter blocks most PDF content in Claude Code. If you work with academic papers, regulatory documents, or corporate presentations, you have likely hit this wall. Your agent reads a PDF and gets partial or empty results.

doc2md

doc2md is a 15,000-line Python pipeline that converts PDF, DOCX, and PPTX files into high-fidelity Markdown with full image extraction and multi-stage quality control.

How It Works

2-tier architecture:

Python tier (zero LLM tokens) — extracts text, tables, and images using multiple extractors
Claude vision tier — 8 expert personas analyze extracted images with document-aware context

Multi-extractor support:

pymupdf4llm (default)
pdfplumber (cross-validation)
MinerU (complex layouts with tables/figures)
Automatic fallback when one extractor struggles

Per-image classification:
8 heuristics classify every image as substantive or decorative — file size, pixel variance, aspect ratio, color count, vector content detection, journal branding patterns, near-black detection.

QC pipeline:
Structural QC catches table collapse, missing headings, dropped content. Loops until genuinely zero issues remain.

Claude Code Integration

PreToolUse hook intercepts PDF/DOCX/PPTX reads and redirects to converted Markdown
SHA-256 conversion registry tracks every file
SKILL.md provides the full pipeline as a slash command

What It Handles

Scientific papers with complex tables and figures
Pharmaceutical regulatory documents
Corporate PPTX presentations with charts and SmartArt
DOCX files with merged cells, images, and nested tables

MIT licensed. Feedback welcome — what document types cause you the most problems?

https://github.com/orangefineblue/doc2md

hesreallyhim · 2026-02-22T21:36:58Z

hesreallyhim
Feb 22, 2026
Maintainer

Please don't open any more self-promotion stuff.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc2md: Solving PDF Reading in Claude Code (15K lines, multi-extractor, QC pipeline) #813

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

doc2md: Solving PDF Reading in Claude Code (15K lines, multi-extractor, QC pipeline) #813

Uh oh!

neoncapy Feb 22, 2026

The Problem

doc2md

How It Works

Claude Code Integration

What It Handles

Replies: 1 comment

Uh oh!

hesreallyhim Feb 22, 2026 Maintainer

neoncapy
Feb 22, 2026

hesreallyhim
Feb 22, 2026
Maintainer