Replies: 1 comment
-
|
Please don't open any more self-promotion stuff. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
The Problem
Anthropic's copyright filter blocks most PDF content in Claude Code. If you work with academic papers, regulatory documents, or corporate presentations, you have likely hit this wall. Your agent reads a PDF and gets partial or empty results.
doc2md
doc2md is a 15,000-line Python pipeline that converts PDF, DOCX, and PPTX files into high-fidelity Markdown with full image extraction and multi-stage quality control.
How It Works
2-tier architecture:
Multi-extractor support:
Per-image classification:
8 heuristics classify every image as substantive or decorative — file size, pixel variance, aspect ratio, color count, vector content detection, journal branding patterns, near-black detection.
QC pipeline:
Structural QC catches table collapse, missing headings, dropped content. Loops until genuinely zero issues remain.
Claude Code Integration
What It Handles
MIT licensed. Feedback welcome — what document types cause you the most problems?
https://github.com/orangefineblue/doc2md
Beta Was this translation helpful? Give feedback.
All reactions