Inside PDF Xtract: How YOLO Detects 11 Document Element Types in a Single Pass

Most PDF extraction tools work by parsing the internal structure of a PDF file — reading text runs, embedded image objects, and line coordinates. This approach is fast and works well on simple, well-formed documents. But it fails on scanned PDFs (no text layer), complex layouts where elements overlap, and ambiguous content like a chart that could be a table or a diagram. PDF Xtract takes a fundamentally different approach: it treats PDF pages as images and runs object detection.

The Rendering Step

Every page is first rendered to a raster image using PyMuPDF (MuPDF under the hood). The rendering resolution directly affects detection accuracy and output quality — at higher DPI, small text in tables is clearer, formula symbols are legible, and the detection model has more pixels to work with. PDF Xtract supports 10–800 DPI, with 200–300 DPI as a practical sweet spot for most documents.

The YOLO Detection Model

The rendered page images are passed through a YOLO (You Only Look Once) object detection model fine-tuned on document layout data from the DocLayNet dataset. YOLO is a single-pass detector: it scans the image once and outputs bounding boxes with class labels and confidence scores simultaneously, making it fast enough for practical use even on multi-page documents. The model outputs eleven element categories, each with a bounding box and a confidence score between 0 and 1.

Accuracy on DocLayNet

The model was evaluated on a held-out validation set of 1,088 document images containing 17,040 annotated element instances from the DocLayNet dataset. The results:

98.9% mAP@50 — correctly locates and classifies nearly every element with ≥50% overlap with ground truth
89.3% mAP@50-95 — the stricter metric averaging across IoU thresholds from 0.50 to 0.95
98% recall at the optimal confidence threshold — only 2% of elements are missed
Table-specific: 98.9% mAP@50, 93.0% mAP@50-95
Formula-specific: 99.0% mAP@50, 87.7% mAP@50-95

Published benchmarks on DocLayNet show state-of-the-art models reaching around 79.7% mAP. The PDF Xtract model scores reflect fine-tuning for the extraction use case, where accuracy on target document types is maximized.

From Bounding Box to Extracted File

Once bounding boxes are detected, each region is cropped from the high-resolution rendered image and saved at the requested DPI and format (JPG, PNG, or TIFF). This means the output quality is a function of your requested resolution, not the original PDF's embedded resolution. A table that was originally rendered at 96 DPI in the PDF can be extracted at 400 DPI by simply re-rendering the page at higher resolution before cropping.

Why Not Just Parse the PDF Structure?

Structural parsing remains useful for simple cases: if you only need image blobs embedded as JPEG objects in a well-formed digital PDF, Poppler's pdfimages is faster and lossless. But for scanned documents, vector-drawn diagrams, complex table layouts, or mixed-content pages, visual detection handles what structural parsing cannot. It's also the only approach that can classify elements — telling you whether a detected region is a table, a figure, a formula, or a heading, rather than just that it's a non-text object.

Try PDF Xtract on your own documents at pdf-xtract.com — the free tier processes up to 2,000 pages/month. Desktop app with unlimited local extraction at pdf-xtract.com/download.