How to Extract Tables from PDFs Using the PDF Xtract API

Extracting tables from PDFs programmatically is one of the most common document processing tasks — and one of the most frustrating. Rule-based tools like Tabula and Camelot work well on simple, text-based PDFs, but fall apart on scanned documents, complex multi-column layouts, or tables embedded in dense page designs. PDF Xtract takes a different approach: it renders pages as images and runs a YOLO object detection model to find tables visually, regardless of whether there's an underlying text layer.

The API Endpoint

All extraction happens through a single POST endpoint. You upload your PDF as multipart form data along with parameters that control what gets extracted, at what resolution, and in what format.

bash

POST /api/extract-elements
Content-Type: multipart/form-data

File: <your PDF>
Categories: "Table"          # Extract only tables
ImageResolution: 300          # 300 DPI output images
ImageOutputFormat: "png"      # png | jpg | tiff
PageRange: "1-10"             # optional page range
OcrEnabled: false             # set true for text extraction

Python Example

python

import requests

with open('report.pdf', 'rb') as f:
    response = requests.post(
        'https://pdf-xtract.com/api/extract-elements',
        headers={'Authorization': 'Bearer YOUR_API_KEY'},
        files={'File': ('report.pdf', f, 'application/pdf')},
        data={
            'Categories': 'Table',
            'ImageResolution': 300,
            'ImageOutputFormat': 'png',
            'OcrEnabled': False,
        }
    )

result = response.json()
print(f"Extracted {len(result['Files'])} table(s) from {result['PageCount']} pages")

for item in result['Files']:
    # item['FileData'] is base64-encoded PNG when StoreFile=false
    import base64
    img_bytes = base64.b64decode(item['FileData'])
    with open(f"table_{item['ElementId']}.png", 'wb') as out:
        out.write(img_bytes)

Response Structure

The response JSON contains extraction metadata and the extracted files themselves. When StoreFile is false (the default), each extracted element is returned as base64-encoded file data inline. When StoreFile is true, the API stores the files to S3 and returns URLs with an expiration timestamp — useful for large documents or when you want to inspect results in a browser before downloading.

ExtractionId — unique identifier for this extraction, linked to usage history
PageCount — number of pages processed
Files — array of extracted elements, each with Category, PageNumber, BoundingBox, and either FileData (base64) or Url
ExpiresAt — URL expiration timestamp (when StoreFile=true)

Filtering and Confidence Tuning

The Categories parameter accepts a comma-separated list of element types. To extract only tables: Categories=Table. To extract both tables and figures: Categories=Table,Image. The confidence_threshold parameter (default 0.25) controls how confidently the model must identify an element before including it. Raise it (e.g., 0.5) to reduce false positives on clean documents; lower it to catch more elements on complex or low-quality scans.

Sign up for a free API key at pdf-xtract.com. The Starter plan includes 2,000 pages/month free. The desktop app offers unlimited local extraction for $24/year.

The API Endpoint

Python Example

Response Structure

Filtering and Confidence Tuning

All Supported Element Categories