Extracting tables from PDFs programmatically is one of the most common document processing tasks — and one of the most frustrating. Rule-based tools like Tabula and Camelot work well on simple, text-based PDFs, but fall apart on scanned documents, complex multi-column layouts, or tables embedded in dense page designs. PDF Xtract takes a different approach: it renders pages as images and runs a YOLO object detection model to find tables visually, regardless of whether there's an underlying text layer.
The API Endpoint
All extraction happens through a single POST endpoint. You upload your PDF as multipart form data along with parameters that control what gets extracted, at what resolution, and in what format.
POST /api/extract-elements
Content-Type: multipart/form-data
File: <your PDF>
Categories: "Table" # Extract only tables
ImageResolution: 300 # 300 DPI output images
ImageOutputFormat: "png" # png | jpg | tiff
PageRange: "1-10" # optional page range
OcrEnabled: false # set true for text extractionPython Example
import requests
with open('report.pdf', 'rb') as f:
response = requests.post(
'https://pdf-xtract.com/api/extract-elements',
headers={'Authorization': 'Bearer YOUR_API_KEY'},
files={'File': ('report.pdf', f, 'application/pdf')},
data={
'Categories': 'Table',
'ImageResolution': 300,
'ImageOutputFormat': 'png',
'OcrEnabled': False,
}
)
result = response.json()
print(f"Extracted {len(result['Files'])} table(s) from {result['PageCount']} pages")
for item in result['Files']:
# item['FileData'] is base64-encoded PNG when StoreFile=false
import base64
img_bytes = base64.b64decode(item['FileData'])
with open(f"table_{item['ElementId']}.png", 'wb') as out:
out.write(img_bytes)Response Structure
The response JSON contains extraction metadata and the extracted files themselves. When StoreFile is false (the default), each extracted element is returned as base64-encoded file data inline. When StoreFile is true, the API stores the files to S3 and returns URLs with an expiration timestamp — useful for large documents or when you want to inspect results in a browser before downloading.
- ExtractionId — unique identifier for this extraction, linked to usage history
- PageCount — number of pages processed
- Files — array of extracted elements, each with Category, PageNumber, BoundingBox, and either FileData (base64) or Url
- ExpiresAt — URL expiration timestamp (when StoreFile=true)
Filtering and Confidence Tuning
The Categories parameter accepts a comma-separated list of element types. To extract only tables: Categories=Table. To extract both tables and figures: Categories=Table,Image. The confidence_threshold parameter (default 0.25) controls how confidently the model must identify an element before including it. Raise it (e.g., 0.5) to reduce false positives on clean documents; lower it to catch more elements on complex or low-quality scans.
All Supported Element Categories
- Image — photos, diagrams, illustrations
- Table — data tables
- Text — text blocks and paragraphs
- Title — document titles
- SectionTitle — section headings
- Caption — figure/table captions
- Footnote — footnotes
- Expression — mathematical formulas
- Entry — list items
- Header — page headers
- Footer — page footers