Validate Document
Validation
Validate Document
Submit a document for asynchronous validation
POST
Validate Document
Submit a base64-encoded document for asynchronous compliance validation.Documentation Index
Fetch the complete documentation index at: https://zerodrift.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Request Body
Base64-encoded document content (PDF, DOCX, etc.)
Pre-defined category for the document. Required if
document_metadata is not provided.Options: retail_investor_letter, retail_fact_sheet_registered_fund, retail_fact_sheet_non_registered, pitch_book_registered_fund, pitch_book_non_registered, scenario_retail_investor_letter, scenario_retail_fact_sheet_registered_fund, scenario_retail_fact_sheet_non_registered, scenario_pitch_book_registered_fund, scenario_pitch_book_non_registeredDetailed metadata for precise rule matching. Required if
document_category is not provided.At least one of
document_category or document_metadata must be provided.Scanned PDF Support (OCR)
The validation service automatically handles scanned PDFs using AWS Textract OCR. No additional parameters are needed — OCR is triggered transparently when text extraction yields insufficient content. How it works:- The service first attempts standard text extraction via
pypdf - If a page yields fewer than 50 characters, it is classified as a scanned/image page
- Scanned pages are automatically sent to AWS Textract for OCR
- The OCR text is merged with any text-extracted pages before validation
| Case | Behavior |
|---|---|
| Text-only PDF | Standard text extraction, no OCR |
| Fully scanned PDF | All pages sent to Textract OCR |
| Mixed PDF (text + scanned pages) | Only scanned pages are OCR’d, text pages kept as-is |
| Workflow | OCR Limit |
|---|---|
POST /api/validate_stream/ (direct base64 upload) | Up to 50 pages, 10MB per page (sync, page-by-page) |
Presigned URL + POST /api/validate_stream_start/ | Up to 3,000 pages, 500MB total (async via S3) |
Scanned PDFs may take longer to process due to OCR. For direct uploads via this endpoint, OCR is performed page-by-page (sync). For large scanned documents (50+ pages), use the presigned URL workflow which enables asynchronous Textract processing with higher limits.
Response
Unique identifier for the validation job
Job status:
acceptedStatus message
ISO 8601 timestamp

