OCR & Metadata

OCR is an acronym for Optical Character Recognition and describes the technique of translating an image of a text, obtained through scanning, faxing, or other imaging system, into the standard text data that is used in computing.

In Udocx, scanned documents are converted to PDF/A files with optional OCR text.  When OCR is enabled, the resulting PDF may be searched, text may be copied and pasted into other documents, and the PDF may be indexed by document management systems for future reference.


Metadata is data about data.  This includes all information that is embedded within a document that is not part of the document itself, or information that is inherently linked to a document within a document management system.  For example, most computer operating systems store date/time stamp information for when each file on the system was created, accessed, and modified.

Perhaps the best example of metadata comes from the world of digital photography where EXIF information embedded in the image file is the 'metadata'.  EXIF information includes the name of the camera manufacturer and camera model, the orientation of the camera when the photo was made, exposure time, f-stop, flash status, focal length, etc.  Some modern cameras even include geolocation information from an embedded GPS receiver within the EXIF data.

Udocx always embeds metadata in compliance with the PDF/A specification, which utilizes an ISO metadata standard known as XMP (eXtensible Metadata Platform).  Therefore, all metadata added by Udocx can be used by other applications such as Microsoft SharePoint.  Udocx even allows you to pass custom metadata to store locations such as SharePoint, which can include information such as whether or not a document has been reviewed or approved.