Symphony OCR searches the document repository for documents to process. It then organizes those documents into one of several lists (these lists are available on the left side of the Symphony OCR web interface). The following diagram displays the tools, lists and explains how they interact:
Symphony OCR consists of three main tools that interact to provide full OCR services:
Finder -locates documents in your document repository
Analyzer -determines if a given document is a candidate for OCR
Processor - performs the actual OCR
As documents flow through Symphony OCR, each of the above components works on the document, then places it in a particular Document List, as described in the next section.
The backlog consists of documents that have not been analyzed or OCRed. These are documents that Symphony OCR is still working on. The following document lists represent the backlog:
Analyzing Documents waiting for the Analyzer to determine if they are candidates for OCR or not
In Process Documents that are in the process of being Analyzed
Processing Documents are candidates for OCR, but have not been processed yet
Reprocessing Documents had some recoverable problem during OCR, and will be processed again later. Typical causes are if the document is open by a user, or was modified while OCR was taking place
These lists represent documents that were successfully analyzed or OCRed. They are documents that have either been OCRed or were already text searchable (and thus not in need of OCR):
Processed documents that have been successfully OCRed by Symphony OCR
In Process documents that are currently being OCRed
Already OCRed documents that were already OCRed (by some other processor or by an earlier version of Symphony OCR)
Contains text documents that are already text searchable (no OCR needed)
No image or text some rare documents contain no text, but also contain no images these generally do not need to be OCRed however, you may choose to do these on a one-off basis. See "How to Process No Image or Text Documents" for instructions
Email Messages - contains the list of email messages that contained attachments that were processed by Symphony OCR
These lists represent documents that could not be processed for some reason. In most cases, an administrator will want to glance over these lists from time to time to ensure that there are no issues with the documents that didn't get processed:
Needs Attention: Documents in the Needs attention list are those that appear to be eligible for OCR, but encountered problems during processing. Files in this list could be corrupted or contain invalid images (try opening them in an image viewer to be sure), or they may be images that Symphony OCR does not handle yet. If the image appears in your viewer, contact your Symphony reseller to see if handling for the file can be added. If the document is corrupted, you can either remove the document from Worldox, or manually tell Symphony OCR to 'ignore' it, which will put it on the Ignored list. If the Needs Attention list contains any documents, the overall system condition will show as "Warn." Ignoring a document that you have already checked is a good way to change the system condition back to "OK".
New: Documents in the New list are those that have be found by the finder tool, but not yet allocated to another document list (documents are only in the New state for a very short period of time).
Deleted: Documents in the deleted list mean that the document record is in the process of being purged from the database documents should only be in this state for a very short period of time.
Too Old: Documents in the Too Old list are those that have a file modified date older than the cut off age defined the Processor configuration.
Inaccessible: Documents in the Inaccessible list are those that could not be processed because of file system security, Worldox security, read-only attributes or other conditions that prevent the document from being accessed and worked on. In addition, if the profile group in which the documents reside contains an invalid base path (containing a space for example), or if the file has a space immediately prior to the document extension, they will be shown in the inaccessible list
Documents in the corrupted list are those that Symphony OCR does not recognize as valid files. The most common reason is that the file is an invalid or corrupted PDF (try opening in Adobe to be sure). Another possibility is that there is some characteristic of the PDF that the Symphony OCR parsing algorithm isn't handling properly. Trumpet does periodically update the PDF parsing algorithms to address corner cases that have not been encountered before.
Try opening the file in Acrobat, then hit Save (Acrobat will try to open and auto-repair corrupted files - when you save the document, it will save uncorrupted). After saving and closing the document, click the Re-Analyze button on the document record in Symphony OCR. This will only work if the file is only lightly corrupted, but is worth a shot.
If that doesn't help, next check to see if the file is already text searchable (i.e. can you search for text inside the PDF already?). If you can, then the document isn't a candidate for OCR anyway, and you can just move the document to the Ignore list.
If the document does need to be OCRed, and the Adobe repair doesn't help, then you may want to submit the document to us for analysis. Open a support ticket by emailing email@example.com and we will send information on how to securely upload the document to us. If we find a problem in our parsing algorithms, we'll fix the issue and get you a patch.
If there are a large number of files that have the same corruption reason, and the files don't appear to actually be corrupted, please open a support ticket by emailing firstname.lastname@example.org and we will send information on how to securely upload a sample document to us. If we find a problem in our parsing algorithms, we'll fix the issue and get you a patch. Alternatively, you can use a bulk Ignore operation to move the documents to Ignore.
Encrypted / Restricted: Documents in the Encrypted/Restricted list are those that are restricted from being processed because of some characteristic of the file itself (for example, an encrypted or partially restricted PDF file will not be processed).
Ignored: Documents in the Ignored list are documents that a Symphony OCR administrator has explicitly told Symphony OCR not to process. Any document on this list was explicitly placed there by human intervention.
Wrong Type: Documents in the Wrong Type lists are a tif documents and TIFF processing is not enabled.
Moved / Unavailable: Documents in the Moved / Unavailable list are no longer available in the Document Management System (DMS). This could mean that the DMS has gone "offline" or the DMS settings have been adjusted so that the documents would not have been found for processing (e.g., if a user selects a profile group to analyze and OCR, and then chooses to un-check that profile group or no longer process it). Document records in the Moved/Unavailable list will be deleted from the database after 15 days. Documents can also appear in the Moved / Unavailable list if they are no longer at that current location.
Digitally Signed: Documents that are digitally signed will not be processed by Symphony OCR because adding OCR information to these documents would invalidate the digital signature. If you wish to have these documents OCRed anyway (and are OK with invalidating the digital signature), please send an email to email@example.com and request that functionality be added.
Too Big: Documents that contain an individual page larger than 10,000 x 12,000 pixels.
Advanced Configuration Setting
If you wish to attempt to process documents that have an individual page larger than 10,000 x 12,000 pixels, you may opt to do so by updating the settings.xlm file. Here's how:
Note: If Symphony OCR is not able to process these documents they may end up in the Needs Attention list.
See Manipulating Document Lists for more information on how to manage these lists