Home → Guides :: Symphony OCR → Administrator Guide → Symphony OCR Workflow, Tools & Document Lists
Symphony OCR searches the document repository for documents to process. It then organizes those documents into one of several lists (these lists are available on the left side of the Symphony OCR web interface). The following diagram displays the tools, lists and explains how they interact:
Symphony OCR consists of three main tools that interact to provide full OCR services:
Finder - locates documents in your document repository
Analyzer - determines if a given document is a candidate for OCR
Processor - performs the actual OCR
As documents flow through Symphony OCR, each of the above components works on the document, then places it in a particular Document List, as described in the next section.
The backlog consists of documents that have not been analyzed or OCRed. These are documents that Symphony OCR is still working on. The following document lists represent the backlog:
Analyzing - Documents waiting for the Analyzer to determine if they are candidates for OCR or not
In Process - Documents that are in the process of being Analyzed
Processing - Documents are candidates for OCR, but have not been processed yet
Reprocessing - Documents had some recoverable problem during OCR, and will be processed again later. Typical causes are if the document is open by a user, or was modified while OCR was taking place
These lists represent documents that were successfully analyzed or OCRed. They are documents that have either been OCRed or were already text searchable (and thus not in need of OCR):
Processed - documents that have been successfully OCRed by Symphony OCR
In Process - documents that are currently being OCRed
Already OCRed - documents that were already OCRed (by some other processor or by an earlier version of Symphony OCR)
Contains text - documents that are already text searchable (no OCR needed)
No image or text - some rare documents contain no text, but also contain no images. These generally do not need to be OCRed however, you may choose to do these on a one-off basis. See "How to Process No Image or Text Documents" for instructions
Email Messages - contains the list of email messages that contained attachments that were processed by Symphony OCR
These lists represent documents that could not be processed for some reason. In most cases, an administrator will want to glance over these lists from time to time to ensure that there are no issues with the documents that didn't get processed:
Needs Attention: Documents in the 'Needs Attention' list are those that appear to be eligible for OCR, but encountered problems during processing. Files in this list could be corrupted or contain invalid images (try opening them in an image viewer to be sure), or they may be images that Symphony OCR does not handle yet.
Occasionally, a document can fall into the 'Needs Attention' list because of bad timing - Symphony OCR trying to process the document when it isn't fully available. So we always recommend clicking the "Show Bulk Operations" button and then "Re-Analyze All", just to ensure this isn't the case.
If the document is corrupted, you can either remove the document from Worldox, or manually tell Symphony OCR to "ignore" it, which will put it on the 'Ignored' list. If the 'Needs Attention' list contains any documents, the overall system condition will show as "Warn." Ignoring a document that you have already checked is a good way to change the system condition back to "OK".
If the document does not appear corrupted, the next step would be to allow us to see a copy of the file. Because PDFs can be generated in countless different ways, we occasionally run into a specific sub-type of PDF that we've not encountered before. If we can get a copy of the file that is falling into the 'Needs Attention' list, we can in almost all cases, add support for the file. Please contact us at support@trumpetinc.com for instructions to upload documents to our secure site.
New: Documents in the New list are those that have be found by the finder tool, but not yet allocated to another document list (documents are only in the New state for a very short period of time).
Deleted: Documents in the deleted list mean that the document record is in the process of being purged from the database (documents should only be in this state for a very short period of time).
Too Old: Documents in the Too Old list are those that have a file modified date older than the cut off age defined the Processor configuration.
Inaccessible: Documents in the Inaccessible list are those that could not be processed because of file system security, Worldox security, read-only attributes or other conditions that prevent the document from being accessed and worked on. In addition, if the profile group in which the documents reside contains an invalid base path (containing a space for example), or if the file has a space immediately prior to the document extension, they will be shown in the inaccessible list
Corrupted Documents - Documents in the corrupted list are those that Symphony OCR does not recognize as valid files. The most common reason is that the file is an invalid or corrupted PDF (try opening in Adobe to be sure). Another possibility is that there is some characteristic of the PDF that the Symphony OCR parsing algorithm isn't handling properly. Trumpet does periodically update the PDF parsing algorithms to address corner cases that have not been encountered before.
What to do?
Try opening the file in Acrobat, then hit Save (Acrobat will try to open and auto-repair corrupted files - when you save the document, it will save uncorrupted). After saving and closing the document, click the Re-Analyze button on the document record in Symphony OCR. This will only work if the file is only lightly corrupted, but is worth a shot.
If that doesn't help, next check to see if the file is already text searchable (i.e. can you search for text inside the PDF already?). If you can, then the document isn't a candidate for OCR anyway, and you can just move the document to the Ignore list.
If the document does need to be OCRed, and the Adobe repair doesn't help, then you may want to submit the document to us for analysis. Open a support ticket by emailing support@trumpetinc.com and we will send information on how to securely upload the document to us. If we find a problem in our parsing algorithms, we'll fix the issue and get you a patch.
If there are a large number of files that have the same corruption reason, and the files don't appear to actually be corrupted, please open a support ticket by emailing support@trumpetinc.com and we will send information on how to securely upload a sample document to us. If we find a problem in our parsing algorithms, we'll fix the issue and get you a patch. Alternatively, you can use a bulk Ignore operation to move the documents to Ignore.
Encrypted / Restricted: Documents in the Encrypted/Restricted list are those that are restricted from being processed because of some characteristic of the file itself (for example, an encrypted or partially restricted PDF file will not be processed).
Ignored: Documents in the Ignored list are documents that a Symphony OCR administrator has explicitly told Symphony OCR not to process. Any document on this list was explicitly placed there by human intervention.
Wrong Type: Documents in the Wrong Type lists are a tif documents and TIFF processing is not enabled.
Moved / Unavailable: Documents in the Moved / Unavailable list are no longer available in the Document Management System (DMS). This could mean that the DMS has gone "offline" or the DMS settings have been adjusted so that the documents would not have been found for processing (e.g., if a user selects a profile group to analyze and OCR, and then chooses to un-check that profile group or no longer process it). Document records in the Moved/Unavailable list will be deleted from the database after 15 days. Documents can also appear in the Moved / Unavailable list if they are no longer at that current location.
Digitally Signed: Documents that are digitally signed will not be processed by Symphony OCR because adding OCR information to these documents would invalidate the digital signature. If you wish to have these documents OCRed anyway (and are OK with invalidating the digital signature), please send an email to support@trumpetinc.com and request that functionality be added.
Too Big (to 8.0.0 and higher)
If a document falls into this list, it does NOT mean the document is contains too many pages. Symphony OCR processes files one page at a time. So if a document falls into this list, it means the document contains one or more pages with pixel dimensions larger than a specified value. In this version of Symphony OCR that value is 32512 x 32512 pixels.
This is a hard limit and cannot be overwritten.
Too Big (Prior to 8.0.0)
If a document falls into this list, it does NOT mean the document is too big. Symphony OCR processes files one page at a time. So if a document falls into this list, it means the document contains one or more pages with pixel dimensions larger than a specified value (ie. The page couldn't be loaded into memory). We usually see this in documents like blueprints of schematic drawings. But there are some things we can do to try to get these types of documents processed, if you find that it needs to be processed.
Clicking on the document in the 'Too Big' list will tell you the size of the offending page.
1) Click on the 'Too big' list.
2) Click on the individual document in question.
3) The offending size of the document is available in the document details.
If you find you have a series of the same type of documents, it's usually the case where the same size file is exceeding the limit. You can attempt to process these documents by modifying the value(s) declared in the setting.xml file. (Defaults differ depending on the version you're running.
See Manipulating Document Lists for more information on how to manage these lists