Home → Training Manuals- Symphony Suite :: Symphony OCR - Administrator → Printer Friendly Version
Symphony OCR is part of Symphony Suite, The Complete Imaging Solution. Symphony OCR is a back-end OCR engine. It will locate all image-only PDF and TIF files in your document management system and convert them to fully text searchable PDFs by adding an invisible layer of text over the image. Symphony OCR typically runs on a back-end PC or server (for Worldox sites, this is typically the Indexer PC).
Tip: Turn OCR off at your scanners and see significant (2 to 5x) improvement in scanning speeds. In fact, Adobe Acrobat turns OCR on by default, and we strongly recommend turning it off. Let Symphony OCR take care of the OCR in the background.
Â
There are four different types of files pertinent to this discussion:
By default, Symphony OCR queries the Worldox document repository for newly saved and modified files every 15 minutes. Generally speaking, newly saved files will be OCRed within about 15 minutes. Depending on the volume of image-only documents already filed to Worldox, it may take a while for Symphony OCR to process the backlog (legacy files). Symphony OCR gives precedence to newer files, so documents that are scanned today will be processed before the backlog.
Refer to the section, Configuration Guide - Worldox - Finder, for further information on finder settings that determine when Symphony OCR locates files for processing.
Refer to the section, Configuration Guide - Worldox - Processor, for further information on configuration settings that determine which files are processed.
Note: While Symphony OCR may process documents within about 15 minutes, you will need to wait until the text indexes are updated (typically overnight) in order to do full text in file searching for the documents using the Worldox document management system.
By default, Symphony OCR queries the Folder document repository for newly saved and modified files every 120 minutes. Generally speaking, newly saved files will be OCRed within about 120 minutes. Depending on the volume of image-only documents already filed to Worldox, it may take a while for Symphony OCR to process the backlog (legacy files). Symphony OCR gives precedence to newer files, so documents that are scanned today will be processed before the backlog.
Refer to the section, Configuration Guide - Folder - Finder, for further information on finder settings that determine when Symphony OCR locates files for processing.
Refer to the section, Configuration Guide - Folder - Processor, for further information on configuration settings that determine which files are processed.
By default, Symphony OCR queries the Folder document repository for newly saved and modified files every 15 minutes. Generally speaking, newly saved files will be OCRed within about 15 minutes. Symphony OCR can also optionally process the files already stored in NetDocuments. By default, it performs a query for these files every 7 days. Symphony OCR gives precedence to newer files, so documents that are scanned today will be processed before the legacy documents.
Note: While Symphony OCR may process documents within about 15 minutes, Netdocuments may take up to 6-8 hours to update its text index. Meaning, if you'd like to run a Netoducments search for words within a document that was recently OCR'd then there may be a 6-8 hour delay. This is due to how Netdocuments prioritizes API activity.
Refer to the section, Configuration Guide - NetDocuments - Finder, for further information on finder settings that determine when Symphony OCR locates files for processing.
Refer to the section, Configuration Guide - NetDocuments - Processor, for further information on settings that determine which files are processed.
To access it directly log onto the workstation where Symphony OCR is installed.
You can also access Symphony OCR from your workstation see: Accessing Symphony OCR
Since Symphony OCR uses a web interface, the display may not automatically refresh as it performs its work. You can manually refresh by selecting the "Refresh" button in Symphony OCR or in the web browser. The Symphony OCR summary page refreshes automatically every 60 seconds. All other pages require a manual refresh.
Symphony OCR can be accessed from the web browser of any workstation connected to the network by typing in the address found in the web browser in which Symphony OCR runs:
If you would prefer for the End Users to see only the Summary View of Symphony OCR, you can do so by doing the following:
In the main Symphony OCR page, select "Simple View"
This will open the Summary Screen without the additional navigation panel:
Copy and paste the URL from here to provide to those users.
Symphony OCR searches the document repository for documents to process. It then organizes those documents into one of several lists (these lists are available on the left side of the Symphony OCR web interface). The following diagram displays the tools, lists and explains how they interact:
Symphony OCR consists of three main tools that interact to provide full OCR services:
Finder — locates documents in your document repository
Analyzer — determines if a given document is a candidate for OCR
Processor — performs the actual OCR
As documents flow through Symphony OCR, each of the above components works on the document, then places it in a particular Document List, as described in the next section.
The backlog consists of documents that have not been analyzed or OCRed. These are documents that Symphony OCR is still working on. The following document lists represent the backlog:
Analyzing — Documents waiting for the Analyzer to determine if they are candidates for OCR or not
In Process — Documents that are in the process of being Analyzed
Processing — Documents are candidates for OCR, but have not been processed yet
Reprocessing — Documents had some recoverable problem during OCR, and will be processed again later. Typical causes are if the document is open by a user, or was modified while OCR was taking place
These lists represent documents that were successfully analyzed or OCRed. They are documents that have either been OCRed or were already text searchable (and thus not in need of OCR):
Processed — documents that have been successfully OCRed by Symphony OCR
In Process — documents that are currently being OCRed
Already OCRed — documents that were already OCRed (by some other processor or by an earlier version of Symphony OCR)
Contains text — documents that are already text searchable (no OCR needed)
No image or text — some rare documents contain no text, but also contain no images. These generally do not need to be OCRed however, you may choose to do these on a one-off basis. See "How to Process No Image or Text Documents" for instructions
Email Messages - contains the list of email messages that contained attachments that were processed by Symphony OCR
These lists represent documents that could not be processed for some reason. In most cases, an administrator will want to glance over these lists from time to time to ensure that there are no issues with the documents that didn't get processed:
Needs Attention: Documents in the 'Needs Attention' list are those that appear to be eligible for OCR, but encountered problems during processing. Files in this list could be corrupted or contain invalid images (try opening them in an image viewer to be sure), or they may be images that Symphony OCR does not handle yet.
Occasionally, a document can fall into the 'Needs Attention' list because of bad timing — Symphony OCR trying to process the document when it isn't fully available. So we always recommend clicking the "Show Bulk Operations" button and then "Re-Analyze All", just to ensure this isn't the case.
If the document is corrupted, you can either remove the document from Worldox, or manually tell Symphony OCR to "ignore" it, which will put it on the 'Ignored' list. If the 'Needs Attention' list contains any documents, the overall system condition will show as "Warn." Ignoring a document that you have already checked is a good way to change the system condition back to "OK".
If the document does not appear corrupted, the next step would be to allow us to see a copy of the file. Because PDFs can be generated in countless different ways, we occasionally run into a specific sub-type of PDF that we've not encountered before. If we can get a copy of the file that is falling into the 'Needs Attention' list, we can in almost all cases, add support for the file. Please contact us at support@trumpetinc.com for instructions to upload documents to our secure site.
New:Â Documents in the New list are those that have be found by the finder tool, but not yet allocated to another document list (documents are only in the New state for a very short period of time).
Deleted: Documents in the deleted list mean that the document record is in the process of being purged from the database (documents should only be in this state for a very short period of time).
Too Old:Â Documents in the Too Old list are those that have a file modified date older than the cut off age defined the Processor configuration.
Inaccessible: Documents in the Inaccessible list are those that could not be processed because of file system security, Worldox security, read-only attributes or other conditions that prevent the document from being accessed and worked on. In addition, if the profile group in which the documents reside contains an invalid base path (containing a space for example), or if the file has a space immediately prior to the document extension, they will be shown in the inaccessible list
Corrupted Documents — Documents in the corrupted list are those that Symphony OCR does not recognize as valid files. The most common reason is that the file is an invalid or corrupted PDF (try opening in Adobe to be sure). Another possibility is that there is some characteristic of the PDF that the Symphony OCR parsing algorithm isn't handling properly. Trumpet does periodically update the PDF parsing algorithms to address corner cases that have not been encountered before.
What to do?
Try opening the file in Acrobat, then hit Save (Acrobat will try to open and auto-repair corrupted files - when you save the document, it will save uncorrupted). After saving and closing the document, click the Re-Analyze button on the document record in Symphony OCR. This will only work if the file is only lightly corrupted, but is worth a shot.
If that doesn't help, next check to see if the file is already text searchable (i.e. can you search for text inside the PDF already?). If you can, then the document isn't a candidate for OCR anyway, and you can just move the document to the Ignore list.
If the document does need to be OCRed, and the Adobe repair doesn't help, then you may want to submit the document to us for analysis. Open a support ticket by emailing support@trumpetinc.com and we will send information on how to securely upload the document to us. If we find a problem in our parsing algorithms, we'll fix the issue and get you a patch.
If there are a large number of files that have the same corruption reason, and the files don't appear to actually be corrupted, please open a support ticket by emailing support@trumpetinc.com and we will send information on how to securely upload a sample document to us. If we find a problem in our parsing algorithms, we'll fix the issue and get you a patch. Alternatively, you can use a bulk Ignore operation to move the documents to Ignore.
Encrypted / Restricted:Â Documents in the Encrypted/Restricted list are those that are restricted from being processed because of some characteristic of the file itself (for example, an encrypted or partially restricted PDF file will not be processed).
Ignored:Â Documents in the Ignored list are documents that a Symphony OCR administrator has explicitly told Symphony OCR not to process. Any document on this list was explicitly placed there by human intervention.
Wrong Type:Â Documents in the Wrong Type lists are a tif documents and TIFF processing is not enabled.
Moved / Unavailable: Documents in the Moved / Unavailable list are no longer available in the Document Management System (DMS). This could mean that the DMS has gone "offline" or the DMS settings have been adjusted so that the documents would not have been found for processing (e.g., if a user selects a profile group to analyze and OCR, and then chooses to un-check that profile group or no longer process it). Document records in the Moved/Unavailable list will be deleted from the database after 15 days. Documents can also appear in the Moved / Unavailable list if they are no longer at that current location.
Digitally Signed: Documents that are digitally signed will not be processed by Symphony OCR because adding OCR information to these documents would invalidate the digital signature. If you wish to have these documents OCRed anyway (and are OK with invalidating the digital signature), please send an email to support@trumpetinc.com and request that functionality be added.
Too Big (to 8.0.0 and higher)
If a document falls into this list, it does NOT mean the document is contains too many pages. Symphony OCR processes files one page at a time. So if a document falls into this list, it means the document contains one or more pages with pixel dimensions larger than a specified value. In this version of Symphony OCR that value is 32512 x 32512 pixels.
This is a hard limit and cannot be overwritten.
Too Big (Prior to 8.0.0)
If a document falls into this list, it does NOT mean the document is too big. Symphony OCR processes files one page at a time. So if a document falls into this list, it means the document contains one or more pages with pixel dimensions larger than a specified value (ie. The page couldn't be loaded into memory). We usually see this in documents like blueprints of schematic drawings. But there are some things we can do to try to get these types of documents processed, if you find that it needs to be processed.
Clicking on the document in the 'Too Big' list will tell you the size of the offending page.
1) Click on the 'Too big' list.
2) Click on the individual document in question.
3) The offending size of the document is available in the document details.
If you find you have a series of the same type of documents, it's usually the case where the same size file is exceeding the limit. You can attempt to process these documents by modifying the value(s) declared in the setting.xml file. (Defaults differ depending on the version you're running.
See Manipulating Document Lists for more information on how to manage these lists
Document Timelines give a week-by-week summary of the number of documents and pages in a given document list. The timelines are organized around the document's modified date, so they represent approximately when the document was added to the system. To view the timeline for a given document list, click into the list then click the "View Timeline" button at the top of the list.
Timelines can be useful for determining how quickly new documents are added to your document management system. For example, the timeline of the Processing and Processed document lists can provide how many documents and pages that are eligible for OCR have been added to the system in the past 52 weeks. This will give an approximate rate of new documents per year.
To view the timeline for processed documents:
2. On the "Documents of Type Processed Screen" locate and select "View Timeline" link.
3. This will take you to the page for "The Timeline of Processed" documents.
This screen will show how many documents and pages were processed from week to week (cumulatively as well). The timeline can also be exported as a CSV or Image file by selecting the appropriate button (this will give you the full history as opposed to going back only 100 weeks).
There are two methods for looking up the details of a particular document:
Lookup By Path — Enter the full path of the document and click "Query" (See also: Checking the Status of a Document).
Look up a document - NetDocuments
Look up Document- Worldox
Look up a document - Windows Folder Tree
Once on the details page, the user can perform these functions:
Refresh — Provides the most current details of a document.There are also various bits of data or history showing what's been found on the file, and what's been done to it. For example, the "History" section shows all of the events logged for that file.
Note: If you delete the details of this document, it will delete this history and start from scratch. The "Page Analysis Details (before processing)" indicates how many words per page were found within the file BEFORE Symphony OCRed it. Visible words are computer-readable words (like digital headers or footers, or text generated by Word, etc). Hidden (aka invisible) words would be words applied by something like Symphony OCR. Note that these numbers are PRE-processing and they do not update after Symphony OCRs the file.
Symphony OCR is a back-end processing engine. This means that very little user interaction is required, but as an administrator, you may wish to check on the status of the software.
For additional suggestions on monitoring Symphony OCR, visit Ongoing Care & Feeding.
Depending on how Symphony OCR Notifications are configured, Status Notifications may be sent to you nightly, when there are errors, or when there are warnings. See Notifications for how to set this up / edit the notification frequency.
The email notification will look and feel very similar to the Summary Page but without the large graph.
You can utilize the buttons in the Notifications to manage Symphony OCR providing that you have network connectivity to the Symphony OCR servers.
If you're not on the same network as Symphony OCR then you can't use the buttons, but the data presented can still give you a quick glance at its progress.
System Statistics tells you how many files are in the Analyzer or OCRing backlog, and how long it estimates it will take to complete those backlogs.
Document Lists give you the itemized numbers of documents found that were Processed or Not Processed. Read the article titled "Symphony OCR Workflow, Tools & Document Lists" for more information on those lists.