HomeTraining Manuals- Symphony Suite :: Symphony OCR - AdministratorPrinter Friendly Version

Training Manuals- Symphony Suite :: Symphony OCR - Administrator

1. Background Information

1.1. What is Symphony OCR

What is Symphony OCR?

Symphony OCR is part of Symphony Suite, The Complete Imaging Solution.  Symphony OCR is a back-end OCR engine. It will locate all image-only PDF and TIF files in your document management system and convert them to fully text searchable PDFs by adding an invisible layer of text over the image. Symphony OCR typically runs on a back-end PC or server (for Worldox sites, this is typically the Indexer PC).

Tip: Turn OCR off at your scanners and see significant (2 to 5x) improvement in scanning speeds. In fact, Adobe Acrobat turns OCR on by default, and we strongly recommend turning it off. Let Symphony OCR take care of the OCR in the background.

 

...

1.2. What Symphony OCR Will and Will not Process

There are four different types of files pertinent to this discussion:

  • Image-only PDF This is an image created by a scanner.
  • Rendered PDF This is a PDF created by a computer (e.g. a Word document converted to PDF). It contains computer readable text by default, so it is fully text searchable as is without the need for OCR.
  • Hybrid PDF This is a PDF that contains both images and rendered content or annotations. If scanning a document, then use Adobe's markup tools, for example, to annotate the image, the PDF will be a hybrid PDF.
  • Image + text PDF This is a PDF that is created when the OCR engine 'reads' an image-only PDF and adds a layer of invisible, computer readable text to the original image. These files retain the exact original image, but also provide the ability to perform context sensitive search for text inside the PDF, as well as copying text to the Windows clipboard.

What Symphony OCR Will Process by Default

  • Symphony OCR will process image-only PDF files (and TIFF files if you choose to do so) and convert them to image + text PDF files.
  • Symphony OCR will process image-only pages within a hybrid PDF and convert them to an image + text PDF (but will not process the rendered pages as they are already text searchable).

What Symphony OCR Will Not Process

  • Symphony OCR will not process rendered PDFs, as these documents are already text searchable. Instead, it will place these files into the "Contains Text" or "Already OCRed" lists.
  • To ensure integrity of the original PDF content, Symphony OCR will not process a PDF if the PDF has been encrypted.
  • Symphony OCR will not recognize handwriting.

...

1.3. When Symphony OCR Will Process Documents

Worldox Document Repository

By default, Symphony OCR queries the Worldox document repository for newly saved and modified files every 15 minutes.  Generally speaking, newly saved files will be OCRed within about 15 minutes.  Depending on the volume of image-only documents already filed to Worldox, it may take a while for Symphony OCR to process the backlog (legacy files).  Symphony OCR gives precedence to newer files, so documents that are scanned today will be processed before the backlog.  

Refer to the section, Configuration Guide - Worldox - Finder, for further information on finder settings that determine when Symphony OCR locates files for processing.

Refer to the section, Configuration Guide - Worldox - Processor, for further information on configuration settings that determine which files are processed.

Note:  While Symphony OCR may process documents within about 15 minutes, you will need to wait until the text indexes are updated (typically overnight) in order to do full text in file searching for the documents using the Worldox document management system.

Folder Document Repository

By default, Symphony OCR queries the Folder document repository for newly saved and modified files every 120 minutes.  Generally speaking, newly saved files will be OCRed within about 120 minutes.  Depending on the volume of image-only documents already filed to Worldox, it may take a while for Symphony OCR to process the backlog (legacy files).  Symphony OCR gives precedence to newer files, so documents that are scanned today will be processed before the backlog.  

Refer to the section, Configuration Guide - Folder - Finder, for further information on finder settings that determine when Symphony OCR locates files for processing.

Refer to the section, Configuration Guide - Folder - Processor, for further information on configuration settings that determine which files are processed.

NetDocuments Repository

By default, Symphony OCR queries the Folder document repository for newly saved and modified files every 15 minutes.  Generally speaking, newly saved files will be OCRed within about 15 minutes. Symphony OCR can also optionally process the files already stored in NetDocuments.  By default, it performs a query for these files every 7 days.  Symphony OCR gives precedence to newer files, so documents that are scanned today will be processed before the legacy documents. 

Note: While Symphony OCR may process documents within about 15 minutes, Netdocuments may take up to 6-8 hours to update its text index. Meaning, if you'd like to run a Netoducments search for words within a document that was recently OCR'd then there may be a 6-8 hour delay. This is due to how Netdocuments prioritizes API activity.

Refer to the section, Configuration Guide - NetDocuments - Finder, for further information on finder settings that determine when Symphony OCR locates files for processing.

Refer to the section, Configuration Guide - NetDocuments - Processor, for further information on settings that determine which files are processed.

...

2. Administering Symphony OCR

2.1. Opening and Closing Symphony OCR

Opening and Closing Symphony

To access it directly log onto the workstation where Symphony OCR is installed.

  • To open Symphony OCR:
    • If the web browser is closed, but Symphony OCR is still open, the user interface can be accessed by right-clicking on the Symphony OCR icon in the system tray and choosing "Show Browser Window"
    • If Symphony OCR is not running, use the desktop shortcut "Symphony OCR" to launch Symphony OCR
  • To close Symphony OCR (Run as a logged in User):
    • Right-click on the Symphony OCR icon in the system tray and select "Quit"
    • Or, select "Quit" from the bottom left corner of the web browser window
  • To close Symphony OCR (Run as a Windows Service):
    • Navigate to the Control Panel -> Administrative Tools -> Services
    • Select Symphony OCR and click the "Stop" link

You can also access Symphony OCR from your workstation see:  Accessing Symphony OCR

 

Refreshing Data

Since Symphony OCR uses a web interface, the display may not automatically refresh as it performs its work. You can manually refresh by selecting the "Refresh" button in Symphony OCR or in the web browser. The Symphony OCR summary page refreshes automatically every 60 seconds. All other pages require a manual refresh.

...

2.2. Accessing Symphony OCR from a User's Desktop

Symphony OCR can be accessed from the web browser of any workstation connected to the network by typing in the address found in the web browser in which Symphony OCR runs:


If you would prefer for the End Users to see only the Summary View of Symphony OCR, you can do so by doing the following:

In the main Symphony OCR page, select "Simple View"


This will open the Summary Screen without the additional navigation panel:

Copy and paste the URL from here to provide to those users.


...

2.3. Symphony OCR Operations and Document Lists

Common Workflow Diagram

Symphony OCR searches the document repository for documents to process. It then organizes those documents into one of several lists (these lists are available on the left side of the Symphony OCR web interface). The following diagram displays the tools, lists and explains how they interact:

Symphony OCR Tools

Symphony OCR consists of three main tools that interact to provide full OCR services:

Finder -locates documents in your document repository

Analyzer -determines if a given document is a candidate for OCR

Processor - performs the actual OCR

As documents flow through Symphony OCR, each of the above components works on the document, then places it in a particular Document List, as described in the next section.

Symphony OCR Document Lists

Backlog Lists

The backlog consists of documents that have not been analyzed or OCRed. These are documents that Symphony OCR is still working on.  The following document lists represent the backlog:

Analyzing Documents waiting for the Analyzer to determine if they are candidates for OCR or not

In Process  Documents that are in the process of being Analyzed

Processing Documents are candidates for OCR, but have not been processed yet

Reprocessing  Documents had some recoverable problem during OCR, and will be processed again later.  Typical causes are if the document is open by a user, or was modified while OCR was taking place

Processed Lists

These lists represent documents that were successfully analyzed or OCRed.  They are documents that have either been OCRed or were already text searchable (and thus not in need of OCR):

Processed  documents that have been successfully OCRed by Symphony OCR

In Process  documents that are currently being OCRed

Already OCRed documents that were already OCRed (by some other processor or by an earlier version of Symphony OCR)

Contains text
documents that are already text searchable (no OCR needed)

No image or text
  some rare documents contain no text, but also contain no images – these generally do not need to be OCRed however, you may choose to do these on a one-off basis.  See "How to Process No Image or Text Documents" for instructions

Email Messages - contains the list of email messages that contained attachments that were processed by Symphony OCR

Not Processed Lists

These lists represent documents that could not be processed for some reason. In most cases, an administrator will want to glance over these lists from time to time to ensure that there are no issues with the documents that didn't get processed:

Needs Attention:  Documents in the 'Needs Attention' list are those that appear to be eligible for OCR, but encountered problems during processing.  Files in this list could be corrupted or contain invalid images (try opening them in an image viewer to be sure), or they may be images that Symphony OCR does not handle yet. 

Occasionally, a document can fall into the 'Needs Attention' list because of bad timing  Symphony OCR trying to process the document when it isn't fully available.  So we always recommend clicking the "Show Bulk Operations" button and then "Re-Analyze All", just to ensure this isn't the case.

If the document is corrupted, you can either remove the document from Worldox, or manually tell Symphony OCR to "ignore" it, which will put it on the 'Ignored' list.  If the 'Needs Attention' list contains any documents, the overall system condition will show as "Warn."  Ignoring a document that you have already checked is a good way to change the system condition back to "OK". 

If the document does not appear corrupted, the next step would be to allow us to see a copy of the file.  Because PDFs can be generated in countless different ways, we occasionally run into a specific sub-type of PDF that we've not encountered before.  If we can get a copy of the file that is falling into the 'Needs Attention' list, we can in almost all cases, add support for the file.  Please contact us at support@trumpetinc.com for instructions to upload documents to our secure site.


New:  Documents in the New list are those that have be found by the finder tool, but not yet allocated to another document list (documents are only in the New state for a very short period of time).

Deleted:  Documents in the deleted list mean that the document record is in the process of being purged from the database – documents should only be in this state for a very short period of time.

Too Old:  Documents in the Too Old list are those that have a file modified date older than the cut off age defined the Processor configuration.

Inaccessible:  Documents in the Inaccessible list are those that could not be processed because of file system security, Worldox security, read-only attributes or other conditions that prevent the document from being accessed and worked on.  In addition, if the profile group in which the documents reside contains an invalid base path (containing a space for example), or if the file has a space immediately prior to the document extension, they will be shown in the inaccessible list

Corrupted Documents

Documents in the corrupted list are those that Symphony OCR does not recognize as valid files. The most common reason is that the file is an invalid or corrupted PDF (try opening in Adobe to be sure).  Another possibility is that there is some characteristic of the PDF that the Symphony OCR parsing algorithm isn't handling properly.  Trumpet does periodically update the PDF parsing algorithms to address corner cases that have not been encountered before.

What to do?

Try opening the file in Acrobat, then hit Save (Acrobat will try to open and auto-repair corrupted files - when you save the document, it will save uncorrupted).  After saving and closing the document, click the Re-Analyze button on the document record in Symphony OCR.  This will only work if the file is only lightly corrupted, but is worth a shot.

If that doesn't help, next check to see if the file is already text searchable (i.e. can you search for text inside the PDF already?).  If you can, then the document isn't a candidate for OCR anyway, and you can just move the document to the Ignore list.

If the document does need to be OCRed, and the Adobe repair doesn't help, then you may want to submit the document to us for analysis.  Open a support ticket by emailing support@trumpetinc.com and we will send information on how to securely upload the document to us.  If we find a problem in our parsing algorithms, we'll fix the issue and get you a patch.

If there are a large number of files that have the same corruption reason, and the files don't appear to actually be corrupted, please open a support ticket by emailing support@trumpetinc.com and we will send information on how to securely upload a sample document to us.  If we find a problem in our parsing algorithms, we'll fix the issue and get you a patch.  Alternatively, you can use a bulk Ignore operation to move the documents to Ignore.

Encrypted / Restricted:  Documents in the Encrypted/Restricted list are those that are restricted from being processed because of some characteristic of the file itself (for example, an encrypted or partially restricted PDF file will not be processed).

Ignored:  Documents in the Ignored list are documents that a Symphony OCR administrator has explicitly told Symphony OCR not to process. Any document on this list was explicitly placed there by human intervention.

Wrong Type:  Documents in the Wrong Type lists are a tif documents and TIFF processing is not enabled.

Moved / Unavailable:  Documents in the Moved / Unavailable list are no longer available in the Document Management System (DMS).  This could mean that the DMS has gone "offline" or the DMS settings have been adjusted so that the documents would not have been found for processing (e.g., if a user selects a profile group to analyze and OCR, and then chooses to un-check that profile group or no longer process it).  Document records in the Moved/Unavailable list will be deleted from the database after 15 days.  Documents can also appear in the Moved / Unavailable list if they are no longer at that current location.

Digitally Signed:  Documents that are digitally signed will not be processed by Symphony OCR because adding OCR information to these documents would invalidate the digital signature.  If you wish to have these documents OCRed anyway (and are OK with invalidating the digital signature), please send an email to support@trumpetinc.com and request that functionality be added.

Too Big (to 8.0.0 and higher)

If a document falls into this list, it does NOT mean the document is contains too many pages.  Symphony OCR processes files one page at a time.  So if a document falls into this list, it means the document contains one or more pages with pixel dimensions larger than a specified value.  In this version of Symphony OCR that value is 32512 x 32512 pixels.  

This is a hard limit and cannot be overwritten.


Too Big (Prior to 8.0.0)

If a document falls into this list, it does NOT mean the document is too big.  Symphony OCR processes files one page at a time.  So if a document falls into this list, it means the document contains one or more pages with pixel dimensions larger than a specified value (ie. The page couldn't be loaded into memory).  We usually see this in documents like blueprints of schematic drawings.  But there are some things we can do to try to get these types of documents processed, if you find that it needs to be processed.  

Clicking on the document in the 'Too Big' list will tell you the size of the offending page. 

1) Click on the 'Too big' list.

2) Click on the individual document in question.

3) The offending size of the document is available in the document details.



If you find you have a series of the same type of documents, it's usually the case where the same size file is exceeding the limit.  You can attempt to process these documents by modifying the value(s) declared in the setting.xml file.  (Defaults differ depending on the version you're running.)

For versions NEWER than 6.5.32

Default: If an individual page contains a total pixel count higher than 36,000,000 pixels the entire document will be filed under the "Too Big" list.

Advanced Configuration Setting

If you wish to attempt to process documents that contain pages with a total pixel count larger than 36,000,000 pixels, you may opt to do so by updating the settings.xml file.  Here's how:

  • Close Symphony OCR (stop Service if installed as Service).
  • Navigate to C:\Program Files\Trumpet\SymphonyOCR\Config\ and open the settings.xml file using notepad.
  • The setting you want to adjust is highlighted in yellow below:
          <documentPreProcessor ..... maxPixels="36000000" ..... />
  • Update the maxPixels variable (within the " ") to whatever you feel is appropriate.
    • Tip: Reference the details on your document, that SOCR reports, to reference the actual size of the page. Set the to equal or exceed that.
  • Save the settings.xml file
  • Launch Symphony OCR (Start Service if installed as Service)

Note:  If Symphony OCR is not able to process these documents they may end up in the Needs Attention list.


General reference guide for page sizes in inches to total pixels:

A Size (8.5x11 inches) = 8415000 pixels

Legal (8.5x14) = 10710000

B size (two A sizes — 17x11) = 16830000

C size (two B sizes — 17x22) = 33660000

Default (20x20) = 36000000

D size (two C sizes — 22×34) = 67320000


For versions OLDER than 6.5.32

Default: If an individual page is larger than 10,000 x 12,000 pixels the entire document will be filed under the "Too Big" list.

Advanced Configuration Setting

If you wish to attempt to process documents that have individual page larger than 10,000 x 12,000 pixels, you may opt to do so by updating the settings.xlm file.  Here's how:

  • Close Symphony OCR (stop Service if installed as Service)
  • Navigate to C:\Program Files\Trumpet\SymphonyOCR\Config\ and open the settings.xml file using notepad
  • The setting you want to adjust is highlighted in yellow below:
          <documentPreProcessor ..... maxHeightPixels="10000" maxWidthPixels="12000" ..... />
  • Update the maxHeight and maxWidth variables (within the "") to whatever you feel is appropriate.
    • Tip: Reference the details that SOCR reports to reference the actual size of the page. Set the max just above that.
  • Save the settings.xml file
  • Launch Symphony OCR (Start Service if installed as Service)

Note:  If Symphony OCR is not able to process these documents they may end up in the Needs Attention list.

Note: If you update your version to 6.5.32 or above then tell SOCR to re-analyze the documents in the 'Too Big' list. Once it re-analyzes them it will now reference their Total Pixel Count, instead of the Height Width ratio.


See Manipulating Document Lists for more information on how to manage these lists

...

2.4. Understanding Document Timelines

Document Timelines give a week-by-week summary of the number of documents and pages in a given document list. The timelines are organized around the document's modified date, so they represent approximately when the document was added to the system. To view the timeline for a given document list, click into the list then click the "View Timeline" button at the top of the list.

Timelines can be useful for determining how quickly new documents are added to your document management system. For example, the timeline of the Processing and Processed document lists can provide how many documents and pages that are eligible for OCR have been added to the system in the past 52 weeks. This will give an approximate rate of new documents per year.

To view the timeline for processed documents:

  1. Click on the Processed link on the left side of the screen.

2. On the "Documents of Type Processed Screen" locate and select "View Timeline" link.

3. This will take you to the page for "The Timeline of Processed" documents.

This screen will show how many documents and pages were processed from week to week (cumulatively as well). The timeline can also be exported as a CSV or Image file by selecting the appropriate button (this will give you the full history as opposed to going back only 100 weeks).

...

2.5. Checking the Details of a Document

Checking the Details of a Document

There are two methods for looking up the details of a particular document:

Lookup By Path  Enter the full path of the document and click "Query" (See also:  Checking the Status of a Document).

Document Lists  Simply select the document in any of the document lists and this will open a details page.

 See also our YouTube Videos Here:

Look up a document - NetDocuments

Look up Document- Worldox

Look up a document - Windows Folder Tree

Interpreting the Details of a Document

Once on the details page, the user can perform these functions:

Refresh  Provides the most current details of a document.

View  Opens the file.

Delete detail  Deletes the details for the document.

Re-Analyze
  Re-analyzes the document (if it has been unable to be processed) and attempts to process the document again.

Purge Backups
  Deletes any copies of the file that Symphony has saved.

There are also various bits of data or history showing what's been found on the file, and what's been done to it. For example, the "History" section shows all of the events logged for that file. Note that if you delete the details of this document, it will delete this history and start from scratch. There is also "Page Analysis Details (before processing)" that indicate how many words per page were found within the file BEFORE Symphony OCRed it. Visible words are computer-readable words (like digital headers or footers, or text generated by Word, etc). Hidden (aka invisible) words would be words applied by something like Symphony OCR. Note that these numbers are PRE-processing and they do not update after Symphony OCRs the file.

...

2.6. System Condition

Symphony OCR is a back-end processing engine. This means that very little user interaction is required, but as an administrator, you may wish to check on the status of the software.

For additional suggestions on monitoring Symphony OCR, visit Ongoing Care & Feeding.

Symphony OCR has three system status settings

  • OK (green) - System is running with no errors or warnings
  • Warn (orange) - System has warnings.  Possible causes could be:
    • There are documents in the "Needs Attention" list (Refer to the section, Not Processed Lists, for additional information)
    • The Analyzer or Processor is not running
    • There are configuration problems in the non-critical sub-systems, such as the Heartbeat system
  • Error (red) - System is not running.  Possible causes could be:
    • The Symphony OCR license has expired or is not valid
    • The Worldox Indexer application is not running
    • The Finder, Analyzer or Processor has errors
    • The annual page count limitation has been reached (please contact Trumpet, Inc. (support@trumpetinc.com) for information on increasing the page count license)

 

...

2.7. Status Notification Emails

Depending on how Symphony OCR Notifications are configured, Status Notifications may be sent to you nightly, when there are errors, or when there are warnings.  See Notifications for how to set this up / edit the notification frequency.

The email notification will look and feel very similar to the Summary Page but without the large graph.

You can utilize the buttons in the Notifications to manage Symphony OCR providing that you have network connectivity to the Symphony OCR servers.

If you're not on the same network as Symphony OCR then you can't use the buttons, but the data presented can still give you a quick glance at its progress.

System Statistics tells you how many files are in the Analyzer or OCRing backlog, and how long it estimates it will take to complete those backlogs.

Document Lists give you the itemized numbers of documents found that were Processed or Not Processed. Read the article titled "Symphony OCR Workflow, Tools & Document Lists" for more information on those lists.

...

© 2012 Trumpet, Inc., All Rights Reserved