390 likes | 801 Views
ProjectWise 101 – Chapter 9 Document Indexing. Gary Cochrane – Technical Director Geospatial Sales – North America. Introduction. ProjectWise Document Indexing Really means three things Full Text Indexing, in support of full text searching Thumbnail Extraction Document Property Extraction
E N D
ProjectWise 101 – Chapter 9Document Indexing Gary Cochrane – Technical Director Geospatial Sales – North America
Introduction • ProjectWise Document Indexing • Really means three things • Full Text Indexing, in support of full text searching • Thumbnail Extraction • Document Property Extraction • We won’t cover this one in PW101 • See Bentley Institute PW Admin course guide for this
Full Text Indexing • We did not write the engine for this • But elected to use the one Microsoft provides • Included with every copy of Windows • That engine is called the MS Indexing Service • And it was installed in the VM as an optional Windows component • Microsoft indexes the following file formats • MSWord, Excel, PPT, HTML, XML, TXT
Pre-installed in VM ProjectWise Integration Server ProjectWise Orchestration Framework MicroStation V8i-SS1 Supported Database Engine Microsoft Message Queuing Service Microsoft Indexing Service Microsoft .NET Framework 2.0 Windows Server 2003 with SP2
Extending the MS Index Service • Microsoft provides an SDK for third parties to extend the Indexing service • So the Indexing service will know how to “filter” files from that vendor • For instance, Adobe provides an “iFilter” that teaches the MS Index Service how to extract text from a PDF file • The Adobe PDF iFilter is installed with Acrobat Reader V9x
Indexing Overview • Within PW, Indexing consists of: • Scheduling • A process that wakes up, checks for new, (or modified files), adds them to the Copy-out queue, and goes back to sleep • Copy-out • Copy the file from the Storage Area, to the machine running the Indexing Service. Then add file to the extraction queue. • Remember, files may be stored on multiple servers • Also, in large installations, a machine may be dedicated to indexing
Indexing Overview – Part II • Overview – continued • Extraction • This process gets the text from the file and adds it to the MS Index catalog. Then adds the file to the Update queue • Update • This process sets the flag on the file (in the PW database) that says it is “done” • New files are added with the flag set to “undone” • Check-out/in causes the flag to be set to “undone”
A note on “done” • Done does not necessarily mean it was successful • It means the file has been processed • In other words, what happens if an unknown file (Ex: an Autocad file) is sent to the Indexing Service? • The file is attempted… • And the indexing service says, “I don’t know how to extract text from this file” • There would be no point in trying the file again • So it is marked as “done”, even when unsuccessful
MicroStation and AutoCAD • ProjectWise provides a mechanism to index the text from these file types • Instead of writing an iFilter, Bentley elected to: • Copy-out the file • Run MicroStation in the background, extract all the text, and write it to an XML file • Send the XML file to the Indexing Engine • Since MicroStation can parse DWG as well… • Then this method saved us from having to write two iFilters
Summary • So within ProjectWise, we index: • Word, PPT, Excel, XML, HTML, TXT • Adobe PDF • DGN, & DWG • More good news • iFilters can be found for many file formats • Some free, and some for purchase
PW Orchestration Framework • Remember when we installed this? • PWOF is responsible for managing batch processes for ProjectWise • This includes all those processes discussed on the previous slides • For Full Text Indexing, that means • Scheduler process, Copy-out process, Extraction process, Updater process, and the MicroStation instance running in the background
Lab 1a • PW Orchestration Framework • Start the Windows Task Manager • Hint: Right-click on empty part of Taskbar • Examine memory usage • On the Performance tab • Switch to Processes tab • Sort by Mem Usage column (descending) • Look for ustation.exe • Look for DmsAfpEngine(s) • Lots of memory consumed here…
Lab 1b • Now open Services dialog • Remember “gears” icon on Quick-Launch • Locate PW Orchestration Framework service • Select the PW OF service, and choose> Stop • Watch memory usage in Task Manager • For remainder of exercise, we need PWOF running • So start it back up now • Note PWOF is configured for automatic startup • It will run each time machine is booted • Close Services and Task Manager
Lab 2a • Open PW Administrator • Log in as> adminpw • Drill down to: • Document Processors> Full Text Indexing • Right-click, choose> Properties
Accept defaut, unless Indexing is to be run on another machine Turn on adminpw adminpw Set to 60 Lab 2b - Full Text Indexing
Enable all times in the schedule Set to 2 Lab 2c - Full Text Indexing
Lab 2d • Switch to File Type Associations tab • Press> Add • In the Extension field, enter> DWG • In the bottom field, enter> DGN • So that DWG files are processed as if they were DGN • Press> OK
Lab 2f • Still on the File Type Associations tab • Again, press> Add • In the Extension field, enter> itiff • In the bottom, enable> Do not process these documents • You can’t extract text from a raster so this prevents wasted file transfers • Press> OK • Press OK again • To close the Full Text Indexing Properties
Lab 2g • Open Task Manager again • Switch to Performance tab • Within 2 minutes, you should see heavy CPU usage • Memory usage will also go up • Up to 60 documents will be indexed in the first pass • If there are more than 60 documents to be done, then they will be queued in the next pass • 2 minutes from now
Analysis • All documents will eventually be processed • When done, the index will be ready for fast full text searches • Once the indexer has caught up, future load will be lighter due to only processing incremental documents
Lab 3a • When done, close Task Manager, open PW Explorer • Log in as user1 • From the main tool box, select> Find Documents • Binocular icon • Change to Full Text tab • Enter Look For> detail • Press OK to start search • Then Close the Search dialog • Your results should include: DGN’s, DWG’s, and PDF’s
Lab 3b • Browse to: • User1/Document Indexing/MS-SHT • These files were not successful because they have an unknown extension • But they were attempted, and flagged as done • Return to PW Administrator • Select datasource name (pwdemo) • Right-click, choose> Properties • Change to Statistics tab • Choose Refresh • Review Full Text Statistics • Close dialog
Lab 3c • While still in PW Administrator • Open Full Text Indexing Properties again • Switch to the File Type Associations tab • Press Add • In the Extension field, enter> SHT • In the bottom Extension field, enter> DGN • So that SHT files will be processed as if they were DGN files • Press OK to complete the Extension mapping • Press OK again to close the Properties dialog
Lab 3d • Once new file type has been added… • Now a small problem • These files were flagged as done, and the Indexer won’t try them again unless they are checked out/in • And even that won’t work unless you actually makes changes… • PW compares files to version on server, and doesn’t transfer back if there are no changes
Lab 3e • Rather than check them all out, and back in • From PW Administrator • Right-click Full Text Indexing • Choose> • Mark folder Documents for Reprocessing • Browse “…” to • USer1/Document Indexing/MS-SHT • Press OK • Press OK again
Analysis • Within 2 minutes, these documents will be re-processed • If you run the search again (in a few minutes), you should also get SHT files in your results • Re-visit Datasource statistics to see if it Full Text categories have changed
Summary • Once the index is created, • You can stop the PW Orchestration Framework service • It is used to create the index, but not to search the index • This will save memory, and CPU cycles • So in a demo, your machine will run faster • BUT, new, (or modified) files will not be re-indexed • Up until now, the PWOF was not being used at all • Full Text Indexing is the first time we’ve needed PWOF, even though it has been running since installation
PW Thumbnails • PW Thumbnails is not “indexing” in the proper sense, but it is similar in nature to Full Text • PW Thumbnails extracts a thumbnail from the document, and stores a copy in the PW database • This allows one to browse PW Explorer, and see thumbnails in the Preview Pane • Not all file types support thumbnails • Among those that do, some don’t do it per the industry standard
Thumbnails – Part II • Important to remember • ProjectWise does not create thumbnails • It only extracts what might be in the file • A good test is to check to see if Windows Explorer displays a thumbnail for the file • If it does, then PW should as well
Lab 4a • Open Windows Explorer • Browse to: • C:\PW-101 Class Files\Document Indexing\MS-V8 • Change to Thumbnail display • MicroStation V8 files have thumbnails
Lab 4b • Browse through remaining Document Indexing folders • Note which include thumbnails • Additional notes • PDF files take a long time because you are really looking at a small view of the whole file, not a thumbnail • AutoCAD doesn’t adhere to the Industry standard • These files only display correctly because MicroStation is installed, and is responsible for displaying a thumbnail • Autodesk may have fixed this in later versions?
Lab 5a • Open PW Administrator • Log in as> adminpw • Drill down to: • Document Processors> Thumbnail Extraction • Right-click, choose> Properties • Similar to Full Text Indexing • But actually less involved
Turn on adminpw adminpw Set to 60 Lab 5b
Enable all times in the schedule Set to 2 Lab 5c
Lab 5d • No changed required on the File Type Associations tab • Press OK to complete the configuration and close the dialog • Within a few minutes, thumbnails should show up in the preview pane
Analysis • Thumbnails are extracted and stored in the PW database • Because document storage may not be local • Thus “touching” the document to see thumbnail in real-time is not practical • Thumbnail notes • Requires less processing than full text • MicroStation not running in this process • Requires PWOF to extract, but not to display
Review • Topics covered in this Chapter • Full text Indexing – Configuration • Full Text Searches • ProjectWise Orchestration Framework • Thumbnail Extraction • Microsoft Indexing Service • And iFilters to extend default supported file types • (I have a free Visio, and MSG iFilter from Microsoft)