1 / 38

Content Categorization A Road Map

Content Categorization A Road Map. Julia Marshall USAID ( Bridgeborn Inc.). http://dec.usaid.gov. What Is Your Goal? . Write Down Your Goal. Samples: Discover what topics are most mentioned in a set of documents and whether they change over time

syshe
Download Presentation

Content Categorization A Road Map

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Content CategorizationA Road Map Julia Marshall USAID (Bridgeborn Inc.)

  2. http://dec.usaid.gov

  3. What Is Your Goal?

  4. Write Down Your Goal Samples: Discover what topics are most mentioned in a set of documents and whether they change over time Create metadata records more quickly than we can with human catalogers/indexers

  5. Flavors of Text Analytics Text Mining Discovering Data Finding Patterns within the Data Auto-Categorization Structuring data to a schema Assigning pre-determined tags

  6. Assess Your Resources People Materials to be categorized Processes Metadata schemas Systems Budget

  7. People How many people will you have? What is their expertise? IT People Indexers/Subject Matter Experts Web Developers/Designers Project Manager How much time will they have to devote to this project?

  8. Materials to be categorized How much material? What format is it in? Paper? Digital files? OCR’d files? What shape is it in?

  9. Processes Are processes already in place for categorization? If so, how is the process done? Who does the process? How standardized is the process?

  10. Metadata Schemas Does your organization have: Thesaurus of topics? Personal name authority files? Organizational name authority files? Gazetteers or geographic names? Standard list of types of documents? Standard way dates are handled?

  11. Systems Will there be a system that consumes output from the SAS Content Cat Studio? How will the system consume the SAS output? Will there need to be code to pull the text of the documents through SAS Code to push the SAS output into your consuming system?

  12. Budget How much money can you spend?

  13. Assess the Costs Tools Application Server space/equipment Staff time Preparatory costs

  14. Select a plan/tool that best fits your organization’s needs Revisit your original goal What do you have the resources to do? Revise your goal to fit your circumstances Find the best tool for the job

  15. Strategize the Implementation What metadata/processes to automate? What are priorities for the above processes? What are the easiest to automate? How much time will it take? Who’s doing what?

  16. A Brief Digression on Project Management

  17. Manage the Management Manage Expectations Pick a “Quick Win” piece of the project Keep them informed at a level that they can understand

  18. The SAS Content Categorization Studio is Plugged in -Now What Do I Do?

  19. Create Profiles For each piece of metadata to auto-categorize, write a profile that tells the application which terms to assign for each document Each term will need a unique set of rules assigned that tell the application when to apply that particular term – and when not to

  20. Tips for Writing Profile Rules Simpler is better – at first Analyze a sample of documents to be auto-categorized – what words show up with the term Differentiate between “Concept” and “Context” Document your rules and your updates as you write them.

  21. Sample Profile Build Logging (OR, (MIN_2, “Logging”, “Selective logging”, “Illegal logging”, “Logging concession”, “Timber extraction”, “Sawmill”, (SENT, “logging”, “impact”), (SENT, “timber”, “harvest”))) Trees (OR, (MAXOC_50, (NOTIN, “Trees”, “Teak trees”), (NOTIN, “trees”, “fruit trees”), NOTINSENT, “Trees”, “Timber”), (NOTINSENT, “trees”, “logging”)))

  22. Collect Sample Sets of Documents Need at least 3 sets. (Probably more). 1st set for writing profile 2nd set for testing 3rd set for the final test

  23. Run the 1st Sample Set Against your Profile Each document will have terms that SAS assigned to it Each term will have a relevancy score Rank the terms by the highest to lowest relevancy score Look at the top 5-10 terms

  24. Evaluate the Output Do the top 5-10 terms make sense? Are the terms too general? What phrases in the set of documents caused SAS to pick those terms? How do you need to rewrite the rules?

  25. Rewrite the Rules in the Profile Based on the Output

  26. RepeatRepeatRepeatRepeatRepeat RepeatRepeat As Needed

  27. I’ve Created the Profile The Output is the Way I Want Now What?

  28. Integrate the Output Design the Workflow Interface Design Connect to Local Systems Train staff More tests

  29. Documents Design the Workflow I SAS Profile Where is the data in each step? Who is handling the data? What has to happen to move the data to the next step? Java Code XML Code Metadata in DEC

  30. Design the Workflow II

  31. Interface DesignSample: USAID Geographic Term(s): USAID Geographic Term(s) SAS values: SAS GeoDescriptor Run Date:

  32. Connect to Local Systems

  33. Train Staff IT Staff Profile managers Output evaluators

  34. Test the Integrated System Gather test samples – again! Run the profile in your test environment Does the output stay the same? Can you update the profiles? Are other users of the system able to use/update the output?

  35. If you answered yes: CELEBRATE!

  36. Maintain the System Documentation Tests Staff training Follow up evaluations

  37. Lessons Learned the Hard Way Be careful using outside data Buy only what you need

  38. Thank you! Email: jmmarshalljb@gmail.com

More Related