1 / 20

National Center for Text Mining Launch Event

National Center for Text Mining Launch Event. Ray R. Larson University of California, Berkeley School of Information Management and Systems. Thanks to Dr. Eric Yen, Prof. Michael Buckland Dr. Rob Sanderson and Prof. Marti Hearst for parts of this presentation. Overview.

dacey
Download Presentation

National Center for Text Mining Launch Event

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. National Center for Text MiningLaunch Event Ray R. Larson University of California, Berkeley School of Information Management and Systems Thanks to Dr. Eric Yen, Prof. Michael Buckland Dr. Rob Sanderson and Prof. Marti Hearst for parts of this presentation

  2. Overview • The Grid, Text Mining and Digital Libraries • Cheshire3: Bringing Search and Text Mining to Grid-Based Digital Libraries • Other Related Berkeley work: The BioText Project

  3. Grid Architecture -- (Dr. Eric Yen, Academia Sinica, Taiwan.) .…. High energy physics Chemical Engineering Climate Astrophysics Cosmology Combustion Applications Application Toolkits Grid Services Grid Fabric ..… Remote Computing Remote Visualization Collaboratories Remote sensors Data Grid Portals Grid middleware Protocols, authentication, policy, instrumentation, Resource management, discovery, events, etc. Storage, networks, computers, display devices, etc. and their associated local services

  4. Grid Architecture (ECAI/AS Grid Digital Library Workshop) Digital Libraries High energy physics Bio-Medical Humanities computing Astrophysics Chemical Engineering Climate Cosmology Combustion … Applications Application Toolkits Grid Services Grid Fabric … Text Mining Remote Computing Remote Visualization Search & Retrieval Metadata management Collaboratories Remote sensors Data Grid Portals Grid middleware Protocols, authentication, policy, instrumentation, Resource management, discovery, events, etc. Storage, networks, computers, display devices, etc. and their associated local services

  5. Grid-Based Digital Libraries • Large-scale distributed storage requirements and technologies, • Distributed Information Retrieval issues and algorithms, • Organizing distributed digital collections, • Shared Metadata – standards and requirements • Managing distributed digital collections, • Security and access control, • Collection Replication and backup.

  6. Cheshire3 Overview • XML Information Retrieval Engine • 3rd Generation of the UC Berkeley Cheshire system, as co-developed at the University of Liverpool. • Uses Python for flexibility and extensibility, but imports C/C++ based libraries for processing speed • Standards based: XML, XSLT, CQL, SRW/U, Z39.50, OAI to name a few. • Grid capable. Uses distributed configuration files, workflow definitions and PVM (currently) to scale from one machine to thousands of parallel nodes. • Free and Open Source Software. (GPL Licence) • http://www.cheshire3.org/ (under development!)

  7. Cheshire3 Server Overview Cheshire III SERVER N E T W O R K C O N F I G & C O N T R O L SERVER CONTROL USER INFO API A P A C H E I N T E R F A C E STAFF UI Native calls Normalization I N D E X I N G C L U S T E R I N G S E A R C H S C A N A U T H E N T I C A T I O N T R R X E A S C N L O S T R F D O R M S P H R A O N T D O L C E O R L CONFIG Z39.50 SOAP OAI SRW Client Fetch ID User/ Clients Put ID OpenURL UDDI WSRP REMOTE SYSTEMS (any protocol) OGIS JDBC DB API LOCAL DB XML INDEXES CONFIG & Metadata INFO RESULT SETS ACCESS INFO

  8. Cheshire Additions for NaCTeM

  9. Cheshire3 Object Model Document Document Document Document ProtocolHandler Server UserStore Database Transformer Record Record Record Record Index Parser Extracter Document Document Document Document Normaliser PreParser DocumentGroup IndexStore RecordStore

  10. Cheshire3 Data Objects • DocumentGroup: • A collection of Document objects (e.g. from a file, directory, or external search) • Document: • A single item, in any format (e.g. PDF file, raw XML string, relational table) • Record: • A single item, represented as parsed XML • Query: • A search query, in the form of CQL (an abstract query language for Information Retrieval) • ResultSet: • An ordered list of pointers to records • Index: • An ordered list of terms extracted from Records

  11. Cheshire3 Process Objects • PreParser: • Given a Document, transform it into another Document (e.g. PDF to Text, Text to XML) • Parser: • Given a Document as a raw XML string, return a parsed Record for the item. • Transformer: • Given a Record, transform it into a Document (e.g. via XSLT, from XML to PDF, or XML to relational table) • Extracter: • Extract terms of a given type from an XML sub-tree (e.g. extract Dates, Keywords, Exact string value) • Normaliser: • Given the results of an extracter, transform the terms, maintaining the data structure (e.g. CaseNormaliser)

  12. Cheshire3 Abstract Objects • Server: • A logical collection of databases • Database: • A logical collection of Documents, their Record representations and Indexes of extracted terms. • Workflow: • A 'meta-process' object that takes a workflow definition in XML and converts it into executable code.

  13. Cheshire3 Grid Tests • Running on an 30 processor cluster in Liverpool using PVM (parallel virtual machine) • Using 17 processors with one “master” and 16 “slave” processes we were able to parse and index MARC data at about 10000 records per second • On a similar setup 610 Mb of TEI data can be parsed and indexed in seconds

  14. Other Work at Berkeley • BioText Project • Directed by Prof. Marti Hearst of SIMS • Currently working on a number of areas in NLP analysis and use of Bio-Medical texts • Developing new and efficient methods for • Abbreviation recognition Slide Credit: Prof Marti Hearst

  15. BioText: Main Goals Sophisticated Text Analysis Annotations in Database Improved Search Interface Slide Credit: Prof Marti Hearst

  16. BioText: A Two-Sided Approach Blast Medline Mesh SwissProt Word Net GO Journal Full Text Empirical Computational Linguistics Algorithms Sophisticated Database Design & Algorithms Slide Credit: Prof Marti Hearst

  17. Computational Language Goals • Recognizing and annotating entities within textual documents • Identifying semantic relations among entities • To (eventually) be used in tandem with semi-automated reasoning systems. Slide Credit: Prof Marti Hearst

  18. Computational Linguistics Goals • Mark up text with semantic relations • <protein1><inhibits><protein2> • <protein><binds-with><receptor> • <chemical><increases><level-of><chemical> Slide Credit: Prof Marti Hearst

  19. Database Research Issues • Efficient querying and updating • Semi-structured information • Fuzzy synonyms • Collection subsets • Efficiently and effectively combining • Relational databases • Text databases • Layers of processing • Hierarchical Ontologies Slide Credit: Prof Marti Hearst

  20. Abbreviation Recognition • Example of BioText work (Schwartz and Hearst 03)… • Fast, simple algorithm for recognizing abbreviation definitions. • Simpler and faster than the rest • Higher precision and recall • Idea: Work backwards from the end • Examples: • In eukaryotes, the key to transcriptional regulation of the Heat Shock Response is the Heat Shock Transcription Factor (HSF). • Gcn5-related N-acetyltransferase (GNAT) • Idea: use redundancy across abstracts to figure out abbreviation meaning even when definition is not present. Slide Credit: Prof Marti Hearst

More Related