1 / 31

Grid Search and Categorization Engine for Content Source Integration at CERN

This project focuses on integrating content sources into GRACE, a grid search and categorization engine, at CERN. It includes steps for submitting a search query, parsing the results, and testing the integration. The project also explores parallelization techniques and simulation to optimize performance.

soria
Download Presentation

Grid Search and Categorization Engine for Content Source Integration at CERN

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GRACEJan Fiete Grosse-OetringhausCERN IT/EGE29.11.04

  2. Grid Search and Categorization Engine Image by Hector Garcia Puigcerver

  3. GRACE Workflow • CERN’s Tasks • Content Source Integration • Grid Integration • Grid Testing

  4. Content Sources Integration • Content Source • Input: Search Query • Output: Search Results • HTML output • OAI (Open Archives Infrastructure) compliant output • Personalized configuration file for each Content Source(SPEC file) • Integration Steps • Submit the search • Parse the result • Retrieve associated documents

  5. Step 1: Submit the Search • Goal: Submit Search Query • Input: Query in GRACE format • Go to contentsource, find search fields • Add fieldto SPEC file <get-param name='p'> <paramval name='/query/Quick-Search'/> </get-param>

  6. Step 2: Parse the Result • Goal: Produce result sets interpretable by GRACE • Input: Search Result in HTML format

  7. Step 2: Parse the Result

  8. Step 2: Parse the Result • Goal: Produce result sets interpretable by GRACE • Input: Search Result in HTML format • Identify Fields: Title, Author, Abstract, … • Produce XPath Expressionse.g. /root/html/body/div/a • Produce XSL (eXtended Stylesheet Language) transformation code • Produce code for retrieval of associated documents • Output: XML result sets

  9. Step 3: Test your file • Test application (seaLion): part of GRACE application • Submits search using a given SPEC file • Returns GRACE result set • Provides debug output • CSTest script • Uses seaLion • Validates results • Batch testing

  10. Results • 16 Content Sources integrated • Input for Deliverable 6.1 • Workflow of Integration • Status, common problems and risks • HowTo: Configuration of Content Sources for Integration with GRACE • Usable by content providers who want to integrate their content source into GRACE • TestKit • Test application & scripts • HowTo & TestKit available on GRACE website

  11. Grid Integration • Two Grid components: • Text Normalizing • Categorizing • Components provided by partners, CERN responsible for integration

  12. First approach • “One for all” (model M1) • Parallel execution of simultaneous searches • O(hours) for complete process

  13. M1 Performance

  14. How to parallelize?

  15. Parallelized Model • Split text normalization

  16. Parallelized Model • Split outside the Grid • Launch N jobs • Perform text normalization • Store results in the Grid (using Replica Manager) • Monitor Status • Launch Categorization job • Pick up documents from the Grid and merges them • Perform Categorization • Get result from Categorization job

  17. Simulation • Simulate parallelized model including • Submission time • Grid overhead • Application overhead • Application performance • Interesting values • User (UI) Waiting Time • Spent Computing Time

  18. Simulation

  19. Conclusions from Simulation • Derived rules for splitting parameters • Minimize user waiting time  Kopt • Save “unnecessary” resources by splitting less than optimal value. Therefore let the user wait 20% more (unnoticeable)  Keff • Calculated formulas for splitting parameters • Implemented in Java class for GRACE application

  20. Measured Results

  21. Measured Results

  22. Results • JDLs for Grid Jobs created for both models • GRACE can run on the Grid • Description of Grid Jobs • Input for Deliverable 6.1 • Parallelized Job Model • Used in Grid Tests

  23. Grid Tests • Test plan for both models and comparison • Creation of input corpus • Creation of test scripts for semi-automatic testing • Creation of scripts for validation of output and parsing of logging • General tests started 20.10.04 • Main test period from 05.11 to 25.11.04 • Tests performed in GILDA testbed • Submitted more than 1000 jobs • Made about 1 million Java API calls

  24. Results

  25. Comparison

  26. Results • Input for Deliverable 7.2 • Validation of the suitability of GRACE for the Grid • Performance testing of the Application • Validation of the parallelized model • Validation of simulated results • Intensive use of GILDA • Feedback to GILDA • Feedback to EGEE • New requirements list

  27. What else…

  28. gContainer • SSL Web service container following WSRF standard • Based upon WSRF::Lite • Service discovery • Load management • Factory service • Can start and manage arbitrary service • Hosted services • Grid Access Service • API Service for Communication with ROOT

  29. Grid Access Service (GAS) • The Grid Access Service represents the user entry point to a set of core services • Composed by different modules File Catalogue Metadata client GAS WMS

  30. Trips • CERN School of Computing, Vico Equense • Grid Computing • Physics Computing • Software Techniques • GRACE General Meeting, Brussels • Project Meeting • Workshop at Global Grid Forum • EGEE JRA1 Design Team Meeting, Padova • Presenting the Grid Access Service

  31. Thanks… … for your attention … this very nice time at CERN! 

More Related