Grid Search and Categorization Engine for Content Source Integration at CERN

GRACEJan Fiete Grosse-OetringhausCERN IT/EGE29.11.04

Grid Search and Categorization Engine Image by Hector Garcia Puigcerver

GRACE Workflow • CERN’s Tasks • Content Source Integration • Grid Integration • Grid Testing

Content Sources Integration • Content Source • Input: Search Query • Output: Search Results • HTML output • OAI (Open Archives Infrastructure) compliant output • Personalized configuration file for each Content Source(SPEC file) • Integration Steps • Submit the search • Parse the result • Retrieve associated documents

Step 1: Submit the Search • Goal: Submit Search Query • Input: Query in GRACE format • Go to contentsource, find search fields • Add fieldto SPEC file <get-param name='p'> <paramval name='/query/Quick-Search'/> </get-param>

Step 2: Parse the Result • Goal: Produce result sets interpretable by GRACE • Input: Search Result in HTML format

Step 2: Parse the Result

Step 2: Parse the Result • Goal: Produce result sets interpretable by GRACE • Input: Search Result in HTML format • Identify Fields: Title, Author, Abstract, … • Produce XPath Expressionse.g. /root/html/body/div/a • Produce XSL (eXtended Stylesheet Language) transformation code • Produce code for retrieval of associated documents • Output: XML result sets

Step 3: Test your file • Test application (seaLion): part of GRACE application • Submits search using a given SPEC file • Returns GRACE result set • Provides debug output • CSTest script • Uses seaLion • Validates results • Batch testing

Results • 16 Content Sources integrated • Input for Deliverable 6.1 • Workflow of Integration • Status, common problems and risks • HowTo: Configuration of Content Sources for Integration with GRACE • Usable by content providers who want to integrate their content source into GRACE • TestKit • Test application & scripts • HowTo & TestKit available on GRACE website

Grid Integration • Two Grid components: • Text Normalizing • Categorizing • Components provided by partners, CERN responsible for integration

First approach • “One for all” (model M1) • Parallel execution of simultaneous searches • O(hours) for complete process

M1 Performance

How to parallelize?

Parallelized Model • Split text normalization

Parallelized Model • Split outside the Grid • Launch N jobs • Perform text normalization • Store results in the Grid (using Replica Manager) • Monitor Status • Launch Categorization job • Pick up documents from the Grid and merges them • Perform Categorization • Get result from Categorization job

Simulation • Simulate parallelized model including • Submission time • Grid overhead • Application overhead • Application performance • Interesting values • User (UI) Waiting Time • Spent Computing Time

Simulation

Conclusions from Simulation • Derived rules for splitting parameters • Minimize user waiting time  Kopt • Save “unnecessary” resources by splitting less than optimal value. Therefore let the user wait 20% more (unnoticeable)  Keff • Calculated formulas for splitting parameters • Implemented in Java class for GRACE application

Measured Results

Results • JDLs for Grid Jobs created for both models • GRACE can run on the Grid • Description of Grid Jobs • Input for Deliverable 6.1 • Parallelized Job Model • Used in Grid Tests

Grid Tests • Test plan for both models and comparison • Creation of input corpus • Creation of test scripts for semi-automatic testing • Creation of scripts for validation of output and parsing of logging • General tests started 20.10.04 • Main test period from 05.11 to 25.11.04 • Tests performed in GILDA testbed • Submitted more than 1000 jobs • Made about 1 million Java API calls

Results

Comparison

Results • Input for Deliverable 7.2 • Validation of the suitability of GRACE for the Grid • Performance testing of the Application • Validation of the parallelized model • Validation of simulated results • Intensive use of GILDA • Feedback to GILDA • Feedback to EGEE • New requirements list

What else…

gContainer • SSL Web service container following WSRF standard • Based upon WSRF::Lite • Service discovery • Load management • Factory service • Can start and manage arbitrary service • Hosted services • Grid Access Service • API Service for Communication with ROOT

Grid Access Service (GAS) • The Grid Access Service represents the user entry point to a set of core services • Composed by different modules File Catalogue Metadata client GAS WMS

Trips • CERN School of Computing, Vico Equense • Grid Computing • Physics Computing • Software Techniques • GRACE General Meeting, Brussels • Project Meeting • Workshop at Global Grid Forum • EGEE JRA1 Design Team Meeting, Padova • Presenting the Grid Access Service

Thanks… … for your attention … this very nice time at CERN! 

Grid Search and Categorization Engine for Content Source Integration at CERN