Load Balancer
Download
1 / 1

CiteSeer X : Next-Gen CiteSeer - PowerPoint PPT Presentation


  • 121 Views
  • Uploaded on

Load Balancer. Servlet Container. Servlet Container. Servlet Container. Session Replication. Authorize, Authenticate. Spring MVC Framework. CAS. Acegi Security. Controller. MyCiteSeer. JSP Views. Data Model. User DB. User DB. Replicated. CSEL Scripts. Load Balancer. CSEL

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' CiteSeer X : Next-Gen CiteSeer' - eytan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Load Balancer

Servlet

Container

Servlet

Container

Servlet

Container

Session Replication

Authorize, Authenticate

Spring MVC

Framework

CAS

Acegi

Security

Controller

MyCiteSeer

JSP Views

Data

Model

User DB

User DB

Replicated

CSEL Scripts

Load Balancer

CSEL

Engine

CSEL

Engine

Control Flow

<sequence>

<invoke

<invoke

<link…

<flow>

<flow>

<invoke

<sequ…

<invoke

<invoke

Native Load Balancing

Index

Servers

Citation

Matching

Data

Acquisition

Central

Logger

Execution Layer

CSEL

Scripts

Task

Repositories

Load Balancer

File

Server

Distributor

Data Layer

Fedora

Or DBMS

Fedora

Or DBMS

Replicated

File System

Repository

XML

File

Server

Exact Dup

Detector

Ingestion Workflow

Crawl Manager

Receiver

Crawler

Crawler

Header

Acknowl

WWW

BPEL Engine

Figures

Tables

Citations

CiteSeerX: Next-Gen CiteSeer

Isaac G. Councill1, Huajing Li2, Levent Bolelli2, Yang Song2,

Ziming Zhuang1, Jian Huang1, Yang Sun1, Ding Zhou2,

Wang-Chien Lee2, Anand Sivasubramaniam2, C. Lee Giles1,2

1College of Information Sciences and Technology

2Department of Computer Science and Engineering

The Pennsylvania State University, University Park, PA 16802, USA

Supported in part by NSF CRI 0454052, NSF 0202007, Microsoft Research, and NASA

The CiteSeer Research Library was created in 1997 to demonstrate autonomous citation indexing (ACI), and has since grown to a collection of over 770,000 documents. CiteSeer currently receives over 1 million requests daily and serves over 1 terabyte of data every month. Having outgrown its original architecture, CiteSeer is being re-architected from the ground up.

Background

Legacy CiteSeer

Web Application

  • MyCiteSeer - New Personal Content Portal

    • Personal collections, RSS-like notifications, social bookmarking, social network facilities

    • Personalized search settings

    • Institutional data tracking possible

    • Transparent document submission system

  • J2EE Servlet Deployment

    • Using Spring MVC Framework for improved organization and extensibility - generic model calls lower-tier execution system for data

    • Secure SSO through Acegi Security and Central Authentication System (CAS)

Execution System

  • CSEL Distributed Execution Engine

    • CiteSeer Execution Language: very simple BPEL-like XML scripting, sequential/parallel task flows

    • CSEL scripts control complex service workflow

    • Scalable and flexible: fine-tuned service replication; built-in load balancing, fail-over

  • Component Task Provider Framework

    • Base libraries provide high-performance container for auto-registering and executing component tasks, including all low-level CiteSeer functionality

    • Scheduled batch events, e.g. near-dup detection

  • High-Performance Object Communication

    • Custom framework transfers large serialized Java objects in microseconds - connection type plug-ins

  • Fedora Integration: Investigating low-level API

    • Currently DBMS, pursuing benefits of Fedora

Data Ingestion

Configurable Ingestion Pipeline

  • Extraction and classification algorithms as collection of WS, multiple languages possible

  • Orchestration via BPEL (Business Process Execution Language), standard backed by leading companies, supports complex workflows

  • Results submitted as XML with URI pointers to file resources - local server provides file access

    Continuous, Manageable Crawling

  • Crawl results piped to ingestion receiver. Pipe can be closed at will - crawl manager queues

    New and Improved Algorithms

  • Tables, figures, acknowledgments, header


ad