Web Site Management Based on Declarative Specifications

Web Site Management Based on Declarative Specifications Alon Levy University of Washington Joint work with: Strudel: Dana Florescu (INRIA), Mary Fernandez, Dan Suciu (AT&T), Khaled Yagoub (INRIA) Tiramisu: Corin Anderson and Dan Weld (UW)

Problem: Building Web sites • Building Web sites involves three tasks: • Selecting and managing the site’s content • Organizing the site’s structure (pages and links) • Designing the graphical presentation of pages. • In current tools, these tasks are (mostly) interdependent. • Strudel’s key ideas: • Separate the three tasks. • Manage content and structure declaratively.

Content Management and Graphical Presentation • Content may be derived from multiple sources: • Databases: relational, object-oriented • Semi-structured sources (XML, Word, Excel, bibtex). Classical data integration problem! (see Tsimmis, Garlic, Information Manifold, Tukwila) • Graphical presentation: • Need to integrate with tools that create animations, images, Java applets. • Create sets of similar HTML pages using templates.

Web-Site Structure • The structure includes: • Set of pages and contents of each page, and • Links between the pages.

Current practice • Current tools separate only content management from presentation: • Content managed by database: • Embed queries in HTML templates • Simple tools to view and modify structure at the extensional • level. • WYSIWYG tools for managing presentation. • But they still cannot: • explicitly manage site's global structure, or • flexibly choose content-management system • As a result it’s hard to: • modify the structure of a web-site, build multiple versions for • different classes of users, enforce integrity constraints.

Talk Outline • Problem definition • Strudel architecture • Advantages of declarative specifications: • Specifying and verifying integrity constraints. • Automatic generation of run-time plans for managing data-intensive web sites. • Tiramisu: • Separating the design tool from the implementation. • Using a collection of tools to build a site.

Strudel Evolution Strudel (Nov. 96)[AT&T] Strudel AT&T Release Strudel-R (INRIA) http://www.research.att.com/sw/tools/strudel Tiramisu (Sept. 98) (U. Washington)

Strudel Architecture and System

Strudel • Features: • Integrates content from multiple sources. • High-level declarative language for managing site’s structure (StruQL). • Advantages: • Derives multiple sites from the same data. • Supports easy restructuring and modification. • Provides platform for: • Enforcing integrity constraints • Designing policies for efficient run-time management of sites.

Strudel Architecture

Data Model • Strudel is based on a semi-structured data model: • labeled directed graphs. • nodes in the graph represent objects, • labels on arcs represent attribute names, • named collections. • Why semi-structured data? • raw data is often semi-structured (and I don’t mean that it’s • embedded in HTML) • convenient for data integration (a` la TSIMMIS) • web-sites are ultimately graphs.

The StruQL Query Language • A StruQL query is a function from a set of input graphs to an • output graph. • A StruQL expression contains two parts: • A query component, and • A restructuring component. • Formally: • INPUT graph names • WHERE conjunction of regular path expression atoms • CREATEname the nodes in the output graph using Skolem functions • LINKspecify the links in the resulting graph. • StruQL evolved into XML-QL, (see WWW8 Conference)

Article 1: Date: 8/1/97 Title: “Clinton announces new …” Priority: Headline Category: USA News Images: im1.gif, im.gif Text: “President Clinton announced…” Related article: article2 Article 2: Date: 8/2/97 Title: “FDA approves new cure for…” Priority: Top Story Category: Health Video: vid1.avi Text: “The Federal Drug Administration…” Example Raw Data

CNN Web-site Query (part 1) Input graph of articles INPUT CNN-ARTICLES Create web page for each article WHERE Articles(a), note arc variable: l art -> l -> t, l in { "Title", "Abstract", "Date", "Text", "Image", "Topimage", "RelatedSite"}, a -> "Category" -> c CREATE ArticlePage(a) LINK ArticlePage(a) -> l -> t {WHERE a -> "RelatedArticle" -> r LINKArticlePage(a) -> "RelatedArticle" -> ArticlePage(r)}

CNN Site Schema RootPage() a-> priority-> “headline” a-> category->c CategoryEntry(c) RootPageEntry(a) Data(t):- a -> l ->t l in {“title”, “top-image”} CategoryPage(c) a ->category->c ArticlePage(a) Data(t): a -> l -> t, l in { "Title", "Abstract",…}

CNN Web-site Query (part #2) CREATE RootPage {WHERE a -> "Priority" -> "headline", l in { "Title", "Date", "Topimage"} CREATERootEntry(a) LINK RootPage -> "HeadlineStory" -> RootEntry(a), Link each headline story to its title, date, top image and full article RootEntry(a) -> "FullStory" -> ArticlePage(a), RootEntry(a) -> l -> t}

HTML Templates <h1> <SFMT title EMBED> </h1> <h2> <SFMT date EMBED> </h2> <SIF top-image>, <SFMT top-image EMBED> <SFMT text EMBED> </SIF> <SFOR a IN related-article ORDER=descend KEY=date> <SFMT @a LINK=title> </SFOR> <BR>

CNN Sports Query INPUT CNN WHERETopCategory(c), c -> "CategoryName" -> cn, cn="Sports", c -> "SubTopic" -> top, Articles(a), a -> l -> t, l in { "Title", "Abstract", "Date", "Text", "Image", "Topimage", "RelatedSite"}, a -> "Category" -> c, c=top CREATE ArticlePage(a) LINK ArticlePage(a) -> l -> t

StruQL Details • Regular path expressions are constructed by a grammar: • R <- “a” |e | R1.R2 |R1|R2 |R1* | L| _ • Atoms in the WHERE clause are of the form X -> R -> Y or C(X) • The LINK clause includes atoms of the form: • LINK f(X) --> “new link” --> g(X) or • LINK f(X) --> L --> g(X) • Queries can be nested, inheriting the WHERE clauses of • their outer blocks. • Note separation between querying part and restructuring part!

More on StruQL • Bare bones language for semi-structured data: includes the essential features. • More expressive than Lorel or UnQL (e.g., can reverse graphs) • Conceptually and in practice: separation between query component and restructuring component is important. • Containment is decidable for StruQL-WHERE (Florescu, Levy & Suciu, PODS-98)

Advantages of Declarative Specifications

Enforcing Integrity Constraints • We often want to verify some constraints on site structure: • all articles from the last two days are reachable from the root • all paths to confidential data must go through an authentication node • Good site design principles are summarized as integrity constraints [Lohse, CACM, 98]. • When site specs are long, constraints are hard to enforce. • Want to verify constraints intentionally.

Intentional IC Verification • Formally, we want to check whether: S(D) |= IC • S is the site specification (e.g., StruQL Query) • IC is a formula describing the constraint: a, Article(a) & date(a) > today-2 => Root -> * -> ArticlePage(a). • for any instance D of the underlying data. • Results: • Sound and complete algorithms for verification of a class of integrity constraints (path constraints). • Algorithms will also propose corrections when IC’s are violated.

Run-time Management of Sites • When do we compute web pages? • Static approach: completely precompute site • Doesn’t work for large sites, forms, hard to update. • Dynamic approach: compute pages on request • Users may wait, a lot of repeated computation, structure of the site is not exploited. • Current tools use one of the extremes, or specify policy per collection of pages. • The specification is implicit in code. • Our goal: use site specification to automatically find optimal strategy.

Possible Run-time Optimizations • View materialization • Function caching: • when web sites represent hierarchically structured data, successive queries in the site differ only in their projected attributes. • Simplification under preconditions: • previous queries on the path may have already verified some conditions for current query. • Lookahead computation: • often it is possible with little cost to compute the data necessary for subsequent pages.

Problem Statement • Given: • site specification • knowledge about browsing patterns • cost function • Produce: • Operational plan: operational schema + a set of queries to compute on a given page request. • Results: (in Strudel-R): framework + • Performance study of the optimizations. • Algorithm for generating operational plans. • Identification of many open problems.

Strudel Experience --> Tiramisu

Experiences with Strudel(except for the lousy GUI) • Integrating data from multiple sources when building a Web site • is a prime concern. Sources are semi-structured! • Declarative specification of site structure is very important • because: • site creation is a highly iterative process • site owners often need redesign after experience from • deployment • we often generate multiple versions of sites from the • same data. • Design of web-sites is done in a top-down fashion. • Strudel can’t be the all encompassing web-site management tool.

Tiramisu: the Second Generation • Strudel and its siblings (Araneus, YAT, WebOQL, WIRM) force the design and implementation of the site to be done in the same tool. • Furthermore, there will always be tools that are specialized for specific tasks. • Tiramisu: • Separate design phase from implementation. • Allow the implementation to be done by a set of cooperating tools.

Tiramisu Architecture mediator data source E/R style diagram of site (site schema) data source web site Implementation manager data source wrapper wrapper wrapper Tool (ASP) Tool (FrontPage) Tool (Strudel)

Screenshot of a TERD

Conclusions • Web-site management is an important area for Database research. • First-generation systems (Strudel, Araneus, YAT, WebOQL) offer important advantages: • Easy modification, creation of multiple versions • enforcing constraints, run-time management • Second generation: (Tiramisu) • Emphasize design phase of site • Implement with a collection of cooperating tools.

Web Site Management Based on Declarative Specifications