Maximizing Peer-to-Peer Portals Through XML: An Integration Case Study from the EPA Presentation for Enterprise Web and Corporate Portal Conference September 5-6, 2001, Santa Clara, California by Brand Niemann, Ph.D. Office of Environmental Information U.S. Environmental Protection Agency
Overview • “The Big Picture” • XML and P2P 101 • Integrating Information and Applications • Questions and Answers
“The Big Picture” “The Big Picture” The Semantic Web XML Structured Data & Information Categories, Metadata, & Databases Integrated Web Pages Titles & Metatags Personal Web Pages P2P Content Networks Web Pages Portals
Portals and Content Networks NXT 3 Interface Search, Personalization, Document Management, Metadata, etc. Content Network: Hierarchical Folders – Each a Portal! “Portlets” Portal (s) “Portlets”
Portals and Content Networks • NXT 3 options: • Customize NXT 3 interface as a portal – 4 day class. • Integrate with Groupware (e.g., Lotus Notes) and Content Management System (e.g., Interwoven, Vignette, and DOCS Open). • NextPage Product Announcement • Universal Updates in Peer Space - called “Proactive Delivery” at this conference. • Solo: offline distribution of portal. • Matrix: collaborate on distributed content in context of a business process.
XML and P2P are a Disruptive Technology and Architecture • Repurpose or republish content. • Breaks down information silos/stovepipes. • Challenges traditional centralization and security practices. • Improves upon a “simple topics” view of categorization (as content grows in size and diversity, need multiple topics and more topics). • Queries produce new content. • Etc.
The Value Proposition • Corporate and government information is undervalued and hidden because it is trapped in proprietary formats and “stovepipe” systems so it is not fully accessible and is difficult and expensive to integrate. • XML make information more accessible and interoperable and “future proofs” it from periodic technology change.
XML 101 • Key Questions: • 1. Is it a programming language like Java or something in plain text that is read and acted upon by a browser? • 2. What browsers can handle it and how prevalent are they? • 3. What are some EPA problems and applications? • 4. Is it as easy to learn as HTML? • 5. Are there helpful tools like for HTML? • 6. How soon will it be prevalent? • 7. What would be a “killer application” for EPA?
Key Question 1 • Is it a programming language like Java or something in plain text that is read and acted upon by a browser? • Simple answer – Plain text on purpose. • More complete answer – eXtensible Markup Language (XML) is an incredibly powerful system for managing information. Use it with many other technologies (Java, ASP-Active Server Pages, etc.) HTML defines how elements are displayed; XML defines what those elements contain.
Background • 1991: Tim Berners-Lee designed the WWW (Weaving the Web, HarperBusiness, 2000, paperback) • 1993: Marc Andreesen created Mosaic and Netscape Web browser • 1996: XML proposed by the W3C* • 2001: About 2 billion Web pages (mostly HTML) • HTML(HyperText Markup Language): A simple, but elegant way of formatting data with special tags in a text file that can be viewed on virtually any computer platform. • XML(eXtensible Markup Language): Based on the same parent as HTML (SGML#) designed to better handle the task of managing information. • HTML lets everyone do some things and XML lets some people do practically anything. *World Wide Web Consortium #Standard Generalized Markup Language
Key Question 2 • What browsers can handle it and how prevalent are they? • Simple answer – Internet Explorer 5.5 and W3C’s Amaya (also an editor). Netscape 6 (Mozilla)? • More complete answer – Use XML to manage data now and convert it to HTML on the server-side for Web browsers that lack XML support. Client-side technology is lagging, but the new SVG* is an important step forward in Web user interface technology. (See http://maps.map.net/start) *Scalable Vector Graphics – nearly a stable W3C Recommendation.
Background • World Wide Web Consortium (W3C): • Created in 1994 to lead the WWW to its full potential by developing common protocols that promote its evolution and ensure its interoperability. • More than 500 organizations worldwide participate in this forum for information, commerce, communication, and collective understanding. • http://www.w3.org/ • XML is the universal format for structured documents and data on the Web and became a W3C Recommendation in February 1998.
Key Question 3 • What are some EPA problems and applications? • Simple answer – XML technology and Peer-to-Peer (P2P) architecture will make practically everything we do better, faster, and cheaper (XML: A Manager’s Guide, Addison-Wesley Information Technology Series, 2000). • More complete answer – It is being used or planned for use in Web database delivery, data exchange and integration, electronic records management, public access content management, and distributed content integration. The EPA XML Technical Advisory Group* has a database of projects and applications in Lotus Notes. *Charted by OEI management in July 2000.
Selected Examples • Web database delivery: OSWER's Chemical Emergency Preparedness and Prevention Office Local Emergency Planning Committee Database (LEPC), http://www.epa.gov/ceppo/lepclist.htm • Data exchange and integration: Integrated Taxonomic Information System (ITIS) Canadian XML Version, http://sis.agr.gc.ca/pls/itisca/taxaget?p_ifx= • Public access content management and distributed content integration: EPA Node on a Federal Government Content Network, http://www.sdi.gov/server.htm
Six Databases Need 30 Filters Oracle Postgress Sybase mySQL Informix Access
Six Databases and An XML Hub Only Need 12 Filters Oracle Postgress Sybase mySQL XML Hub Informix Access
XML for InterchangeBetween Applications Database GIS Spreadsheet XML Repository XML OLAP Data Warehouse 3D Visualization
Key Question 4 • Is it as easy to learn as HTML? • Simple answer: No, but there are resources that make learning it like HTML. See XML for the World Wide Web: Visual QuickStart Guide, Elizabeth Castrow, Peachpit Press, http://www.cookwood.com/xml/index.html • More complete answer: I recommend training for all managers and for hands-on workers because: • If you think XML is just for techies, or aren’t sure what it is, you’re already behind the curve. • XML is the new standard for exchanging data electronically. • XML is a better way of organizing Web content. • XML will help you do a lot of things faster, better, and cheaper. • Source: XML: A Manager’s Guide Book Foreword by Dr. David A. Taylor.
Key Question 5 • Are there helpful tools like for HTML? • Simple answer – Yes, and they continue to improve. • More complete answer - XML-Journal Readers' Choice Awards (the Oscar's of the Software Industry), http://www.sys-con.com/xml/readerschoice/index.html • XMLSpy 3.0 won 6 of the 13 award categories! http://www.xmlspy.com/
Key Question 6 • How soon will it be prevalent? • Simple answer – It is becoming prevalent outside the agency now because of the Federal CIO XML Working Group and will become so at EPA because of the CDX* and NEIEN# projects. • More complete answer – See “The State of XML: Why Individuals Matter,” XML.COM, http://www.xml.com/pub/a/2001/05/30/stateofxml.html • Many existing technologies are being re-engineered to take advantage of XML, gaining interoperability benefits previously too costly to realize (called the “attack of the angle brackets”). • Industries are finding that XML vocabularies can form a basis for collaboration and cost-cutting. • XML’s influence is proving disruptive to the technological status quo. * Central Data Exchange #National Environmental Information Exchange Network
Key Question 7 • What would be a “killer application” for EPA? Bridge the Digital Divide and provide universal access to Web content! • Simple answer – The phone remains the ubiquitous communications device and can be used to meet the new Section 508 requirements, so if you can access content via the Web from a browser, you can access it using VoiceXML from the telephone. • More complete answer – The VoiceXML Forum, the W3C Voice Browser Working Group, and vendors provide standards and tools: • http://www.voicexml.org/ • http://www.w3.org/Voice/ • http://studio.tellme.com/
P2P 101 • P2P Collaboration – Turn desktop PC into a server that directly shares data with other PCs. • P2P Networking – Each node in the network functions as both server and client without the need for central systems. • P2P Technology – Just like the Internet – fast, flexible, and decentralized. (Ossi Urchs, Internet Guru) • P2P and Web Services – Both are about Web servers communicating with one another.
“Hierarchical Peer-to-Peer” Architecture Key: Client Nodes (outer circles); Server Nodes (inner circles)
“True Peer-to-Peer” Architecture Key: Peer Nodes (all circles)
Integrating Information and Applications • Environmental Node and Earth Science Portal – EPA and USGS • Federal Statistics and Tools – FedStats.Net and Beyond • XML Pilot Projects Portal – National Environmental Information Exchange Network and EPA’s Central Data Exchange • The Uberportal* – FedGov Content Network *Gartner Group, Local Briefing, Emerging Internet Technologies, June 27, 2001, p. 19.
Content Network Concepts • Folders can contains files, databases, and Web resources. • Folders can/should be on different Web servers, but look and function as though they are on the same Web server. • This is accomplished by two new XML-based standards that send lean XML messages between the Web servers: • Content Network Protocol (CNP) • eXtensible Indexing Language (XIL) • Distributed folders and nodes can be managed both centrally and locally by the Content Network Manager and the Manage Content Administration Tools.
NXT 3 Technology • Version 3.2 just released: • Cross-platform with port to Sun Solaris (Intel/Windows and Sun/Solaris can be connected peer Web servers). • ORACLE database adapter and 87 new language encoders. • New installation wizard migrates previous content collections. • Application Framework for use with leading portal products.
Federal Statistics and Analysis Tools - FedStats.Net and Beyond • Virtual centralization of Federal statistics. http://www.fcw.com/fcw/articles/2001/0108/cov-xmlbx3-01-08-01.asp • Repurposing of Annual Statistical Abstract 1999 and into an XML document database. • Republishing of the Annual Statistical Abstract 2000 into an Integrated Distributed Statistical Compendia for live content management. • Integration of Federal statistics with analysis and visualization tools from Insightful Corporation.
XML Pilot Projects Portal • National Environmental Information Exchange Network • A standards-based, highly interconnected, dynamic, flexible and secure network, operating with broad-based voluntary participation of individual state environmental agencies and EPA. • Federal Computer Week, June 18, 2001 http://www.fcw.com/fcw/articles/2001/0618/pol-epa-06-18-01.asp • EPA’s Central Data Exchange • The point for electronic entry for nearly all environmental data submissions to the agency. • http://www.epa.gov/cdx/ • TIBCO Extensibility Canon/Developer • A leading design-time repository that manages the development and deployment of XML assets utilizing a Web-based interface.
The Uberportal*:FedGov Content Network • Purpose: • The FedStats.Net Content Network “proof of concept” was a success so apply it to the integration of Federal Web portals (about 50) into a Federal Government Content Network. • FirstGov.Gov, Science.Gov, etc. could use content network technology to compliment and supplement their work with search engine-based technologies.
Concepts • Web search engine-based technology and efforts help find and organize content for content networks. • Content network technology adds value to the Web experience by providing more structure to and improved searching of the actual content, either in its original form and location or repurposed or republished using XML and P2P technologies and tools.
Strategy • There are 6 basic ways to integrate Web portals into a Content Network using NXT 3 technology: • 1. Use the Web Content Service to crawl and index the contents of external Web sites to integrate their content. • 2. Use the Content Network Link to connect to other Web servers running NXT 3 to syndicate their content (“Server P2P”). • 3. Replicate the content of a Web server on a central Web server because of agency security constraints.
Strategy (continued) • Six basic ways (continued): • 4. Re-purpose or re-publish key content to improve its usability in a content network. • 5. XML-ize proprietary search engine indices. • 6. Use distributed content generation technologies to feed the content network from the “grassroots level” (“Desktop P2P”).
Examples of Work in Progress • Work on an example for each of the six basic ways: • 1. Crawl and index – many Web documents and Web sites. • 2. Syndicate NXT 3 servers – NextPage, EPA, and USGS so far. • 3. Replicate content – Federal databases on CD/DVD and the Web (“the top 100”!?).
Examples of Work in Progress (continued) • Work on an example for each of the six basic ways (continued): • 4. Re-purpose or re-publish – Distributed Integrated Statistical Compendia based on the Census Bureau’s Year 2000 Statistical Abstract and other examples. • 5. XML-ize proprietary search engine indices – looking for search engine that will provide it. • 6. Distributed content generation technologies – using Manage Content feature of NXT 3 and encouraging organizations to try My Association technology.
Questions and Answers • Brand Niemann. Ph.D. • USEPA Headquarters, EPA West, Room 6143D • Office of Environmental Information, MC 2822T • 1200 Pennsylvania Avenue, NW, Washington, DC 20460 • 202-566-1657 • email@example.com • EPA: http://220.127.116.11 • Outside EPA: http://18.104.22.168