1 / 33

Programming for WWW (ICE 1338)

Programming for WWW (ICE 1338). Lecture #4 July 2, 2004 In-Young Ko iko .AT. i cu . ac.kr Information and Communications University (ICU). Announcements. Our TA Name: Mr. Trinh Minh Cuong Email: minhcuong .AT. icu.ac.kr Office: F641 Office Hours: Tuesday 11-12PM, Thursday 2-4PM

jonco
Download Presentation

Programming for WWW (ICE 1338)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programming for WWW(ICE 1338) Lecture #4July 2, 2004In-Young Koiko .AT. icu.ac.krInformation and Communications University (ICU)

  2. Announcements • Our TA • Name: Mr. Trinh Minh Cuong • Email: minhcuong .AT. icu.ac.kr • Office: F641 • Office Hours: Tuesday 11-12PM, Thursday 2-4PM • Please send the instructor your team information • Please send the instructor your information for creating a Unix account • Submit your homework#1 (a URL or HTML source) by tomorrow Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

  3. Review of the Previous Lecture • Cascading Style Sheet • Web-based Information Integration • Examples • Information Mediators • Information Wrappers (Web Wrappers) Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

  4. Contents of Today’s Lecture • Basic UNIX Commands • More on Web-based Information Integration • JavaScript Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

  5. UNIX Operating System • A multi-user, multi-tasking operating system • Developed by Ken Thompson and Dennis Ritchie at the Bell Lab in early 70’s • Success factors of UNIX • Written in a high-level language (C language) – improving readability and portability • Support of primitives (system calls) – permitting complex programs to be built efficiently • A hierarchical file system – easy maintenance • Hiding the machine architecture from the user – allowing programs to be run on different machines • http://www.unix-systems.org/ Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

  6. Architecture of UNIX Systems Other application programs sh who nroff a.out cpp Kernel date comp Hardware cc we as ld grep vi ed Other application programs Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

  7. Basic UNIX Shell Commands • cd - Changes directories to the one named • pwd - Displays the current working directory • ls - Lists the contents of the current directory • ls -l - Same as above, but it lists with more information • mkdir - Make a directory • rmdir - Remove a directory • cat - Concatenate or show a files contents • cp - Copy a file • mv - Rename or move a file to a different name or directory • rm - Remove a file • logout - Terminates a Unix Shell session • man - Access manual pages http://infohost.nmt.edu/tcc/help/unix/unix_cmd.html Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

  8. Publishing Web Pages on the Server • Copy your files to the ‘public_html’ directory under your home directory in the server • Use FTP to copy your files in a local directory to the server directory ftp vega.icu.ac.kr (login with your user ID) cd public_html lcd d:\myweb put index.html (mput *.html) quit • Your homepage is now accessible from http://vega.icu.ac.kr/~yourid Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

  9. Connections Between Web Clients and Servers A Web Server A Web Browser Listen Connect Accept Write 80 Process Read Return A Web server is a daemon process that executes in the background waiting for some event to occur Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

  10. Sockets A Web Server • A socket is an end point for communication between two machines • A socket is an association of a protocol, address and process to an end point of communication A Web Browser Listen Connect Accept Write 80 Process Read Return Sockets Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

  11. Accessing Web Contents from Java Programs via Sockets import java.net.*; import java.io.*; … Socket sk = new Socket(www.icu.ac.kr, 80); OutputStream os = sk.getOutputStream(); PrintWriter pw = new PrintWriter(os); pw.println("GET /index.html"); pw.println(); pw.flush(); InputStream is = sk.getInputStream(); InputStreamReader ips = new InputStreamReader(is); BufferedReader in = new BufferedReader(ips); String line; while ((line=in.readLine()) != null) { System.out.println(line); } Socket Creation Write Request Read Results Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

  12. Accessing Web Contents from Java Programs via URL Connections import java.net.*; import java.io.*; … URL url = new URL(“http://www.icu.ac.kr”); URLConnection urlc = url.openConnection(); InputStream is = urlc.getInputStream(); InputStreamReader ips = new InputStreamReader(is); BufferedReader in = new BufferedReader(ips); String line; while ((line=in.readLine()) != null) { System.out.println(line); } URL Object Creation URL Connection Creation Read Results Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

  13. Java String Manipulation Methods for Result Parsing • int indexOf(String str, int fromIndex) • int lastIndexOf(String str, int fromIndex) • boolean startsWith(String prefix) • boolean endsWith(String suffix) • boolean matches(String regex) • String[] split(String regex) • String substring(int begineIndex, int endIndex) • String toLowerCase() • String toUpperCase() http://java.sun.com/j2se/1.4.2/docs/api/index.html Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

  14. Web Wrapper for Naver.com URL Summary Title Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

  15. Result Parsing Strategies • Structure-based Parsing • Analyzes Web pages based on tag hierarchies • Cannot be used for ill-formed HTML documents • Pattern-based Parsing • Search for a unique string pattern to locate a result item • Needs to identify such unique string patterns first Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

  16. Structure-based Result Parsing Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

  17. Pattern-based Result Parsing • Find out a unique pattern to locate a result item • e.g., “<tr><td><font” in the Naver result pages • Find the prefix and suffix patterns to extract an information piece (e.g., URL, title, summary) from the result item • e.g., “a href=” to extract a URL from a result line Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

  18. Java Implementation of Web Wrapper public void WebWrapper(String host, String path, String query, int startIndex, int pageSize) { try { String address = "http://" + host + path + "?where=webkr" + "&query=" + query + "&start=" + startIndex + "1" + “&display=" + pageSize; URL url = new URL(address); URLConnection urlc = url.openConnection(); urlc.setRequestProperty("Accept", "*/*"); urlc.setRequestProperty("User-Agent", "Mozilla/4.0"); InputStream is = urlc.getInputStream(); InputStreamReader ips = new InputStreamReader(is); BufferedReader in = new BufferedReader(ips); String line; while ((line=in.readLine()) != null) { // System.out.println(line); // } } catch(Exception e) { e.printStackTrace(); } } Query Translation Parsing Results Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

  19. Web Robots • A Web robot is a program (agent) that collects information while following all the links on a Web page • Web Robots = Crawlers = Spiders • Web search engines use Web robots to collect and index Web documents • A tag to tell Web robots not to index a page: <metaname=“robots" content=“noindex,nofollow”/> • Crawling methods: • Breadth-first crawling • Depth-first crawling Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

  20. Breadth First Crawlers http://ibook.ics.uci.edu/Slides/39 Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

  21. Depth First Crawlers http://ibook.ics.uci.edu/Slides/39 Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

  22. Web-based Information Management Applications (Example Scenario) Identify Recurring Disaster Areas in China, e.g. Locations of Floods Cross-product between place names and the disaster-type categories An Web document collection about ‘China disasters’ Classify documents based on the disaster types mentioned For each map layer displayed, get the set of place names and classify the documents based on the place names Plot the document clusters on the map to figure out the major flooding areas Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

  23. Web-based Information Management Applications (Example App. Design) : Sequential connection : Pipelined connection Keyword Editor Keyword Extractor Product Categories Mapping Clusters Search Engines Place Name Extractor Place Name Generator Pipelined components Generate multiple sets of place names Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

  24. Problems in Composing Large-scale Information Management Applications • Time-consuming to explore and test a large number of options • Hard to choose appropriate services for collections • Hard to quickly substitute and test a service within a sequence of steps • Difficulties of capturing and reusing shared patterns of information management steps • Difficult to record and recurrently perform information management steps • Necessity of extracting abstract patterns of information management steps and reusing them • Hard to cope with dynamic aspects of Web resources Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

  25. Characteristics of Large-scale Information Management Tasks • Incremental development of information management steps for an abstract task goal • Recurrent executions of the steps • Evolving requirements of users • Shared patterns of management steps • Collection-based information processing • Dynamic aspects of information sources and services • Large and growing number of component services Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

  26. Improvement Goals • Significantly reduce construction time, keeping costs low • Enable very rapid construction/adaptation of new applications • Provide static and run-time diagnostic tools, facilitating debugging and performance tuning tasks Rapid Composition and Reconfiguration of Large-scale Custom Applications Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

  27. JavaScript • The goal of JavaScript is to provide programming capability at both the client and server ends of a Web connection • Originally developed by Netscape, as LiveScript • Became a joint venture of Netscape and Sun in1995, renamed JavaScript • Now standardized by the European Computer Manufacturers Association as ECMA-262(also ISO 16262) • User interactions with HTML documents inJavaScript use the event-driven model ofcomputation Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

  28. A Popup Window <html> <head><title>ICE1338</title> <style type = "text/css"> <!-- p { font-size: 12pt; color: blue; background-color: yellow } h2, h3 { font-size: 16pt; color: red; font-style: oblique } --> </style> <script language = "JavaScript"> function displayDate() { alert("Today's date is: " + new Date() + "!!"); } </script> </head> <body onLoad="displayDate()"> <br/> <h2>Programming for WWW</h2> Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

  29. JavaScript vs. Java • Both share similar syntax • JavaScript is a scripting language, not a programming language • JavaScript is an interpreter-based language • JavaScript is dynamically typed • JavaScript does not support class-based inheritance • JavaScripts are usually embedded in HTML documents Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

  30. General Syntax of JavaScript • Direct embedding of a JavaScript code: <script language = "JavaScript"> -- JavaScript script – </script> • Indirect JavaScript specification: <script language = "JavaScript" src = "myScript.js“/> • Identifier form: begin with a letter or underscore,followed by any number of letters, underscores, and digits • Case sensitive • 25 reserved words, plus future reserved words • Comments: both // and /* … */ Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

  31. Document Object Model HTML “A platform- and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure and style of documents” <html> <head> <title>My Document</title> </head> <body> <h1>Header</h1> <p>Paragraph</p> </body> </html> http://www.mozilla.org/docs/dom/technote/intro/ var header = document.getElementsByTagName("H1").item(0); header.firstChild.data = "A dynamic document"; Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

  32. DOM Specification • http://www.w3.org/TR/DOM-Level-2-HTML/html.html • e.g., Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

  33. Screen Outputs • The model for the browser display window is the Window object • Properties: • window.document • window.screenLeft • window.screenTop • … • Methods: • alert: • confirm • prompt http://devedge.netscape.com/central/javascript/ Programming for WWW (Lecture#4) In-Young Ko, Information Communications University

More Related