1 / 11

Notes on Final Project of MIR Course

Notes on Final Project of MIR Course. Part I: Crawling Phase. Crawling Phase. Crawling the Dmoz directory It has as taxonomic structure (Tree-like) Each subdirectory by a group. Crawling Phase. This tree-like structure has two important components: Internal Nodes (also known as “topics”)

elga
Download Presentation

Notes on Final Project of MIR Course

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Notes on Final Project of MIR Course Part I: Crawling Phase Modern Information Retrival Course, Semantic web Research labratory

  2. Crawling Phase • Crawling the Dmoz directory • It has as taxonomic structure (Tree-like) • Each subdirectory by a group Modern Information Retrival Course, Semantic web Research labratory

  3. Crawling Phase • This tree-like structure has two important components: • Internal Nodes (also known as “topics”) • Leaves (also known as “pages”) Topics Pages Modern Information Retrival Course, Semantic web Research labratory

  4. Crawling Phase • Then each topic has a: • list of children (subtopics) • unique path to root node (supertopics) • description • list of related pages • And each page has: • A topic Modern Information Retrival Course, Semantic web Research labratory

  5. Crawling Phase Description of Current Topic The Current Topic (Node) • Each topic has some characteristics List of super topics List of subtopics List of Related Pages (Leaves) Modern Information Retrival Course, Semantic web Research labratory

  6. Crawling Phase • Deliveries for first phase: • TopicNames.txt • Each line contains a topic number and the full name of that topic, separated by a tab character (i.e. 46 Top/Science/Agriculture ) • TopicDescs.txt • Each line contains a topic number and the description of that topic, separated by a tab character. For some topics, the description is a zero-length string. • TopicHierarchy.txt • Each line contains a pair of topic numbers (separated by a tab character). The first of these two topics is the parent of the second topic. Each topic has exactly one parent, except for the root (topic 0), which has no parent. Modern Information Retrival Course, Semantic web Research labratory

  7. Crawling Phase • Deliveries for first phase: • DocUrls.txt • Each line contains a document number and its URL, separated by a tab character • DocTitles.txt • Each line contains a document number and its title, separated by a tab character • DocTopics.txt • Each line contains a document number and a topic number, separated by a tab character. This indicates that the document belongs to the given topic. Modern Information Retrival Course, Semantic web Research labratory

  8. Crawling Phase • Deliveries for first phase: • Documents.zip • The  contents of the documents seperately   • A list of samples for each output file have been added to the Assignments page (for “Science” Subdirectory) Modern Information Retrival Course, Semantic web Research labratory

  9. Crawling Phase • Naming contraction: • Names in each subdirectory start with a special character: Modern Information Retrival Course, Semantic web Research labratory

  10. Crawling Phase • Then for each sub tree , generate numeric names for children in BFS search order. • i.e. in Science Subdirectory: Sample Topic Sample Page 1 L1 5 L4 L3 2 L2 4 L8 3 L5 L7 L6 Modern Information Retrival Course, Semantic web Research labratory

  11. Crawling Phase • Assignments of subdirectories to groups: Modern Information Retrival Course, Semantic web Research labratory

More Related