1 / 15

Web Information Retrieval Projects

Web Information Retrieval Projects. Ida Mele. Rules. Students can work in teams (max 3 people)

asabi
Download Presentation

Web Information Retrieval Projects

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Information RetrievalProjects Ida Mele

  2. Rules • Students can work in teams (max 3 people) • The project must be delivered by the deadline that will be published on my web site. Usually the project discussion is the same day of the written exam. Students who register for the first exam call can present the software project in the first or in the second exam call • The project score is from 0 to 10. The professor decides the final mark • The same project can be assigned to max 2 groups • For any question/doubt/problem, send me an email Projects

  3. Project Request • Students have to send me an email with object: WebIR - project request specifying: • Name and last name of each student in the group • Title of the project and dataset the students intend to use • Short description of what the students intend to do (up to 250 words) Important: all the members of the group should be cc-ed in the email • If everything is OK, you will receive a confirmation email • There is no deadline for the request of the project Projects

  4. Project Delivery • The presentation of the project takes 15 minutes • The presentation should contain: • the description of the problem and of the dataset • the most important issues related to the implementation, and how they have been addressed • the results achieved • Students can use slides for their presentations and if they want they can realize a demo as well • Deadline and more instructions about the project delivery will be published on my web site Projects

  5. List of Projects • Analyze the link structure of a large graph from the Web • Find circles in a social network through link analysis • Find communities in a network of users • Classification of online reviews • Topic classification of tweets • Personalized ranking of query results • Hadoop implementation of a link-based ranking algorithm • Hadoop implementation of an inverted index Projects

  6. Projects 1) Analyze the link structure of a large graph from the Web • Create the web graph and analyze its link structure by computing degree, in-degree, out-degree, PageRank, TruncatedPageRank, edge reciprocity, graph assortativity, number of triangles, etc. Plot the distributions of the features • List of datasets you can use: • http://law.di.unimi.it/datasets.php use one of the graphs available in Section Larger crawls • http://snap.stanford.edu/data/index.html use graphs in Section Web graphs (e.g., web-Google, web-Stanford, web-NotreDame) • http://webdatacommons.org/hyperlinkgraph/ use the graph representing subdomains Projects

  7. Projects 2) Find circles in a social network through link analysis • Create the graph of the users of a popular social network (e.g., Twitter, Facebook, or Google+). Analyze the network and apply link-based features to identify circles. Check if the circles you get match the ones obtained from the analysis of common features • List of datasets you can use: • http://snap.stanford.edu/data/index.html use one of the ego graphs available in Section Social networks: ego-Facebook, ego-Gplus, or ego-Twitter. Each dataset is made of the ego network, the set of circles for the ego node, and the connections among ego networks. You can use the file with the set of circles as a ground-truth Projects

  8. Projects 3) Find communities in a network of users • Create a graph where nodes are people and a link between two people represents the fact that they have something in common. For example, they are collaborators (DBLP co-authorship network) or they have bought the same product (Amazon product co-purchasing network), etc. Use this graph to find communities of people and check the results with the ground-truth provided in the dataset • List of datasets you can use: • http://snap.stanford.edu/data/index.html use one of the graphs available in Section Networks with ground-truth communities(e.g., com-DBLP, com-Amazon, com-YouTube, com-Friendster) Projects

  9. Projects 4) Classification of online reviews • Given a set of user reviews about products (food, wine, etc.), analyze the text and other features for creating a classification of reviews. Some possible classifications are dividing reviews for kind/brand of product, for judgment (positive/neutral/negative), for helpfulness, etc. • List of datasets you can use: • http://snap.stanford.edu/data/index.html use data available in in Section Online Reviews (e.g., CellarTracker, Amazon reviews, Fine Foods, Movies) Projects

  10. Projects 5) Topic classification of tweets • Given a set of english tweets, implement a topic-classification algorithm which divides tweets into categories. Possible categories are personal updates, news, politics, economics, sports, music, gossip, etc. You can also use ODP categories (http://www.dmoz.org/) for creating the list of possible topics • List of datasets you can use: • Send me an email, and I will give you the link to the dataset you can download Projects

  11. Projects 6) Personalized ranking of query results • Create a system for query-result personalization. The users of the system can specify their interests by selecting them from a list of keywords (e.g., gossip, sport, politics, …). You can use a HTML form for the registration to the system. • Crawl a portion of the web (e.g., news websites) and create the corresponding webgraph. Use a personalized ranking algorithm, for example, Topic-Specific PageRank, for ranking the pages according to user interests and compare the personalized ranking against the not-personalized one. Projects

  12. Projects 7)Hadoop implementation of a link-based ranking algorithm • Given a web graph, where nodes represent web pages and the edge between two nodes u and v represents the link from the source page u to the target page v, implement in Hadoop a ranking algorithm (PageRank or HITS) to computes the scores of the nodes. Plot and analyze the distribution of the obtained scores • List of datasets you can use: • http://law.di.unimi.it/datasets.php use one of the graphs available in Section Larger crawls • http://snap.stanford.edu/data/index.html use graphs in Section Web graphs (e.g., web-Google, web-Stanford, web-NotreDame) Projects

  13. Projects 8) Hadoop implementation of an inverted index • Given a large collection of documents, create the inverted index, which is made of a dictionary and the posting lists. The dictionary contains indexed terms (remove stop-words and use stemming for preprocessing). For each term in the dictionary, the posting list contains information about documents where the term appears. Each posting has the ID of the document, the frequency of the term in the document, and the positions of the occurrences of the term in the document • List of datasets you can use: • Gutenberg project (http://www.gutenberg.org/) offers free ebooks that can be used for creating the document collection Projects

  14. Important Information • Students can choose one of the projects in the list, or they can propose a different project • There are no constraints on the datasets to use: The students can use the datasets suggested in the list of projects or different datasets available on the Web, or they can even create a new dataset for their project • Links to other dataset sources: • http://vlado.fmf.uni-lj.si/pub/networks/data/default.htm • http://www.trustlet.org/wiki/Repositories_of_datasets • http://www-personal.umich.edu/~mejn/netdata/ Projects

  15. Important Information • There are no constraints on programming languages, libraries, and tools to use • Links to some tools/libraries for working with graphs: • Graph visualization: Gephi (http://gephi.org/), Graphviz (http://www.graphviz.org/) • Large-graph partitioning: METIS (http://glaros.dtc.umn.edu/gkhome/metis/metis/overview) • Java Library: WebGraph (http://webgraph.di.unimi.it/), JUNG (http://jung.sourceforge.net/) • Python library: NetworkX (http://networkx.github.io/) Projects

More Related