1 / 13

CS240A Project

CS240A Project. Project proposal Teams of 2-3 students Interesting parallel application or system Conference-quality paper as a leveraging point (what is the challenge?) ACM/IEEE/SIAM/USENIX conferences (SuperComputing, SIGIR, WSDM, SIGMOD, etc) High performance and measurement is key:

Download Presentation

CS240A Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS240A Project • Project proposal • Teams of 2-3 students • Interesting parallel application or system • Conference-quality paper as a leveraging point (what is the challenge?) • ACM/IEEE/SIAM/USENIX conferences (SuperComputing, SIGIR, WSDM, SIGMOD, etc) • High performance and measurement is key: • Understanding performance, tuning, scaling, etc. • More important than the difficulty of problem • Leverage • Research projects, Master projects

  2. Some Project Ideas • Examples • Parallel applications • Data mining. Ranking (parallel algorithms). • Duplicate detection • Secure search (encrypted data, slow to process) • Search engines/graph search etc. • Matrix multiplication for similarity/recommendation • Systems • Speedup Mapreduce I/O. Incremental computing • Integrate MPI with Mapreduce • Parallel storage systems.

  3. Timeline • Week of Feb 7: preliminary proposal • Meeting with me now or later • Select paper(s) for reviewing. • Feb 15 week ( probaly delay 1 week): Project progress presentation • Background reviewing and progress report. • March 17 weekFinal project presentation. 5-page report.

  4. Datasets • Wikipedia data sets • http://www.search-engines-book.com/collections/ • 121K documents, 715MB. • NSF Research Award Abstracts 1990-2003 Data Set • (129,000 abstracts) http://archive.ics.uci.edu/ml/datasets/NSF+Research+Award+Abstracts+1990-2003 • Edmunds Car review & Tripadvisor Hotel review • from Tripadvisor (~259,000 reviews) and Edmunds (~42,230 reviews). • http://archive.ics.uci.edu/ml/datasets/OpinRank+Review+Dataset. • Yahoo music • KDD 2011 Cup Yahoo music rating (717M over 600K items). Columbia million song collection. • Yahoo answer dataset • 4.4M questions and their corresponding answers.

  5. Processed Datasets • Microsoft Web page ranking data (LETOR) • Feature vectors extracted from query-url pairs along with relevance judgment training data (10K, 30K). • Bags of word datasets • Musicxmatch (song lyrics, 237,662 tracks). • http://labrosa.ee.columbia.edu/millionsong/musixmatch • Enron emails (39861), PubMed Abstracts (8.2M). NY time articles (0.3M). http://archive.ics.uci.edu/ml/datasets/Bag+of+Words • KDD 2012 datasets for ads click and social networking. • Yahoo front page click data (45Mclicks)

  6. Cache-aware Matrix Multiplication for Similarity Comparison • Similarity computation. • Two items are similar if their vector multiplication value > threadshold. • N- item similarity is a matrix multiplication problem. • Expected work: • Cache analysis and performance tuning with Hadoop C/C++ • Data: • document dataset • More details: • http://cs.ucsb.edu/~xtang/streaming.html • Contact: Maha@cs, xtang@cs

  7. Music Recommendation • Recommendation uses item similarity computation. • Two items are similar if their vector multiplication value > threadshold. • N- item similarity is a matrix multiplication problem. • Expected work: mapreduce Java programming. Data analysis • Data: • Yahoo music • Challenge: mapreduce programming. • Related paper: Fidel Cacheda, Vctor Carneiro, Diego Fernandez, and Vreixo Formoso. Comparison of collaborative Filtering algorithms: Limitations of current techniques and proposals for scalable, high-performance recommender systems. ACM Trans. Web, 5(1), February 2011. • http://kddcup.yahoo.com/workshop.php • Contact: maha@cs

  8. Clustered Storage and Deduplication and in Cloud Backup • Build a parallel cloud backup system with deduplication support with various constraints • Work involved: use distributed file system and build a duplication layer. • Contact: wei@cs • paper: Wei Zhang et al. Multi-level Selective Deduplication for VM Snapshots in Cloud Storage. Slides. In Proc. of IEEE Cloud 2012.

  9. Computation and Communication Re-Scheduling for MapReduce • This project studies alternative task scheduling and communication methods for repetitive MapReduce execution jobs • Scheduling mapper/reducer tasks if we know certain computation and communication patterns and jobs. • Communication can be also be re-arranged. • Paper: A platform for scalable one-pass analytics using MapReduce, SIGMOD 2011 • SIGMOD 2010, ParaTimer: a progress indicator for MapReduce DAGs

  10. High Performance or Secure Inverted Index Search • Develop a prototype inverted index search code on a multi-threaded multicore shared memory system • Input : inverted index for a set of words • Output: find results matching a query • Expected work: algorithm understaning. C/C++ programming using pthreads. • If you are interested in security • Secure search Reference: Privacy-Preserving Multi-keyword Ranked Search over Encrypted Cloud Data. INFOCOM 2011 • If you are interested in algorithm performance: • Culpepper, J., Moat, A.: Effcient set intersection for inverted indexing. ACM • Transactions on Information Systems (TOIS) 29(1) (2010)

  11. Query Log Analysis • AOL has a leaked query log dataset • Compute query similarity using what results users have selected. • Expected work: mapreduce programming. Algorith/data analysis. • Reference: J. Guo, et al. Intent-aware query similarity. CIKM 2011

  12. Parallel Online Search System • Setup searchable datasets with a search engine • For large datasets, setup data partitions and multiple services • Expected work: integrate open source search engine software and run multiple machines. • Partitioning data, built a search engine on a cluster. • Performance metrics: response time, throughout • Document Allocation Policies for Selective Searching of Distributed Indexes (Kulkarni and Callan) - Appears in the Proceedings of the 19th ACM Conference on Information and Knowledge Management, Oct 2010, Toronto, Canada. • Kai Shen, Tao Yang, Lingkun Chu, JoAnne L. Holliday, Doug Kuschner, and Huican Zhu, Neptune: Scalable Replica Management and Programming Support for Cluster-based Network Services. In Proc. of the 3rd USENIX Symposium on Internet Technologies and Systems (USITS'01), Pages 197-208, San Francisco CA

  13. Other parallel application papers • Parallel Boosted Regression Trees for Web Search Ranking Stephen Tyree, Kilian W. Weinberger, Kunal Agrawal and Jennifer Paykin. WWW 2011. • Hybrid MPI/OpenMP Parallel Linear Support Vector Machine Training. Journal of Machine Learning Research, 2009.

More Related