CS 540 Database Management Systems

CS 540 Database Management Systems Project Definitions

CS 540 Project Predict the 2014 NCAA Basketball Tournament Canhua Huang, Minfeng Wen

The general definition of the problem Predict the 2014 NCAA Basketball Tournament: • Develop a program to test how well machine learning and statistical techniques improve the forecast. • Specifically, the program has two stages. In the first stage, build and test models against the previous NCAA tournaments. In the second stage, predict the outcome of the 2014 tournament.

Challenges Data analyze: • It is hard to find a reliable and accurate algorithm to improve the forecast by using machine learning and statistical techniques. • Does not have absolutely reliable data • Contingency • There is too many noise in the dataset, filling data is a difficult task • A mass of meaningful data • Maybe miss some valuable data

Data Data from “Kaggle”: • teams.csv: Identifies the 356 different college teams that are present in at least one of the seasons from 1995-1996 through 2013-2014 • seasons.csv: Identifies the 18 different seasons included in the historical data, along with certain season-level properties • regular_season_results.csv: Identifies the game-by-game results for all 18 seasons of historical data, from season A (1995-6) through season R (2012-3)

Data • tourney_results.csv: Identifies the game-by-game NCAA tournament results for all 18 seasons of historical data • tourney_seeds.csv: Identifies the seeds for the final 64 teams in each NCAA tournament, for all 18 seasons of historical data • tourney_slots.csv: Identifies the mechanism by which teams are paired against each other, depending upon their seeds

Schedule · January 20 - January 26: Do the research to understand the definition of this problem and Data of this program. · January 27 - February 9: Design the algorithm concerning with machine learning and statistical technique. · February 10 - February 16: Implement the first stage of the program · February 17 - February 23: Implement the second stage of the program · February 24 - March 5: Write the report of the program · March 6- March 10: Prepare for the final presentation

CS540 Project Proposal Establish Strategy for New Businesses Based on Yelp Dataset January 21, 2014 YaonanZhong, Xinyang Chen, Fan Ke OSU School of Electrical Eng. & Computer Sci.

Establish Strategy for New Businesses Based on Yelp Dataset Problem and Motivation In the first round of Yelp Dataset Challenge, most researchers focused on solving problems for existing businesses based on their review data. In our project, we try to establish commercial strategies for new-coming businesses through analyzing the information of existing businesses. We will study how Yelp dataset can potentially help people explore new businesses based on locations, user reviews, and check-in records of existing ones.

Establish Strategy for New Businesses Based on Yelp Dataset • Issues concerned by new businesses • Why the successful businesses are successful at a specific area and a specific period of time. • Why the unsuccessful ones are unsuccessful at the same area and time. • What’s the potential factors that may promote new businesses compared with old ones.

Establish Strategy for New Businesses Based on Yelp Dataset • What we can extract from YelpDB • It is easy to view the regional distribution and growth process of a specific kind of business along a specific period of time. • Parsing and extract keyword from reviews of interested businesses. • Retrieve check-in records to set up and adjust proper business hours.

Project Flow

YelpDB Object--Business { 'type': 'business', 'business_id': (encrypted business id), 'name': (business name), 'neighborhoods': [(hood names)], 'full_address': (localized address), 'city': (city), 'state': (state), 'latitude': latitude, 'longitude': longitude, 'stars': (star rating, rounded to half-stars), 'review_count': review count, 'categories': [(localized category names)] 'open': True / False (corresponds to closed, not business hours), }

YelpDB Object--Review { 'type': 'review', 'business_id': (encrypted business id), 'user_id': (encrypted user id), 'stars': (star rating, rounded to half-stars), 'text': (review text), 'date': (date, formatted like '2012-03-14'), 'votes': {(vote type): (count)} }

YelpDB Object--Checkin { 'type': 'checkin', 'business_id': (encrypted business id), 'checkin_info': { '0-0': (number of checkins from 00:00 to 01:00 on all Sundays), '1-0': (number of checkins from 01:00 to 02:00 on all Sundays), ... '14-4': (number of checkins from 14:00 to 15:00 on all Thursdays), ... '23-6': (number of checkins from 23:00 to 00:00 on all Saturdays) } # if there was no checkin for a hour-day block it will not be in the dict }

Challenge 1. Review text parsing: how to decide a positive or negative word. 2. Relational Model: how to build well organized relation schema among these given data types. 3. Analysis of the Discrete data: Since we want to give some predictive information to the new merchant, we need to analyze the data collected reviews from yelp. Those data must be discrete. We can obtain certain trend of those data. But it should be a challenge to make sure whether the data doesn’t appear in this trend are useless to the new merchant.

Schedule

ARA Aquatic Research Accelerator Matt Viehdorfer Nels Oscar Wyatt Allen CS 540 - Winter 2014

Data • Ongoing annual longitudinal survey of Willamette River fish populations • Provided by the Oregon Department of Fisheries and Wildlife (ODFW) who is working with the Northwest Alliance for Computational Science & Engineering (NACSE) • Stan Gregory (ODFW) • Currently 67 designations of fish • Additional species in the future • Collected from various sites with differing characteristics • Most recent data covers a 13 year period • Over 56,000 samples to date • Includes local environmental and habitat conditions

Problem • Heterogeneous, multidimensional data • Inconsistent data formatting between years • Need a standard structure moving forward with controlled vocabularies • As it currently exists there are limited tools for examining the data • Researchers need more useful tools for exploring and querying the data to create and validate their hypotheses

Deliverables We aim to provide: • An intuitive and flexible web interface for visually constructing complex queries against this data set • A method for querying arbitrarily defined geographic regions • Exploration by attribute • Constant visual feedback for query results to inform and refine the analytic process

Schedule

Predicting NCAA Basketball Results Vahid Ghadakchi Jose Picado Tadesse Zemicheal

Definition of the problem • Predict the 2014 NCAA Basketball Tournament results based on historical data. • March Machine Learning Madness competition in kaggle.com*. *http://www.kaggle.com/c/march-machine-learning-madness

Specific challenges • Research about previous work on sports results prediction. • Search for useful data. • Integrate data from different sources. • Explore and compare different learning algorithms. • Implement prediction algorithm.

Description of data • Data provided in Kaggle competition: • Teams(id, name) • Seasons(season, years, dayzero, regionW/X/Y/Z) • RegularSeasonResults(season, daynum, wteam, wscore, lteam, lscore, ...) • TourneyResults(season, daynum, wteam, wscore, lteam, lscore, ...) • TourneySeeds(season, seed, team)

More data • Published rankings, offense and defense stats, team players. • Consider effect of time in value of data.

Proposed schedule

Prediction of user’s rating to a restaurant CS540 Database Management System

Team Members Rui Qin Chao Peng Jianqing Cui

The General Definition of the problems • About Yelp. • The goal of the project is to predict a score a user will give to a restaurant. • The rating will be predicted based on two factors: • the previous ratings and reviews of the user at similar restaurants . • the ratings and reviews of the restaurant we’d like to predict from other customers.

The challenges we’d like to address Natural Language Processing: • Keywords from the user’s reviews towards similar restaurants • Keywords of the restaurant’s reviews from other customers • Calculating the similarity of keywords and rate

Data • The data we use come from Yelp Dataset challenge , totally 43,873 users, 229,907 reviews • We’ll only use part of the data, discard users only have a few reviews. • We’ll focus on the reviews with useful tag. • Divide all the data into training and testing dataset

Proposed schedule • week 3-4: Data processing (data filtering,create training set and testing set). • week 4-5: Keyword extraction from users’ reviews, and restaurant ratings. • week 5-8: Keyword quantization and similarity comparison algorithm,coding. • week 8-9: Project testing. • week 9-10: Report writing, and presentation preparing.

ENVISIONING THE YELP DATA SET TEAM MEMBERS: Mohammad Amin Alipour <alipourm@onid.oregonstate.edu>, Sumanth Suresh Avadhani <avadhans@onid.oregonstate.edu>, Varun Sharma <sharmava@onid.oregonstate.edu>, Nitin Jogee Bella Subramanian <subraman@onid.oregonstate.edu>

PROBLEM DEFINITION How well can you guess a review's rating from its text alone? Can you take all of the reviews of a business and predict when it will be the most busy? What makes a review useful, funny, or cool? Can you figure out which business a user is likely to review next? How much of a business's success is really just location, location, location? What businesses deserve their own subcategory,and can you learn this from the review text?

DATA DESCRIPTION Business Data Review Data User Data Check-In Data

BUSINESS DATA {"business_id": String , "full_address": String, "open": true, "categories": [], “city": , "review_count":, "name": , "neighborhoods": [], "longitude":, "latitude":, "state": "AZ", "stars":, "type": "business"}

REVIEW DATA {"votes": {"funny": , "useful":, "cool": 2}, "user_id": , "review_id":, "stars":, "date": , "text": "", "type": "review", "business_id": }

USER DATA {"votes": {"funny": , "useful": , "cool": }, "user_id": , "name": , "average_stars":, "review_count":, "type": "user"}

CHECK-IN DATA {"checkin_info": {"11-3": 17, "8-5": 1, "15-0": 2, "15-3": 2, "15-5": 2, "14-4": 1, "14-5": 3, "14-6": 6, "14-0": 2, "14-1": 2, "14-3": 2, "0-5": 1, "1-6": 1, "11-5": 3, "11-4": 11, "13-1": 1, "11-6": 6, "11-1": 18, "13-6": 5, "13-5": 4, "11-2": 9, "12-6": 5, "12-4": 8, "12-5": 5, "12-2": 12, "12-3": 19, "12-0": 20, "12-1": 14, "13-3": 1, "9-5": 2, "9-4": 1, "13-2": 6, "20-1": 1, "9-6": 4, "16-3": 1, "16-1": 1, "16-5": 1, "10-0": 3, "10-1": 4, "10-2": 4, "10-3": 4, "10-4": 1, "10-5": 2, "10-6": 2, "11-0": 3}, "type": "checkin", "business_id": }

QUERIES AND CHALLENGES • Businesses in parts of city that had more reviews. e.g. • What businesses in Pheonix has relatively few reviews? • Density of different business in the city • What parts of city has more Yelp-reviewed accounting firms • Visual Comparison of number of reviews for businesses • What accounting firm has the most positive review. • Sentiment analysis of reviews for businesses • Reviews for what business was more funny. • Which user has the most critical reviews. • Crediting and weighting users based on comparative reviews. • Map selection based query • In selected area X, which ice cream shop is the best. • What are the average reviews. • Data fusion with other APIs to provide some interesting information, maybe Zillow and Yahoo Listings.

SCHEDULE Choosing the backend Google API, Yahoo API, Zillow, ... (3 days) Visualizing current data: (2 weeks) Mashups (1 week) Generating meta-data (2 weeks) Integrating results (2 weeks)

YELL! Because Yelp doesn’t help!

Team • ShravyaVarakantham • AmitBawaskar

Project Features • All the basic features of an application like Yelp • User groups • In group privacy(sharing , suggestions etc.) • User specific and timely notifications • Online ordering (Individual/ Group) • Split your bill • Timely consolidation and budget setting • MORE LIKELY TO READ AND WRITE REVIEWS • HIGHLY PERSONALIZED INTERCONNECTION.

Challenges • Interconnecting the different features using a coherent database. • Pattern analysis • Working with images in the database Data Set • Yelp data set tables to start with: • Business • Review • User • Check in

Schedule

FacialKeypointsDetection ShuaiXu，HaiYu， GuanhengLiu， FanYang

Definition of Problem The objective of this task is to predict keypoint positions on face images. This can be used as a building block in several applications, such as: • Tracking faces in images and video • Analyzing facial expressions • Detecting dysmorphic facial signs for medical diagnosis • Biometrics / face recognition

CS 540 Database Management Systems