250 likes | 267 Views
Explore the use of sentimental learning and geo-ranking in local search optimization. Learn how sentiment analysis can enhance rankings and how location-based metrics impact search results. Discover the benefits of dimensionality reduction and supervised learning in sentiment analysis.
E N D
Deep Web Mining and Learning for Advanced Local Search CS8803 Advisor Prof Liu Yu Liu, Dan Hou Zhigang Hua, Xin Sun Yanbing Yu
Competitors • Yahoo! Local • Yelp • CitySearch • Google Local • Yellow Page How to beat them?
Research Background • Deep Web Crawling • Sentimental Learning • Sentimental Ranking Model • Geo-credit Ranking Model • Social Network for Businesses
Query-based Crawler Sentimental Learner Super Local-Search HTML Parser Apache Server JDBC Database Architecture
Tools • Open source social network platform Elgg, OpenSocial • LAMP Server Linux+Apache+Mysql+PHP • Google Map API, eg, Geocode,
Sentimental Learning Can we use ONE score to show how good/ bad the store is?
Sentimental Learning • Objective • To identify positive and negative opinions of a store • Dataset • Reviews represented by bag-of-terms • Normalized TF-IDF feature (normalized) • Two ways of sentiment representation • Simply average the scores • but “what you think good might be bad for me” • Manual labeling • 1 to 5 (“least satisfied” to “most satisfied”) • consensus based • time-accuracy tradeoff
Dimension Reduction • High dimensionality • 6857 tokens • Memory limitation • Possibly under-fitting • Dimension Reduction • PCA (Principle Component Analysis) • an orthogonal linear transformation • transforms the data to a new coordinate system • retains the characteristics of the data set that contribute most to its variance • Get the most important features without losing generality
Principle Component Analysis • Original Dimension: 6857 • Covariance Reserved: 95% • Different Granularity • Manual Labeling: • Score Averaging:
Sentimental learning • Features used for sentimental learning: • Vector Space Model (reviews/comments) • Some keywords related to sentiments: • Positive: good, happy, wonderful, excellent, awesome, great, ok, nice, etc • Negative: bad, sad, ugly, outdated, shabby, stupid, wrong, awful, etc • Most words unrelated to sentiments: • e.g. buy, take, go, iPod, apple, comment, etc… • Causing noise for sentimental learning!!
What we do? • How to learn sentiments from a large set of features with lots of noise? • Vector Space Model: MXN (Entity-Term, e.g. 6,000X20,000) • Dimensionality reduction (PCA) • Using supervised learning for sentimental learning • Human labeling vs. Average rating • An online entity always includes many reviews with each review containing a rating • Average Rating is an alternative labeling for the entity • Manual labeling: • 1 (least satisfactory) – 5 (most satisfactory) • Three persons do labeling, most-vote-adopted
Manual labeling vs. Average rating • Machine learning • Around 300 entities from local search, 6800 features after stop words removing and stemming • Using different SVM kernels • Avoiding overfit • Leave-one-out estimation • Nonlinearity of features • Polynomial kernel achieves best performance • Manual labeling • Training more precise • Labeling more consistent • Rate averaging • Training less precise • Rating more random • E.g. average(5, 5, 1) = 3
What we learned? • Dimensionality reduction is necessary • Term Vector Space Model (VSM) is huge in nature • Human labeling is necessary • Sentimental learning involved subjective judge instead of objective judge. • Human rating is very random because it is not consistent across different people • More labeling data is needed • Other methods to be used: • Unsupervised learning (clustering) • Gaussian Mixture Model (an alternative to learn sentiments, while it is difficult to know the # of hidden sentiments)
How to use learned sentiments? • Sentimental learning can be used to improve ranking of local search • Because sentimental value represents an important metrics to evaluate the rank of an entity • Local search is influenced by the sentiment • Sentimental ranking model (SRM): • SentiRank = a*ContentSim + (1-a)*SentiValue • Empirically setting the parameter as “0.5”. • Similar to PageRank • PageRank = b*ContentSim + (1-b)*PageImportance
Geocoding • Geocoding of Addresses For example , the geo-center of store AA National Auto Parts Is located at 3410 Washington St, Phoenix, AZ,85009 Using Geocode, we can get the exact latitude and longtitude (33.447708, -112.13246) • Haversine Formula of Great-circle distance: Distance between two pairs of coordinates on sphere = (3959 * acos( cos( radians(33.448) ) * cos( radians( lat ) ) * cos( radians( lng ) - radians(-122) ) + sin( radians(-112.132) ) * sin(radians( lat ) ) ) )
Geo-Sentimental Ranking Model (GSRM) • Three Measurements • Content Similarity -- term-frequency • Sentimental Value -- sentimental learning • Geo-distance -- Google Map API • GSRM Ranking model
Thank You ! • QA time