Mining Recipes in Microblog

Mining Recipes in Microblog Shengyu Liu, Qingcai Chen, Shanshan Guan, Xiaolong Wang, Huimiao Shi Intelligent Computing Research CenterHarbin Institute of Technology Shenzhen Graduate School • 級別：資工碩一 • 學號：102598007 • 姓名：林怡靜

Abstract • Microblog, as an online communication platform, is becoming more and more popular. Users generate volumes of data everyday and the user generated content contains a lot of useful knowledge such as practical skills and technical expertise. • This paper proposes a cross-data method to mine recipes in Microblog. • Snippets of text relevant to recipes are firstly extracted fromBaidu Encyclopedia. • Secondly, the extracted snippets of text are used to train a domain-specific unigram language model. • Thirdly, candidate recipes in Microblog are mined based on the unigram language model. • Finally, some heuristic rules are used to identify real recipes from the candidate recipes. • Experimental results show the effectiveness of the proposed method.

Introduction • Sina Weibo is one of the most notable Chinese microblogging services. • Users can post short messages, known as tweets, to broadcast anything with a strict limit of 140 characters. • It is reported that more than 100 million new tweets are posted every day in Sina Weibo. • Huge amounts of data available on Sina Weibo make it a valuable data source for data mining and attract a lot of attention in the research community.

Existing research focuses on two aspects of Microblog: • the social network (i.e. the network formed by links between users) • the actual text of tweets. • Research on the social network concentrates on measuring user influence and dynamics of popularity [1], analyzing the formation of communities and discovering communities in Microblog [2, 3], and the diffusion of tweets in Microblog [4]. • Research on actual text of tweets consists of classification of tweets [5], sentiment analysis [6] and trending topic detection [7] etc. • We have not yet found research on discovering tweets that describe some specific kinds of knowledge.

Although the length of a tweet is very short, it can also describe kinds of useful knowledge. • Discovering some specific kinds of knowledge and recommending them to interested users are meaningful work. We specifically focus on the task of mining recipes from Sina Weibo in this paper.

Most of tweets are ungrammatical and often contain noisy texts such as abbreviation, emoticons and spelling errors. • The length-limitation and above characteristics of tweets cause great difficulties in traditional NLP tools, but they are helpful for discovering knowledge descriptive tweets. • Because of the length-limitation, most of terms in a knowledge descriptive tweet are very relevant to the specific domain of the knowledge descriptive tweet. • Therefore, terms in a domain-specific language model are good indicators of whether or not a tweet is relevant to the specificdomain.

Baidu encyclopedia can be viewed as a potential corpus. • We propose a cross-data method to automatically extract training corpus from Baidu encyclopedia and produce the domain-specific language model. • Recipes are mined in Microblog by the language model.

Related Work • At present, tasks of data mining on the actual text of tweets mainly includes automatic summarization, sentiment analysis and event detection and tracking, etc. • B. Sharifi [8] proposed an algorithm called Phrase Reinforcement to create summary of posts containing a specified phrase. The central idea of the Phrase Reinforcement algorithm is to find the most commonly used phrase that encompasses the topic phrase. • Using twitter as a corpus, A. Pak [5] built a sentiment classifier that is able to determine positive, negative and neutral sentiments for a document.

L. Barbosa [9] proposed a method to detect sentiments on tweets. The proposed method explores characteristics of how tweets are written and meta-information of the words composing tweets. • A.-M [10] focused on detecting controversial events from twitter and used three regression machine learning models to solve the problem. • H. Sayyadi [11] proposed an algorithm to detect and describe events using the co-occurrence of keywords in documents. • Moreover, there are researchers trying to predict stock market [12] and elections [13] according to twitter. We focus on mining one kind of knowledge descriptive tweets, recipes, in this paper.

Baidu Encyclopedia • Baidu Encyclopedia is an open content online encyclopedia, which aims at creating a Chinese encyclopedia covering knowledge of all areas. • It is written collaboratively by volunteers and almost anyone can add a new entry to it or change an entry that already exists. • Each entry in Baidu Encyclopedia is described by an article. In this paper, directory and open category in the article are used to extract domain-specific snippets of text from articles in Baidu Encyclopedia.

A. Directory • The directory of an article contains titles of all paragraphs in the article and each paragraph depicts a certain characteristic of the entry. • B. Open category • Every article in Baidu encyclopedia is labeled with some tags by the volunteers. The tags are usually categories in which the article belong and are called open categories of the entry.

The proposed Method

A. Tweets preprocessing • Tweets with a length less than 10 characters are removed, because a knowledge descriptive tweet should not be too short. • Such string as “@***” and URLs in the tweets are also removed. “@***” are usually computer screen names of the users mentioned in the tweets.

B. Information Extraction from Baidu Encyclopedia • Each paragraph of an article depicts a certain characteristic of the entry, so paragraphs with the same heading in different articles can be used to train a domain specific language model. • The steps of constructing the corpus are as follows: • Articles belonging in the open category, 美食(gourmet food), are collected. 760 articles are acquired totally. • Articles whose directory contains the paragraph title “做法(cooking method)” are selected from the 760 articles.

3. paragraphs corresponding to the paragraph title “ 做法(cooking method)” are extracted from articles acquired in step 2. • The collection of all extracted paragraphs is used as a corpus to train a language model of recipes.

C. Domain Specific Language Model • A unigram language model of recipes is trained on the automatically constructed corpus. Each sentence in the corpus is segmented into words and all stop words are removed. Then the number of times each word appears in the corpus is counted. • Probability of words wi in the unigram language model is estimated by maximum likelihood estimation,where c(wi) is the number of times wi appears in the corpus.

Moreover, for words that don’t appear in the corpus, relative frequencies have to be smoothed to avoid zero probabilities caused by the sparse data problem. We adopt additive smoothing to estimate probabilities of words. Then probability of wi is, • V is the set of all considered words in the unigram language model.

D. Discovering Candidate Recipes Using Unigram Language Model • For a tweet s, probability of s being generated by the unigram language model of recipes is, • w1, w2, ..., wn are words appearing in s. • Intuitively, the larger P(s) is, the more likely s is a recipe.

However, there are cases that probability of a very short tweet is larger than that of a long tweet. • Considering the following example, s1 is a tweet containing only one word whose probability is 0.5 and s2 is a tweet containing ten words whose probability are all 0.9. P(s1) is larger than P(s2) in this example. • Obviously, it is in contradiction to the actual situation, because knowledge descriptive tweets are usually longer than other tweets.

According to above analysis, we define the degree of correlation between s and recipes as, • P(wi) can be viewed as the degree of correlation between wi and recipes. Tweets with a P(s) larger than a given threshold θ are identified as candidate recipes. The threshold θ is determined experimentally and it is set to 3 in this paper. • Finally, duplicate tweets are removed from the identified tweets.

Experiment • A. Dataset • The proposed method is evaluated on two datasets. • Dataset 1 is got from cnpameng. Cnpameng is a platform on which researchers crawl data from Sina Weibo collaboratively and share the crawled data. • Dataset 1 contains more than 65 million tweets, about 24.5 GB. Tweets in dataset 1 are all posted by users of Sina Weibo before August 10, 2012.

Because of the large number of tweets in dataset 1, we can’t know how many tweets describing recipes actually are there in dataset 1. Therefore, recall rate of the proposed method on dataset 1 could not be calculated. • In order to measure the recall rate of the proposed method, we construct dataset 2 manually. Dataset 2 contains 1000 randomly selected tweets that are posted by Sina Weibo users after August 10, 2012. 500 tweets in dataset 2 are recipes, and the rest are tweets involving other topics.

B. Evaluation Metrics • Precision, recall and F1 measure are adopted to evaluate the performance of the proposed method. The employed metrics are as follows: • where m is the number of real recipes that are identified by the proposed method, n is the number of tweets that are identified as recipes by the proposed method, d is the number of real recipes in the dataset.

C. Experimental Results • 65279 candidate tweets are mined from dataset 1 by the proposed method. • After an examination on the candidate tweets, we find that almost all the candidate tweets are relevant to recipes. But there are still many negative examples. • Most of the above negative examples are tweets which list some items such as foods, health care effects and places.

So we filter the negative examples in the candidate recipes with some heuristic rules. The heuristic rules are summarized as follows: • If a tweet contains several numerals and length of strings between two numerals is shorter than 10, it is very likely to list some items and should be removed. The number 10 is determined experimentally. • If a tweet contains substrings such as which all mean “several kinds of”, it tends to list some items rather than describes a cooking method and should be removed. • If a tweet contains substrings such as it also should be removed. * denotes a number and all the substrings means “kinds of”.

After filtering the negative examples, 12756 tweets are identified as recipes from dataset 1. • It is hard to count the number of real recipes in 12756 weets, so we calculate precision of the proposed method on dataset 1 through random sampling. • Random sampling is carried out 5 times and 500 tweets are randomly selected from the 12756 tweets each time. The 500 tweets are manually examined and real recipes are picked out each time. Results are as follows:

D. Discussion • From the performance results, we can see that the precision and recall of the proposed method is satisfactory. • This proves our hypothesis is right that the length-limitation of tweets can be viewed as an advantage for the task of this paper. Terms chosen by users to post a knowledge descriptive tweet are often carefully selected and highly relevant to the specific field. • Heuristic rules can refine the results. It demonstrates that structures of knowledge descriptive tweets’ actual text are more formal and can be used to distinguish the knowledge descriptive tweets and other tweets.

Conclusion And Future Work • The large volume of data available on Sina Weibo makes it a valuable data source for knowledge discovery. We tried mining recipes from tweets through a cross-data method in this paper. • The proposed method can construct a corpus and train a unigram language model automatically. The proposed method was evaluated on two datasets and achieved a good performance. • For future work, we will try discovering knowledge of other fields and building a personalized knowledge recommendation system in Microblog.

Thank you for listening

we find that almost all the candidate tweets are relevant to recipes. But there are still many negative examples, which can be divided into seven classes. • Tweets recommending health care foods; • Tweets listing foods that have the same health care effect; • Tweets describing the nutritional value of a particular food in detail; • Tweets introducing knowledge about food pairings; • Tweets about what the users have eaten; • Tweets introducing special food in various places; • Advertisements of food.

Mining Recipes in Microblog