240 likes | 342 Views
This study explores the methods of processing and analyzing massive log data from web browsing and social network interactions. Using tools like Map/Reduce and databases such as Cassandra and HBase, significant user behavior insights can be extracted from large datasets. The research focuses on real-time analysis of user actions, the effectiveness of search results, and behavior patterns derived from 66 million clicks in one month. The strategic application of data mining and machine learning is discussed to enhance business processes and user recommendations.
E N D
Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012
Web-browsing data social network communications sensor data ->Behavior data Google and Facebook, for example, are Big Data companies. Big data Opportunities Challenges • Big data processing • Extracting useful information that reflects user behavior from massive log • Instance data management • Data analysis Behavior data (like web log) can be used for improving and supporting business processes. Data mining, process mining and so on
BI/ Reporting Data Mining Machine Learning Process Mining Process Mining Analytic applications Cassandra Cloud computing (Map/ReduceFramework) Cloud computing (Map/ReduceFramework) Instance data Big Data processing Big Data Access Hive NoSQL NoSQL Distributed File System(HDFS) Cloud Storage UnstructuredData Raw data Distributed File System(HDFS) Key-value Database(HBase ,Cassandra, MongoDB)
Case study: Search Engine Company • News, Page, Image, Maps, Music, navigation • Dataset: • 66 million clicks in one month, 2.2 million clicks per day • ->generate behavior in 10 minutes • User Behavior: • Visiting path (Referer) • Searching result effectiveness • Abs Clicking Behavior • Source and Destination of User visiting • Robot Behavior Reorganization and Analysis • Visiting page layout • Behavior comparison and product improvement • User grouping and recommendation
Data features • It contains massive information in a well recorded format • Large scale with big growing potential • Real-time analysis
existing tools • Data extracting: XESame,Prom Import • Process Mining : ProM • Due to large data set, analysing has low speed and in most situations it got crash • Offline analysis-> real-time analysis Extracting data from cloud Cloud Storage /no rational DB Instance data(XES)
System Structure Understandablemodel Extracting useful information that reflects user behavior from massive log Log processing
CPU: Intel Xeon 2.40GHZ • RAM:2GB • 14Nodes
Process Discovery • One instance/case is defined as one visitor’s one time visiting. • IP+UA • CookieID • Activity varies based on different requirements • Alpha miner • Heuristic miner • Fuzzy miner • Sequence model
Conclusion • It is a nice project to get into data analysis field ,with the combination of web data analysis, process mining and cloud computing technology. • Future work: • 1 More algorithms and technologies should be applied to this data set. • 2 Behavior comparison and user recommendation still need to be accomplished. • 3 Can process mining analyze the behavior that does not have a certain pattern. • 1 Log Sampling • 2 Detect the incorrectness from logs before applying log to analysis technologies. • 3 Extend function of “converting data from key-value database or cloud storage to event log” in Prom or XESame.
feedback • 1 What is the real questions? • 2 Why process mining?
Thank you ! Meng Dou 13/9/2012