1 / 24

Processing and Analyzing Large log from Search Engine

Processing and Analyzing Large log from Search Engine. Meng Dou 13/9/2012. Web-browsing data social network communications sensor data ->Behavior data Google and Facebook, for example, are Big Data companies. Big data. Opportunities. Challenges. Big data processing

Download Presentation

Processing and Analyzing Large log from Search Engine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012

  2. Web-browsing data social network communications sensor data ->Behavior data Google and Facebook, for example, are Big Data companies. Big data Opportunities Challenges • Big data processing • Extracting useful information that reflects user behavior from massive log • Instance data management • Data analysis Behavior data (like web log) can be used for improving and supporting business processes. Data mining, process mining and so on

  3. BI/ Reporting Data Mining Machine Learning Process Mining Process Mining Analytic applications Cassandra Cloud computing (Map/ReduceFramework) Cloud computing (Map/ReduceFramework) Instance data Big Data processing Big Data Access Hive NoSQL NoSQL Distributed File System(HDFS) Cloud Storage UnstructuredData Raw data Distributed File System(HDFS) Key-value Database(HBase ,Cassandra, MongoDB)

  4. Case study: Search Engine Company • News, Page, Image, Maps, Music, navigation • Dataset: • 66 million clicks in one month, 2.2 million clicks per day • ->generate behavior in 10 minutes • User Behavior: • Visiting path (Referer) • Searching result effectiveness  • Abs Clicking Behavior • Source and Destination of User visiting • Robot Behavior Reorganization and Analysis • Visiting page layout • Behavior comparison and product improvement • User grouping and recommendation

  5. Data features  • It contains massive information in a well recorded format • Large scale with big growing potential • Real-time analysis

  6. existing tools • Data extracting: XESame,Prom Import • Process Mining : ProM • Due to large data set, analysing has low speed and in most situations it got crash • Offline analysis-> real-time analysis Extracting data from cloud Cloud Storage /no rational DB Instance data(XES)

  7. System Structure Understandablemodel Extracting useful information that reflects user behavior from massive log Log processing

  8. Convert raw log to instance data(event log) with Map/Reduce

  9. CPU: Intel Xeon 2.40GHZ • RAM:2GB • 14Nodes

  10. Process Discovery • One instance/case is defined as one visitor’s one time visiting. • IP+UA • CookieID • Activity varies based on different requirements • Alpha miner • Heuristic miner • Fuzzy miner • Sequence model

  11. Behavior analysis

  12. Behavior analysis

  13. Active visitor’s visiting path

  14. Behavior analysis

  15. Main page

  16. Sequence model

  17. `

  18. XES statistics

  19. Conclusion • It is a nice project to get into data analysis field ,with the combination of web data analysis, process mining and cloud computing technology. • Future work: • 1 More algorithms and technologies should be applied to this data set. • 2 Behavior comparison and user recommendation still need to be accomplished. • 3 Can process mining analyze the behavior that does not have a certain pattern. • 1 Log Sampling • 2 Detect the incorrectness from logs before applying log to analysis technologies. • 3 Extend function of “converting data from key-value database or cloud storage to event log” in Prom or XESame.

  20. feedback • 1 What is the real questions? • 2 Why process mining?

  21. Thank you ! Meng Dou 13/9/2012

More Related