1 / 27

Clickstream analysis - data collection, preprocessing and mining using LISp-Miner system

This case study explores the use of LISp-Miner for clickstream analysis, data collection, preprocessing, and mining. It covers topics such as effective placement of online advertising, data collection on the server application layer, and data preprocessing techniques. The study also discusses advantages and disadvantages of the LISp-Miner system and presents a methodology using CRISP-DM. Key topics include UML sequence diagrams, segment procedure, merge procedure, data mining, and the use of association rules.

scottsilva
Download Presentation

Clickstream analysis - data collection, preprocessing and mining using LISp-Miner system

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clickstream analysis - data collection, preprocessing and mining usingLISp-Miner system A case study approach Effective placement of on-line advertising Tomáš Kliegr KIZI

  2. Methodology • CRISP-DM

  3. I. Data collection • Data are collected on the server application layer • No demands on the tracked website • ASP.NET must be supported

  4. UML Sequence diagram

  5. Comparison with log-file based approaches Advantages • Works with all browsers with enabled cookies • Automatic robot filtering • Storage efficiency • Easy to integrate & safe to operate Disadvantages • Database required • Hosting must support .NET Framework

  6. II. Data preprocessing Problem: collected click streams have varying lengths. This phase creates a fixed-length visitor’s profile in a two step process Segment procedure: classifies pages into a domain specific taxonomy on several levels of granularity. Merge procedure: extracts important and characteristic information from visitor’s clickstream.

  7. Segment procedure • Classifies pages into a domain specific taxonomy on several levels of granularity. • Assigns Time on page and Score to each page in visitor’s clickstream • Score expresses absolute weight of a particular page in user’s click stream. S = (ln(O) + 1)* t o – order of a page in users clickstream t – time on page

  8. Assigning pages to categories Prespecified taxonomy (tuples ProductID - category, Tuples URL pattern – category) Visited pages (URL addresses Stored in a database) SQL Server SP Segment Pages classified on several levels of granularity

  9. General category (Cat) Search Topic Alps ExtendedCategory (ECat) Catalogue Segment – Example output Page www.poznani.cz/hiking-alps/

  10. Merge procedure This procedure creates the visitor profile: • Basic attributes (6): Total time on web, Number of displayed pages, Day of week,Hour of day, Referring domain (constituted by URL and Cat attributes). • Important points on the path (12): Entry page, Exit page, Conversion page. (Page name, Cat, ECat and S). • Attributes conceptualizing the path (11): Range of interest, Most favourite topic(Topic, S), Search total (S) and Search analytically (Fulltext (S), Extended search (S),Catalogue Search (S)), General information pages total (S) and analytically (Discounts(S), Insurance (S), About (S)).

  11. Merge – example output

  12. III. Datamining • Association Rules are the most frequently used approach [Facci, Lanza] • LISp-Miner system - 4ft-Miner, SD4ft-Miner • Categories created in LMDataSource

  13. Sample tasks • Task 1: • From which referring class of websites do most converted visitors come? • Task 2: • What are the visitor’s interests in relation to the referring server • Task 3: • Relation between provision of information on discounts, insurance and entrance page and conversion

  14. Choosing the right quantifier • Founded implication • Support a, a/(a+b+c+d) • Confidence a/(a+b) • Problem: tight dependancies rarely found and rarely required in clickstream data • Above average quantifier “Among objects satisfying Ant there are at least 100*p per cent more objects satisfying Suc then there are objects satisfying Suc in the whole data matrix.”LISp-Miner Help

  15. SD4ft-Miner • Mines for patterns of the form   /(,,) • This SD4ft-Pattern means that the subsets given by Boolean attributes , differ in what concerns the relation of Boolean attributes ,  when condition  is satisfied. • What groups of customers , (i.e. depending on where they come from) under what condition remarkably differ when it comes to the probability of conversion. • We express “the conversion condition” by setting only the succedent () and we leave the antecedent unset.

  16. 4ft Miner vs SD4ft 4ft-Miner, Above Average Quant. SD4ft-Miner, (neg. gace type for 2nd subset) The value of increase in the conversion rate is more suitable for out purposes as the 2nd set is disjunctive with the 1st set. The cr. For partner webs is 78 % higher than is the average for other referrers Con1/Conf2= 0,132/0,074 = 1,784

  17. Solution to Task 1 From which referring class of websites do most converted visitors come?

  18. SD4Ft – cont. • If the output is sorted according to Difference of values of confidence • The first rule says: Conversion rate for visitors coming from partner websites is 13.2%, while conversion rate for visitors coming from company’s own websites is only 4.9%.

  19. Review • The goal of the second run of the CRISP-DM Cycle is to • improve currently used tools, • increase the quality of current attributes • add new attributes by involving page texts • wrap feasible solutions into Ferda modules

  20. I. Data collection • Track visitors across visits • Permanent cookies • Track real actions not only page views • Add parameters • Stronger normalization • Database can become easily full under current implementation

  21. II. Data preprocessing • Provide tool for taxonomy design and matching • Match pages to taxonomies semi-manually • based on pattern in URL • Based on words in documents • Automatically cluster pages using information retrieval methods • Functionally – repeating content in sidebars, etc. • Semantically – use headings, title, em, strong,desc. • Assumption: Commercial content is written for search engines. • Use Wordnet to assign hypernyms to keywords • Negative use of WordNet could aid distinguishing product names

  22. This Boring Headline is Written for Google • New York Times: “About a year ago, The Sacramento Bee changed online section titles. "Real Estate" became "Homes," "Scene" turned into "Lifestyle," and dining information found in newsprint under "Taste," is online under "Taste/Food."'"

  23. Preprocessing cont. Are the keywords used to find the document on a search engine contained in the document? Yes No Are there more relavant pages to this keyword? Does this keyword occur on some other page of the web? Yes Yes No No All is the way it should be Possible Google Bomb / negative reputation Possible mistake in SEO

  24. III. DataMining • Example DM task 1: Which “classes” of words are most frequently used? • Example DM task 2: What two groups of people (e.g. googling for Africa vs. Mountain biking) under what condition (did they buy something) remarkably differ what concerns the relation of number of visited pages and number of visited topics

  25. Conclusion • To do: • Utilize (Euro)WordNet • Assign different weights based on HTML Tags • Test feasibility of Query/Document coocurrencies (Sample DM Tasks) • If it works: • Include/ Write Spider • Write taxonomy editor/miner • Wrap it all as Ferda modules

More Related