1 / 21

Personalisation

Personalisation. Seminar on Unlocking the Secrets of the Past: Text Mining for Historical Documents Sven Steudter. Domain. Museums offering vast amount of information But: Visitors receptivity and time limited Challenge: seclecting ( subjectively ) interesting exhibits

dinesh
Download Presentation

Personalisation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Personalisation Seminar on Unlocking the Secrets of the Past: Text Mining for Historical Documents Sven Steudter

  2. Domain • Museums offeringvastamountofinformation • But: Visitorsreceptivityand time limited • Challenge: seclecting (subjectively) interestingexhibits • Ideaof mobile, electronicehandheld, like PDA assistingvisitorby : • Deliveringcontentbased on observationsofvisit • Recommendexhibits • Non-intrusive ,adaptive usermodellingtechnologiesused

  3. Predictionstimuli • Different stimuli: • Physicalproximityofexhibits • Conceptualsimularity (based on textualdescribtionofexhibit) • Relative sequenceothervisitorsvisitedexhibits (popularity) • Evaluate relative impactofthe different factors => seperatestimuli • Language basedmodelssimulatevisitorsthoughtprocess

  4. Experimental Setup • Melbourne Museum, Australia • Largestmuseum in Southern Hemisphere • RestrictedtoAustralia Gallery collection, presentshistoryofcityof Melbourne: • PharLap • CSIRAC • Variation ofexhibits: can not classified in a singlecategory

  5. Experimental Setup • Wide rangeofmodality: • Information plaque • Audio-visualenhancement • Multiple displaysinteractingwithvisitor • Here: NOT differentiatebetweenexhibittypesormodalities • Australia Gallery Collectionexistsof 53 exhibits • Topologyoffloor: open plan design => nopredeterminedsequencebyarchitecture

  6. Resources • Floorplanofexhibitionlocated in 2. floor • Physicaldistanceoftheexhibits • Melbourne Museum web-siteprovidescorresponding web-pageforeveryexhibit • Dataset of 60 visitorpathsthroughthegallery, usedfor: • Training (machinelearning) • Evaluation

  7. Predictionsbased onProximityandPopularity • Proximity-basedpredictions: • Exhibitsranked in order ofphysicaldistance • Prediction: closest not-yet-visitedexhibittovisitorscurrentlocation • In evaluation: baseline • Popularity-basedpredictions: • Visitorpathsprovidedby Melbourne Museum • Convertpathsintomatrixoftransitionalprobabilities • Zero probabilitiesremovedwithLaplaciansmoothing • Markov Model

  8. Text-basedPrediction • Exhibitsrelatedtoeachotherbyinformationcontent • Every exhibits web-pageconsitsof: • Body oftextdescribingexhibit • Set ofattributekeywords • Predictionofmostsimilarexhibit: • Keywordsasqueries • Web-pagesasdocumentspace • Simple termfrequency-inverse documentfrequency, tf-idf • Score ofeachqueryovereachdocumentnormalised

  9. WSD • Whymakevisitorsconnectionsbetweenexhibits ? • Multiple simularitiesbetweenexhibitspossible • Useof Word Sense Disambiguation: • Path ofvisitorassentenceofexhibits • Eachexhibit in sentencehasassociatedmeaning • Deteminemeaningofnextexhibit • Foreachword in keywordsetofeachexhibit: • WordNetsimilarityiscalculatedagainsteachotherword in otherexhibits

  10. WordNetSimilarity • Similaritymethodsused: • Lin (measuresdifferenceofinformationcontentoftwotermsasfunctionofprobabilityofoccurence in a corpus) • Leacock-Chodorow (edge-counting: function of length of path linking the terms and position of the terms in the taxonomy) • Banerjee-Pedersen (Leskalgorithm) • SimilarityassumofWordNetsimilaritiesbetweeneachkeyword • Visitorshistorymaybeimportantforprediction • Latestvisitedexhibitshigherimpact on visitorthanfirstvisitedexhibits

  11. Evaluation: Method • Foreachmethodtwotests: • Predictnextexhibit in visitorspath • Restrictpredictions, onlyifpred. overthreshold • Evaluation data, aforementiened 60 visitorpaths • 60-fold cross-validationused, forPopularity: • 59 visitorpathsastrainingdata • 1 remainingpathforevaluationused • Repeat thisfor all 60 paths • Combine theresults in singleestimation (e.gaverage)

  12. Evaluation Accuracy: Percentageoftimes, occuredevent was predictedwithhighestProbability BOE: BagofExhibits: Percentageofexhibitsvisitedbyvisitor, not necessary in order ofrecommendation BOE is, in thiscase, identicaltoprecision Single exhibithistory

  13. Evaluation Single exhibithistory withoutthresholdwiththreshold

  14. Evaluation Visitorshistoryenhanced Single exhibithistory

  15. Conclusion • Best performingmethod: Popularity-basedprediction • Historyenhancedmodelslowperformer, possiblereason: • Visitorshadnopreconceivedtask in mind • Movingfromoneimpressiveexhibittonext • Historyhere not relevant, currentlocationmoreimportant • Keep in mind: • Small dataset • Melbourne Gallery (historyofthecity) perhabsnogoodchoice

  16. BACKUP

  17. tf-idf • Term frequency – inverse documentfrequency • Term count=numberoftimes a giventermappears in document • Number n oftermt_i in doumentd_j • In larger documentstermoccursmorelikely, therefornormalise • Inverse documentfrequency, idf, measuresgeneralimportanceofterm • Total numberofdocuments, • Dividedbynrofdocscontainingterm

  18. tf-idf: Similarity • Vectorspace model used • Documentsandqueriesrepresentedasvectors • Eachdimensioncorrespondsto a term • Tf-idfusedforweighting • Compare angle betweenquery an doc

  19. WordNetsimilarities • Lin: •  method to compute the semantic relatedness of word senses using the information content of the concepts in WordNet and the 'Similarity Theorem' • Leacock-Chodorow: •  counts up the number of edges between the senses in the 'is-a' hierarchy of WordNet • value is then scaled by the maximum depth of the WordNet 'is-a' hierarchy • Banerjee-Pedersen, Lesk: • choosing pairs of ambiguous words within a neighbourhood • checks their definitions in a dictionary • choose the senses as to maximise the number of common terms in the definitions of the chosen words.

  20. Precision, Recall Recall: Percentageof relevant documentswithrespecttothe relative numberofdocumentsretrieved. Precision: Percentageof relevant documentsretrievedwithrespectto total numberof relevant documents in dataspace.

  21. F-Score • F-Score combines Precision and Recall • Harmonicmeanofprecisionandrecall

More Related