Personalisation
This presentation is the property of its rightful owner.
Sponsored Links
1 / 21

Personalisation PowerPoint PPT Presentation


  • 43 Views
  • Uploaded on
  • Presentation posted in: General

Personalisation. Seminar on Unlocking the Secrets of the Past: Text Mining for Historical Documents Sven Steudter. Domain. Museums offering vast amount of information But: Visitors receptivity and time limited Challenge: seclecting ( subjectively ) interesting exhibits

Download Presentation

Personalisation

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Personalisation

Personalisation

Seminar on

Unlocking the Secrets of the Past:

Text Mining for Historical Documents

Sven Steudter


Domain

Domain

  • Museums offeringvastamountofinformation

  • But: Visitorsreceptivityand time limited

  • Challenge: seclecting (subjectively) interestingexhibits

  • Ideaof mobile, electronicehandheld, like PDA assistingvisitorby :

    • Deliveringcontentbased on observationsofvisit

    • Recommendexhibits

  • Non-intrusive ,adaptive usermodellingtechnologiesused


Prediction stimuli

Predictionstimuli

  • Different stimuli:

    • Physicalproximityofexhibits

    • Conceptualsimularity (based on textualdescribtionofexhibit)

    • Relative sequenceothervisitorsvisitedexhibits (popularity)

  • Evaluate relative impactofthe different factors => seperatestimuli

  • Language basedmodelssimulatevisitorsthoughtprocess


Experimental setup

Experimental Setup

  • Melbourne Museum, Australia

  • Largestmuseum in Southern Hemisphere

  • RestrictedtoAustralia Gallery collection, presentshistoryofcityof Melbourne:

    • PharLap

    • CSIRAC

  • Variation ofexhibits: can not classified in a singlecategory


Experimental setup1

Experimental Setup

  • Wide rangeofmodality:

    • Information plaque

    • Audio-visualenhancement

    • Multiple displaysinteractingwithvisitor

  • Here: NOT differentiatebetweenexhibittypesormodalities

  • Australia Gallery Collectionexistsof 53 exhibits

  • Topologyoffloor: open plan design => nopredeterminedsequencebyarchitecture


Resources

Resources

  • Floorplanofexhibitionlocated in 2. floor

  • Physicaldistanceoftheexhibits

  • Melbourne Museum web-siteprovidescorresponding web-pageforeveryexhibit

  • Dataset of 60 visitorpathsthroughthegallery, usedfor:

    • Training (machinelearning)

    • Evaluation


Predictions based on proximity and popularity

Predictionsbased onProximityandPopularity

  • Proximity-basedpredictions:

    • Exhibitsranked in order ofphysicaldistance

    • Prediction: closest not-yet-visitedexhibittovisitorscurrentlocation

    • In evaluation: baseline

  • Popularity-basedpredictions:

    • Visitorpathsprovidedby Melbourne Museum

    • Convertpathsintomatrixoftransitionalprobabilities

    • Zero probabilitiesremovedwithLaplaciansmoothing

    • Markov Model


Text based prediction

Text-basedPrediction

  • Exhibitsrelatedtoeachotherbyinformationcontent

  • Every exhibits web-pageconsitsof:

    • Body oftextdescribingexhibit

    • Set ofattributekeywords

  • Predictionofmostsimilarexhibit:

    • Keywordsasqueries

    • Web-pagesasdocumentspace

  • Simple termfrequency-inverse documentfrequency, tf-idf

  • Score ofeachqueryovereachdocumentnormalised


Personalisation

WSD

  • Whymakevisitorsconnectionsbetweenexhibits ?

  • Multiple simularitiesbetweenexhibitspossible

  • Useof Word Sense Disambiguation:

    • Path ofvisitorassentenceofexhibits

    • Eachexhibit in sentencehasassociatedmeaning

    • Deteminemeaningofnextexhibit

  • Foreachword in keywordsetofeachexhibit:

    • WordNetsimilarityiscalculatedagainsteachotherword in otherexhibits


Wordnet similarity

WordNetSimilarity

  • Similaritymethodsused:

    • Lin (measuresdifferenceofinformationcontentoftwotermsasfunctionofprobabilityofoccurence in a corpus)

    • Leacock-Chodorow (edge-counting: function of length of path linking the terms and position of the terms in the taxonomy)

    • Banerjee-Pedersen (Leskalgorithm)

  • SimilarityassumofWordNetsimilaritiesbetweeneachkeyword

  • Visitorshistorymaybeimportantforprediction

  • Latestvisitedexhibitshigherimpact on visitorthanfirstvisitedexhibits


Evaluation method

Evaluation: Method

  • Foreachmethodtwotests:

    • Predictnextexhibit in visitorspath

    • Restrictpredictions, onlyifpred. overthreshold

  • Evaluation data, aforementiened 60 visitorpaths

  • 60-fold cross-validationused, forPopularity:

    • 59 visitorpathsastrainingdata

    • 1 remainingpathforevaluationused

    • Repeat thisfor all 60 paths

    • Combine theresults in singleestimation (e.gaverage)


Evaluation

Evaluation

Accuracy: Percentageoftimes, occuredevent was predictedwithhighestProbability

BOE: BagofExhibits: Percentageofexhibitsvisitedbyvisitor, not necessary in order ofrecommendation

BOE is, in thiscase, identicaltoprecision

Single exhibithistory


Evaluation1

Evaluation

Single exhibithistory

withoutthresholdwiththreshold


Evaluation2

Evaluation

Visitorshistoryenhanced

Single exhibithistory


Conclusion

Conclusion

  • Best performingmethod: Popularity-basedprediction

  • Historyenhancedmodelslowperformer, possiblereason:

    • Visitorshadnopreconceivedtask in mind

    • Movingfromoneimpressiveexhibittonext

  • Historyhere not relevant, currentlocationmoreimportant

  • Keep in mind:

    • Small dataset

    • Melbourne Gallery (historyofthecity) perhabsnogoodchoice


Personalisation

BACKUP


Tf idf

tf-idf

  • Term frequency – inverse documentfrequency

  • Term count=numberoftimes a giventermappears in document

  • Number n oftermt_i in doumentd_j

  • In larger documentstermoccursmorelikely, therefornormalise

  • Inverse documentfrequency, idf, measuresgeneralimportanceofterm

  • Total numberofdocuments,

  • Dividedbynrofdocscontainingterm


Tf idf similarity

tf-idf: Similarity

  • Vectorspace model used

  • Documentsandqueriesrepresentedasvectors

  • Eachdimensioncorrespondsto a term

  • Tf-idfusedforweighting

  • Compare angle betweenquery an doc


Wordnet similarities

WordNetsimilarities

  • Lin:

    •  method to compute the semantic relatedness of word senses using the information content of the concepts in WordNet and the 'Similarity Theorem'

  • Leacock-Chodorow:

    •  counts up the number of edges between the senses in the 'is-a' hierarchy of WordNet

    • value is then scaled by the maximum depth of the WordNet 'is-a' hierarchy

    • Banerjee-Pedersen, Lesk:

      • choosing pairs of ambiguous words within a neighbourhood

      • checks their definitions in a dictionary

      • choose the senses as to maximise the number of common terms in the definitions of the chosen words.


Precision recall

Precision, Recall

Recall: Percentageof relevant documentswithrespecttothe relative numberofdocumentsretrieved.

Precision: Percentageof relevant documentsretrievedwithrespectto total numberofrelevant documents in dataspace.


F score

F-Score

  • F-Score combines Precision and Recall

  • Harmonicmeanofprecisionandrecall


  • Login