artificial intelligence and data mining in information retrieval n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Artificial Intelligence and Data Mining in Information Retrieval PowerPoint Presentation
Download Presentation
Artificial Intelligence and Data Mining in Information Retrieval

Loading in 2 Seconds...

play fullscreen
1 / 21

Artificial Intelligence and Data Mining in Information Retrieval - PowerPoint PPT Presentation


  • 112 Views
  • Uploaded on

Artificial Intelligence and Data Mining in Information Retrieval. 31.01.2011 Presentation by Volker Rehberg University of Konstanz. Agenda. ->Agenda Definitions Indexing Classification Clustering Feedback & Ranking Conclusion.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Artificial Intelligence and Data Mining in Information Retrieval' - kato-witt


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
artificial intelligence and data mining in information retrieval

Artificial Intelligence and Data Mining in Information Retrieval

31.01.2011

Presentation by Volker Rehberg University of Konstanz

agenda
Agenda

->Agenda DefinitionsIndexingClassification Clustering Feedback & Ranking Conclusion

Overall Goal: most important AI/ Data Mining methods for Information Retrieval & most valuable impact.

  • Definition Information Retrieval vs. Artificial Intelligence and Data Mining

Indexing and Dimension Reduction:

  • Term Vector Model
  • Dimension Reduction by PCA
  • Latent Semantic Analysis

Classification:

  • Support Vector Machines
  • Bayes Classifier
  • Fuzzy Classification
agenda1
Agenda

->Agenda DefinitionsIndexingClassification Clustering Feedback & Ranking Conclusion

Clustering:

  • Query Reformulation
  • Document Clustering for Presentation

Relevance Feedback and Ranking:

  • with Neuronal Networks

Summary and Conclusion

definitions
Definitions

Agenda ->DefinitionsIndexingClassification Clustering Feedback & Ranking Conclusion

  • Definition Information Retrieval:

“Information retrieval (IR) is finding material (usually documents) … that satisfies an information need fromwithin large collections…” Christopher Manning [1]

four distinct phases:

  • indexing
  • query formulation
  • comparison
  • feedback

[2]

definitions1
Definitions

Agenda ->DefinitionsIndexingClassification Clustering Feedback & Ranking Conclusion

Definition ArtificialIntelligence:

„Is thescienceandengineeringofmaking intelligent machines.“

John McCarthy [3]

  • Soft AI hypothesis: Machines canbehave intelligent.
  • Strong Artificial AI hypothesis: Machines arereallyabletothink [4]

Definition Data Mining:

„Data Mining is …generatingknowledgefromdataandit‘spresentation. It‘s … originated in statisticsorartificialintelligenceandshouldbeapplicabletolarge databases...“Wolfgang Ertel [5]

term vector model
Term Vector Model

Agenda Definitions->IndexingClassification Clustering Feedback & Ranking Conclusion

Boolean or numeric vector of appearances of words in

documents

Documents similar if they lie close together in vector space.

Retrieval by distance or angle between query and document

Term Document Matrix has large amountof

dimensions

  • Problem: highcostforprocessing
  • Need forreducingdimensions [1]
dimension reduction by pca
Dimension Reductionby PCA

Agenda Definitions->IndexingClassification Clustering Feedback & Ranking Conclusion

Advantage: Processing on a reducedanduncorrelatedmatrix

PCA isrotationofdata in space (bynewcoordinatesystem), so that:

  • thefirstdimensionsstoremostoftheinformation
  • 1st dimensionhashighest Eigenvalue (importance)
  • Tomainapproaches: varianceapproach vs. errorapproach
  • Weget a byrelevanceorderedsetofuncorrelateddimensions
  • After thatwecanreducethedimensionality [6]
latent semantic indexing
Latent SemanticIndexing

Agenda Definitions ->IndexingClassification Clustering Feedback & Ranking Conclusion

  • LSI isreducingdimensionsoftd- matrixbygroupingtermstoconcepts

Advantages:

  • Dimension reductionofterm-documentmatrix
  • Documentcanberetrievedevenifitdoes not containquerywords
  • LSI one applies Singular Value Decomposition to a term-document matrix
  • Models conceptsrelatedtotermsandconceptsrelatedtodocuments. [1]
classification
Classification

Agenda DefinitionsIndexing->ClassificationClustering Feedback & Ranking Conclusion

Classification is assigning an object (e.g. text or audio document) to a distinct class

Advantages for Indexing, Query Formulation, Comparison, Ranking

Example Benefits for Information Retrieval:

  • Identify the language of a document
  • Identify spam pages and do not index them [1]
  • Identify collocations (e.g. „New York“, „Data Mining“, ) andindexthemtogether
  • Categorizization in library cataloging system or Newswire stories [7]
  • Categorization of multimedia content (e.g. audio, video) [8]
  • Classifyquery, toseewhichtext „category“ userissearchingfor
  • Relevance ranking (classes relevant / non relevant) [1]
classification1
Classification

Agenda DefinitionsIndexing->ClassificationClustering Feedback & Ranking Conclusion

Several methods:

  • Support Vector Machines
  • Naïve Bayes
  • K-Nearest Neighbour
  • Decision Trees
  • Neuronal Networks
  • Genetic Algorithms

Supervised (often used) vs. unsupervised learning (rarely used):

  • Unsupervised learning require no training but much more computation-intensive than supervised schemes. [1]
support vector machines
Support Vector Machines

Agenda DefinitionsIndexing->ClassificationClustering Feedback & Ranking Conclusion

Dividedocumentsinto 2 classes

bydrawing Hyperplane withmaximummargintosupportvectors.

Application: Video Retrieval [8]

bayes classifier
BayesClassifier

Agenda DefinitionsIndexing->ClassificationClustering Feedback & Ranking Conclusion

  • The theoremstates:
  • Example Spamfilter:
  • Was in the 1960th used for first ranking systems
  • Implementation of probabilistic model [10]

P(A) is the prior probability (also called "unconditional“ probability)

P(A|B) is the conditional probability of A, given B (also calledposteriorprobability)

P(B|A) is the conditional probability of B, given A (also calledlikelihood) P(B) is the prior probability

[9]

fuzzy classification
FuzzyClassification

Agenda DefinitionsIndexing->Classification Clustering Feedback & Ranking Conclusion

  • In standardlogic: somethingistrue/not true
  • In fuzzylogic: canbetrueorfalseto a certaindegree [6]

ExampleApplicationandBenfitsto IR:

  • FuzzyClassification (e.g. relevant/ not relevant  Ranking)
  • Fuzzy Clustering
clustering
Clustering

Agenda DefinitionsIndexingClassification->Clustering Feedback & Ranking Conclusion

Clustering is finding groups of similar objects

Term Clustering:

Cluster search terms by appearance in documents and add similar terms to query.

Advantage:

  • Query expand-> increase Recall

Document Clustering:

Find similar documents with respect to relevance to information needs. [1]

Advantages: Retrieve similar documents

Advance presentation of documents

query reformulation by clustering
Query Reformulationby Clustering

Agenda DefinitionsIndexingClassification->Clustering Feedback & Ranking Conclusion

  • Query = „training“ & „aircraftcaptain„
  • Does not returndocumentswithonlyterm „pilot“

 increaserecallbyaddingrelatedterm „pilot“ toquery

Association Cluster: termsthatoftenappear in the same documentsaresemanticallyrelatedorsynonyms [11]

document clustering for presentation
Document Clustering forPresentation

Agenda DefinitionsIndexingClassification->Clustering Feedback & Ranking Conclusion

www.yippy.com

Advantage: Clear out ambiguitybysemanticclustering

ranking and feedback with nn
Ranking and Feedback with NN

Agenda DefinitionsIndexingClassification Clustering ->Feedback & Ranking Conclusion

Neurons connectedinto a network

Neurons have:

  • Input andoutputvalues
  • Activationfunction
  • Weightedconnections

Implementationfor:

  • Vectorspace model
  • Probabilistic model
  • Boolean model
ranking and feedback with nn1
Ranking and Feedback with NN

Agenda DefinitionsIndexingClassification Clustering ->Feedback & Ranking Conclusion

3 layersofneuronsconnectedthroughweights.

Input layer: query

Hidden layer: terms

Output layer: documents

Query: Propagation frominputto

hiddenlayer

Feedback: Backpropagation

Relevanceranking

 Relevancefeedback [12] [13]

summary and conclusion
SummaryandConclusion

Agenda DefinitionsIndexingClassification Clustering Feedback & Ranking ->Conclusion

  • AI &DM have a highinfluence on all processesof Information Retrieval
  • Multimedia Retrievalishereoneofthemostinterestingfieldsofresearchatthemoment.

Interested in AI/DM? -> gotothecourses:

  • „Data Mining“ 1 & 2
  • „ComputationalMethodsforDocument Analysis“
resources
Resources

Agenda DefinitionsIndexingClassification Clustering Feedback & Ranking ->Conclusion

  • [1] Manning, Christopher; Raghavan, Prabhakar; Schütze, Hinrich : “Introduction to Information Retrieval“ by Manning, Raghavan, Schütze
  • [2 ]Lewis, D.D. (1991) “Learning in intelligent information retrieval.” Proceedings of the International Workshop on Machine Learning, (Evanston, Illinois), pp. 235–239.
  • [3] MCCarthy, John: “What is artificial Intelligence?”, www-formal.stanford.edu/jmc/whatisai/node, 2007
  • [4 ]Russell, Stuart; Norvig Peter: “Artificial Intelligence: A modern Approach”, 3.Edition, Prentice Hall 2010
  • [5] Ertel, Wolfgang: „Grundkurs Künstliche Intelligenz: eine praxisorientierte Einführung“, 2. ed. Vieweg und Teubner, 2009
  • [6] Berthold, Michael; Hand, David J. : „Intelligent Data Analysis : An Introduction„, 2.ed., 2007
  • [7] Cunningham, S.J., Littin, J.N. and Witten: “Applications of machine learning in information retrieval.” : Annual Review of Information Science and Technology, edited by M.E. Williams, pp. 341-419. American Society for Information Science
resources1
Resources

Agenda DefinitionsIndexingClassification Clustering Feedback & Ranking ->Conclusion

  • [8]Blanken, Henk; Vries, Arjen P.; Blok, Hernk Ernst; Feng, Ling: „Multimedia Retrieval“, Springer, 2007
  • [9] Mansmann, F. ; Berthold, M.; Keim, D. : „Data Mining Foundations: FindingExplanations“ Vorlesungsfolien, Universität Konstanz, 2011
  • [10] Stock, Wolfgang G.: „Information Retrieval: Informationen suchen und finden“, Oldenburg Wissenschaftsverlag, 2007
  • [11] Eigenstuhler , Gerald; Hubmann, Alexander; Wischounig, Daniel: „ Information SearchandRetrieval Vorlesungsblock 05: Query Reformulation, AI in Information Retrieval „ , Graz: Institut für Informationssysteme und Computer Medien, http://www.iicm.tu-graz.ac.at/isr/vo/inhalte/block_05/block05.htm#automatic_local_analysis, Jan 2010
  • [12] Sigel, Christian: „Inferenznetzwerke und Neuronale Netze im Information Retrieval“, Johannes Gutenberg-Universitat Mainz, 2010, http://www.informatik.uni-mainz.de/lehre/ir/seminar-wise-0910/Sigel-INNN-Folien.pdf
  • [13] Hsinchun Chen: “Machine Learning for Information Retrieval: Neuronal Networks, Symbolic Learning, and Genetic Algorithms”, JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. 46(3):194-216, 1995