COMP3740 CR32: Knowledge Management and Adaptive Systems

COMP3740 CR32:Knowledge Management and Adaptive Systems Overview and example KM exam questions By Eric Atwell, School of Computing, University of Leeds

Office: 6.06a eric@comp.leeds.ac.uk http://www.comp.leeds.ac.uk/eric http://www.comp.leeds.ac.uk/nlp S2: Vania Dimitrova S1: Eric Atwell Office: 9.10p vania@comp.leeds.ac.uk http://www.comp.leeds.ac.uk/vania http://www.comp.leeds.ac.uk/agc/krgroup.html

Knowledge in Knowledge Management the nature of knowledge, definitions and different types Knowledge used in Knowledge Based Systems, KM systems Knowledge and Information Retrieval / Extraction Analysis of WWW data: Google tools, SketchEngine, BootCat IR: finding documents which match keywords / concepts IE: extracting key terms, facts (DB-fields) from documents Matching user requirements, advanced/intelligent matching Mining WWW as source of data and knowledge Knowledge Discovery Collating data in data warehouse; transforming and cleaning Cross-industry standard process for data mining (CRISP-DM) OLAP, knowledge visualisation, machine learning in WEKA Analysis of WWW-sourced data Semester 1 Topics in KM

One way to see what you need to learn is to look at past exam papers – this gives a “bird’s eye view” COMP3740 CR32 is a new module … BUT developed from COMP3410 Technologies for Knowledge Management COMP3640 Personalisation and User-Adaptive Systems For example, past COMP3410 exam paper covers some topics in CR32 Past Exam Papers?

Serge Sharoff is a lecturer at Leeds University who has published many research papers relating to technologies for knowledge management, for example: … (i)Imagine you are asked to assess the impact of Dr Sharoff’s research, by finding a list of papers by other researchers which cite these publications. Suggest three Information Retrieval tools you could use for this task. State an advantage and a disadvantage of each of these three IR tools for this search task, in comparison to the other tools. Q1a: KM for bibliographic search

(i)Name 3 appropriate tools e.g. Google Scholar, CiteSeer, ISI Web of Knowledge, Google Books An appropriate pro and con of each, eg: Google Scholar: Pro: wider coverage, all publications on open WWW; Con: does not give full references, just URL and some details Citeseer: Pro: stores papers in several formats plus BibTeX references; Con: not as good coverage, esp interdisciplinary ISI Web of Knowledge or Web of Science: Pro: good coverage of top journals including “paid-for” Con: most papers in this field are not in top journals A1a: KM for bibliographic search

Q: Suggest three reasons why citations for some papers might not be found by any of your suggested IR tools A: - Two of these papers are in Russian, citations may also be; these tools focus on English-language papers; - Papers in this field are mainly in conference/workshop proceedings, not journals, hence less likely to be indexed by IR tools (esp Web of Science) - older papers may not be online, so less likely to be found and cited by others Q1a (ii): KM doesn’t always work

What is the difference between Information Retrieval and Information Extraction? A Knowledge Management consultancy aims to build a database of all Data Mining tools available for download via the WWW, including name, cost, implementation language, input/output format(s), and Machine Learning algorithm(s) included; should they use IR or IE for this task, and why? Q1b: Info Retrieval v Info Extraction

IR: finding whole documents which match query IE: extracting data/info from a given text to populate fields in data-base or knowledge-base records Both IR and IE are appropriate: this task requires IR to find DM tool description webpages from whole WWW, but then finding the specific details in each webpage is “identifying fields in records for DB population” task A1b: Info Retrieval v Info Extraction

IR query finds “matching” documents. The user may say some are not relevant. Relevance feedback can guide the system to adapt the initial query – new query finds “more of the same” This may look complicated but it’s just putting the numbers into the equation… Q1c: using relevance feedback to adapt a query

[4 marks: 1 for correct q vector, 1 for realising  sums a single d vector, 1for 3 weighted vectors, 1 for answer] q' = q +di / | HR | - di / | HNR| = 0.5 q + 0.5 d1  0.5 d4 = 0.5  (1.0, 0.6, 0.0, 0.0, 0.0) + 0.5  (0.8, 0.8, 0.0, 0.0, 0.4)  0.5  (0.6, 0.8, 0.4, 0.6, 0.0) = (0.5, 0.3, 0.0, 0.0, 0.0) + (0.4, 0.4, 0.0, 0.0, 0.2)  (0.3, 0.4, 0.2, 0.3, 0.0) = (0.6, 0.3,  0.2,  0.3, 0.2) Relevance feedback example

“In 2008, Leeds University adopted the Blackboard Virtual Learning Environment (VLE) to be used in undergraduate taught modules in all schools and departments. In future, lectures and tutorials may become redundant at Leeds University: if we assume that student learning fits Coleman’s model of Knowledge Management processes, then the Virtual Learning Environment provides technologies to deal with all stages in this model. All relevant explicit, implicit, tacit and cultural knowledge can be captured and stored in our Virtual Learning Environment, for students to access using Information Retrieval technologies.” Is this claim plausible? In your answer, explain what is meant by Coleman’s model of Knowledge Management processes, citing examples relating to learning and teaching at Leeds University. Define and give relevant examples of the four type of knowledge; and state whether they could be captured and stored in our VLE, and searched for via an Information Retrieval system. [20 marks] Q2: Knowledge processes

Key points: - Coleman process of knowledge gathering/acquisition: big problem would be data capture and preparation - Coleman process of knowledge storage/organisation: KM/IR could be of great benefit - Coleman process of knowledge refining/adding value: lectures aim at more than “rote learning” - Coleman process of knowledge transfer/dissemination: students prefer human factors of lectures? - Explicit Knowledge has been articulated - example: e.g. lecture notes, course handbooks - already captured, and already accessible via IR search - Implicit Knowledge hasn’t been articulated (but could be) - example, e.g. extra material known to lecturer but not on the handouts - could potentially be captured, accessible if text form eg transcripts - Tacit Knowledge can’t be articulated but is done “without thinking” - example, e.g. how to design and implement elegant programs - tacit knowledge cannot be captured, hence cannot be searched for via IR - Cultural Knowledge is shared norms/beliefs to enable concerted action - example, e.g. students cooperate in groupwork - written guidelines can be captured and retrieved, but not “group spirit” even an “essay” has a marking scheme

Association rules link arbitrary features; e.g. (center = 0) => (color = 0) (100% - perfect predictor); Classification rules predict final feature (class) english=UK/US; e.g. (color < 3) => (english = UK) (100% - perfect predictor) Q3: Data Mining with WEKA

(colorpercent < = 40) / \ Yes No / \ UK US Simple decision tree

aim to balance the decision tree: best attribute is one which naturally splits instances into homogeneous subtrees with least errors. E.g. (colorpercent <= 40) splits into perfectly-predictive subsets with the training set. How to choose the root?

depends on decision-point given in (b); eg: for (colorpercent <= 40) we get 2 wrong classifications: === Confusion Matrix === a b <-- classified as 1 2 | a = UK 0 0 | b = US Confusion matrix

Supervised learning involves learning from example instances with desired "answer" or classification, eg building decision tree to predict the last attribute, English=UK/US, given the arff instances; Unsupervised learning involves learning from example instances but not being shown desired "answer" for each, eg clustering instances into groups of similar documents on the basis of discriminative feature-values, not including English as the target class; this may yield another division of documents. Supervised v unsupervised ML

Knowledge in Knowledge Management Knowledge and Information Retrieval / Extraction Knowledge Discovery January mock exam: Knowledge Management Reminder: bird’s eye overview of KM

COMP3740 CR32: Knowledge Management and Adaptive Systems