Introduction to Information Retrieval

Introduction to Information Retrieval Chap. 16 Flat clustering

Contents

Before starting(1/2) • Clustering algorithm • 역할: 문서의 집합을 하위집합 또는 군집(cluster)로묶음 • 목표: 내적으로는 통일성 있고(coherent) 외적으로는서로 명백히 다른 cluster들을 만들어냄 • Clustering에 대한 약간의 서술 • Unsupervised learning • 가장 흔히 사용되는 비지도학습(unsupervised learning)의 형태 • 전문가가 각 문서를 class들에게 배당할 필요 없음 • 문서간거리 • 흔히 Euclidian 거리로 계산

Before starting(2/2) • Clustering의 종류 • Hierarchy 유무: Flat vs. Hierarchical • 배타성 여부: hard vs. soft • Hard clustering: 각문서는 하나의 cluster의 원소만 될 수 있음 • Soft clustering: 한 문서가 어느 cluster의 원소가 될 것인지가 하나의 분포를 구성함(e.g., Latent Semantic Indexing)

16.1 Clustering in information retrieval • Cluster hypothesis • 같은 cluster에 속하는 문서들은 정보 수요에 대한 적합성의 측면에서 유사한 속성을 가진다(Documents in the same cluster behave similarly with respect to relevance to information needs). • 더 나아가, 같은 cluster 안에 있는 문서들은 같은 query에 대해 같이 적합(relevant)할가능성이 있다

16.1 Clustering in information retrieval • 적용 • Search result clustering • Search term이 여러 의미를 가질 때 유용(e.g., jaguar: 자동차, 동물, Apple OS) • (e.g., http://vivisimo.com)

16.1 Clustering in information retrieval • Scatter-gather

16.1 Clustering in information retrieval • Collection clustering • Collection에 들어 있는 문서들을 정적인 군집들(static clusters)에 할당 • e.g., Google News, Columbia NewBlaster system • Language modeling • 주어진 query에 적합한 문서의 수가 적을 때, inverted index를 활용한 검색 대신 같은 cluster안의 다른 문서를 가져올 수도 있다. • Recall을 높여준다. • cluster내 문서들간의 유사도가 높으면 전반적으로 ‘적합’하다. • Language model은 이 아이디어를 받아들였다. • Language model: tf at document level + tf at collection level • 대신에, tf at document level + tf at cluster level를적용해 문서 내에 term이 발생할 확률에 대한 더 정확한 추산을 가능하게 했다.

16.1 Clustering in information retrieval • Cluster-based retrieval • 통상적인 정보검색과 달리, query와 가장 가까운 cluster를 찾아 이 cluster에 속하는 문서들만을 대상으로 검색을 실시하면, 정보검색의 속도를 높일 수 있다.

16.2 Problem statement • Hard flat clustering • Goal • 다음이 주어졌을 때 • 문서 집합 D={d1, …, dN} • 원하는 cluster의개수 K • Clustering의질을 평가하는 objective function • Objective function을 극소화하는(minimize) assignment γ: D→{1, …, K}를 계산함 • 문서들간의 연관성을 측정하는 척도: 이 장에서는 유사도(similarity) 또는 거리(distance)를 사용 • 용어 • partitional clustering: 각문서가 하나의 cluster에만 속하는 clustering • Exhaustive clustering: 모든 문서를 각각 cluster에 할당, (cf. nonexhaustive clustering: 할당하지 않는 문서도 있음)

16.2 Problem statement • Cardinality – The number of clusters • Cluster의 개수 K를 몇으로 할지는 어려운 이슈이다. • 모든 발생 가능한 cluster를 나열한 후 이중에서 최적의 조건을 선택하는 방법은 exponentially many partition들을 야기한다.

16.3 Evaluation of clustering • Criteria of quality • Internal criterion • High intracluster similarity & low intercluster similarity • External criterion • 외적인 평가기준 활용(gold standard or a set of classes in an evaluation benchmark) • goldstandard와 얼마나 부합하는지 평가

16.3 Evaluation of clustering • 4가지 external criteria • Purity • 단순한 기준 • 각 cluster를, 그 cluster에 가장 많이 등장하는 class에 배정 • 이렇게 배정된 class와 부합하는 문서의 개수를 전체 문서 개수로 나눔 • Ω={ω1, ω2, …, ωK}: cluster 집합, ωk : cluster k에 속하는 문서들 • C={c1, c2, …, cJ}: class의 집합, cj: classj에 속하는 문서들 • Cluster의 개수가 많아질 수록 purity가 높아진다. 극단적인 경우, 각 cluster에 단 하나씩만의 문서가 포함되면 purity는 1이 된다. 하지만, clustering을 하는 이유가 없어진다.

16.3 Evaluation of clustering • Normalized mutual information(NMI) • I: mutual information • H: entropy

16.3 Evaluation of clustering • K=N 일 때 MI가 최대가 된다는 점에서 purity와 같은 문제점을 가지나, • [H(Ω)+H(C)]/2로 나누어 normalize • Cluster의 개수가 증가하면 entropy도 함께 증가하여, MI 증가의 효과를 줄임 • Normalization을 통해 cluster의 개수와 다른 경우와도 비교 가능 • NMI는 0과 1 사이의 값을 가짐 • Rand index(RI) • Clustering의 정확도를 평가

16.3 Evaluation of clustering • 그림 16.4의 예

16.3 Evaluation of clustering • RI는 FP와 FN에 같은 weight를 준다. • 유사한 문서를 다른 cluster로 분리하는 것이, 상이한문서를 같은 cluster에 할당하는 것보다 나쁘다. • F measure는 FN에 더 많은 penalty를 줌(β>1이면)

16.4 K-means • K-means • 가장중요한 flat clustering algorithm • 목적: 문서와, 그 문서가 속하는 cluster의 중심(center)간의 Euclidian distance를 최소화함 • Centroid: cluster 문서들의 중심 • cluster는 구형이며, centroid는 무게중심이 되고, 이상적으로, cluster는 서로 중첩돼서는 안된다.

16.4 K-means

16.4 K-means • K-means의 목표는 residual sum of squares(RSS)를 최소화하는 것이다(이는, 문서와 centroid간의 거리의 제곱합을 최소화하는 것과 같다). • K-means 알고리즘의 중지조건(termination condition) • 정해진반복회수I 만큼의 반복을 시행했을 때 • 반복회수가 부족할 경우 clustering의질이 낮다. • 매 반복마다, 문서들의 cluster 할당에 변화가 나타나지 않을 때 • 소요시간을 예측하기 어렵다. • Centroid가 변하지 않을 때 • RSS가 특정 기준 이하로 내려갈 때 • RSS 감소가 특정 기준 이하로 내려갈 때(converge에 가까워짐)

16.4 K-means • RSS의 최소화라는 관점에서 재해석한 K-means centroid

16.4 K-means • Tie breaking • 특정문서가 서로다른 cluster의 centroid들과 거리가 같다면, 이를 조치할 방법 필요(선착순?) • Outlier • 문서 집합이 다수의 outlier들을 포함할 경우, clustering algorithm이 global minimum의 도달하지 못할 가능성이 높아짐 • Initial seed 선택의 중요성

16.4 K-means • Seed selection을 위한 heuristics • Seed set에서 outlier는 배제 • 시작점을 여러 개로 해, 저비용의 clustering 수행 • Hierarchical clustering과 같은, 제 3의 방법으로부터 seed 선택에대한 정보를 얻음 • deterministic hierarchical clustering method가 K-means보다 더 예측 가능 • Buckshot algorithm과 같이, iK와 같이 소수 random sample로 좋은 ㄴseed를 얻어낼 방법도 있음 • A robust method • 각 cluster에서 i개의 random vector들을 뽑아, 이들의 centroid를 seed로 사용

16.4 K-means • Time complexity of K-means • 대부분의 시간은 거리 계산에서 쓰임 • overall complexity at reassignment step: Θ(KNM) • K: cluster 개수, N: 문서 개수, M: collection 내의 unique 어휘 가짓수 • Overall complexity at recomputation step: Θ(NM) • 반복횟수(iteration number) I를 고정된 값으로 하면, overall complexity는 Θ(IKNM) • K-means는 linear하며, 그렇기 때문에 hierarchical clustering보다 효율적 • 대부분의 경우에, K-means는 complete convergence나 convergence에 근접한 clustering에 빠르게 도달

16.4 K-means • High dimensionality • 문서 vector는 대체로 sparse하므로 문서간거리 계산에는 비용이 많이 들지 않음 • 문서와 centroid간의 거리 계산은 high dimensionality 하에서 계산비용 상승 • Truncating centroid • centroid를 가장 유의미한 k(e.g., k=1,000)개의 term만으로 계산 • Clustering의 질을 저하시키지 않으면서 reassignment step의 속도를 현저히 높일 수 있음 • K-medoids: K-means의 변형 • 문서들의 centroid대신, centroid에 가장 근접한 특정 문서인 medoid사용 • Outlier 문제에 대해서 robust • Time complexity 문제에 관해서는 K-means와 같음

16.4 K-means • 또 다른 heuristic method • 새로운 cluster에게 penalty 주기 • 모든 문서를 포함하는 단 하나의 cluster에서 시작해 점진적으로 cluster의 수를 늘려 감 • 여기에 적용할 objective function이고려할 요소 • Distortion의 정도 • 문서들이 cluster prototype으로부터 보이는 편차의 정도 • RSS가 흔히 쓰이는 척도 • Model complexity • Cluster의개수에 비례 • λ: weighting factor, 이 값이 클수록 적은 cluster를선호 • Akaike information criterion(AIC) • -L(K): K cluster를 위한 데이터의 maximum log-likelihood의음수 • q(K): K cluster 모형의 parameter 개수

16.4 K-means • K-means에서 AIC는 다음과 같이 표현될 수 있다. • 즉, L(K)=-(1/2)RSSmin(K), λ=2M

16.5 Model-based clustering • EM 알고리즘: K-mean의 일반화 • K-means보다는, 더 다양하게 분포된 문서에 적용 가능 • 전제 • 자료들은 모형(model)에의해 생성되었다고 가정하며, 자료로부터 원래의 모형을 복원하려 시도 • Model parameter 추정에 가장 많이 쓰이는 기준은 maximum likelihood • K-means에서 exp(-RSS)는 특정 모형이 자료를 산출해낼 가능성과 비례 • L(D|Θ): clustering의 질을평가하는 objective function • Θ를 구한 이후 각 문서-cluster 쌍의 P(d|ωk;Θ}를 구할 수 있다. • K-means와 같은 hard clustering은 여러 topic에 적합한 상황을 모형화할 수 없다.

16.5 Model-based clustering • rigid vs. flexible • K-means나 hierarchical clustering과 달리 model –based clustering은 flexible • Expectation-maximization(EM) • L(D|Θ)를극대화하는 iterative algorithm • 여기에서는 mixture of multivariate Beroulli distribution으로 취급 • Θ={Θ1, …, ΘK}, Θk={αk, q1k, …, qMk}, qmk=P(Um=1|ωk), • P(Um=1|ωk): cluster k에 속하는 문서가 term tm을 포함할 확률 • αk: Cluster ωk의 prior: 문서 d에 대한 아무런 정보 없을 때 문서 d가 cluster ωk에속할 확률

16.5 Model-based clustering • Mixture model • Cluster ωk를αk의 확률로 선택하고, qmk에 따라 어휘들을 산출함으로써 문서를 산출해 냄 • Multivariate Beroulli에서 문서는 M개의 Boolean value들의 vector로 표현됨 • K-means의 parameter는 centroid였던 데에 비해, 여기에서 나오는 EM의 parameter들은 αk와 qmk임 • Expectation step: K-means의 reassignment step과 유사 • Maximization step: K-means의 recomputation step과 유사

16.5 Model-based clustering • Expectation step: K-means의 reassignment step과 유사 • tm∈ dn이면I(tm∈ dn)=1 • rnk: 이전 iteration에서 계산된 cluster k에대한 문서 dn의 soft assignment • Maximization step: K-means의 recomputation step과 유사 • Expectation step에서 계산된 αk와 qmk를 이용해 rnk재계산

16.5 Model-based clustering 뒷면에계속

16.5 Model-based clustering Cluster 1 seeds

16.5 Model-based clustering • 좋은 seed를 찾아 시작하는 것은 EM에서는 훨씬 더 중요한 쟁점이다. • 적절한 seed에서 시작하지 못할 경우 EM은 local optima에서 벗어나지 못하는 단점을 가졌다. • 초기 seed를 설정하기 위한 별개의 method를 쓰기도 하며, 여기에 hard K-means가 적용되기도 한다.

Thank You !

Introduction to Information Retrieval