1 / 54

An extended K-means++ with mixed attributes for outlier detection

An extended K-means++ with mixed attributes for outlier detection. Presented by Miss Sarunya Kanjanawattana. Examination Committee. Dr. Sumanta Guha (Chairperson) Prof. Dr. Phan Minh Dung (Committee) Dr. Matthew N. Dailey (Committee). :: Agenda ::. Background Literature review

hastin
Download Presentation

An extended K-means++ with mixed attributes for outlier detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An extended K-means++ with mixed attributes for outlier detection Presented by Miss SarunyaKanjanawattana

  2. Examination Committee Dr. SumantaGuha (Chairperson)Prof. Dr. Phan Minh Dung (Committee)Dr. Matthew N. Dailey (Committee)

  3. :: Agenda :: • Background • Literaturereview • Methodologies

  4. Background • Problem statement • Objective of the study • Scope and Limitation • Contribution

  5. « Background » • Data mining : • huge volume of data and information are collected in databases. • These tremendous data has far exceeded the human ability to analyze extract valuable information for the purpose of decision-making support. “data mining helps to transform the collected data into valuable information”

  6. « Background » • Outlier detection : • Outlier cluster is a popular methodology that uses to detect fraud in data sets. • identify data points as “normal” or “outlier” Outlier data point => fraudulent sample

  7. « Background » • Fraud detection • Health insurance fraud detection is a beneficial and challenging task. • The detection helps to observe the fraud and abuse pattern. Example : Institutional or health professional led health insurance fraud include the falsification of information on forms.

  8. « Background » • The National Health Security office • is an autonomous state agency, officially founded in 2002 , stated by the National Health Security Act • The vital duties of NHSO • are to manage the health security fund and allocate the subsidiary budget to 236 clinics and 963 hospitals to promote and develop a good health care system for all Thai people.

  9. « Problem statement » • Fraud and abuse • led to significant additional expense in the health care system. • A case study : NHSO database • Occurred with the large number of data . • Many transactions emerge constantly daily hour. • These become huge and hard to use human inspections for detecting fraud. • Outlier clustering approach : • Need fast and more accuracy algorithm to monitor outliers

  10. « Objective of the study » • To provide a process of extracting the fraud instances and uncover unusual activities in NHSO. • To develop the K-means++, that is another variation of standard k-means algorithm, with mixed attributes of dataset for detecting outliers. • To answer what is the optimal “”.

  11. « Scope and Limitation  » • The data source only involved in 4 provinces in Thailand • Nakhonratchasima, Chaiyaphom, Burirum and Surin. • The transaction comes from a group of High-costs diseases • There is high chance to occur fraudulent behaviors larger than other groups of diseases.

  12. « Contribution » • The proposed study provides the methodology to detect fraud and abuse in NSHO, Thailand. It will present some results of outlier cluster. • This study proposes a novel algorithm based on extended K-means++ to work with mixed attributes and detect outliers.

  13. Literaturereview • Fraud detection • The process of data mining

  14. « Literaturereview » Frauddetection Yi et al. 2006 : • understand and detect suspicious health care frauds from large databases using clustering technique • Use two clusters to compare : SAS EM and CLUTO • As the experimental results indicate that CLUTO is faster than SAS EM while SAS EM provides more useful clusters than CLUTO.

  15. « Literaturereview » Frauddetection Liou, Tang, and Chen 2008 : • Applies data mining techniques to detect fraudulent or abusive reporting by healthcare providers using their invoices for diabetic outpatient services. • Logistic regression, neural network, classification trees • The classification tree model performs the best with an overall correct identification rate of 99%.

  16. « Literaturereview » The process of data mining • Data preprocessing • The data that obtain from the real databases are often incomplete, noisy and inconsistent. • The target of data preprocessing is to clean a rough data set for improve accuracy. • The process of data preprocessing : • data cleaning, data transformation and integration and data reduction.

  17. « Literaturereview » The process of data mining • Data preprocessing Wang and Chiang 2009 : • presents an efficient data preprocessing procedure for the support of vector clustering (SVC) to reduce the size of a training dataset.

  18. « Literaturereview » The process of data mining • K-means algorithm

  19. « Literaturereview » The process of data mining • K-means algorithm • The benefits of K-means • fast and simplicity. Its algorithm is really easy to understand and implementation. • The shortcoming of K-means • number of clusters dependency • degeneracy

  20. « Literaturereview » The process of data mining • K-means++ algorithm

  21. « Literaturereview » The process of data mining • K-means++ algorithm • Arthur and Vassilvitskii 2007 • Fast and more efficient • K-means : O(i * n * k) • K-means++ : O(log k) • not pretty good to work with a dataset which combines categorical and numerical attribute

  22. D(x) = • the shortest distance from • a data point x to the • closest center we have • already chosen. « Literaturereview » The process of data mining • K-means++ algorithm • Example (k=3)

  23. « Literaturereview » The process of data mining • K-means++ algorithm • Example (k=3)

  24. « Literaturereview » The process of data mining • K-means++ algorithm • Example D2=12+72 D2=82+42 D2=72+32 D2=22+12 (k=3)

  25. « Literaturereview » The process of data mining • K-means++ algorithm • Example D2=12+72 D2=82+42 D2=72+32 D2=22+12 (k=3)

  26. « Literaturereview » The process of data mining • K-means++ algorithm • Example D2=12+72 D2=12+12 D2=22+12 (k=3)

  27. « Literaturereview » The process of data mining • K-means++ algorithm • Example D2=12+72 D2=12+12 D2=22+12 (k=3)

  28. « Literaturereview » The process of data mining • K-means++ algorithm • Example (k=3)

  29. « Literaturereview » The process of data mining • Y-means algorithm

  30. « Literaturereview » The process of data mining • Y-means algorithm • Guan, Ghorbani, and Belacel 2003 • based on the K-means algorithm • It overcomes two shortcomings of K-means: • number of clusters dependency and degeneracy

  31. « Literaturereview » The process of data mining • Koufakou, Ortiz, Georgiopoulos, Anagnostopoulos, and Reynolds 2007 • Introduced a strategy named “Attribute Value Frequency (AVF)”. • That is a fast and scalable outlier detection strategy for categorical data.

  32. Methodologies • Methodology • Data collection • Data evaluation • Tasks and timeline

  33. « Methodologies » • It can divide into 3 phases. • Phases 1: Data preprocessing • Convert categorical data to numeric data • Phases 2: Clustering • Followed by K-means++ algorithm • Phases 3: Outlier detection • Local and global outlier • Determine what cluster is outlier

  34. « Methodologies » • Overview of the extended K-means++ algorithm

  35. « Methodologies » • Phases 1: Data preprocessing

  36. « Methodologies » • Phases 1: Data preprocessing • Normalizes the numeric attributes’ value into the range of 0 and 1

  37. « Methodologies » • Phases 1: Data preprocessing • Normalizes the numeric attributes’ value into the range of 0 and 1

  38. « Methodologies » • Phases 1: Data preprocessing 2) A categorical attribute A with most number of items is selected to be the base attribute. • 2 items: A,B • 3 items: C,D,E

  39. « Methodologies » • Phases 1: Data preprocessing 3) Counting the frequency of co-occurrence, represent by Matrix M • A B C D E • A • B • C • D • E • Matrix M = • 4 0 2 2 0 • 0 311 1 • 0 0 3 0 0 • 0 003 0 • 0 0 0 0 1

  40. « Methodologies » • Phases 1: Data preprocessing 4) Calculate similarity between items represent by equation D • A B C D E • A • B • C • D • E • Matrix M = • 4 0 2 2 0 • 0 311 1 • 0 0 3 0 0 • 0 003 0 • 0 0 0 0 1

  41. « Methodologies » • Phases 1: Data preprocessing 5) Find group variance of numerical value by following equation: • SSw(Y) = 0.04 • SSw(Z) = 1.294 • << Select Y

  42. « Methodologies » • Phases 1: Data preprocessing 6) Every base item can be quantified by assigning mean of the mapping value in the selected numeric attribute.

  43. « Methodologies » • Phases 1: Data preprocessing 7) All other categorical items can be quantified by applying the function: F(A) = 0.4 * 0.2 + 0.4 * 0.8 + 0 * 0.6 = 0.4 F(B) = 0.2 * 0.2 + 0.2 * 0.8 + 0.33 * 0.6 = 0.398 • *All data in data set are numeric now.

  44. « Methodologies » • Phases 2: Clustering • Probability : • D(x) : denote the shortest distance from a data point x to the closest center we have already chosen.

  45. « Methodologies » • Phases 2: Clustering • Define initial values: •  = Cluster width • for detect local outlier • Followed by previous study = 2.32. •  = Cluster population ratio • for detect global outlier • My assumption :  = 0.9 • Detection rate and false negative rate should be get the highest values with optimal “”.

  46. « Methodologies » • Phases 3: Outlier detection

  47. « Methodologies » • Phases 3: Outlier detection • There are 2 stages • Local outlier detection : •  = cluster width

  48. « Methodologies » • Phases 3: Outlier detection • There are 2 stages • Global outlier detection •  = population ratio

  49. « Data collection » • A real dataset provided by National Health Security office of Thailand was applied to demonstrate the effectiveness of the proposed method. • Primary data will gather information from database especially statement information that contains all financial transactions, Thailand.

  50. « Data collection » • Overview of data set

More Related