Combining semi supervised clustering with social network analysis a case study on fraud detection
This presentation is the property of its rightful owner.
Sponsored Links
1 / 26

Mining Data Semantics (MDS'2011) Workshop PowerPoint PPT Presentation


  • 87 Views
  • Uploaded on
  • Presentation posted in: General

Combining Semi-Supervised Clustering with Social Network Analysis : A Case Study on Fraud Detection. Mining Data Semantics (MDS'2011) Workshop in conjunction with SIGKDD2011, August 21-24, 2011, San Diego, CA, USA. João Botelho, [email protected] |.

Download Presentation

Mining Data Semantics (MDS'2011) Workshop

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Combining semi supervised clustering with social network analysis a case study on fraud detection

CombiningSemi-SupervisedClusteringwith Social NetworkAnalysis: A Case StudyonFraudDetection

Mining Data Semantics (MDS'2011) Workshop

in conjunction with SIGKDD2011, August 21-24, 2011, San Diego, CA, USA.

João Botelho, [email protected] |

Cláudia Antunes, [email protected]


Mining data semantics mds 2011 workshop

CONTENTS

  • Motivationandproblemstatement

  • S2C+SNA methodology

  • Case study

  • Conclusions


Mining data semantics mds 2011 workshop

CONTENTS

  • Motivationandproblemstatement

  • S2C+SNA methodology

  • Case study

  • Conclusions


Mining data semantics mds 2011 workshop

FRAUD DETECTION IN TAXES PAYMENTS

  • Fraudin Taxes Payments

    • Improper payments in taxes due to fraud, waste and abuse;

  • Involves millions of possible fraud targets;

  • Necessityof effective tools to prevent fraud or or just to identify it in time;


Mining data semantics mds 2011 workshop

CHALLENGES ON FRAUD DETECTION


Mining data semantics mds 2011 workshop

CONTENTS

  • MotivationandProblemstatement

  • S2C+SNA methodology

  • Case study

  • Conclusions


Metodologia da solu o

Metodologia da Solução

S2C+SNA METHODOLOGY


Mining data semantics mds 2011 workshop

WHY SEMI-SUPERVISED CLUSTERING?


Mining data semantics mds 2011 workshop

WHY SOCIAL NETWORKS?


Mining data semantics mds 2011 workshop

DATA PREPARATION> DATASET

Thismethodology assumes theexistenceoftwodatasets:

- Datasetwithlabeledandunlabeledinstances;

- Social network Data (describing interactions between these instances);


Mining data semantics mds 2011 workshop

DATA PREPARATION>SNOWBALL SAMPLING

  • In order to discard un-useful components of the social network and optimize computational resources, the target population can be reached using snowball sampling.


Mining data semantics mds 2011 workshop

DATA PREPARATION>BAD RANK

  • DerivedfromPageRank e HITS

  • Usedby Google to detectweb SPAM

  • Bad Rank allow us to identify the risk that is associated to a member by analyzing their links to other “bad” members.


Mining data semantics mds 2011 workshop

DATA PREPARATION>BAD RANK (DEMO)


Mining data semantics mds 2011 workshop

DATA PREPARATION>BAD RANK

  • The application of Bad Rank results in a new attribute that will enrich the entity decriptionto be used in the classification process.


Mining data semantics mds 2011 workshop

MODELING>SEMI-SUPERVISED CLUSTERING

  • The most common semi-supervised algorithms studied in this paper are modifications of the K-Means algorithm (unsupervised) to incorporate domain knowledge.

  • Typically, this knowledge can be incorporated:

    • when the initial centroids are chosen (by seeding)

      • Seeded-Kmeans

      • Constrained-Kmeans

    • in the form of constraints that have to be satisfied when grouping similar objects (constrained algorithms).

      • PCK-Means

      • MPCK-Means


Mining data semantics mds 2011 workshop

MODELING>SEMI-SUPERVISED CLUSTERING


Mining data semantics mds 2011 workshop

CONTENTS

  • MotivationandProblemstatement

  • S2C+SNA methodology

  • Case study

  • Conclusions


Mining data semantics mds 2011 workshop

CASE STUDY

  • Dataset: Fraudin Taxes Payments;

  • Since the experiments presented in this work will focus only in the problem of detecting fraud with small fractions of labeled data, it was extracted a balanced dataset with equal number of fraud and non fraud instances.

    • 3000 instances;

    • 50% Fraud; 50% NonFraud;


Mining data semantics mds 2011 workshop

EXPERIMENTS SETUP

  • All the experiments were conducted selecting randomly 10 different sets of pre-labeled instances for each algorithm and for different fractions of incorporated labeled instances.

  • The results presented next report the best, worst and the average of the acuracy results obtained on these datasets.


Mining data semantics mds 2011 workshop

CLUSTERING RESULTS WITH AND WITHOUT BADRANK ATTRIBUTE


Mining data semantics mds 2011 workshop

BEST AND WORST RESULTS WITHOUT BADRANK


Mining data semantics mds 2011 workshop

BEST AND WORST RESULTS WITH BADRANK


Mining data semantics mds 2011 workshop

CONTENTS

  • MotivationandProblemstatement

  • S2C+SNA methodology

  • Case study

  • Conclusions


Mining data semantics mds 2011 workshop

CONCLUSIONS

  • It is clear to see that with a small fraction of labeled instances all the semi-supervised algorithms obtain a significant improvement when comparing to the unsupervised clustering (Kmeans).

    • Constrained K-Means have the best performance when comparing to other semi-supervised algorithms.

  • Semi-supervised clustering performs better when data is enriched with social network analysis.

    • BadRank, the results show significant improvements in all experiments, after 15% of labeled instances used.


Mining data semantics mds 2011 workshop

CONCLUSIONS

  • This methodology can also be applied to other areas:

    • where supervised information is very difficult to achieve

    • where Social Network Analysis can provide important information about human entities, making visible patterns, linkages and connections that could not be discovered using only static data (transitional data).

  • Churn detection is a good candidate to apply this methodology.


Mining data semantics mds 2011 workshop

FIM

QUESTIONS?


  • Login