Combining semi supervised clustering with social network analysis a case study on fraud detection
This presentation is the property of its rightful owner.
Sponsored Links
1 / 26

Mining Data Semantics (MDS'2011) Workshop PowerPoint PPT Presentation


  • 93 Views
  • Uploaded on
  • Presentation posted in: General

Combining Semi-Supervised Clustering with Social Network Analysis : A Case Study on Fraud Detection. Mining Data Semantics (MDS'2011) Workshop in conjunction with SIGKDD2011, August 21-24, 2011, San Diego, CA, USA. João Botelho, [email protected] |.

Download Presentation

Mining Data Semantics (MDS'2011) Workshop

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


CombiningSemi-SupervisedClusteringwith Social NetworkAnalysis: A Case StudyonFraudDetection

Mining Data Semantics (MDS'2011) Workshop

in conjunction with SIGKDD2011, August 21-24, 2011, San Diego, CA, USA.

João Botelho, [email protected] |

Cláudia Antunes, [email protected]


CONTENTS

  • Motivationandproblemstatement

  • S2C+SNA methodology

  • Case study

  • Conclusions


CONTENTS

  • Motivationandproblemstatement

  • S2C+SNA methodology

  • Case study

  • Conclusions


FRAUD DETECTION IN TAXES PAYMENTS

  • Fraudin Taxes Payments

    • Improper payments in taxes due to fraud, waste and abuse;

  • Involves millions of possible fraud targets;

  • Necessityof effective tools to prevent fraud or or just to identify it in time;


CHALLENGES ON FRAUD DETECTION


CONTENTS

  • MotivationandProblemstatement

  • S2C+SNA methodology

  • Case study

  • Conclusions


Metodologia da Solução

S2C+SNA METHODOLOGY


WHY SEMI-SUPERVISED CLUSTERING?


WHY SOCIAL NETWORKS?


DATA PREPARATION> DATASET

Thismethodology assumes theexistenceoftwodatasets:

- Datasetwithlabeledandunlabeledinstances;

- Social network Data (describing interactions between these instances);


DATA PREPARATION>SNOWBALL SAMPLING

  • In order to discard un-useful components of the social network and optimize computational resources, the target population can be reached using snowball sampling.


DATA PREPARATION>BAD RANK

  • DerivedfromPageRank e HITS

  • Usedby Google to detectweb SPAM

  • Bad Rank allow us to identify the risk that is associated to a member by analyzing their links to other “bad” members.


DATA PREPARATION>BAD RANK (DEMO)


DATA PREPARATION>BAD RANK

  • The application of Bad Rank results in a new attribute that will enrich the entity decriptionto be used in the classification process.


MODELING>SEMI-SUPERVISED CLUSTERING

  • The most common semi-supervised algorithms studied in this paper are modifications of the K-Means algorithm (unsupervised) to incorporate domain knowledge.

  • Typically, this knowledge can be incorporated:

    • when the initial centroids are chosen (by seeding)

      • Seeded-Kmeans

      • Constrained-Kmeans

    • in the form of constraints that have to be satisfied when grouping similar objects (constrained algorithms).

      • PCK-Means

      • MPCK-Means


MODELING>SEMI-SUPERVISED CLUSTERING


CONTENTS

  • MotivationandProblemstatement

  • S2C+SNA methodology

  • Case study

  • Conclusions


CASE STUDY

  • Dataset: Fraudin Taxes Payments;

  • Since the experiments presented in this work will focus only in the problem of detecting fraud with small fractions of labeled data, it was extracted a balanced dataset with equal number of fraud and non fraud instances.

    • 3000 instances;

    • 50% Fraud; 50% NonFraud;


EXPERIMENTS SETUP

  • All the experiments were conducted selecting randomly 10 different sets of pre-labeled instances for each algorithm and for different fractions of incorporated labeled instances.

  • The results presented next report the best, worst and the average of the acuracy results obtained on these datasets.


CLUSTERING RESULTS WITH AND WITHOUT BADRANK ATTRIBUTE


BEST AND WORST RESULTS WITHOUT BADRANK


BEST AND WORST RESULTS WITH BADRANK


CONTENTS

  • MotivationandProblemstatement

  • S2C+SNA methodology

  • Case study

  • Conclusions


CONCLUSIONS

  • It is clear to see that with a small fraction of labeled instances all the semi-supervised algorithms obtain a significant improvement when comparing to the unsupervised clustering (Kmeans).

    • Constrained K-Means have the best performance when comparing to other semi-supervised algorithms.

  • Semi-supervised clustering performs better when data is enriched with social network analysis.

    • BadRank, the results show significant improvements in all experiments, after 15% of labeled instances used.


CONCLUSIONS

  • This methodology can also be applied to other areas:

    • where supervised information is very difficult to achieve

    • where Social Network Analysis can provide important information about human entities, making visible patterns, linkages and connections that could not be discovered using only static data (transitional data).

  • Churn detection is a good candidate to apply this methodology.


FIM

QUESTIONS?


  • Login