Combining semi supervised clustering with social network analysis a case study on fraud detection
Download
1 / 26

Mining Data Semantics (MDS'2011) Workshop - PowerPoint PPT Presentation


  • 113 Views
  • Uploaded on

Combining Semi-Supervised Clustering with Social Network Analysis : A Case Study on Fraud Detection. Mining Data Semantics (MDS'2011) Workshop in conjunction with SIGKDD2011, August 21-24, 2011, San Diego, CA, USA. João Botelho, [email protected] |.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Mining Data Semantics (MDS'2011) Workshop' - december


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Combining semi supervised clustering with social network analysis a case study on fraud detection

CombiningSemi-SupervisedClusteringwith Social NetworkAnalysis: A Case StudyonFraudDetection

Mining Data Semantics (MDS'2011) Workshop

in conjunction with SIGKDD2011, August 21-24, 2011, San Diego, CA, USA.

João Botelho, [email protected] |

Cláudia Antunes, [email protected]


CONTENTS

  • Motivationandproblemstatement

  • S2C+SNA methodology

  • Case study

  • Conclusions


CONTENTS

  • Motivationandproblemstatement

  • S2C+SNA methodology

  • Case study

  • Conclusions


FRAUD DETECTION IN TAXES PAYMENTS

  • Fraudin Taxes Payments

    • Improper payments in taxes due to fraud, waste and abuse;

  • Involves millions of possible fraud targets;

  • Necessityof effective tools to prevent fraud or or just to identify it in time;



CONTENTS

  • MotivationandProblemstatement

  • S2C+SNA methodology

  • Case study

  • Conclusions


Metodologia da solu o
Metodologia da Solução

S2C+SNA METHODOLOGY




DATA PREPARATION> DATASET

Thismethodology assumes theexistenceoftwodatasets:

- Datasetwithlabeledandunlabeledinstances;

- Social network Data (describing interactions between these instances);


DATA PREPARATION>SNOWBALL SAMPLING

  • In order to discard un-useful components of the social network and optimize computational resources, the target population can be reached using snowball sampling.


DATA PREPARATION>BAD RANK

  • DerivedfromPageRank e HITS

  • Usedby Google to detectweb SPAM

  • Bad Rank allow us to identify the risk that is associated to a member by analyzing their links to other “bad” members.



DATA PREPARATION>BAD RANK

  • The application of Bad Rank results in a new attribute that will enrich the entity decriptionto be used in the classification process.


MODELING>SEMI-SUPERVISED CLUSTERING

  • The most common semi-supervised algorithms studied in this paper are modifications of the K-Means algorithm (unsupervised) to incorporate domain knowledge.

  • Typically, this knowledge can be incorporated:

    • when the initial centroids are chosen (by seeding)

      • Seeded-Kmeans

      • Constrained-Kmeans

    • in the form of constraints that have to be satisfied when grouping similar objects (constrained algorithms).

      • PCK-Means

      • MPCK-Means



CONTENTS

  • MotivationandProblemstatement

  • S2C+SNA methodology

  • Case study

  • Conclusions


CASE STUDY

  • Dataset: Fraudin Taxes Payments;

  • Since the experiments presented in this work will focus only in the problem of detecting fraud with small fractions of labeled data, it was extracted a balanced dataset with equal number of fraud and non fraud instances.

    • 3000 instances;

    • 50% Fraud; 50% NonFraud;


EXPERIMENTS SETUP

  • All the experiments were conducted selecting randomly 10 different sets of pre-labeled instances for each algorithm and for different fractions of incorporated labeled instances.

  • The results presented next report the best, worst and the average of the acuracy results obtained on these datasets.





CONTENTS

  • MotivationandProblemstatement

  • S2C+SNA methodology

  • Case study

  • Conclusions


CONCLUSIONS

  • It is clear to see that with a small fraction of labeled instances all the semi-supervised algorithms obtain a significant improvement when comparing to the unsupervised clustering (Kmeans).

    • Constrained K-Means have the best performance when comparing to other semi-supervised algorithms.

  • Semi-supervised clustering performs better when data is enriched with social network analysis.

    • BadRank, the results show significant improvements in all experiments, after 15% of labeled instances used.


CONCLUSIONS

  • This methodology can also be applied to other areas:

    • where supervised information is very difficult to achieve

    • where Social Network Analysis can provide important information about human entities, making visible patterns, linkages and connections that could not be discovered using only static data (transitional data).

  • Churn detection is a good candidate to apply this methodology.


FIM

QUESTIONS?


ad