combining semi supervised clustering with social network analysis a case study on fraud detection
Download
Skip this Video
Download Presentation
Mining Data Semantics (MDS\'2011) Workshop

Loading in 2 Seconds...

play fullscreen
1 / 26

Mining Data Semantics (MDS'2011) Workshop - PowerPoint PPT Presentation


  • 115 Views
  • Uploaded on

Combining Semi-Supervised Clustering with Social Network Analysis : A Case Study on Fraud Detection. Mining Data Semantics (MDS\'2011) Workshop in conjunction with SIGKDD2011, August 21-24, 2011, San Diego, CA, USA. João Botelho, [email protected] |.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Mining Data Semantics (MDS'2011) Workshop' - december


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
combining semi supervised clustering with social network analysis a case study on fraud detection

CombiningSemi-SupervisedClusteringwith Social NetworkAnalysis: A Case StudyonFraudDetection

Mining Data Semantics (MDS\'2011) Workshop

in conjunction with SIGKDD2011, August 21-24, 2011, San Diego, CA, USA.

João Botelho, [email protected] |

Cláudia Antunes, [email protected]

slide2

CONTENTS

  • Motivationandproblemstatement
  • S2C+SNA methodology
  • Case study
  • Conclusions
slide3

CONTENTS

  • Motivationandproblemstatement
  • S2C+SNA methodology
  • Case study
  • Conclusions
slide4

FRAUD DETECTION IN TAXES PAYMENTS

  • Fraudin Taxes Payments
    • Improper payments in taxes due to fraud, waste and abuse;
  • Involves millions of possible fraud targets;
  • Necessityof effective tools to prevent fraud or or just to identify it in time;
slide6

CONTENTS

  • MotivationandProblemstatement
  • S2C+SNA methodology
  • Case study
  • Conclusions
metodologia da solu o
Metodologia da Solução

S2C+SNA METHODOLOGY

slide10

DATA PREPARATION> DATASET

Thismethodology assumes theexistenceoftwodatasets:

- Datasetwithlabeledandunlabeledinstances;

- Social network Data (describing interactions between these instances);

slide11

DATA PREPARATION>SNOWBALL SAMPLING

  • In order to discard un-useful components of the social network and optimize computational resources, the target population can be reached using snowball sampling.
slide12

DATA PREPARATION>BAD RANK

  • DerivedfromPageRank e HITS
  • Usedby Google to detectweb SPAM
  • Bad Rank allow us to identify the risk that is associated to a member by analyzing their links to other “bad” members.
slide14

DATA PREPARATION>BAD RANK

  • The application of Bad Rank results in a new attribute that will enrich the entity decriptionto be used in the classification process.
slide15

MODELING>SEMI-SUPERVISED CLUSTERING

  • The most common semi-supervised algorithms studied in this paper are modifications of the K-Means algorithm (unsupervised) to incorporate domain knowledge.
  • Typically, this knowledge can be incorporated:
    • when the initial centroids are chosen (by seeding)
      • Seeded-Kmeans
      • Constrained-Kmeans
    • in the form of constraints that have to be satisfied when grouping similar objects (constrained algorithms).
      • PCK-Means
      • MPCK-Means
slide17

CONTENTS

  • MotivationandProblemstatement
  • S2C+SNA methodology
  • Case study
  • Conclusions
slide18

CASE STUDY

  • Dataset: Fraudin Taxes Payments;
  • Since the experiments presented in this work will focus only in the problem of detecting fraud with small fractions of labeled data, it was extracted a balanced dataset with equal number of fraud and non fraud instances.
    • 3000 instances;
    • 50% Fraud; 50% NonFraud;
slide19

EXPERIMENTS SETUP

  • All the experiments were conducted selecting randomly 10 different sets of pre-labeled instances for each algorithm and for different fractions of incorporated labeled instances.
  • The results presented next report the best, worst and the average of the acuracy results obtained on these datasets.
slide23

CONTENTS

  • MotivationandProblemstatement
  • S2C+SNA methodology
  • Case study
  • Conclusions
slide24

CONCLUSIONS

  • It is clear to see that with a small fraction of labeled instances all the semi-supervised algorithms obtain a significant improvement when comparing to the unsupervised clustering (Kmeans).
    • Constrained K-Means have the best performance when comparing to other semi-supervised algorithms.
  • Semi-supervised clustering performs better when data is enriched with social network analysis.
    • BadRank, the results show significant improvements in all experiments, after 15% of labeled instances used.
slide25

CONCLUSIONS

  • This methodology can also be applied to other areas:
    • where supervised information is very difficult to achieve
    • where Social Network Analysis can provide important information about human entities, making visible patterns, linkages and connections that could not be discovered using only static data (transitional data).
  • Churn detection is a good candidate to apply this methodology.
slide26
FIM

QUESTIONS?

ad