1 / 20

Entity Resolution Tool ‘ sdlink ’

Entity Resolution Tool ‘ sdlink ’. - Darshana Pathak - Dr. Hye -Chung Kum. Index:. Overview Entity resolution process About Framework Configuration file Class Details How to … Future Work Questions?. Overview:.

anka
Download Presentation

Entity Resolution Tool ‘ sdlink ’

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Entity Resolution Tool‘sdlink’ - DarshanaPathak - Dr. Hye-Chung Kum

  2. Index: • Overview • Entity resolution process • About Framework • Configuration file • Class Details • How to … • Future Work • Questions?

  3. Overview: • Framework for developing Entity Resolution Tool - named ‘sdlink’ • Idea is to provide a ‘Lab’ • For whom? • Research assistants, students • Why? • To contribute towards research

  4. Entity Resolution Process: Configure: Define link Variable Compare: Similarity Metrics, Find Distance Decide: Supervised/ Unsupervised Decision Model Search: Reduce space (Blocking) Evaluate: Assess the linked data Analyze: Error Propagation Refine: Relationships and Deduplication Data Management

  5. Entity Resolution Process:

  6. Various Tools: • Searching Methods • Blocking • Sorting • Hashing • Sorted Neighborhood • Comparison Functions • Hamming Distance • Edit Distance • Jaro’s Algorithm • N-grams • SoundexCode

  7. Various Tools: • Decision Models • Probabilistic Model • Induction Model • Clustering Model • Hybrid Model • Measurement Tools • Reduction Ratio • Pairs Completeness • Accuracy • Completeness

  8. About Framework: • Basic framework includes: • Configuration file: configure.xml • Main class: SDLink.java • ConfigFile and ConfigReader • CSVFile, CSVReader and CSVWriter • BlockingModel.java • DistanceCalculator.java Everything explained in further slides.

  9. Configuration File: • Name: configure.xml • Specifies: • 2 CSV Files to be linked • List of attributes • Blocking method • Weight for each attribute • Clustering method

  10. Java Class Details: • SDLink.java – Initializes all classes to • Read configuration file • Read 2 CSV Files • Perform blocking • Calculate distances • Perform clustering • Writing output to output files

  11. Java Class Details: • ConfigFile.java and ConfigReader.java • Read configure.xml • Know everything about CSVFiles, attributes, blocking methods and clustering method. • Store all these information in an instance of ConfigFile.java so that other classes can readily access this information whenever required.

  12. Java Class Details: • CSVFile.java, CSVReader.java & CSVWriter.java • Read both CSV Files • Combine two files into one • Form a 2-D matrix of all attributes in CSV files • Store all the data from CSV file into an instance of CSVFile.java

  13. Java Class Details: • BlockingModel.java • Performs blocking on the 2-D matrix of data • Knows how to partition rows from configure.xml • Important step as further clustering is done on each block. • Necessary to handle large data.

  14. Java Class Details: • DistanceCalculator.java • Performs operations on each block (formed in blocking step) separately. • Calculates distance between two attributes • Compares distances and calculates densities iteratively • Forms many tiny clusters as the process runs for multiple iterations • Process runs until no clusters can be formed.

  15. Java Class Details: • Everything runs in a big LOOP… • There can be multiple blocking attributes. • The whole process of blocking and clustering runs for each blocking attribute. • The output of every iteration is an input to the next iteration. • Be careful: It should not be an infinitely long process!

  16. How to… : • Using this basic framework, you can implement your own ideas • E.g. A new clustering algorithm – • Write the code and just plug it into distance calculator class • Make sure not to disturb existing functionality • Be purely object oriented  • Check the new algorithm’s output

  17. How to… : • This code is available on Macbeth (but no version control till now…) • We will have version control system like SVN, where multiple developers can check out and check in code… • To avoid risk, we can add separate methods and classes without touching existing code.

  18. Future Work: • Version Control System • Generate proper output files • Implement and test various clustering algorithms • Develop graphical user interface • And much more…

  19. References: • TAILOR: A Record Linkage Toolbox (2002) Mohamed Elfeky , VassiliosVerykios , Ahmed Elmagarmid. • A GLASS BOX APPROACH FOR LINKING ADMINISTRATIVE RECORDS: PI: Gale Boyd, Co-PI: Wayne Gray and Hye- Chung Kum

  20. Questions ???

More Related