1 / 33

Arkansas Research Center

Arkansas Research Center. arc.arkansas.gov. You are the identity manager…. MARIA WILSON HIGH SCHOOL. CASTILLO-DELGADO. MARIA WILSON HIGH SCHOOL. CASTILLO-DELG. You are the identity manager…. MARIA D WILSON HS. CASTILLO-DELGADO. MARIA C WILSON HS. CASTILLO-DELG.

devika
Download Presentation

Arkansas Research Center

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Arkansas Research Center arc.arkansas.gov

  2. You are the identity manager… MARIA WILSON HIGH SCHOOL CASTILLO-DELGADO MARIA WILSON HIGH SCHOOL CASTILLO-DELG

  3. You are the identity manager… MARIA D WILSON HS CASTILLO-DELGADO MARIA C WILSON HS CASTILLO-DELG

  4. You are the identity manager… MARIA D WILSON HS CASTILLO-DELGADO DOB: 11/05/1995 MARIA C WILSON HS CASTILLO-DELG DOB: 9/24/1994

  5. Identity Resolution Problems (K12) • There are ~55,000 unique first names among students in Arkansas and ~40,000 last names. • Approximately 20% of Arkansas students share both the same first and last name with another student.

  6. Identity Resolution (K12) • There are 4,026 students in Arkansas that share an SSN with at least one other student in the state. • Between August and January, 874 student transfers to other schools resulted in an SSN change. • Between August and January, an additional 1,018 students changed their SSN—we have records for only 300 of these changes. • There are ~17,000 students in Arkansas with a “900” SSN

  7. Identity Resolution (Workforce) • ~55,000,000 records for 10 years, 2,938,718 unique SSNs, no DOBs, inconsistent naming standards. • 7,865 SSNs used by two or more people, for a total of 18,278 different individuals. Those would be combined incomes and treated as the same person if SSN was the primary key. • The same person has two or more SSNs (because of a typo/transposition) 13,373 times. There would be 13,373 additional (non-existent) people with separate incomes if SSN was the primary key.

  8. Problem Statement There are known knowns; there are things we know we know. We also know there are known unknowns; that is to say, we know there are some things we do not know. But there are also unknown unknowns – the ones we don’t know we don’t know. D. Rumsfeld 2/12/2002

  9. Record Linking: Merge/Purge File A File B Your knowledge is limited to what’s in these two files ONLY

  10. Knowledge Base Approach All known representations are stored to facilitate matching in the future and possibly resolve past matching errors. Bob Smith, Conway High School Robert Smith, Acxiom Knowledge Base Bob Smith, UCA

  11. Knowledge Base Steps Do all 5 values match exactly (E5)? No. Do 4 values match (E4)? No. Do 3 values match (E3)? No/*. Do 2 values match (E2)? Yes. Are they enough for confidence? No. CONCLUSION: NO MATCH THEY ARE KEPT SEPARATE IN KNOWLEDGE BASE MARIA D WILSON HS CASTILLO-DELGADO DOB: 11/05/1995 * Last name is a special case MARIA C WILSON HS CASTILLO-DELG DOB: 9/24/1994

  12. Exact v. Fuzzy(Deterministic v. Probabilistic) Exact matching drives the majority of identity resolution (Pareto Rule—80% is easy) Probabilistic algorithms – Soundex, QTR, Edit Distance, Neural Networks (Pareto Rule—20% require 80% of effort) You want a system that does what YOU, a human, would do

  13. Possible Matching Errors False Positives (Over-consolidation) False Negatives (Under-consolidation)

  14. Identity Management Over-consolidation – split the records apart and update all affected systems Under-consolidation – bring the records together and update all affected systems

  15. Actual Results 100,000+ records from Explore and Plan exams, 2008 and 2009. Match rate, 99%.

  16. Examples: 1% Not Matched 100% is not realistic – 99% is realistic, but what’s important is the ability to manage problems as they arise

  17. Oyster Development 1st Generation – built in Access, automation of queries/functions creating Knowledge Base. (started in 2009) – shared with W. Virginia Data was longitudinal, but sourced from K-12 exclusively 2009 IES Grant included funding for research with UALR – this work became “Oyster” Oyster also funded with 2009 ARRA Grant

  18. What is Oyster? Open-System Entity Resolution Not database-driven, pure XML Java source code (unicode support) Matching by R-Swoosh methodologies but could be adapted to Fellegi-Sunter

  19. Timeline Oyster (Java/XML) 1stGenIDs (Access) 1.1 1.2 1.3 1.4 1.5 2.0 2009 1.x 2.x 3.0 3.1 3.2 2010 2011 K.I.M. (SQL/PHP) GUI 2.0 2012

  20. Oyster and KIM Oyster: Thorough documentation and GUI KIM: Little documentation and no GUI Oyster: Has not been benchmarked since memory fix KIM: Throughput is 1 – 5 million records an hour, depending on the data and use Oyster: R-Swoosh KIM: Fellegi-Sunter Both deal with over- and under-consolidations

  21. Fellegi-Sunter: Record-based matchingR-Swoosh: Attribute-based matching Already determined to be the same individual Neil Gibson, 987654321 Neal Gibson, 222222222 Neal Gibbs, 987654321 What about: Neal Gibson, 987654321 (all correct) Neil Gibbs, 222222222 (none correct)

  22. Oyster XML Run Script

  23. Oyster XML Index

  24. Oyster Input GUI

  25. Oyster Run Script GUI

  26. TrustEd: Knowledgebase Identity Management (KIM) TrustEd Identifier Management (TIM) TIM Identifier Management TrustED KIM De-identified Research Databases Identity Resolution

  27. TrustEd: KIM & TIM Research Data RecID PII SourceID RecID TIM Identifier Management TrustED KIM De-identified Research Databases Identity Resolution

  28. TrustEd: KIM & TIM PII KBID KIMID TIM Identifier Management TrustED KIM KIMID RecID De-identified Research Databases Identity Resolution

  29. TrustEd: KIM & TIM KIMID SourceID RecID TIMID Research Data AgencyID TIM Identifier Management TrustED KIM De-identified Research Databases Identity Resolution

  30. TrustEd: KIM & TIM RecID SourceID TIMID: Management Agency Crosswalks Research Data PII TIM Identifier Management TrustED KIM De-identified Research Databases Identity Resolution

  31. TrustEd Results TrustEd validates the request based on sharing rules and translates the requesting agency’s local IDs to that of the other agency. The results are then returned to the requesting agency without the use of personally identifiable information. ADHE DWS What are the salaries for these individuals? HE0236 HE0651 HE1327 WF4297 Salary: $36,000 WF8516 Salary: $28,000 WF3508 Salary: $41,000 TIM HE0236 ↔WF4297 HE0651 ↔WF8516 HE1327 ↔WF3508 Brokered Result 1 HE0236 Salary: $36,000 HE0651 Salary : $28,000 HE1327 Salary : $41,000 Brokered Result 2 Salary : $41,000 Salary : $36,000 Salary : $28,000 Brokered Result 3 Average Salary : $35,000

  32. Examples of Multi-agency Research • UAMS nICU – 1998 births to 2011 K12 assessments • Pre-K programs to K12 preparedness/assessments • K12 indicators for Higher Ed on-time graduation • Employment outcomes – Higher Ed to Workforce • Special Ed outcomes – K12, Higher Ed, Workforce, and Dept. of Corrections

  33. Questions? Oyster Information – UALR http://sourceforge.net/projects/oysterer/ jrtalburt@ualr.edu KIM Information – ARC http://arc.arkansas.gov Neal.Gibson@arkansas.gov Greg.Holland@arkansas.gov

More Related