False Identity Detection

QR’09 RESEARCH NEWS EXCLUSIVE False Identity Detection Order-of-Magnitude Based Approach Tossapon Boongoen & Qiang Shen,Aberystwyth University, UK An integration of qualitative reasoning and link analysis, to detect possible use of false (or deceptive) identity. ? Most Wanted

Outline Date 24/06/09 Background of False Identity False Identity Detection Approaches Order-of-Magnitude Based Model Experimental Results Conclusion

False Identity has become the common denominator of serious crimes and terrorism Background Date 24/06/09 Age of Terror In particular to 9/11 attack, US authorities failed to discover the use of false identities by terrorists. In UK, financial losses due to such cause are reported to be around 1.3 billion pounds each year. 19 terrorists entered the US on 9-11 with false identity

Identityisa set of characteristic descriptors unique to a specific person. Background Date 24/06/09 Identity Attributed Name Date-of-birth Educational, financial or criminal history DNA and fingerprint Easy to falsify! Biographical Biometric

Name Deceptionis the most common practice with attributed identity. Attributed Identity Date 24/06/09 False Identity (attributed) Resident (33.3%) Name (100%) DOB (66.7%) ID (56.3%) Completely different name Add-on abbreviation Similar pronunciation First-second name swap

False Identity Detection Date 24/06/09 Text-based approach makes use of string-matching techniques to compare the similarity of strings (X, Y), e.g. Edit distance and Jaro. Edit distance is based on the number of edit operations to transform X to Y. Problems of high deception Bin laden  The prince Bin laden  The emir Fadil muhamad  Harun fazul Jarorelies on the number and order of the common characters between X and Y. This method is effective for problems caused by data-entry or translation errors. But, it fails to deal with ‘highly deceptive cases’. Effective for short strings, especially personal names.

Cash Card Identity A House Email Phone No. Identity B Identity C False Identity Detection Date 24/06/09 Link-based approach Despite using several false identities, a criminal (e.g. terrorist) typically exhibits a unique relation pattern to other information objects. Similarity of objects can be estimated from the link patterns they are part in. Example methods: SimRank (publication domain) PageSim (Internet domain)

Link Analysis Date 24/06/09 Terminology D Vertex  Name Edge  Co-occurrence relation Edge weight  Co-occurrence frequency 8 1 4 Link-based similarity A 3 Several methods use different properties of shared neighbours. 2 1 C B E.g., for the neighbours of (A, B): 6 1 Cardinality = 2 (i.e. C and D) Uniqueness: average(Uniquenessesof individual neighbours) = Uniqueness of C +Uniqueness of D 2

Uniqueness Measure Date 24/06/09 Uniquenessis estimated for each shared neighbour k of vertices i and j: fik : frequency of the link between vertices i and k, fjk : frequency of the link between vertices j and k, fmk : frequency of the link between any vertex m and the vertex k. Uniqueness measurecaptures the relative density of unique links to the nodes in question.

Uniqueness Measure Date 24/06/09 D 8 8 1 1 4 4 3 A 2 3 A 1 2 1 C 6 1 B 6 1 B 1+3 2+6 Uniqueness of D = Uniqueness of C = 1 + 3 2 + 6 + 1 + 1

Additionally, most link-based similarity methods take into account one property of neighbourhood context. Existing numerical techniques encounter the problem of inaccurate description (often caused by unduly large values). E.g. Normalised interpretation of cardinality = 100 is 0.1, when the maximum cardinality = 1,000. OM-based Model Date 24/06/09 Motivations Link properties, such as cardinality, are usually a matter of degree. Link property measures are gauged and described qualitatively: using order-of-magnitude formalism. Multiple properties (e.g. cardinality and uniqueness) are combined to improve the quality of similarity measure.

 0 6 … 2 Numerical scale Small Medium Large Human analyst (6, ) (2, 6] [0, 2] OM scale OM-based Model Date 24/06/09 Constructing an OM scale: Cardinality Landmark set = {2, 6}

OM-based Model Date 24/06/09 OM Space: Cardinality [small, large] [small, medium] [medium, large] Precision Abstraction [small, small] [medium, medium] [large, large] [medium, large] [small, medium]

OM-based Model Date 24/06/09 Semi-supervised determination of landmarks Human-directed landmarks are not optimal for different datasets. A better alternative is to learn from data. In this work, Density function is used to determine landmarks: D(t): density of property measure t, N(t): number of entity pairs, whose property measure ≥ t, N*: number of all entity pairs.

23 7 4 10 OM-based Model Date 24/06/09 Learning landmark values Order-of-magnitude Values of D(t)

OM-based Model Date 24/06/09 Homogenisation of OM Models Multiple link properties are described in different OM spaces. Prior to combining these measures, the homogenisation of property-specific OM scales is required. For instance: Landmark sets of cardinality and uniqueness are {2, 6} and {0.1, 0.3, 0.6, 0.8}, to be mapped onto the homogenised scale of {-3, -2, -1, 0, 1, 2, 3}. Step1.1: Select the central landmark (lc), which is in the middle of each ordered landmark set CT = {2, 6}  lc = 2 or lc = 6 UQ = {0.1, 0.3, 0.6, 0.8}  lc = 0.3 or lc = 0.6

Homogenisation Date 24/06/09 Step1-2: Modify each original landmark li to its new value sli, such that sli = li – lc. CT = {2, 6}  {0, 4}, lc = 2 UQ = {0.1, 0.3, 0.6, 0.8}  {-0.2, 0, 0.3, 0.5}, lc = 0.3 Step2: Add landmark values, such that they symmetrically appear on both positive and negative sides of 0. CT = {0, 4}  {-4, 0, 4} CT = {-4, 0, 4}  {-4, -2, -1, 0, 1, 2, 4} UQ = {-0.2, 0, 0.3, 0.5}  {-0.5, -0.3, -0.2, 0, 0.2, 0.3, 0.5} UQ = {-0.5, -0.3, -0.2, 0, 0.2, 0.3, 0.5} Step3: Add additional landmarks, such that all landmark sets have the same granularity.

Homogenisation Date 24/06/09 Finally, map the modified scales to the homogenised set.

Property Label Original Homogenised Cardinality small [0, 2] (-, 0] medium (2, 6] (0, 3] large (6, +) (3, +) Uniqueness very low [0, 0.1] (-, -1] low (0.1, 0.3] (-1, 0] moderate (0.3, 0.6] (0, 2] high (0.6, 0.8] (2, 3] very high (0.8, 1] (3, +) OM-based Model Date 24/06/09 Homogenised and Original Scales

OM-based Model Date 24/06/09 Combining property measures Different relevance (importance) degrees for different properties. Qualitative relevance is used here: Cardinality (CT) = ++ (or 2) and Uniqueness (UQ) = + (or 1). OMS (Order-of-Magnitude based Similarity). RVCT, RVUQ: relevance degrees of CT and UQ, respectively, (.): real weighted sum, [(.)]: qualitative expression of (.), S*: OM space for expressing OMS values.

OM-based Model Date 24/06/09 Combining property measures Example: CT = [medium, medium]and UQ = [moderate, high] OMS = 2CT + UQ = (2×(0, 3] + (0, 2])  (2×(0, 3] + (2, 3]) = (0, 8]  (2, 9] = (0, 9]

VL L M H VH 9 0 -1 6 OMS of (0, 9] = [M, H] OM-based Model Date 24/06/09 Order-of-magnitude Similarity (OMS) • Estimated with respect to homogenised scale • Described using the OM space of S* • Different S* can be used for a specific precision level required.

Terrorist Data Date 24/06/09 Terrorist Data is extracted from online news and web stories ... Osama bin Laden and Ayman al-Zawahri, moved out of Pakistan and are believed to have crossed the border back into Afghanistan ... Wanted Al-Qaeda chief Osama bin Laden and his top aide, Ayman al-Zawahri, have been witnessed ... Al-Qaeda 1 1 Ayman al-Zawahri + 1 1 Osama bin Laden 1 1 Afghanistan

Chung-Hsing Yeh Rowena Chen 11 4 Jisong Chen 7 Hepu Dong 1 10 2 5 Hepu Deng Kate A. Smith Abu abdallah Abu muhammad 20 13 September-11 attack Al qaida 35 57 Bin laden 10 14 Afghanistan Example Data Date 24/06/09 Terrorist DBLP

OMS Performance Date 24/06/09 Different Combination Methods OMS: Order-of-magnitude model with semi-supervised landmarks. For Terrorist, CT = {4, 7, 10, 23}, UQ = {0.05, 0.12, 0.27, 0.43, 1} For DBLP, CT = {2, 5, 9, 15}, UQ = {0.008, 0.04, 0.17, 0.31, 1} OMSH: OMS with human-directed landmarks. CT = {2, 6}, UQ = {0.1, 0.3, 0.6, 0.8} QT: Numerical weighted summation. Note that again, the relevance degrees of CT and UQ are 2:1 here.

OMS Performance Date 24/06/09 With Terrorist Data (Precision/Recall)

OMS Performance Date 24/06/09 With DBLP Data (Precision/Recall)

OMS Performance Date 24/06/09 OMS against other link-based methods With Terrorist Data

OMS Performance Date 24/06/09 With DBLP Data

Conclusion Date 24/06/09 Contribution: OMS, as a combination of OM reasoning and link analysis, with (semi-supervised) data-driven determination of landmarks. • Usually performing better than numerical link-based approaches. • Improving similarity measure by combining link properties. • Allowing explanation for possible reduction of false positives. Further Work: Evaluation with more relevant data. Learning of relevance degrees from data. Acknowledgement: This research is supported by UK EPSRC grant EP/D057086.

False Identity Detection