Untangling names
This presentation is the property of its rightful owner.
Sponsored Links
1 / 33

Untangling Names PowerPoint PPT Presentation


  • 58 Views
  • Uploaded on
  • Presentation posted in: General

Untangling Names. Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew [email protected] TROPICOS + IPNI. Why match?. Why is this difficult?. Variation. Calophyllum kiong K.Schum. & Lauterb. Fl. Deutsch. Sudsee, 450. Calophyllum kiong Lauterb. & K.Schum.

Download Presentation

Untangling Names

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Untangling names

Untangling Names

Lessons learned (so far) from the linking of

IPNI and TROPICOS

Julius Welby

RBG Kew

[email protected]


Untangling names

TROPICOS + IPNI


Untangling names

Why match?


Untangling names

Why is this difficult?


Untangling names

Variation

Calophyllum kiong K.Schum. & Lauterb.

Fl. Deutsch. Sudsee, 450.

Calophyllum kiong Lauterb. & K.Schum.

Die Flora der Deutschen Schutzgebiete in der Sudsee 1900


Untangling names

Duplication

  • Poa annua L. -- Sp. Pl. 68. 1753 (GCI)

  • Poa annua L. -- Species Plantarum 2 1753 (APNI)

  • Poa annua L. -- Sp. Pl. 68. (IK)


Untangling names

Duplication

  • Calophyllum microphyllum Scheffin Tijdschr. Nederl. Ind. xxxii. (1871) 406. (IK)

  • Calophyllum microphyllum Planch. & Trianain Ann. Sc. Nat. Ser. IV. xv. (1861) 282. (IK)

  • Calophyllum microphyllum T.Anders.Fl. Brit. Ind. (J. D. Hooker). i. 272. (IK)


Untangling names

Matching


Untangling names

Fields

1Calophyllum Calophyllum

2kiongkiong

3K.Schum. & Lauterb.Lauterb. & K.Schum.

  • Fl. Deutsch. SudseeDie Flora der Deutschen…

  • 450.1900


Untangling names

Lesson 1

Speed matters


Untangling names

Speed matters

2,500 by 2,000 by 4 fields

20,000,000 comparisons

~5.5 hours at 1ms per comparison


Untangling names

Be lazy

  • Do as little as possible

  • Do easy things if possible

  • Do hard things only if necessary

  • Only expend effort when it’s worth it


Untangling names

Be lazy

  • Do as little as possible

    • Specify fields as ‘must match’

    • If a ‘must match’ field fails

      • Mark the match as failed

      • Stop comparing fields


Species infragenus infraspecies authors rank

speciesinfragenusinfraspeciesauthorsrank …

Parameterised matching


Untangling names

How lazy?


Untangling names

Optimising

  • The order of field matching is important

    • Choose suitable fields to match first

    • Aim to fail matches early

  • Significant speed-up


Untangling names

Also, for speed

  • Do as little as possible

    • Do escaping or standardisation once

    • Done on import for each dataset

    • Keep field matching functions clean


Untangling names

More speed optimisation

  • Do easy things if possible

    • Define cascading tests

    • Do easy tests first, if practical

      • Length comparisons

      • Composition comparisons


Speed lessons

Speed Lessons

  • Speed matters

  • Minimise comparisons made

    • ‘Must match’ parameters

    • Match fields in an efficient order

  • Do data cleaning once, up front

  • Look for ways to fail matches cheaply


Untangling names

Accuracy


Untangling names

Accuracy

False -

OK

False +


Untangling names

Strict match

F-

OK


Untangling names

Fuzzy match

OK

F+


Untangling names

Doughnut of uncertainty


Untangling names

Lesson 2:Look at near misses


Untangling names

Near misses are checkable


Untangling names

One approach

  • Currently, to get best results:

    • Tend towards strictness

    • Handle false negatives


Untangling names

One approach

  • Currently, best results from:

    • Tend towards strictness

    • Handle false negatives

  • Failures on ‘rightmost’ fields can be written to a report

  • Checked and fed back in as escapes

  • Rerun


Untangling names

Lesson 3:Remove predictable variation


Untangling names

Predictable variation

  • Gendered endings

  • Common alternatives

    • Endings:

      • ii,i

      • Iae,ae

  • Dataset specific quirks:

    • &, &


Untangling names

The framework

  • Python

  • Psyco

  • Modular

  • Extensible

  • In progress

  • More details will be available on the TDWG website

  • Source code availability


Untangling names

The framework

  • Some results (HTML)


Untangling names

Thanks to

  • Bob Magill

  • Sally Hinchcliffe

  • The Moore Foundation

  • Contact:

  • [email protected]

  • or after Jan 2007 :[email protected]


  • Login