Untangling names
Download
1 / 33

Untangling Names - PowerPoint PPT Presentation


  • 78 Views
  • Uploaded on

Untangling Names. Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew [email protected] TROPICOS + IPNI. Why match?. Why is this difficult?. Variation. Calophyllum kiong K.Schum. & Lauterb. Fl. Deutsch. Sudsee, 450. Calophyllum kiong Lauterb. & K.Schum.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Untangling Names' - adrina


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Untangling names

Untangling Names

Lessons learned (so far) from the linking of

IPNI and TROPICOS

Julius Welby

RBG Kew

[email protected]





Variation

Calophyllum kiong K.Schum. & Lauterb.

Fl. Deutsch. Sudsee, 450.

Calophyllum kiong Lauterb. & K.Schum.

Die Flora der Deutschen Schutzgebiete in der Sudsee 1900


Duplication

  • Poa annua L. -- Sp. Pl. 68. 1753 (GCI)

  • Poa annua L. -- Species Plantarum 2 1753 (APNI)

  • Poa annua L. -- Sp. Pl. 68. (IK)


Duplication

  • Calophyllum microphyllum Scheffin Tijdschr. Nederl. Ind. xxxii. (1871) 406. (IK)

  • Calophyllum microphyllum Planch. & Trianain Ann. Sc. Nat. Ser. IV. xv. (1861) 282. (IK)

  • Calophyllum microphyllum T.Anders.Fl. Brit. Ind. (J. D. Hooker). i. 272. (IK)



Fields

1 Calophyllum Calophyllum

2 kiong kiong

3 K.Schum. & Lauterb. Lauterb. & K.Schum.

  • Fl. Deutsch. Sudsee Die Flora der Deutschen…

  • 450. 1900


Lesson 1

Speed matters


Speed matters

2,500 by 2,000 by 4 fields

20,000,000 comparisons

~5.5 hours at 1ms per comparison


Be lazy

  • Do as little as possible

  • Do easy things if possible

  • Do hard things only if necessary

  • Only expend effort when it’s worth it


Be lazy

  • Do as little as possible

    • Specify fields as ‘must match’

    • If a ‘must match’ field fails

      • Mark the match as failed

      • Stop comparing fields


Species infragenus infraspecies authors rank
speciesinfragenusinfraspeciesauthorsrank …

Parameterised matching



Optimising

  • The order of field matching is important

    • Choose suitable fields to match first

    • Aim to fail matches early

  • Significant speed-up


Also, for speed

  • Do as little as possible

    • Do escaping or standardisation once

    • Done on import for each dataset

    • Keep field matching functions clean


More speed optimisation

  • Do easy things if possible

    • Define cascading tests

    • Do easy tests first, if practical

      • Length comparisons

      • Composition comparisons


Speed lessons
Speed Lessons

  • Speed matters

  • Minimise comparisons made

    • ‘Must match’ parameters

    • Match fields in an efficient order

  • Do data cleaning once, up front

  • Look for ways to fail matches cheaply



Accuracy

False -

OK

False +





Lesson 2:Look at near misses



One approach

  • Currently, to get best results:

    • Tend towards strictness

    • Handle false negatives


One approach

  • Currently, best results from:

    • Tend towards strictness

    • Handle false negatives

  • Failures on ‘rightmost’ fields can be written to a report

  • Checked and fed back in as escapes

  • Rerun


Lesson 3:Remove predictable variation


Predictable variation

  • Gendered endings

  • Common alternatives

    • Endings:

      • ii,i

      • Iae,ae

  • Dataset specific quirks:

    • &, &


The framework

  • Python

  • Psyco

  • Modular

  • Extensible

  • In progress

  • More details will be available on the TDWG website

  • Source code availability


The framework

  • Some results (HTML)


Thanks to

  • Bob Magill

  • Sally Hinchcliffe

  • The Moore Foundation

  • Contact:

  • [email protected]

  • or after Jan 2007 :[email protected]


ad