untangling names
Download
Skip this Video
Download Presentation
Untangling Names

Loading in 2 Seconds...

play fullscreen
1 / 33

Untangling Names - PowerPoint PPT Presentation


  • 87 Views
  • Uploaded on

Untangling Names. Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew [email protected] TROPICOS + IPNI. Why match?. Why is this difficult?. Variation. Calophyllum kiong K.Schum. & Lauterb. Fl. Deutsch. Sudsee, 450. Calophyllum kiong Lauterb. & K.Schum.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Untangling Names' - adrina


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
untangling names

Untangling Names

Lessons learned (so far) from the linking of

IPNI and TROPICOS

Julius Welby

RBG Kew

[email protected]

slide5

Variation

Calophyllum kiong K.Schum. & Lauterb.

Fl. Deutsch. Sudsee, 450.

Calophyllum kiong Lauterb. & K.Schum.

Die Flora der Deutschen Schutzgebiete in der Sudsee 1900

slide6

Duplication

  • Poa annua L. -- Sp. Pl. 68. 1753 (GCI)
  • Poa annua L. -- Species Plantarum 2 1753 (APNI)
  • Poa annua L. -- Sp. Pl. 68. (IK)
slide7

Duplication

  • Calophyllum microphyllum Scheffin Tijdschr. Nederl. Ind. xxxii. (1871) 406. (IK)
  • Calophyllum microphyllum Planch. & Trianain Ann. Sc. Nat. Ser. IV. xv. (1861) 282. (IK)
  • Calophyllum microphyllum T.Anders.Fl. Brit. Ind. (J. D. Hooker). i. 272. (IK)
slide9

Fields

1 Calophyllum Calophyllum

2 kiong kiong

3 K.Schum. & Lauterb. Lauterb. & K.Schum.

  • Fl. Deutsch. Sudsee Die Flora der Deutschen…
  • 450. 1900
slide10

Lesson 1

Speed matters

slide11

Speed matters

2,500 by 2,000 by 4 fields

20,000,000 comparisons

~5.5 hours at 1ms per comparison

slide12

Be lazy

  • Do as little as possible
  • Do easy things if possible
  • Do hard things only if necessary
  • Only expend effort when it’s worth it
slide13

Be lazy

  • Do as little as possible
    • Specify fields as ‘must match’
    • If a ‘must match’ field fails
      • Mark the match as failed
      • Stop comparing fields
slide16

Optimising

  • The order of field matching is important
    • Choose suitable fields to match first
    • Aim to fail matches early
  • Significant speed-up
slide17

Also, for speed

  • Do as little as possible
    • Do escaping or standardisation once
    • Done on import for each dataset
    • Keep field matching functions clean
slide18

More speed optimisation

  • Do easy things if possible
    • Define cascading tests
    • Do easy tests first, if practical
        • Length comparisons
        • Composition comparisons
speed lessons
Speed Lessons
  • Speed matters
  • Minimise comparisons made
    • ‘Must match’ parameters
    • Match fields in an efficient order
  • Do data cleaning once, up front
  • Look for ways to fail matches cheaply
slide21

Accuracy

False -

OK

False +

slide27

One approach

  • Currently, to get best results:
    • Tend towards strictness
    • Handle false negatives
slide28

One approach

  • Currently, best results from:
    • Tend towards strictness
    • Handle false negatives
  • Failures on ‘rightmost’ fields can be written to a report
  • Checked and fed back in as escapes
  • Rerun
slide30

Predictable variation

  • Gendered endings
  • Common alternatives
    • Endings:
      • ii,i
      • Iae,ae
  • Dataset specific quirks:
    • &, &
slide31

The framework

  • Python
  • Psyco
  • Modular
  • Extensible
  • In progress
  • More details will be available on the TDWG website
  • Source code availability
slide32

The framework

  • Some results (HTML)
slide33

Thanks to

ad