multithreaded ingestion of bufr messages from the idd l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Multithreaded ingestion of BUFR messages from the IDD PowerPoint Presentation
Download Presentation
Multithreaded ingestion of BUFR messages from the IDD

Loading in 2 Seconds...

play fullscreen
1 / 30

Multithreaded ingestion of BUFR messages from the IDD - PowerPoint PPT Presentation


  • 157 Views
  • Uploaded on

Multithreaded ingestion of BUFR messages from the IDD . John Caron Oct 8, 2008. Overview. BUFR format IDD HRS BUFR data stream Multithreaded processing of IDD messages Indexing data. BUFR data format. WMO standard for observational met data circa 1988: “Table Driven Forms” (TDF)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Multithreaded ingestion of BUFR messages from the IDD' - silas


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
overview
Overview
  • BUFR format
  • IDD HRS BUFR data stream
  • Multithreaded processing of IDD messages
  • Indexing data
bufr data format
BUFR data format

WMO standard for observational met data

circa 1988: “Table Driven Forms” (TDF)

Improvement over “character oriented codes” (eg metars)

Migration from previous forms still large WMO focus

Today: Edition 4 format, Version 13 of the tables

Table driven (12000 entries in global tables)

Each record contains a set of data descriptors (dds)

Global WMO and local tables

Simple “Compressed binary”

Packed bits, scale/offset covert to float

Fixed precision, no dynamic range

Difference from reference value

slide4

3-1-32 : tableD

3-1-1 : tableD

0-1-1 : WMO_block_number units=Numeric scale=0 refVal=0 nbits=7

0-1-2 : WMO_station_number units=Numeric scale=0 refVal=0 nbits=10

0-2-1 : Type_of_station units=Code table scale=0 refVal=0 nbits=2

3-1-11 : tableD

0-4-1 : Year units=Year scale=0 refVal=0 nbits=12

0-4-2 : Month units=Month scale=0 refVal=0 nbits=4

0-4-3 : Day units=Day scale=0 refVal=0 nbits=6

3-1-12 : tableD

0-4-4 : Hour units=Hour scale=0 refVal=0 nbits=5

0-4-5 : Minute units=Minute scale=0 refVal=0 nbits=6

3-1-24 : tableD

0-5-2 : Latitude units=Degree scale=2 refVal=-9000 nbits=15

0-6-2 : Longitude units=Degree scale=2 refVal=-18000 nbits=16

0-7-1 : Height_of_station units=m scale=0 refVal=-400 nbits=15

0-1-18 : Short_station_or_site_name units=CCITT IA5 nchars=5

0-2-3 : Type_of_measuring_equipment_used units=Code table scale=0 refVal=0

2-1-132 : tableC-operators

2-2-130 : tableC-operators

0-2-121 : Mean_frequency units=Hz scale=-8 refVal=0 nbits=7

2-2-0 : tableC-operators

2-1-0 : tableC-operators

0-8-21 : Time_significance units=Code table scale=0 refVal=0 nbits=5

0-4-26 : Time_period_or_displacement units=Second scale=0 refVal=-4096 nbits=13

1-9-0 : replication

0-31-1 : Delayed_descriptor_replication_factor units=Numeric scale=0 refVal=0

0-7-6 : Height_above_station units=m scale=0 refVal=0 nbits=15

0-25-34 : Wind_profiler_quality_control_test_results units=Flag table scale=0

0-11-1 : Wind_direction units=Degree true scale=0 refVal=0 nbits=9

0-11-2 : Wind_speed units=m s-1 scale=1 refVal=0 nbits=12

2-1-127 : tableC-operators

0-11-50 : Standard_deviation_of_horizontal_wind_speed units=m s-1 scale=1 refVal=0 nbits=12

2-1-0 : tableC-operators

0-11-6 : w-component units=m s-1 scale=2 refVal=-4096 nbits=13

0-11-51 : Standard_deviation_of_vertical_wind_speed units=m s-1 scale=1 refVal=0 nbits=8

bufr problems 1
BUFR problems (1)

BUFR format is too complex:

  • Looks like design by committee
  • Specification not exact
  • No coding/decoding reference implementation
  • Mixture of data model / data encoding / standard quantities

BUFR format is too simple:

  • Fixed length tables (64 categories, 256 entries) eventually run out
  • Fixed dynamic range (no exponents)
bufr problems 2
BUFR problems (2)

Table-driven parsing is brittle

  • No authoritative registry of local Tables
  • WMO global table is not machine-readable
  • Past versions are not available

It seems that:

  • Each provider has their own set of software and tables
  • Often legacy Fortran
bufr table mismatch
BUFR Table mismatch
  • No way to be sure if coder/decoder use the same table
  • If table entry missing, cant decode
  • If wrong table entry is used
    • Bit size wrong, usually can detect with bit counting
    • Scale/Factor/Name/Units wrong = “silent failure” (expert/human may detect)
table mismatches
Table mismatches

Each archive center probably has solved this coder/decoder matching internally

  • NCEP encodes the tables in BUFR messages, and stores in the archive files
  • Others???
bufr progress
BUFR progress
  • As of 9/2008, WMO decided
    • Will make tables available in Microsoft Access format
    • Clarified versioning (sort of)
  • Progress in detecting/fixing encoding errors
  • Unidata nudge: email group, validation web site
  • BritMet effort to map BUFR to ISO, define XML version of tables
bufr data on idd
BUFR data on IDD
  • 177 K messages / day
  • 6.7 M observations / day
  • 1.2 Gbytes / day
  • Avg message size = 7227 bytes
  • Avg obs/message = 37
  • Unique wmo Headers = 555
  • Unique dds = 125
  • wmoHeaders with multiple dds = 61
originating stations
Originating Stations
  • CWAO Montreal
  • EDZW Offenbach (RSMC) (78.0)
  • EGRR UK Meteorological Office Bracknell (RSMC) (74.0)
  • EKMI Copenhagen (94.0),
  • EUMG EUMETSAT Operation Centre (254.0)
  • EUSR
  • KBOU The NOAA Forecast Systems Laboratory (59.0)
  • KKCI US National Weather Service (NCEP) (7.0)
  • KNES US NOAA/NESDIS (160.0)
  • KWBC US National Weather Service (NCEP) (7.0)
  • KWNH US National Weather Service (NCEP)
  • KWNO NCEP / Central Operations (7.3)
  • LFPW Toulouse (RSMC) (85.0),
  • RJTD Tokyo (RSMC), Japan Meteorological Agency (34.0)
  • RKSL Seoul 40.0
  • SBBR Brazilian Space Agency ? INPE (46.0)
  • VHHH Hong-Kong 110.0
data heterogeneity
Data heterogeneity
  • Each BUFR record in principle could have its own data schema : 2M database schemas!
  • In reality, there are much smaller number of groups of homogenous records
    • WMO headers are not sufficient
    • Can’t use pqact FILE by matching the header
    • Only the dds itself is reliable
    • So must crack the message to reliably group the records
overview15
Overview
  • Get messages from LDM pipe
  • Process in memory, write out to disk
  • Must be very fast, no blocking I/O
  • Use java.util.concurrent library for multithreading
ldm pqact
LDM pqact

# Get all BUFR messages from HRS

HRS ^[IJ]

PIPE –metadata java –jar ldm.jar

slide17

LDM

stream

pipe

ArrayBlockingQueue<MessageTask>

Message

Queue

Break into

Separate

messages

1.extract

pipeReadingThread (1) (io)

blocking take

Read contents

Classify type by dds

2.dispatch

Step 1 and 2

Extract and dispatch

MessType

processor

MessType

processor

MessType

processor

messageThread (1?) (cpu)

slide18

dispatch

MessType

processor

Step 3

Write message

dispatch

MessageWriter

implements Callable<Result>

ConcurrentLinkedQueue<Message>

Owns file eg 2008-09-11.bufr

submit

MessageWriter

implements Callable<Result>

Result call() {

write message(s)

}

Executor

CompletionService<Result>

3.write

messageThread (1) (cpu)

threadPool (n) (io)

slide19

MessageWriter

implements Callable<IndexerTask>

IndexTask call() {

write message(s)

}

Step 4

Index

Write message

Return IndexerTask

Executor

Queue<Future<IndexerTask>>

Add to Index

blocking take

indexThread (1?) (io)

slide20

dispatch

Step 5

cleanup

MessType

processor

dispatch

Close files

Concurrent hashMap ?

MessageWriter

implements Callable<Result>

ConcurrentLinkedQueue<Message>

Owns file 2008-09-11.bufr

cleanupThread (1) (io)

submit

Executor

CompletionService<Result>

messageThread (1) (cpu)

slide21

Step 6

Scour

scourThread (1) (io)

Remove from Index

Delete file

Executor

Queue<Future<IndexerTask>>

Add to Index

blocking take

indexThread (1?) (io)

why isnt scouring part of ldm
Why isnt Scouring part of LDM?
  • LDM is message oriented – doesn’t know contents
  • Decoders know about the contents of the messages
  • Put scouring into the decoders
threads
Threads
  • Read from LDM pipe
  • Read message content and dispatch
  • Write Messages to files
  • Index
  • Cleanup / close MessageWriters
  • Scour
design prejudices
Design prejudices
  • Keep data in original format
    • Data reliability
  • Aggregate homogeneous data into files
    • Data locality
  • Create external indices, with pointers into the files
    • Data recovery
  • Scour entire files, not parts of a file
indexing
Indexing
  • Need 1D indexes (B-trees)
  • Want 2D indices for spatial data
    • Rtree (areas)
    • Quadtree (points)
  • Index selectivity: seek vs. scan
    • Sequential access ~100x faster than random access
    • Index must select < 1% data to be useful
possible open source indexers
Possible Open Source Indexers
  • Berkeley DB Java edition
    • Btree, very fast, no SQL
    • Dual GPL/commercial license
  • Relational databases “SQL on Btrees”
    • Java (Derby, H2, many others)
    • C (MySQL, Postgres)
  • Object databases
    • Db4o (dual GPL/commercial license)
high performance
High performance
  • Embeddable in the decoder
    • Same process space
    • Not client/server
  • Access from server answering queries
    • Multiprocess access or client/server
    • Bdb must sync periodically (perf?)
  • Transactions probably too slow
    • Need recovery strategy
test assumptions
Test Assumptions
  • Process IDD messages in memory (vs) write to file then postprocess
  • Store in files – add external indexing (vs) store data in database
  • One database vs many?
  • Embedded vs client/server
  • SQL vs specific queries
    • SQL allows ad-hoc queries - performance?
  • 2D indexing
conclusions
Conclusions
  • Test/time various indexing strategies and technologies
    • Production
    • scouring
  • Eventually part of IDD/TDS
    • Must be easy to maintain (Java)
    • Scale to large archives / data volumes