1 / 51

Adding Custom Types to PostgreSQL

Adding Custom Types to PostgreSQL. Slides 2 to 21 - Introduction to custom types Slides 22 to 30 - Memory Alignment issues Slides 31 to 51 - Details of the LWGEOM type. By David Blasby, Refractions Reserach. Extending Postgresql with Custom Datatypes. At the end of this course you will:

esma
Download Presentation

Adding Custom Types to PostgreSQL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adding Custom Types to PostgreSQL Slides 2 to 21 - Introduction to custom types Slides 22 to 30 - Memory Alignment issues Slides 31 to 51 - Details of the LWGEOM type By David Blasby, Refractions Reserach

  2. Extending Postgresql with Custom Datatypes • At the end of this course you will: • Be able to explain how postgresql stores data • Create fixed length and variable length types • Add “C” functions required to create a type • Add “C” functions needed to manipulate a type • Test and compile resources needed for a type • Debug your “C” code

  3. Two types of Postgresql types • Fixed-Length • Every instance requires exactly the same amount of space to store • Examples: simple numbers, date, ip address • Variable-Length • Different instances require different amounts of space to store • Examples: text, lists, geometries

  4. Simple Example - list of 4 integers Fixed Length (size of 16 bytes): Variable Length (size of 24 bytes): Total size of this object (includes this integer)

  5. Postgresql always stores variables as a contiguous set of bytes. This is usually refereed to as its “serialized form”. The difference between fixed length and variable length types is that the variable length type’s first four bytes contains an integer saying how long the structure is. The size of a fixed length type is known.

  6. Postgresql types CANNOT contain pointers //for variable length types char *postgresql_data; int length = *((int*) postgresql_data); char *copy_of_postgresql_data; copy_of_postgresql_data = palloc( length); memcpy(copy_of_postgresql_data, postgresql_data, length); pfree(postgresql_data); This block of code finds the length of the postgresql variable length type (it’s the first 4 bytes of the structure). It then makes a copy of it. This must always be a valid object. Postgresql may move your data around on you. NEVER PUT A MEMORY POINTER IN YOUR TYPE

  7. SQL command to create types CREATE TYPE name ( INPUT = input_function, OUTPUT = output_function [ , RECEIVE = receive_function ] [ , SEND = send_function ] [ , INTERNALLENGTH = { internallength | VARIABLE } ] [ , PASSEDBYVALUE ] [ , ALIGNMENT = alignment ] [ , STORAGE = storage ] [ , DEFAULT = default ] [ , ELEMENT = element ] [ , DELIMITER = delimiter ] ); input_function - converts text to this type output_function - converts this type to text alignment - machine byte alignment INTERNALLENGTH - fixed length size or ‘VARIABLE’

  8. Example - 3d/2d bounding box; fixed length CREATE TYPE bounding_box_fixed ( INPUT = bounding_box_fixed_in, OUTPUT = bounding_box_fixed_out, INTERNALLENGTH = 48, ALIGNMENT = double ); (xmax, ymax, zmax) (xmin, ymin, zmin)

  9. Variable length 2d/3d bounding box CREATE TYPE bounding_box_variable ( INPUT = bounding_box_variable_in, OUTPUT = bounding_box_variable_out, INTERNALLENGTH = VARIABLE , ALIGNMENT = double ); Two morphs - 2d and 3d:

  10. //convert a null-terminated string into a bounding_box_fixed // input should look exactly like: // 'BBOX(xmin ymin, xmax ymax)' // OR 'BBOX(xmin ymin zmin, xmax ymax zmax)' // returns a bounding_box_fixed PG_FUNCTION_INFO_V1(bounding_box_fixed_in); Datum bounding_box_fixed_in(PG_FUNCTION_ARGS) { char *str = PG_GETARG_CSTRING(0); bounding_box_fixed result; int nmatching; //for sscaf testing double xmin, ymin, xmax,ymax, zmin, zmax; zmin = zmax = 0; // defaults //assume 3d nmatching = sscanf(str, "BBOX(%lf %lf %lf, %lf %lf %lf)", &xmin, &ymin, &zmin, &xmax, &ymax, &zmax ); if (nmatching != 6) { //try 2d nmatching = sscanf(str, "BBOX(%lf %lf, %lf %lf)", &xmin, &ymin, &xmax, &ymax ); if (nmatching != 4) { elog(ERROR,"bounding_box_fixed_in: couldnt parse input"); PG_RETURN_NULL(); //never get here } } result = (bounding_box_fixed) palloc(48); // fixed length memcpy(result + 0, &xmin, sizeof(double)); memcpy(result + 8, &ymin, sizeof(double)); memcpy(result + 16, &xmax, sizeof(double)); memcpy(result + 24, &ymax, sizeof(double)); memcpy(result + 32, &zmin, sizeof(double)); memcpy(result + 40, &zmax, sizeof(double)); PG_RETURN_POINTER( result); } Macro to get postgresql arguments Postgresql macros for defining functions Return error Return object Postgresql memory management NEVER USE malloc()

  11. //takes a bounding_box_fixed and returns a null-terminated string // 'BBOX(<xmin> <ymin> <zmin>,<xmax> <ymax> <zmax>)' //NOTE: this will return too much memory - should use better string // handling PG_FUNCTION_INFO_V1(bounding_box_fixed_out); Datum bounding_box_fixed_out(PG_FUNCTION_ARGS) { bounding_box_fixed bbox = PG_GETARG_POINTER(0); char *result; double xmin,ymin,zmin, xmax,ymax,zmax; xmin = *((double *) (bbox + 0)); ymin = *((double *) (bbox + 8)); xmax = *((double *) (bbox + 16)); ymax = *((double *) (bbox + 24)); zmin = *((double *) (bbox + 32)); zmax = *((double *) (bbox + 40)); result = palloc(20 + 6*34); sprintf(result,"BBOX(%.15g %.15g %.15g,%.15g %.15g %.15g)", xmin, ymin, zmin, xmax, ymax, zmax ); PG_RETURN_CSTRING(result); } Postgresql macros for defining functions Macro to get postgresql arguments Postgresql memory management NEVER USE malloc() we estimate the required size Return text

  12. CREATE FUNCTION bounding_box_fixed_in(cstring) RETURNS bounding_box_fixed AS ’<location of .so>' LANGUAGE 'C' WITH (isstrict); CREATE FUNCTION bounding_box_fixed_out(bounding_box_fixed) RETURNS cstring AS ’<location of .so>' LANGUAGE 'C' WITH (isstrict); CREATE TYPE bounding_box_fixed ( alignment = double, internallength = 48, input = bounding_box_fixed_in, output = bounding_box_fixed_out ); Your ‘C’ code will becompiled to a .so file Convert string to bounding box Covert bounding boxto string Actual definition of type isstrict - means ‘return NULL if any of the arguments are NULL’

  13. dblasby=# select 'BBOX(0 0, 10 10)'::bounding_box_fixed; bounding_box_fixed --------------------- BBOX(0 0 0,10 10 0) (1 row) dblasby=# select 'BBOX(1 2 3, 4 5 6)'::bounding_box_fixed; bounding_box_fixed -------------------- BBOX(1 2 3,4 5 6) (1 row) dblasby=# select 'BBOX(10 10, 0 0)'::bounding_box_fixed; bounding_box_fixed --------------------- BBOX(10 10 0,0 0 0) (1 row) dblasby=# select 'BBOX(1 2 5 6)'::bounding_box_fixed; ERROR: bounding_box_fixed_in: couldnt parse input Oops, should have caught this!

  14. // return a double representing xmin of the bounding box. PG_FUNCTION_INFO_V1(bounding_box_fixed_xmin); Datum bounding_box_fixed_xmin(PG_FUNCTION_ARGS) { bounding_box_fixed bbox = PG_GETARG_POINTER(0); PG_RETURN_FLOAT8( *( (double *) bbox) ) ; } CREATE FUNCTION xmin(bounding_box_fixed) RETURNS FLOAT8 AS '@MODULE_FILENAME@' ,'bounding_box_fixed_xmin' LANGUAGE 'C' WITH (isstrict,iscachable); dblasby=# select xmin('BBOX(1 2 3, 4 5 6)'::bounding_box_fixed); xmin ------ 1 (1 row)

  15. PG_FUNCTION_INFO_V1(bounding_box_var_in); Datum bounding_box_var_in(PG_FUNCTION_ARGS) { char *str = PG_GETARG_CSTRING(0); bounding_box_fixed result; int nmatching; double xmin, ymin, xmax,ymax, zmin, zmax; char is3d; int length; zmin = zmax = 0; // defaults //assume 3d is3d = TRUE; nmatching = sscanf(str, "BBOX(%lf %lf %lf,%lf %lf %lf)", &xmin, &ymin, &zmin, &xmax, &ymax, &zmax ); if (nmatching != 6) { is3d = FALSE; //try 2d nmatching = sscanf(str, "BBOX(%lf %lf,%lf %lf)", &xmin, &ymin, &xmax, &ymax ); if (nmatching != 4) { elog(ERROR,"bounding_box_var_in: couldnt parse input"); PG_RETURN_NULL();//never get here } } if (is3d) length = 53; else length = 37; result = palloc(length); memcpy(result, &length, sizeof(int32)); result[4] = is3d; memcpy(result + 5, &xmin, sizeof(double)); memcpy(result + 13, &ymin, sizeof(double)); memcpy(result + 21, &xmax, sizeof(double)); memcpy(result + 29, &ymax, sizeof(double)); if (is3d) { memcpy(result + 37, &zmin, sizeof(double)); memcpy(result + 45, &zmax, sizeof(double)); } PG_RETURN_POINTER( result); } Two different sizes Add variable length info Only fill in if its 3d

  16. Homework: • add xmin(), ymin(), zmin(), xmax(), ymax(), zmax() • add area() • add volume() • add more checking to bounding_box_fixed_in() function • add a constructor: • construct_fixed_bbox(xmin,ymin,zmin,xmax,ymax,zmax) • construct_fixed_bbox(xmin,ymin,xmax,ymax) • add equal(<bbox>, <bbox>) • add bbox_gt(<bbox>, <bbox>) • add bbox_lt(<bbox>, <bbox>) • finish off the variable length version For a>b: return true if a.xmax > b.xmax if (a.xmax == b.xmax) return a.ymax > b.ymax

  17. Compiling • 1. create_type.c -- “C” functions • 2. create_type.sql.in -- SQL to create functions/types • 3. Makefile • converts the .c to a .so • converts the .sql.in to a .sql Good luck writing your own makefile! Just fills in the name of the .so • I’ve already set you up with these in: • L:\Refractions\Projects\PostGIS\training\pati • L:\Refractions\Projects\PostGIS\training\kevin You may have to setup an environment variable: setenv PGSQL_SRC /usr/local/src/postgresql-7.3.5/

  18. Running • Use the development server on port 8765 • psql -p 8765 <database name> • psql -p 8765 ptozer • psql -p 8765 kneufeld You’re going to crash the server a lot - don’t screw other people cd /data1/Refractions/Projects/PostGIS/training/dave make psql -p 8765 dblasby dblasby=# \i create_type.sql Or just copy and paste

  19. Debugging Debugging is difficult because your code is quite far removed from the server. … elog(NOTICE, “gonna do the multiply”); elog(NOTICE,”%i * %i = %i”, a, b, a*b); … Use elog() just like sprintf() • For more detailed debugging you’ll either have to: • run the server with valgrind • valgrind --num-callers=10 /opt/pgsql73/bin/postmaster -D /raid/pgdata/pgsql_devel_73 • run the server, then attach to it with gdb Login as postgres to do this. Shutdown the server first. Good luck figuring this out!!

  20. Don’t know what function to call? • fmgr.h defines most of the postgresql macros: • PG_RETURN_*() • PG_GETARG_*() • /usr/local/src/postgresql-7.3.5/src/include/fmgr.h • Lots of examples in postgis: • check it out of CVS (details on the postgis.org site) • Other examples in the postgresql contrib/ directory: • /usr/local/src/postgresql-7.3.5/contrib/ • Online documents available: • http://hydra/pgsql74/

  21. Operators Want to be able to do queries like:SELECT … WHERE my_bbox = ‘…’; SELECT … WHERE my_bbox < ‘…’; SELECT … WHERE … ORDER BY <bbox column>; CREATE OPERATOR < ( LEFTARG = bounding_box_fixed, RIGHTARG = bounding_box_fixed, PROCEDURE = bbox_lt, COMMUTATOR = '>', NEGATOR = '>=', RESTRICT = contsel, JOIN = contjoinsel ); You already wrote this as part of your homework These are forplanning estimates • You’ll need to define these operators: • > • >= • < • <= • =

  22. Memory Alignment typdef struct { char isfemale; //0=male, 1 = female double salary; //yearly salery int32 age; // in years } PERSONTYPE; You might think this structure is 13 bytes long (1 byte for isfemale, 8 for salary, and 4 for age). In most applications, you never have to worry about alignment because the compiler takes care of it for you.

  23. Intel representation typdef struct { char isfemale; //0=male, 1 = female char filler1; char filler2; char filler3; double salary; //yearly salery int32 age; // in years } PERSONTYPE; The Intel representation takes up 16 bytes. Intel requires that doubles start on “4-byte boundaries” - this means the actual memory address is evenly divisible by 4. Since a word is 4 bytes long, this is also called “word-aligned”.

  24. SPARC/PowerPC representation typdef struct { char isfemale; //0=male, 1 = female char[7] filler; double salary; //yearly salery int32 age; // in years } PERSONTYPE; The SPARC representation takes up 20 bytes. SPARC requires that doubles start on “8-byte boundaries” - this means the actual memory address is evenly divisible by 8. Since a word is 4 bytes long, this is also called “double-word-aligned”.

  25. “C” representation Postgresql representation typdef struct { char isfemale; char filler1; char filler2; char filler3; double salary; int32 age; } PERSONTYPE; In the ‘CREATE TYPE’ statement, we specified ‘double’ alignment, so this is on a word (4 byte) boundary on an Intel machine. On a SPARC this will be on an 8 byte boundary. These are definitely not correctly aligned! When you try to directly access salary inside the postgresql representation, the machine will panic and segfault. Basically, the computer will be executing a ‘get two words starting from this memory address’ command. Since you asked for memory starting from an un-aligned location, the CPU will not know how to do it.

  26. memcpy() is your friend because it doesn’t worry about memory alignment! char isfemale; double mySalery; // compiler makes sure this is aligned int age; // compiler makes sure this is aligned char *postgresql_data; isfemale = postgresql_data[0]; // chars are not aligned // copy the unaligned data to an aligned location memcpy(&mySalery, postgresql_data+1, 8); memcpy(&age, postgresql_data + 9, 4); // this crashes because you’re grabbing a double from an // unaligned location. mySalery = *((double *) postgresql_data+1);

  27. Really advanced stuff - making your postgresql representation equivalent to the “C” representation! Postgresql representation typdef struct { char isfemale; char filler1; char filler2; char filler3; int32 age; double salary; } PERSONTYPE; This is exactly equivalent to the “C” version, for both Intel and SPARC machines. NOTE: there’s 3 wasted bytes Note: switched age & salary and I’ve explicitly put in the filler. These are correctly aligned on both Intel and SPARC machines.

  28. When the postgresql serialized form and “C” form are exactly the same, you can simplify your code. Char *postgresql_data; PERSONTYPE *aPerson; aPerson = (PERSONTYPE *) postgresql_data; elog(NOTICE,”person.isfemale = %i”, (int) aPerson->isfemale); elog(NOTICE,”person.age = %i”, aPerson->age); elog(NOTICE,”person.salary = %g”, aPerson->salary); This is difficult to do, and often wastes space (a few bytes * 100,000,000 rows is a lot) but your code is much easier to write, understand and maintain.

  29. Making things easier for yourself by making an ADT Abstract Data Type - the non-object oriented equivalent of a class. double bbox_get_zmin(char *serialized_bbox) { double zmin; if (serialized_bbox == NULL) elog(ERROR,”bbox_get_zmin -- got NULL input”); memcpy(&zmin, serialized_bbox +32, 8); return zmin; } void bbox_set_zmin(char *serialized_bbox, double zmin) { if (serialized_bbox != NULL) memcpy(serialized_bbox +32,&zmin, 8); }

  30. Debugging with valgrind 1. Make sure no one else is using/connected to the database 2. Shut down the database: /etc/init.d/pgsql_devel_73 stop 3. Log-on as the postgres user su - postgres 4. Run the postmaster under valgrind /opt/valgrind/bin/valgrind --num-callers=10 /opt/pgsql73/bin/postmaster -D /raid/pgdata/pgsql_devel_73 5. When finished, control-c the step #4 command to kill postgresql 6. Re-start the database normally /etc/init.d/pgsql_devel_73 start If this takes forever, it means someone’s connected to the database. It will ask you for the postgres password Valgrind simulates a linux computer and runs postgresql in it. When the simulated machine does something its not supposed to do (like an illegal memory access) valgrind will immediately report the error and where in your program it occurred.

  31. Light Weight Geometry (LWGEOM) Introduction to LWGEOM Support types for LWGEOM Using simple LWGEOMs Using MULTI* LWGEOMs

  32. Introduction to LWGEOM • OGC compliant datatype: • POINT • LINESTRING • POLYGON • MULTIPOINT (set of points) • MULTILINESTRING (set of linestrings) • MULTIPOLYGON (set of polygons) • GEOMETRYCOLLECTION (set of geometries) • OGC only specifies 2D geometries. LWGEOM supports • 2D, 3D, and 4D (x,y,z,m) geometries. The LWGEOM object is very similar to the PostGIS geometry object, but is optimized for space.

  33. The serialized form of the LWGEOM type is very similar to the OGC WKB type (fully specified in the SF SQL spec). • Indexing is based on a FLOAT4-based bounding box instead of a FLOAT8-based bounding box. This makes indexes much smaller and more likely to be cached. • All the redundant information has been removed • Source code is based on a LWGEOM API. This should make learning, maintenance, and change management easier.

  34. Support Types POINT2D POINT3D POINT4D BOX2DFLOAT4 POINTARRAY LWPOINT LWLINE LWPOLY 8bit type

  35. Points PostGIS held all points as 3D points. This wasted space for people using 2D geometries, and didn’t support 4D points. LWGEOM supports 2D, 3D and 4D points. We’ll see how they are abstracted and used later. If you were to convert a 2D points to a 3D point, the z coordinate would be either 0 or NaN. This implementation detail hasn’t been decided yet. typedef struct { double x, y; } POINT2D; typedef struct { double x,y,z; } POINT3D; typedef struct { double x,y,z,m; } POINT4D;

  36. BOX2DFLOAT4 typedef struct { float xmin; float ymin; float xmax; float ymax; } BOX2DFLOAT4; Since a normal 4-byte float does not have the same precision as an 8-byte double, we slightly enlarge the bounding box during a conversion. This ensures that we’re not losing anything important in the double-to-float conversion. For example, the numbers closest to 10,000,000 in float4 are: 10000001.00000000000000000000 10000000.01862645149230957031 9999999.00000000000000000000

  37. POINTARRAY POINTARRAY abstracts a set of points. Its responsible for conversions between 2D, 3D, and 4D and handles memory alignment issues. typedef struct { char *serialized_pointlist; // probably missaligned. 2d or 3d. points to a double char dim; // see dim definition (0=2d,1=3d,2=4d) uint32 npoints; // number of points } POINTARRAY; Returns a point extern POINT2D getPoint4d(POINTARRAY *pa, int n); extern void getPoint4d_p(POINTARRAY *pa, int n, char *point); Modifies memory If you ask for a Point4D from a 2D POINTARRAY, you’ll get the (x,y) from the actual POINTARRAY. The (z,m) coordinates will be set with either 0 or NaN (implementation is undefined right now).

  38. LWPOINT - Represents a single point. typedef struct { char dims; // see dim definition (0=2d,1=3d,2=4d) int SRID; // spatial ref sys POINTARRAY *point; // hide 2d/3d/4d (this will be an array of 1 point) } LWPOINT; // "light-weight point" -1 if undefined There are several support functions that will convert this into something that can be serialized, a constructor, a de-serializer, and a few utility functions. LWGEOM LWPOINT LWGEOM De-serialize Serialize In memory, easy-to-use In postgresql In postgresql

  39. LWLINE - represents a connected set of points typedef struct { char dims; // see dim definition (0=2d,1=3d,2=4d) int SRID; // spatial ref sys -1=none POINTARRAY *points; // set of POINTs } LWLINE; //"light-weight line" There are several support functions that will convert this into something that can be serialized, a constructor, a de-serializer, and a few utility functions. LWGEOM LWLINE LWGEOM De-serialize Serialize In memory, easy-to-use In postgresql In postgresql

  40. LWPOLYGON - represents a single polygon (set of rings) typedef struct { int SRID; // spatial ref sys -1=none char dims; // see dim definition (0=2d,1=3d,2=4d) int nrings; POINTARRAY **rings; // list of rings (list of points) } LWPOLY; //"light-weight polygon" There are several support functions that will convert this into something that can be serialized, a constructor, a de-serializer, manage rings, and a few utility functions. LWGEOM LWPOLY LWGEOM De-serialize Serialize In memory, easy-to-use In postgresql In postgresql

  41. LWGEOM ‘type’ structure (8 bits) /* LWGEOM types are an 8-bit char in this format: xSDDtttt WHERE x = unused S = 4 byte SRID attached (0= not attached (-1), 1= attached) DD = dimentionality (00=2d, 01=3d, 10= 4d, 11 = undef) tttt = actual type (as per the WKB type): enum wkbGeometryType { wkbPoint = 1, wkbLineString = 2, wkbPolygon = 3, wkbMultiPoint = 4, wkbMultiLineString = 5, wkbMultiPolygon = 6, wkbGeometryCollection = 7 }; */ bool lwgeom_hasSRID(char type); // true iff S bit is set int lwgeom_getDims(char type); // 2 if 2d, 3 if 3d, 4 if 4d int lwgeom_getType(char type); // returns the tttt value (1 to 7)

  42. Generic Interpretation of an LWGEOMs Postgresql variable length ovehead The type byte tells you if this is present Actual geometry information (i.e. its points)

  43. Point LWPOINT *lwpoint_deserialize(char *serialized_form); Type tells you if these are present typedef struct { char dims; // (0=2d,1=3d,2=4d) int SRID; // spatial ref sys POINTARRAY *point; // array of 1 point } LWPOINT; // "light-weight point"

  44. Line LWLINE *lwline_deserialize(char *serialized_form); Could be 2D, 3D, or 4D typedef struct { char dims; // see dim definition (0=2d,1=3d,2=4d) int SRID; // spatial ref sys -1=none POINTARRAY *points; // set of POINTs } LWLINE; //"light-weight line"

  45. Polygon LWPOLY *lwpoly_deserialize(char *serialized_form); typedef struct { int SRID; // spatial ref sys -1=none char dims; // see dim definition (0=2d,1=3d,2=4d) int nrings; POINTARRAY **rings; // list of rings (list of points) } LWPOLY; //"light-weight polygon"

  46. Multi-Geometries Each one of these will start with an 8 bit type.

  47. Example - ‘MULTIPOINT(0 0, 10 10)’ ‘MULTIPOINT’ 2D 2 ‘POINT’ 2D (almost certainly not present) (0 0) ‘POINT’ 2D (almost certainly not present) (10 10)

  48. GEOMETRYCOLLECTION(MULTIPOINT(0 0, 10 10), LINESTRING(0 0, 10 10)) A multi with a multi inside. Be scared. geometrycollection M u l t i p o i n t linestring

  49. LWGEOM_INSPECTED typedef struct { int SRID; char *serialized_form; // orginal structure char type; // 8-bit type for the LWGEOM int ngeometries; // number of sub-geometries char **sub_geoms; // list of pointers (into serialized_form) of the sub-geoms } LWGEOM_INSPECTED; LWGEOM_INSPECTED *lwgeom_inspect(char *serialized_form);

  50. More Info 1. PostGIS mailing lists (best info here) www.postgis.org 2. “Use the source” checkout the PostGIS CVS (instructions at www.postgis.org) 3. OGC Simple Features for SQL Specification

More Related