1 / 76

NoSQL and NOSQL

NoSQL and NOSQL. Beginning with o-r databases. First, CRUD…. Global notions of managing persistent data, regardless of the model or system Create, Read, Update, Delete But there is also DDL And there are implementation issues, like sorts and indices. Why not standard tables?.

gerek
Download Presentation

NoSQL and NOSQL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NoSQL and NOSQL Beginning with o-r databases

  2. First, CRUD… • Global notions of managing persistent data, regardless of the model or system • Create, Read, Update, Delete • But there is also DDL • And there are implementation issues, like sorts and indices

  3. Why not standard tables? • Extreme data structuring conflict between host language and database language: • Impedance mismatch • Atomic values are the only common data type • To retrieve all of an object requires lots of joins • Difficult to look for objects that are similar but have some different attributes • Difficult to retrieve an attribute that is a collection • You have to program with two programming languages • We have value-based semantics – difficult to know if two people have the same mother, or just a mother with the same name/ID and this causes us to make inference

  4. What is o-o? • Relation is a set of tuples • Objects are arranged in sets of objects • In a relation, a tuple’s components are primitive (int, string) • The components of an object can be complex types (sets, tuples, other objects) • SQL: programs are global • Object: programs are local

  5. Key concept: Object Id’s • Every object has a unique Id: different objects have different Ids • Immutable: does not change as the object changes • Different from primary key! • Like a key, identifies an object uniquely • But key values can change – oidscannot • And there are inferences based on values

  6. Objects and Values • An object is a pair: (oid, value) • Example: A Joe Public’s object (#32, [ SSN: 111-22-3333, Name: “Joe Public”, PhoneN: {“516-123-4567”, “516-345-6789”}, Child: {#445, #73} ] )

  7. Classes • Class: set of semantically similar objects (eg, people, students, cars, motorcycles) • A class has: • Type: describes common structure of all objects in the class (semantically similar objects are also structurally similar) • Method signatures: declarations of the operations that can be applied to all objects in the class. • Extent: the set of all objects in the class • Classes are organized in a class hierarchy • The extent of a class contains the extent of any of its subclasses

  8. The ODMG Standard • ODMG 3.0 was released in 2000 • Includes the data model (more or less) • ODL: The object definition language • OQL: The object query language • A transaction specification mechanism • Language bindings: How to access an ODMG database from C++, Smalltalk, and Java (expect C# to be added to the mix)

  9. Main Idea: Host Language = Data Language • Objects in the host language are mapped directly to database objects • Some objects in the host program are persistent.

  10. The Structure of an ODMG Application

  11. Objects in SQL • Object-relational extension of SQL-92 • Includes the legacy relational model • SQLdatabase= a finite set of relations • relation = a set of tuples (extends legacy relations) OR a set of objects (completelynew) • object = (oid, tuple-value) • tuple = tuple-value • tuple-value = [Attr1: v1, …, Attrn: vn] • multiset-value = {v1, …, vn}

  12. Path expressions SELECT T.Student.Name, T.Grade FROM TRANSCRIPT T WHERE T.Student.Address.Street = ‘Main St.’

  13. PostgreSQL vs. MySQL • PostgreSQL is a generation newer • It has nice UDT capabilities • There are libraries of UDTs that can be imported and used • Both PostgreSQL and MySQL • Full text search • XML data types • To some degree free • MySQL • Never underestimate the value of a heavily understood piece of software • Lots of stacks and development environments come configured to work with it (but to a lesser extent, this is true of PostgreSQL, too). • It is a “core” SQL database, in that we can move pretty much to any other server-based DBMS is we start with MySQL

  14. Triggers in PostgreSQL • Triggers automatically fire stored procedures when some event happens, like an insert or update. They allow the database to enforce some required behavior in response to changing data. • PL/pgSQL – Procedural Language of PostgreSQL

  15. Example CREATE TABLE logs ( event_idinteger, old_titlevarchar(255), old_startstimestamp, old_ends timestamp, logged_at timestamp DEFAULT current_timestamp ); A logs table

  16. Continued CREATE OR REPLACE FUNCTION log_event() RETURNS trigger AS $$ DECLARE BEGIN INSERT INTO logs (event_id, old_title, old_starts, old_ends) VALUES (OLD.event_id, OLD.title, OLD.starts, OLD.ends); RAISE NOTICE 'Someone just changed event #%', OLD.event_id; RETURN NEW; END; A function to insert old data in to the log

  17. Continued… a trigger CREATE TRIGGER log_events AFTER UPDATE ON events FOR EACH ROW EXECUTE PROCEDURE log_event(); Logs changes after any row is updated

  18. Rules • A RULE is a description of how to alter the parsed query tree. • Every time Postgresruns an SQL statement, it parses the statement into a query tree (generally called an abstract syntax tree).

  19. Back to PostgreSQL Page 32 of Seven Databases, onward Fuzzy searches, full text searching

  20. Postgres and spatial data • For manipulating 2D/3D spatial data • Points, lines, and polygons formed from points and lines • Can perform union, intersection, operations • Can project shapes into 2D areas • Has a 3D geometry type (relatively new) • Can calculate accurate distances in meters • Works with an open source server that allows folks to share geospatial data • Command line interface • Also supports some forms of raster data • Provides spatial indices • Has a notion of a geometric column

  21. Queries SELECT superhero.name FROM city, superhero WHEREST_Contains(city.geom, superhero.geom) and city.name = 'Gotham'; SELECTAsBinary(the_geom) as wkb_geometryFROMriver AS r, state AS s WHERE intersects(r.the_geom, s.the_geom)

  22. Mapnik • Used for OSM (open street map) data and uses postgis • Mapnik is an open source system for rendering maps • Used to design maps • Written in C++ • It renders maps from postgis databases

  23. Next: full text and approximate text search • But first, not to be confused with the Like operator • Used % as the wild card • Or with regular expressions for character string comparison

  24. Full text search • First, you index the words in a document and create an array of lexemes • Second, specify a boolean phrase using and, or, not, and parens • We typically don’t index “stop” words like and, or, the, etc. • Dictionaries are used to find roots of related words, like dead and dying • Thesauruses dictionaries are used to for recognition of domain-specific and similar words

  25. Documents • A document is a text attribute in a row of a table • Often we use part of a document or concatenate various parts of documents

  26. Details: dictionaries • Define stop words that should not be indexed • Map synonyms to a single word. • Map phrases to a single word using a thesaurus. • Map different variations of a word to a canonical form

  27. Searching • Uses a match operator - @@ • Basic search consists of asking about the relationship to a vector of words to a given document, which is also a vector • The vector can have and, or, etc. in it • tsvector – document – normalized lexemes • tsquery – query

  28. Examples SELECT title FROM pgweb WHERE to_tsvector(title || ' ' || body) @@ to_tsquery('create & table') ORDER BY last_mod_date DESC LIMIT 10; select the ten most recent documents that contain create and table in the title or body Results can be ranked

  29. Recent addition: fuzziness • soundex(text) returns text • Converts a string to its Soundexcode • Based on pronunciation • difference(text, text) returns int • converts two strings to their Soundex codes and then reports the number of matching code positions • 0 is a no match • 4 is a full match • Def: A phonetic coding system intended to suppress spelling variation and determining the relationship between two (similar) words

  30. Levenshtein • Levenshtein distance is a metric for evaluating the difference between two sequences, in particular, words • E.g.: test=# SELECT levenshtein('GUMBO', 'GAMBOL'); • E.g.: SELECT * FROM some_table WHERE levenshtein(code, 'AB123-lHdfj') <= 3 ORDER BY levenshtein(code, 'AB123-lHdfj') LIMIT 10 • Used in particular, to detect nicknames

  31. Metaphone • E.g., metaphone(text source, intmax_output_length) returns text • Similar to soundex • Used to classify words according to their english pronunciation • Apparently better for non-english languages, compared to soundex • E.g.: SELECT * FROM users WHERE METAPHONE(users.first_name, 2) = METAPHONE('Willem', 2) should detect similarity to word William

  32. Class project, beginning • Build an application on top of • PostgreSQL or MySQL • And one of the other NoSQL databases in the 3 books • But pick from only key/value, key/document, or column-based databases • The application is written in a language of your choice • Each of the databases must be used to manage the kind of data it is intended for • Traditional relational table data • And nontraditional data

  33. Final grades • Choice 1: each of exam 1, exam 2, project are 1/3 of your final grade • Choice 2: if you build a web app for your project, I will use the best two of three grades

  34. NoSQL DBs • Why?

  35. Relational DBs • SQL - Fixed schema, row oriented & optimized • SQL - Rigid 2Phase transactions, with locking bottleneck • SQL - Set theoretic • SQL - Centralized distribution • SQL - Computational, not navigational/inter-connected & set-oriented • Sql - Poor support for heterogeneity & compression

  36. No SQL - no or not only • Column-oriented - HBase (uses column families and no schema, has versioning and consistence transactions) • Key/value pairs - Google Dynamo • Graph like - Neo4J • Document based - MongoDB (cluster based for huge scale, supports nested docs, and uses JavaScript for queries, and no schema)

  37. But remember - • Categories not distinct - take each one for what it is • Heterogeneous structure & polyglot language environment is common • NoSQL DBs tend to be unsupported with funky GUIs - but there are very active volunteer user bases maintaining and evolving them • NoSQL DBs also tend to use programming languages for queries

  38. When do you want non-2P transactions and no SQL? • Interactive, zillion user apps where user fixes errors via some form of compensation • Minimal interconnectedness • Individual data values are not mission-critical • Read-heavy environments • Cloud -based environments • Queries are not set-oriented & are computational and imperative, and perhaps long • Real time apps

  39. SQL is here to stay... • Formal & unambiguous semantics • Declarative language with clean separation of application and queries • Consistent • Flexible • Black boxed, tested, and supported - and very well understood with many thousands of trained programmers - SQL is a basic language, like Java, Javascript, PHP, C#. etc. • Great GUIs that are very rich and debugged

  40. And importantly... • Lots of apps need clean, well understood stacks, not speed or the cloud • In particular, websites that do retail business need consistent transactions and do not need the speed that comes with delayed updates • Relational DBs scale reasonably well, too, at least in non-cloud environments

  41. Again… • The classification of the various nosql databases is imprecise, semi-controversial, and we have to be careful about reading too much into it. • Rather than focusing on categorizing dbs, we should be concerned with what they do, how they relate to each other with respect to functionality, and how they compare to sql databases.

  42. Key-value and key-document DBs • Databases that access aggregate data • Key-value dbs know nothing about the structure of the aggregate • Key-document databases do know, but the interpretation of these aggregates happens outside the db • Keep in mind that these two categories of databases overlap in practice • Importantly, both of these two database systems categories focus on storing and retrieving individual aggregates, and not on interrelating (horizontally) multiple aggregates • There is something similar to this in SQL DBs – and that is highly un-normalized tables

  43. Important notions… • It can be a difficult problem to represent some domains as key-value or key-document databases, as the boundaries of aggregates might not be easy to determine. • This basic data modeling issue has a lot of influence on the sort of database you should use. • Relational databases don’t manipulate aggregates, but they are aggregate neutral for the most part, leaving the construction of aggregates to run time … but we might have hidden, un-normalized tables that make some commonly used aggregates much faster to materialize

  44. Key-value vs. key-document • In key-value databases, we can only retrieve data via a key • In key-document databases, we may be able to ask questions about the content of documents – but again, we are not cross-associating them • Mongo is perhaps the most talked about key-document system, and so we will start there

  45. Installing Mongo • Mongo • http://docs.mongodb.org/manual/installation • A GUI • http://www.mongodb.org/display/DOCS/Admin+UIs

  46. GUIs for Mongo • There are a few GUIs that seem pretty good • Mongo-vision: http://code.google.com/p/mongo-vision/ (web page) • Needs Prudence as a web server • MongoVue: http://mongovue.com, but Windows only • RockMongo (web based): http://rockmongo.com/ (web page) • Needs an apache web server • Very easy to install, just download • http://docs.mongodb.org/manual/installation

  47. Getting an Apache web server • XAMPP for windows (mac version is way out of date) • MAMP for Macs (on the app store) • WAMP for windows (bitnami.org) • All of these give you PHP and MySQL as well. If we have time, we will look at MySQL full text search. • You might want to install PostgreSQL, too. There is a bitnami stack. If there is time, we will look at PostgreSQL UDTs and full text search.

  48. Mongo overview • Document based • Focuses on clusters for extremely large scaling • Supports nested documents • Uses JavaScript for queries • No schema

  49. Terminology • A database consists of collections • Collections are made up of documents • A document is made up of fields • There are also indices • There are also cursors

  50. When to use Mongo • Medical records and other large document systems • Read heavy environments like analytics and mining • Partnered with relational databases • Relational for live data • Mongo for huge largely read only archives • Online applications • Massively wide e-commerce

More Related