1 / 48

Searching for Success

Searching for Success. Amazon CloudSearch and Relational Databases. Agenda. Finding things Types of Databases Making Choices What is CloudSearch? Combining CloudSearch with Relational Sample Code. Finding Things. So Many Databases. Finding Your Information.

elom
Download Presentation

Searching for Success

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Searching for Success Amazon CloudSearch and Relational Databases

  2. Agenda • Finding things • Types of Databases • Making Choices • What is CloudSearch? • Combining CloudSearch with Relational • Sample Code

  3. Finding Things So Many Databases

  4. Finding Your Information • Your users need to find things • What do you use? • A Database! • What Kind?

  5. It's a Big World Out There! • "Database" != "Relational Database" • Tons of relational databases • Amazon RDS • MySQL • MSSQL • Oracle • but…

  6. Many Other Types • NoSQL databases • Dynamo, Cassandra, CouchDB… • Graph databases • Neo4J, Titan, … • Column oriented databases • Redshift, Bigtable… • Text Search Engine • CloudSearch, Lucene, Autonomy...

  7. Text Search Engine • Good at text queries • "Harry Potter and the Philosopher's Stone" Harry harry harry Potter potter potter and and and the the the Philosopher's philosopher philosopher's stone Stone stone harry potter philosopher stone

  8. Text Search Engine • Basic element is the document • Documents are made of fields • "title" => "star wars" • Fields can be • Missing • Multi-valued • Variable length

  9. Text Search Engine • Documents are not "normalized" • In a relational database • A movie table • A director table • An actor table • In CloudSearch • One document per movie

  10. Text Search Engine Relational

  11. Relevance • Key differentiator for text search • Not "does this match?" • "how WELL does this match? • Includes multiple factors • Term Frequency, Document Frequency, Proximity • Users can customize this • Distance • Popularity • Field Weighting

  12. Text is more than "War & Peace" • It's not just books & blog posts • Meta-data • Author, Title, Category, Tags • Can include numbers: counts, dates, latitude,…

  13. Making Choices Relational? CloudSearch?

  14. Relational Database • Good at • Exact matches • Joins • Atomic Transactions • Not so good at • Relevance • How well does this match? • Handling words

  15. Text Search Engines • Good at finding • Words, Phrases • Relevance • Not so good at • Joins • Transactions

  16. Options for Search • Can I just use a relational database? • Yes. • Do I want to just use a relational database? • Probably not

  17. Simple Approach • Widely supported, easy SELECT id, title FROM books WHERE title LIKE "%amazon%" • Does not perform well • Doesn't deal with multiple words

  18. Text Extensions for Relational Databases • Vendor specific SELECT id,title FROM books WHERE MATCH(title) AGAINST('Harry Potter') IN NATURAL LANGUAGE MODE • Use different index structures • Typically MUCH less mature than relational code • More manual processes • Scaling, (if possible) • Managing • minimal relevance, no control

  19. Appropriate Tools VS

  20. Options • Relational database • Weak relevance • Scaling & performance limits • Text Search Engine • No transactions & locking • No Joins • Both • Some extra effort, then best of both worlds

  21. What is Amazon CloudSearch?

  22. CloudSearch • Fully-managed text search engine • High Performance • Automatically Scaling • Reliable, Resilient • Based on Amazon Product Search

  23. Search Features • Faceting • Complex queries • (and 'potter harry' (not author:'rowling')) • Configurable synonyms, stemming & stopwords • Custom Sorting/Ranking

  24. Scaling • CloudSearch scales automatically • Handle your spikes • Plan for success, but don't spend until you need it • Handle more data • Scaling is seamless – no downtime

  25. Automatic Scaling DATA Document Quantity and Size SEARCH INSTANCE SEARCH INSTANCE SEARCH INSTANCE Index Partition n Copy 1 Index Partition 2 Copy 1 Index Partition 1 Copy 1 TRAFFIC Search Request Volume and Complexity SEARCH INSTANCE SEARCH INSTANCE SEARCH INSTANCE Index Partition 2 Copy 2 Index Partition 1 Copy 2 Index Partition n Copy 2 SEARCH INSTANCE SEARCH INSTANCE SEARCH INSTANCE Index Partition n Copy n Index Partition 1 Copy n Index Partition 2 Copy n

  26. Easy to Use Queries • Rest API • Simple to add • Http Post • Simple to query • q=star trek • Simple to integrate • JSON HTTP CloudSearch HTTP Documents

  27. Amazon CloudSearch Architecture AWS Query DNS / Load Balancing Search Domain Doc Svc API Command Line Tools Console Config API Command Line Tools Console Console Search API CONFIG SERVICE DOCUMENT SERVICE SEARCH SERVICE

  28. What Can You Search For With CloudSearch? • Wine • Your college buddies • Curly hair products • Downton Abbey episodes • News in Bermuda • Playoff tickets • Online courses • Cat memes • Furniture • Doctor reviews • Take out food • Vacation rentals • Trademarks • African safaris • Kids arts & crafts • French dating/marriage • Online videos • Recipes • Weather insurance • Fashion news • Bollywood music • Stock art And more!

  29. Combining CloudSearch+Relational Database

  30. Combining the Two • Best of both worlds • Relational queries run on relational database • Text queries run on CloudSearch • Downside: Complexity • More moving parts • Synchronization

  31. Synchronization • Which one is the master? • Usually the relational database • Updates • All at once • At regular intervals • When data is available • Deletes

  32. Dataflow • One source • Simultaneous updates CloudSearch Loader Source RDBMS

  33. Dataflow • One source • Two loaders CloudSearch RDBMS Loader Loader Source

  34. Dataflow • One source • Log updates • Two loader CloudSearch RDBMS Loader Source Log Loader

  35. Dataflow Source CloudSearch RDBMS Loader Source Source Log Loader

  36. Sample Code

  37. Dataflow • One source • Two loaders CloudSearch RDBMS Loader Loader Source

  38. Java Example • Read from MySQL • JDBC – Nothing special • Post to CloudSearch • Apache HTTP Client

  39. Libraries • Apache • HTTP Client • HTTP Core • Commons Logging • AWS Java SDK • MySQL connector

  40. Source Files • CloudSearchRDS • Just does the setup for the demo • ExtractAndUpload • Does the main work • Batcher • Groups documents into batches • PosterHttp • Posts to CloudSearch

  41. Main Loop ResultSetrs = stmt.executeQuery("select * from movies"); ResultSetMetaDatameta = rs.getMetaData(); for (int col = 1; col <= meta.getColumnCount(); col++) names.add(meta.getColumnName(col)); while (rs.next()) { int version = (int) (lastModified.getTime() / 1000); JSONObjectdoc = new JSONObject(); for (String name : names) { doc.put(name, rs.getString(name)); } String id = rs.getString("id"); if (batcher != null) { batcher.addDocument(doc, version, id); } }

  42. SQL • select * from movies; • select key as id, title as name from movies • Denormalizing may require multiple queries

  43. Demo

  44. Search: It's not just for Relational Data • You can pull data from • S3 • Redshift • Web • Internal Documents • And more… • And make it searchable

  45. Indexing S3 ListObjectsRequestlistObjectsRequest = new ListObjectsRequest().withBucketName(bucketName); ObjectListingobjectListing; do { objectListing = s3client.listObjects(listObjectsRequest); for (S3ObjectSummary objectSummary : objectListing.getObjectSummaries()) { processObject(objectSummary); } listObjectsRequest.setMarker(objectListing.getNextMarker()); } while (objectListing.isTruncated());

  46. Summary • Use the right tool! • Text Search for Searching Text • CloudSearch is fully managed text search • Easy to get data from relational DB • Easy to load data into CloudSearch

  47. Next Step: Free Trial • One month (750 hours) free. • Set up an account • Give it a try! • Questions? • TomHill@amazon.com

More Related