1 / 32

Vagif Jalilov Rivet Logic

Integrating Apache Solr with Alfresco WCM for Faceted Search and Navigation of Next-Generation Web Sites. Vagif Jalilov Rivet Logic. About Rivet Logic. Award-winning professional services focused on: Enterprise Content Management Web Content Management Collaboration and Social Communities

teo
Download Presentation

Vagif Jalilov Rivet Logic

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Integrating Apache Solr with Alfresco WCM for Faceted Search and Navigation of Next-Generation Web Sites Vagif Jalilov Rivet Logic

  2. About Rivet Logic • Award-winning professional services focused on: • Enterprise Content Management • Web Content Management • Collaboration and Social Communities • Using Leading Open Source Software

  3. Business Case for Alfresco & Solr • Large scale sites • Need for real-time updates • Full-text search • Faceted search

  4. Technical Challenges for Search • Accurately index each page • Solution: Assembly of relevant content to index • Targeted, real-time indexing • Solution: Trigger indexing from publishing mechanism

  5. Possible Index Solutions • Spidering/Crawling • Follow navigational & cross-links • Parse HTML and fetch relevant content • Spider full (or partial) site each time • Real-time Indexing • Triggered by FSR deployment • Process only change-set (incremental updates) • Assemble relevant page content

  6. Typical Web Application CMS (Alfresco) • Binary Content Source Control • Source code & libs • View templates • Site navigation • Web content

  7. “Managed” (Riveted) Web Application CMS (Alfresco) • Binary Content • Web Content • Site Navigation • (View templates) Source Control • Source code & libs • (View templates)

  8. Page Composition Meta-content.xml Page-metadata.xml Related-links.xml dynamic Section-html.xml Supporting-items.xml dynamic

  9. Content Delivery (http://crafterrivet.org)

  10. Alfresco WCM Lifecycle

  11. Indexing Architecture

  12. Solr Customizations • Custom Solr • Schema.xml • Fields (Type, Indexed/Stored) • Unique key • Solrconfig.xml • “dismax” type request handler to define queried fields • ExtractingRequestHandler (indexing RT docs)

  13. Custom Solr Schema <field name="page_url" type="string" indexed="true" stored="true"required="true"/> <field name="page_title" type="text" indexed="true" stored="true"/> <field name="page_category" type="string" indexed="true" stored="true"/> <field name="page_type" type="string" indexed="true" stored="true"/> <field name="page_last_modified" type="date" indexed="true" stored="true"/> <field name="page_text" type="text" indexed="true" stored="true"/> <field name="page_file_size" type="int" indexed="false" stored="true"/> </fields> <uniqueKey>page_url</uniqueKey>

  14. ExtractingRequestHandler <!-- Solr Cell: http://wiki.apache.org/solr/ExtractingRequestHandler --> <requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler" startup="lazy"> <lst name="defaults"> <str name="fmap.content">page_text</str> <str name="fmap.title">page_title</str> <str name="uprefix">ignored_</str> </lst> </requestHandler> <dynamicField name="ignored_*" type="ignored"/> ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract"); up.addFile(new File(filePath)); SolrServer solrServer = new CommonsHttpSolrServer(solrServerUrl); solrServer.request(up); solrServer.commit();

  15. Custom RequestHandler <!-- DisMaxRequestHandler allows easy searching across multiple fields for simple user-entered phrases. It's implementation is now just the standard SearchHandler with a default query type of "dismax". see http://wiki.apache.org/solr/DisMaxRequestHandler --> <requestHandler name=”solrDemoDismax" class="solr.SearchHandler" > <lst name="defaults"> <str name="defType">dismax</str> <str name="qf"> page_title^5.0 page_text^1.0 </str> </lst> </requestHandler>

  16. Compilation • Compiler Engine processes all instructions • Dispatches to appropriate Page Type Compiler

  17. Content Deployment & Solr Update

  18. Compiler Instructions <updates deploy-root=”/path/to/content/root"> ... <update>/solutions/security/article.xml</update> <delete>/products/widget/top-section.xml</delete> ... </updates>

  19. Compilation Types • Web Pages (HTML) • Rich Text (PDF)

  20. Web Page Compilation & Indexing Indexer Instructions

  21. HTML Indexer Instruction <?xml version="1.0" encoding="ISO-8859-1"?> <add> <doc> <field name="page_url">/solutions/content-mgmt/overview.html</field> <field name="page_title">Increase productivity and streamline workflow throughout the enterprise</field> <field name="page_description">Commercial enterprises and government agencies face significant challenges as they strive to meet a rapidly growing need to manage thousands ...</field> <field name="page_category”>Solutions</field> <field name="page_type">Web Page</field> <field name="page_last_modified">2009-12-18T15:03:57Z</field> <field name="page_text">Rivet Logic addresses many of today's workplace challenges with Enterprise Content Management (ECM) solutions that enable organizations to transform traditional content repositories and static intranets into dynamic, collaborative work environments through open source functionality. Through ...</field> </doc> </add>

  22. Rich Text Compilation & Indexing

  23. Rich Text Indexer Instruction <?xml version="1.0" encoding="ISO-8859-1"?> <add> <doc> <field name=”page_file">/docroot/static/about-us/press-releases/2010/rl_crafter_studio.pdf</field> <field name=”page_url”>/about-us/press-releases/2010/rl_crafter_studio.pdf</field> <field name="page_title”>Rivet Logic launches Crafter Studio for user friendly Web content authoring and publishing.</field> <field name="page_category">News</field> <field name="page_type">Press Release</field> <field name="page_last_modified">2007-12-19T08:00:00Z</field> <field name="page_file_size”>135</field> </doc> </add>

  24. Compiler Configuration

  25. Compiler Configuration <compiler-config> <page-types> <page-type name="Solution Page” compiler="com.rivetlogic.index.compile.ArticleCompiler"> <uri-pattern pattern=".*/page-content/solutions/.*(article|page-metadata|meta-content).xml$" /> <properties> <property field=“page_type” value=“Web Page”/> <property field=“page_category” value=“Solutions”/> </properties> </page-type> <page-type name="Press Release Page” compiler="com.paetec.index.model.compile.PressReleaseCompiler"> <uri-pattern pattern=".*/press-releases/.*/(press-release|meta-content).xml$" /> <properties> <property field=“page_type” value=“Press Release”/> <property field=“page_category” value=“News”/> </properties> </page-type> <page-types> <compiler-config>

  26. Search UI • Full text search • Faceted search on category & type • Pagination or search result clustering • Keyword highlighting in search results • Track user queries

  27. Search Results Page

  28. Clustered Results

  29. Summary • Requirements: • Real time updates • Full editorial control • Faceted search • Solution • Alfresco CMS • Alfresco plugin for Solr indexing • Compile updates & index • Serve in UI (ft search + facets)

  30. Q & A • Thank you for attending :-) • Questions, comments…

  31. Appendix

  32. Search Model/API

More Related