1 / 40

Metadata Extractors, Content Transformers & Renditions

Metadata Extractors, Content Transformers & Renditions. Neil Mc Erlean. Who am I?. Lead Engineer in the Services Team 4 years at Alfresco (since 3.2) Previously worked on Hybrid Sync Alfresco in the Cloud Various services/components Transformers & Extractors REST APIs

yannis
Download Presentation

Metadata Extractors, Content Transformers & Renditions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Metadata Extractors, Content Transformers & Renditions • Neil Mc Erlean

  2. Who am I? • Lead Engineer in the Services Team • 4 years at Alfresco (since 3.2) • Previously worked on • Hybrid Sync • Alfresco in the Cloud • Various services/components • Transformers & Extractors • REST APIs • Actions & Behaviours and more… • Ex-astrophysicist (of which more later)

  3. Talk content • What data is in your content? • How does Alfresco get at it? • What does Alfresco do with it? • How can you use these features? • Introductory material • no prior knowledge assumed

  4. Talk content - Breaking it down • Your content & its metadata • Alternative renditions of your content • Overviews of the 3 services • Java Foundation APIs. JavaScript. • Configuring & extending Alfresco. • All code samples available as runnable tests - download from the website.

  5. #1 Metadata Extraction

  6. #2 Content Transformation • Alfresco uses them to produce • images (thumbnails) • plain text (indexing) • inter-Office transforms • Also generally useful

  7. #3 Rendition Service • Very similar to transformations • More general service • More than just content to content

  8. How do these components work? • Mostly by leveraging existing OSS Java libs • Notably Apache Tika • Some external OS processes too • OpenOffice.org (OOo), LibreOffice • ImageMagick • pdf2swf (swftools) • Some bespoke impls e.g. zip - txt • ‘embedded’ thumbnails/previews iWorks, Office

  9. General Considerations • CPU, memory • In process vs. out of process vs. Remote CPU • Selection of ‘best’ extractor/transformer • Stay for Andy Hunt’s talk for Support’s troubleshooting tips

  10. Metadata Extraction

  11. #1 Metadata Extraction • Triggered on content creation or update. • or on demand • ‘Best’ available extractor obtained from MetadataExtracterRegistry. • This Extractor pulls out the metadata. • Format depends on the extractor lib/impl. • key/value pairs • These data are mapped onto the Alfresco content model • configurable mapping. <ExtractorClass>.properties

  12. Metadata extraction - Java • MetadataExtracterRegistry registry = appContext.getBean("metadataExtracterRegistry”, • MetadataExtracterRegistry.class); • ContentReader reader = • contentService.getReader(nodeRef, • ContentModel.PROP_CONTENT); • MetadataExtracter extractor = registry.getExtracter(reader.getMimetype()); • Map<QName, Serializable> props = • new HashMap<QName, Serializable>(); • extractor.extract(reader, • OverwritePolicy.EAGER, props);

  13. Overwrite Policy – when re-extracting • EAGER • extracted value is not null • PRUDENT • db property doesn’t exist or is null or “” (+ above) • CAUTIOUS • existing property == undefined

  14. <ExtractorClass>.properties mapping • namespace.prefix.cm=http://www.alfresco.org/model/content/1.0 • author=cm:author • title=cm:title • #Note need to escape ‘:’ in key name • geo\:lat=cm:latitude • geo\:long=cm:longitude

  15. Mapping properties • Can map extracted key-value onto multiple content properties • Can ignore extracted key-values i.e. not map.

  16. Metadata extraction - JavaScript • var action = • actions.create('extract-metadata'); • action.execute(nodeRef);

  17. Ways to customise & extend • Customisation of existing extractors • Define new mappings – to an existing or a new content model. • Adding new extractors • Identify 3rd party lib that can read the binary file • Or write your own code to do this • Extend AbstractMappingMetadataExtracter • Or write a Tika plugin • Define metadata mappings • org.alfresco.repo.content.metadata

  18. Recap • Metadata extraction harvests ‘hidden’ data and maps it into Alfresco content model. • Support for many MIME types • Metadata insertion coming • it’s on HEAD but currently disabled • also maps metadata tags to cm:taggable • “Best” extractor selection covered below

  19. Content Transformers

  20. Out of the box transformers • text, html, xml • Microsoft Office (doc & docx formats) • OpenDocument Format • iWorks (Keynote, Pages, Numbers) • Images • Shockwave Flash (SWF) • RFC822 email, Outlook .msg email • Adobe PDF, Illustrator, PSD • Electronic publication (epub) • Rich Text (RTF) • MP3 • Archives (ZIP, tar) • Many more

  21. Available transformers • No ‘graph’ of transform paths/mime types • Spring beans extend “baseContentTransformer” • They implement isTransformable(from, to) • They can be • simple (A to B) • ‘complex’ (A to C, via B) • failover (A to B, A to B…) • overlapping (multiple beans for same path) • dynamically un/available (e.g. OOo)

  22. /api/service/mimetypes webscript • http://localhost:8080/alfresco/service/mimetypes • MIME types • Metadata Extractors • Content Transformers • As services come and go (OOo), entries may disappear

  23. /api/service/mimetypes webscript • application/vnd.openxmlformats-officedocument.presentationml.presentation - pptx • Extractors: org.alfresco.repo.content.metadata.PoiMetadataExtracter • Transformable To: • application/pdf = Using a Direct Open Office Connection • application/vnd.ms-powerpoint = Using a Direct Open Office Connection • application/vnd.oasis.opendocument.presentation = Using a Direct Open Office Connection • application/x-shockwave-flash = Complex via: application/pdf • image/jpeg = Complex via: application/pdf • image/png = Complex via: application/pdf • text/html = org.alfresco.repo.content.transform.TikaAutoContentTransformer • text/plain = org.alfresco.repo.content.transform.TikaAutoContentTransformer • text/xml = org.alfresco.repo.content.transform.TikaAutoContentTransformer • Transformable From: application/vnd.ms-powerpoint = Using a Direct Open Office Connection • application/vnd.oasis.opendocument.presentation = Using a Direct Open Office Connection

  24. “Best” transformer selection • Alfresco prefers • available transformers (obviously) • ‘explicit’ transformers • previously fast transformers* • Alfresco doesn’t understand the output quality • pass/fail • fast/slow • * past performance is not a guide to future performance.

  25. Content Transformation - Java • ContentTransformerRegistry registry = • appContext.getBean("contentTransformerRegistry”); • ContentReader reader = contentService.getReader • (nodeRef, ContentModel.PROP_CONTENT); • ContentWriter writer = contentService.getWriter • (targetNode, ContentModel.PROP_CONTENT, true); • writer.setEncoding("UTF-8”); • writer.setMimetype(MimetypeMap.MIMETYPE_TEXT_PLAIN); • // Now have a reader & writer ready to go

  26. Content Transformation – Java ctd. • ContentTransformer transformer = • registry.getTransformer • (MimetypeMap.MIMETYPE_ZIP, • reader.getSize(), • MimetypeMap.MIMETYPE_TEXT_PLAIN, null); • transformer.transform(reader, writer);

  27. Content Transformation - JavaScript • var action = actions.create('transform'); • action.parameters["destination-folder"] = node.parent; • action.parameters["assoc-type"] = • "{http://www.alfresco.org/model/content/1.0}contains"; • action.parameters["assoc-name"] = • node.name + "transformed"; • action.parameters["mime-type"] = "text/plain"; • action.execute(testNode);

  28. Config: Transformer Filtering/Debugging • org.alfresco.service.cmr.repository. • TransformationOptionLimits • timeouts, size limits, page limits • content.transformer.OpenOffice. mimeTypeLimits.txt.pdf. maxSourceSizeKBytes=5120 • org.alfresco.repo.content.TransformerDebug • contextual logging

  29. Extending • Follow the Alfresco patterns • org.alfresco.repo.content.transform • Remember the chains • Remember the subsystems • ImageMagick • OpenOffice • Remember the Enterprise variants • JodConverter

  30. Recap • Many transformations & paths possible • No graph • Can be expensive in CPU/memory • Transformation to text = free indexing • No link between source & transformed content • Thumbnails are children of their source nodes • Bespoke behaviours ensure thumbnails are updated

  31. Renditions

  32. Renditions • A more general feature than transformers • Although with a strong overlap • Thumbnails are renditions • Previews are renditions • Not all renditions are thumbnails/previews

  33. Renditions • Flexible location • Always associated to their source node. • Child nodes of their source node. • Child nodes of another folder node. • Updated when their source updates. • Can be disabled with marker aspect • rn:preventRenditions • See ‘preventRenditions’ spring bean to register other ‘unrenditionable’ content classes • Can reflect the content and/or metadata of their source node.

  34. Standard rendition engines • reformat redirects to vanilla transforms • image image manipulation parameters • freemarker run some FTL against source content • xslt run XSLT on (XML) source node • composite rendition series [reformat, crop]

  35. Persistence of Rendition Definitions • Create Rendition Definition • Set parameter values on it • Execute it against a source node • Definitions can be persisted • Useful for complex or commonly used • RenditionService.save(), .load() • Saved into Alfresco’s Data Dictionary

  36. Renditions - Java NodeRef jpgNodeRef; QName renditionName = QName.createQName(NamespaceService.CONTENT_MODEL_1_0_URI, "myRendDefn"); RenditionDefinition renditionDef = renditionService.createRenditionDefinition (renditionName, "imageRenderingEngine"); renditionDef.setParameterValue( ImageRenderingEngine.PARAM_RESIZE_WIDTH, 128); renditionDef.setParameterValue( ImageRenderingEngine.PARAM_RESIZE_HEIGHT, 512); renditionDef.setParameterValue( ImageRenderingEngine.PARAM_MAINTAIN_ASPECT_RATIO, false); ChildAssociationRef chAssRef = renditionService.render(jpgNodeRef, renditionDef);

  37. Renditions - JavaScript • var renditionDef = renditionService • .createRenditionDefinition("cm:cropResize”, • "imageRenderingEngine"); • renditionDef.parameters["destination-path-template”] • = "/Company Home/Cropped Images/${name}.jpg"; • renditionDef.parameters["isAbsolute"] = true; • renditionDef.parameters["xSize"] = 50; • renditionDef.parameters["ySize"] = 50; • renditionService.render(testNode, renditionDef); • var renditions = renditionService.getRenditions(testNode);

  38. Recap • Renditions == Transformations++ • More complex, more powerful

  39. End

More Related