1 / 44

Centera Integration Training

Centera Integration Training. Centera API Best Practices. Corporate Systems Engineering July 2007. Agenda. Refresher on Centera API Performance Optimizations Metadata Storage Strategies & ClipID Formats SDK Options (Buffers / Failover) Managing Retention Wrap Up and Questions.

amish
Download Presentation

Centera Integration Training

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EMC Centera – API Best Practices Centera Integration Training Centera API Best Practices Corporate Systems Engineering July 2007

  2. EMC Centera – API Best Practices Agenda • Refresher on Centera API • Performance Optimizations • Metadata • Storage Strategies & ClipID Formats • SDK Options (Buffers / Failover) • Managing Retention • Wrap Up and Questions

  3. EMC Centera – API Best Practices API Background • Understand the main Centera API concepts • Pools, Clips, Streams, Tags, Blobs, Query • Appreciate the following points: • Open, read, write, query and delete transactions contact the Centera while clip create, get /set attribute, close are local SDK operations • 1 transaction uses 1 socket is 1 thread of access • C-Clips only have relationships with blobs, never other C-Clips • The high transactional overhead for any write operation and how it affects the writing of small objects • Understanding these points makes integration easier, more efficient and more effective!

  4. EMC Centera – API Best Practices Deploy In A Multi-Tier Architecture Clients Running Application Application CentraStar Software API Library OS • CentraStar Software resides on the Centera. • API Library resides on the application server. • Client application on PC connects to application server. • Server application makes calls to Centera-supplied API. • API interacts with Centera over TCP/IP using HPP protocol.

  5. TransactionExample (Write) Application Server Application API Library Operating System Centera EMC Centera – API Best Practices • The client sends a write (import) request to the server • The server sends an acknowledgement to the client • The client begins streaming its data to the server • After the client has sent its last packet of data, it begins waiting for an ACK from the server • The server distributes the data to the Storage Nodes, and updates its indices • The server sends an acknowledgementto the client and the write operation returns Request ------ > < -----------ACK Data-------- > Data-------- > Data-------- > Data-------- > . .(Client waiting for ACK) . Client < -----------ACK

  6. EMC Centera – API Best Practices PerformanceOptimizations • Multithreading • Small Object Management • Embedded Blobs • Containerization • Hybrid Containerization • Huge Object Management • Blob Slicing • Pool Management

  7. EMC Centera – API Best Practices Optimizing Write Performance Large Object (~5Mb) setup commit Small Object (~50Kb) setup commit So for a single stream over time we see … … with small objects, most of the total transaction time is spent in setup and commit – hardly any in data transfer!

  8. EMC Centera – API Best Practices Multithreaded Writes Multithreading provides overlapped I/O – more data is transferred in a given period of time. Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 8

  9. EMC Centera – API Best Practices Multithreading • Advantages • Better utilization of available bandwidth • Overlapped I/O yields better throughput • Takes advantage of multiple access nodes • Shared PoolConnection for improved load balancing • Disadvantages • More coding required • Multithreading coding/debugging generally trickier than single threaded programming • Thread packages differ from platform to platform • Scales to a point, then rolls off

  10. EMC Centera – API Best Practices Multithreading • Make the number of threads configurable individually for Read and Write • A good combined number is 20 threads per access node • This needs to be configured at install time • For large numbers of threads, increase the value of FP_OPTION_MAXCONNECTIONS (default is 100) • No application exists in a vacuum • Be conscious of workload imposed by other applications

  11. EMC Centera – API Best Practices Managing Small Content - Object Count Limitations • CentraStar 3.1 / Gen 4 has 50 million object count / node limitation • This count includes all types of Centera objects • CDF • Blobs • Mirror copies • Parity fragments • Reflections • CDF should be designed to fully utilise capacity (bytes) before these object count limits are encountered. • Embedded Blobs cuts down object usage by at least 50%

  12. EMC Centera – API Best Practices Managing Small Content - Embedded Blobs CDF Blob • Whenever a write or read is done, two objects are transferred • Writes • the Blob is written, added to the Tag in the CDF being constructed and when fully constructed the CDF is written. • Reads • the CDF is read, the application navigates to the Tag containing the content and the Blob is retrieved

  13. CDF Blob EMC Centera – API Best Practices Managing Small Content - Embedded Blobs • With embedded blobs, there is no separate blob so all data is transferred as a single object when the CDF is read or written • The SDK transparently stores the Blob as an Attribute on the Tag inside the CDF • Base64 encoded to adhere to the XML character set • Developer does not write any “special” code other then enabling the feature • I/O operations are reduced by at least half • only the CDF is read/written • proportionately greater savings if multiple blobs are stored in the CDF • Can only be used for relatively small objects (< 100KB)

  14. EMC Centera – API Best Practices Managing Small Content - Embedded Blobs • Embedded Blobs are easily enabled within an application • Globally via an FPPool option • FPPool_SetGlobalOption(FP_OPTION_EMBEDDED_DATA_THRESHOLD, 100*1024) • Threshold (100KB in the example above, max is 100KB) is then used to determine how the data is stored • Explicitly on the FPTag_BlobWrite call • FPTag_BlobWrite(theTag, theStream, FP_OPTION_LINK_DATA) • FPTag_BlobWrite(theTag, theStream, FP_OPTION_EMBED_DATA) • Overrides any Global setting that is in force

  15. EMC Centera – API Best Practices Managing Small Content - Embedded Blobs • Advantages • Can dramatically decrease object count usage if multiple blobs are stored embedded in each CDF • Reduces I/Os (blob does not need to be read separately) • Easy to code • Disadvantages • Single instance storage is lost • XML-Compatible Data Encoding (Base64) increases storage requirement by 33% • Read performance can be impacted • All blob content is retrieved when opening the clip • The larger CDF takes longer to parse • Standard guidelines should be followed i.e. CDF size < 10MB

  16. EMC Centera – API Best Practices Managing Small Content - Containerization Content Descriptor File Image name Byte offset Length 1023.jpg 21078 2497 1023.jpg 21078 2497 Container Blob • Small objects are collected and inserted into a larger container object • Each individual’s byte offset and length is stored in the metadata • When an object is retrieved, a byte offset read is done through the API and only the small object is returned

  17. EMC Centera – API Best Practices Managing Small Content - Containerization • Advantages • Better utilization of available bandwidth • Much faster ingest of a large number of small pieces of content • Reduces the object count • Disadvantages • More coding involved • Deletion of individual object requires re-writing and re-indexing entire container • No Single Instance Storage • Limited “use case” • Only applicable where huge numbers of small objects require to be stored in the same C-Clip • Size of CDF would become unmanageable / non-performant (100MB limit) • Embedded Blobs strategy is preferable in most situations

  18. Check1003 Check1004 Check1005 Check1006 EMC Centera – API Best Practices Managing Small Content - Hybrid Containerization • Combines aspects of Embedded Blobs and Containers • “Containers” are constructed using multiple embedded blobs • CDF effectively becomes the container • Each blob still represents a single application-level object CDF

  19. EMC Centera – API Best Practices Managing Small Content - Hybrid vs. Classic Containers • Individual object indexing not required • Local storage managed by SDK rather than application • Application does not need to build a local container • The CDF becomes the container • Simplified deletion of individual objects from container • Automatic Single Instance Storage for objects larger than the embedded blob threshold • No code changes required

  20. EMC Centera – API Best Practices Managing Huge Content - Blob Slicing • Write blob data to Centera using multiple threads provided by the application • Enables increased performance at time of write • No increase in performance for blobs < 5MB in size • Segments are exported to Centera as if they are different blobs • Segments are referenced by a single tag • The same method as the internal 100MB blob segmentation feature

  21. EMC Centera – API Best Practices Managing Huge Content - Blob Slicing • FPTag_BlobWritePartial(Tag, Stream, options, sequenceID) • Sequence ID determines the sequence of data written by multiple threads for one tag • sets the order in which data is to be read back by the SDK • must be greater than 0 • Duplicate IDs on a tag are not allowed and will return error • Read-back is performed in ascending Sequence ID order • Transparently supports FPTag_BlobRead() and FPTag_BlobReadPartial().

  22. EMC Centera – API Best Practices Managing Huge Content - Blob Slicing • Does not operate with any embedded options • Linked data only • FP_OPTION_EMBED_DATA causes an error • FP_OPTION_EMBEDDED_DATA_THRESHOLD setting is ignored

  23. EMC Centera – API Best Practices Managing Huge Content - Blob Slicing • FPTag_CreatePartialFileForInput(FilePath, Perm, BuffSize, Offset,Length) • Similar to CreateFileForInput • Allows for a section of the input file to be written • Two additional parameters • Offset where reading should begin within the file • Length of the segment to be read • Transparently supports FPTag_BlobWrite() and FPTag_BlobWritePartial().

  24. EMC Centera – API Best Practices Pool Management – Creation process FPPool_Open(“10.0.0.1,10.0.0.2”) • Pool of sockets (and associated data structures) is allocated • Probe packet is sent to the first address provided • if a response is not received before timeout (configurable, default is 2 minutes), subsequent addresses will be tried in sequence • Response is received from another AN, which contains a list of all known ANs ordered by load • The replica for the cluster is probed (with the same default timeout of 2 minutes) • The entire replica chain is walked • Allowing for timeouts, this could be a lengthy process X Primary Replica1 Replica2

  25. EMC Centera – API Best Practices Pool Management – Recommended Strategy FPPool_Open • FPPool_Open should be called once when the application is started • Subsequent Centera I/O should be done using this single Pool Reference • When the application shuts down, call FPPool_Close Write Read Read Write Read Read Write Write Write FPPool_Close

  26. Application #2 Email Archiving EMC Centera – API Best Practices Centera in a Multi-Application Repository Environment Application #1 ECM Application #3 PC Backup Centera

  27. Cluster Pool App Pool 1 App Pool 2 App Pool 3 Default Pool = = Blob CDF Pool 1 Pool 2 Pool 3 Default Pool EMC Centera – API Best Practices Virtual Pools How can we partition data access?

  28. EMC Centera – API Best Practices Virtual Pools • Virtual Pools implement a form of Data Partitioning • Think of Virtual Pools as Virtual Centeras within a physical Centera cluster • Applications should • connect to Virtual Pools through Access Profiles and Capabilities

  29. EMC Centera – API Best Practices Metadata • Metadata is the key to disaster recovery • Rich metadata helps to identify individual pieces of content given local repository failure • Clip level metadata can be retrieved as part of a Query Result set when performing Disaster Recovery • FPQueryExpression_SelectField(myQueryExp, “aClipAttribute”); • CenteraSeek relies on metadata • capability to “Google” the Centera • this is not what Query is intended for! • chargeback reporting • uses metadata only, not the document content itself • all types of Metadata can be queried • Standard SDK metadata e.g. creation.date • Clip level metadata e.g. ApplicationName (attribute added by application) • Tag level metadata e.g. InvoiceNumber (attribute added by application)

  30. EMC Centera – API Best Practices Metadata - Sample CDF <?xml version='1.0' encoding='UTF-8' standalone='no'?> <ecml version="3.0"> <eclipdescription> <meta name="type" value="Standard"/> <meta name=“name" value=“TrainingClip"/> <meta name="creation.date" value="2004.08.05 09:31:19 GMT"/> <meta name="modification.date" value="2004.08.05 09:35:51 GMT"/> <meta name="numfiles" value=“1"/> <meta name="totalsize" value="2082"/> <meta name="refid" value="5DD0B54HG7OCG3UTUGV1FP004Q"/> <meta name="prev.clip" value=""/> <meta name="clip.naming.scheme" value="MD5"/> <meta name="numtags" value=“1"/> <meta name="sdk.version" value="3.0.377"/> <custom-meta name=“ApplicationName" value=“MetadataExample"/> <custom-meta name=“ApplicationVendor" value=“EMC_Engineering"/> <custom-meta name=“ApplicationVersion" value=“3.4"/> </eclipdescription> <eclipcontents> <myMainTag someMeaningfulAttribute=“aValue" anotherAttribute=“andAnmotherValue”> <eclipblob md5="DM6JEBLJFH9I" size="694327" offset="0"/> </myMainTag> </eclipcontents> </ecml>

  31. To: DBA EFG242769LH32e57R23E2IBC4FE EMC Centera – API Best Practices DisasterRecovery • What happens if local Content Address repository is lost? • Protect the relationship between the archive and the local store • Metadata assists in local store reconstruction using Query • Store the Transaction Logs or Incremental database backups on Centera • Email resulting C-Clip IDs to DBA or store on Profile Clip • Use a separate Virtual Pool exclusively for these backups • Logs / Backups can be easily retrieved in DR scenario Application Server DB Centera

  32. EMC Centera – API Best Practices Storage Strategy Capacity (SSC) • Allows for Single Instance Storage • prevents identical content being stored numerous times • Uses M or M++ naming schemes • M++ performs additional SHA-256 calculation (performance overhead) • Reduces collision potential • Content Address is 27 (M) or 53 (M++) bytes long • M++ CA incorporates part of the SHA-256 calculated ID

  33. EMC Centera – API Best Practices Storage Strategy Performance (SSPP / SSPF) • Blobs below the threshold (default 250K) are written using a 53 byte Content Address incorporating a Time element • 2 different types of available • Partial (SSPP) • C-Clips retain the standard 27 byte Content Address • Should only be used by applications which cannot 53 byte CA • Full (SSPF) • C-Clips and Blobs both use the 53 byte CA • Always use FP_OPTION_CLIENT_CALCID_STREAMING as content cannot pre-exist due to the change in the Content Address • When using Generic Streams of unknown size, set FP_OPTION_PREFETCH_SIZE >= SSP threshold to ensure the correct strategy is used • Use the SDK defined constant for Content Address Length to ensure compatibility with future naming schemes

  34. EMC Centera – API Best Practices SDK Options • The SDK allows for configuration of many options to control different aspects of PoolConnection behaviour • Buffer sizes • Failover strategies • Storage strategies • Timeouts / Retry limits • 2 main types of options • Global options which apply to all PoolReferences created by the application • Local options which affect individual PoolReference instances

  35. EMC Centera – API Best Practices Buffer Sizes • FP_OPTION_BUFFERSIZE • Size of the CDF buffer in bytes. • Will swap out to disc if CDF is larger than the buffer • Default 16KB / Min 1KB / Max 10MB • Set this value to exceed the total CDF size (XML + Blob data) to avoid swapping to disc • e.g.(150 * 1024) for a C-Clip with a single embedded blob • FP_OPTION_PREFETCH_SIZE • Size of the temporary buffer used by the SDK to assist with content length based storage decisions • Storage Strategy Performance or Capacity • Parity or Mirroring • Default 32 KB / Maximum 1 MB. • If using Generic Streams of unknown length, set this value to 1MB

  36. EMC Centera – API Best Practices Failover Strategies • Can be enabled or disabled for all operations • FP_OPTION_ENABLE_MULTICLUSTER_FAILOVER • Network failover occurs when connectivity to the Primary cluster is lost • Content failover occurs when an operation (read, write, delete, query or exists) relating to a particular C-Clip does not return success • Can be configured separately for each operation • FP_OPTION_MULTICLUSTER_XXXX_STRATEGY • Three possible strategies • FP_NO_STRATEGY / FP_FAILOVER_STRATEGY / FP_REPLICATION_STRATEGY • NB: Not all strategies are supported for all operations and default behavior differs.

  37. EMC Centera – API Best Practices Storage Strategies • FP_OPTION_DEFAULT_COLLISION_AVOIDANCE • Enable (by supplying value FP_TRUE) if • Single Instance Storage is unlikely to be of value • Remote chance of a collision possibility is unacceptable • can also be used as an option on FPTag_BlobWrite to control behaviour at the Object (rather than Pool) level • FP_OPTION_EMBEDDED_DATA_THRESHOLD • Already covered in depth

  38. EMC Centera – API Best Practices Setting Options as Environment Variables • Any pool option may be set as an environment variable • When set in this way, all options behave as if they are GlobalOptions • e.g. set FP_OPTION_OPENSTRATEGY=“FP_LAZY_OPEN” • Options set within the application code take precedence over those set in the environment • Benefits: • Allows customers to adopt options that are not used by the developer/ISV • Increases options for troubleshooting applications in the field • Can sometimes enable immediate bug fixes in the field as the developer/ISV works on a patch release

  39. EMC Centera – API Best Practices Authorized Capabilities • Identifies the Authorized operations that can be performed by an Application connecting to a Pool / Profile combination • Read, Write, Delete, Exists, Query, Privileged Delete, Monitor • Purge capability is deprecated and should not be used • Applications should proactively determine which capabilities are available (see FPPool_GetCapability() in API Guide) • Configure the application user interface / options accordingly

  40. EMC Centera – API Best Practices Managing Retention • C-Clips can have an associated Retention Period or Retention Class • Mandating the setting of retention can be enforced at the pool level. • If the application does not require Retention, the Retention Period should be explicitly set to zero. • Do not rely on default Retention Period of the cluster • Default on CE+ is infinity! • Retention is applied to the whole CDF rather than individual Tags. • C-Clip should contain objects with the same retention requirements • Use Retention Classes for well defined “common” retention Periods (typically set by Regulatory Body) • Allows “painless” update should the period change • CentraStar 3.1 introduced Advanced Retention features • Separate license • Event Based Retention / Litigation Hold / Retention Governors

  41. EMC Centera – API Best Practices Managing Retention • Retention is calculated from creation.date timestamp • set when the FPClip_Create is called • Retention can be changed only by creating new clip • existing C-Clip is opened • updates are made e.g. retention period / class changed • FPClip_Write is called and a new Content Address is returned • old Content Address is saved in prev.clipid • new creation timestamp is set in modification.date • Important: a new clip is created, so you must either • Delete the old one immediately using DELETE_PRIVILEGED • Not available on CE+ editions • Or “Walk” the clip chain and delete when *all* the retentions have expired

  42. EMC Centera – API Best Practices Application Registration • Applications should register via FPPool_RegisterApplication() prior to calling FPPool_Open() • For each pool connection established, the Centera records: • Application name / version (as specified in the API call) • Hostname • Hardware platform • Operating system • Profile used to connect • SDK version • The Application Name supplied to the API call should be constant • do not use argv[0]! • Information is retained for all application instances which have authenticated for a period equal to the ‘audit log retention period’.

  43. EMC Centera – API Best Practices Questions? • Please Visit SDK Forums on Centera Developer’s Portal • http://lighthouse.emc.com • Corporate Systems Engineering team provides: • Info, Training, Whitepapers • Design Advice / Design Reviews • Debugging Support • Code Reviews • Email Support Available from: • CenteraCSE@emc.com

More Related