Changes to Sizing Spread Sheet for Documentum 5.3

Changes to Sizing Spread Sheet for Documentum 5.3 Documentum Performance Group

Agenda • Changes to the Customer Input Page • Changes to the Output Page • Some Sizing Examples

Changes to the Customer Input Page • App server cluster support in WDK/Webtop • Fulltext query rate • Fulltext space

App Server Cluster support overhead Will Factor in CPU cost associated with Session Serialization in Clustered HA environment

5.3 Sizing changes for WDK • 5.3 webtop consumes 40% more CPU than 5.2.5 • Due partly to inclusion of new features (drag & drop) and infrastructure changes • This overhead is being reduced for SP1. Sizing spreadsheet for SP1 will reflect this. • 5.3 App Server cluster support has an additional 50% overhead • This is due to cost of replicating state • Is Worst case: memory-based replication (between two App servers) • To be reduced in 5.3 SP1, will be reflected in SP1 spreadsheet

Fulltext query rate Will Factor in CPU cost associated large numbers of full text queries

Fulltext indexing Characteristics Most sizing requests specify docs/day, but normally that load is not for 365 days out of the year

Fulltext indexing Characteristics Will Factor in CPU cost and Disk I/O associated with the indexing portion of fulltext

Fulltext indexing Characteristics: Options • None = No full text indexing enabled • Immediate Indexing = Attempt to minimize index time from 'save' to 'searchable‘ • Default for 5.3 • Expensive relative to disk space, CPU utilization, and I/O • Delayed Indexing = Attempt to reduce disk space, memory, or CPU util at cost to ‘save to searchable’ latency • Initial Focus: Transient Disk Space tuning • Requires some detailed Index Server tuning

Transient Fulltext Index Space Tuning Transient Space needs for building a large partition with all documents Transient Space needs for building four small equal sized partitions within Index More information on this tuning to be provided in an FAQ

Fulltext space consumption Will Factor in content information for fulltext disk space and CPU calculations

Will at times factor in known platform differences

Output Page changes Hardware resources needed for Index Agent and Index Server

Option #2 is changed to reflect likely “Content Server and Indexing Servers” on same host scenario

Example Option #2 Index Agent • Content Server & Indexing software on same host Pros: - Easy to install and administer - Grow Capacity by adding more CPUs, disk, and memory Cons: - Resource contention risks - Footprint of Indexing subsystem could exceed excess capacity of a pre-5.3 production system Dftxml msg Index Server (FAST) Staging Area Meta data & content Query & results Index Content Server Content

Option #3 is changed to add Index Agent/Index Server on separate host scenario Note: The initial release will not cover multi-node configurations of the Index Agent/Server

Additional Supported Scenarios for FCS Index Agent • All Full Text Components on a Separate host Pros: • Separates resource consumption “new” 5.3 full text from a rest of Content Server • Likely to arise in upgrade scenarios from 5.2.X Cons: • Additional server required Dftxml msg Index Server (FAST) Staging Area Meta data & content Query & results Index Content Server Content

Sizing Exercises • Generic document repository (< 2 million docs) • Large system: 100,000 docs/day

Generic Document repository • Provided System characteristics: • Upgrade from 5.2.5 (repository already existing) • Total size of system < 1 million objects • Total content Size = 240 GB • Ingest: ½ GB/day • Approximately 1,000 objects/day • Average file size ½ MB • Less than 1000 users (20 active at any one time)

questions: How much of the content might be fulltext indexable? • Check size and number of objects by format • Example: • 40 GB of the 240 TB is of content is of a format that can be indexed • Less than 500,000 objects have content that can be indexed • About 360,000 objects have content that can’t be indexed • At least 102 separate formats! • However, Word and PDF dominate the content space that can be fulltext indexed (90%) • All objects have at least their meta-data indexed

Enter average size, number of docs, and whether content can be indexed for 4 rows below: • Word: 106K byte average, 160,000 docs, content indexed=Y • PDF: 352K bytes average, 56,000 docs, content indexed=Y • Other: 20K bytes average, 275,000 docs, content indexed=Y • Images: 550K bytes average, 360,000 docs, content indexed=N

What would that imply for hardware to do upgrade? • So far we haven’t calculated growth • Estimate for space needed for fulltext: 19 GB

What about growth? • Assume 260 busy days in the year and 1,000 docs per busy day • Assume document proportions remain the same: • Word (19%)  19% of 1000 = 190  190 x 260 = 49,400/yr • PDF (7%)  7% of 1000 = 70  70 x 260 = 18,200/yr • Other (32%)  32% of 1000 = 320  83,000 /yr • Image (42%)  42% of 1000 = 420  109,000 /yr

Index Size after 3 years • Around 30 GB needed

Could I size fulltext as a simple 40% of total content size? • Old, tried and true(?) method • It can, especially if “Non-indexable” content could dominate! • In this example [system without growth] • 40% of 240 GB = 96 GB vs. 19 GB • Example with system including growth • 40% of 385 GB = 154 GB vs. 30 GB • For small systems, the cost of overestimating is small

Other notes • Index Subsystem can co-reside with Content Server • Existing system must have spare CPU capacity & memory capacity • New fulltext index should reside on high capacity disk array or SAN, not on NAS device or single disk • At 1+ million docs the indexing side could bottleneck on the disk • Spreadsheet shows minimal disk I/O requirements, but these are averages spread over 24 hour periods • actual ones will be higher during indexing process

Large system (100,000 docs per day) • Provided System characteristics: • Ingest: 110GB/day • The data is primarily static once submitted • Approximately 100,000 objects/day • Average file size 1MB • Average metadata size per file: 10kb • Estimated total: 4TB in 3 years on Tier 1- 120TB on Tier 2, 1TB database • Tier 1 Storage - Symmetrix for 30 days • Tier 2 Storage - Centera • Initial pilot: 50 users • 10% of objects/capacity applying text search

Initial observations on provided information • How many days in a year will see 100,000 docs/day? • Lets assume 260 busy days a year • If weekend load rate significant then it should be factored into average per day • 100,000 docs/day x 260 days/yr x 3 yrs = 78 million docs • This is more than can be handled by 5.3 FCS! • 5.3 SP1 features are needed • 5.3 SP1 features needed for large systems • Ability for single repository to have multiple “collections” • Multi-node Index Server support

5.3 Large Full Text support: FCS vs. SP1 • In 5.3 FCS each Content Server repository is mapped to a single Index Server “collection” • In 5.3 SP1: • Collections can be mapped to a single index search “column” • Content Server will be able to have multiple collections per repository • Index Agent to provide mapping of “a_storage_type” to Index Server collection • This can be used to “range partition” the fulltext data • Once a collection reaches a certain size ( < 10 million) data can be routed to • Older “static” data can be put in older collections • CPU burn no longer needed to rebuild older collections

Which area in the spreadsheet should I enter the document profile?

Normally, this input area could be used exclusively • This assumes about 40% of the original content size is for fulltext • Probably not a big deal for small repositories, but could potentially lead to large overestimate for ones like this (with 78 million docs)

What does “10% of objects/capacity applying text search” mean? • Does it mean: “10% of objects will have content to full text”? • Does it mean: “10% of the objects will be fulltext indexed”? • Does it mean: “10% of the searches will be fulltext (as opposed to just the attributes)”? • Assume the first one. • Note that the “Content Loading” area does not allow you to model this!!

10% word docs to have content FT indexed Space consumption based on meta-data + content 90% images to have only meta-data fulltext indexed Space consumption only on attributes fulltext indexed Alternate model

Other model (con’t) • Uses an alternate space calculation • Reflects that most documents will just have a small amount of meta-data to fulltext index • Total fulltext index size now 4 TB vs. 30 TB of previous

Other model (con’t): note CPU • Note that the CPU’s have not changed between models • This is incorrect (initial model should have at least twice the CPUs as stated) • To be fixed in an upcoming version of spreadsheet

Other items to worry about • Disk I/O needs (I/O’s per sec) reported in spreadsheet reflect average (over loading period) not peak needed values • To reach high throughput fulltext Disk I/O subsystem needs to always be able achieve several hundred I/O’s per second • Do not put fulltext index on single drive (except in case of tiny repository)

Changes to Sizing Spread Sheet for Documentum 5.3

Changes to Sizing Spread Sheet for Documentum 5.3

Presentation Transcript

The M odel for Improvement A Method to Adapt, Implement, and Spread Changes

Intro to 5.3

Questionnaire and Spread sheet Print screens

5.3

Spread Sheet Evidence

Spread sheet Evidence!

Section 5.3 Energy, Temperature Changes and Changes of State

Inside EMC Documentum

EMC Documentum Compliance Manager

to spread JOY

Documentum

Documentum 5.3 Rapid Success Program

Manageware For Documentum

Creating Documentum Objects

CMIS + Documentum Web Services

Importing spread sheet into document

Pipe Sizing Sizing Gas Pipe for Low-Pressure Systems

Sizing Pipes for Efficiency

REACH – changes to the EU Safety Data Sheet

Measurement for Spread

Sizing

Making Changes in the Balance Sheet