1 / 24

Streams Service Review

Streams Service Review. Distributed Database Operations Workshop Eva Dafonte Pérez. Outline. Tier0 responsibilities Tier1 responsibilities What do I have to do? Recent problems Bugs related to Streams Recommended patches Pending requests New 11g features Summary. Overview. ATLAS.

eloisesmith
Download Presentation

Streams Service Review

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Streams Service Review Distributed Database Operations Workshop Eva Dafonte Pérez

  2. Outline • Tier0 responsibilities • Tier1 responsibilities • What do I have to do? • Recent problems • Bugs related to Streams • Recommended patches • Pending requests • New 11g features • Summary

  3. Overview ATLAS

  4. Overview LHCB CMS

  5. Tier0 responsibilities • Initial Streams setup • Add new schemas to the Streams environment • Split & Merge • Streams re-synchronization • Analyze and test new features and optimizations • Validate upgrades and patches • Monitoring

  6. Tier1 responsibilities • Announce interventions • schedule new intervention using 3D wiki • submit EGEE broadcasts • register outages in the CIC portal • long interventions: contact Tier0 to analyze if it is necessary to split the Streams setup • Unplanned downtime: update Tier0 • problemdescription, progress and expected duration • Report regularly • Read-only replica: ensure only reader account is open

  7. Tier1 responsibilities • Maintain the 3d OEM operational • check agents status • configure targets • After an intervention: check and re-enable Streams processes • re-start apply process @destination • re-enable propagation job @downstream box When SPLIT: • re-start capture process @downstream box

  8. Tier1 responsibilities # connect as streams admnistrator @destination database strmadmin@db> select apply_name, status from dba_apply; Apply Process Name Status ---------------------------- ----------- STRMADMIN_APPLY_STREVA DISABLED strmadmin@db> exec dbms_apply_adm.start_apply(‘STRMADMIN_APPLY_STREVA‘); PL/SQL procedure successfully completed.

  9. Tier1 responsibilities account with privileges to re-start the Streams components @downstream database - one per Tier1 site # connect as strmprop user @downstream database strmprop_cern@db> select propagation_name, status from dba_propagation; Propagation Name Status ------------------------------ ----------- STREAMS_PROP_STREVA_DWSDB DISABLED STREAMS_PROP_STREVA_STRMTEST ENABLED strmprop_cern@db> exec dbms_propagation_adm.start_propagation(‘STREAMS_PROP_STREVA_DWSDB‘); PL/SQL procedure successfully completed.

  10. Tier1 responsibilities • ensure you can connect using your strmprop account • (password, connection string) • check that you are using the correct process name # connect as strmprop user @downstream database strmprop_cern@db> select propagation_name, status from dba_propagation; strmprop_cern@db> select capture_name, status from dba_capture; Capture Name Status ---------------------------- ----------- STRMADMIN_CAPTURE_STREVA ENABLED STRMADMIN_CAP_TEMP DISABLED strmprop_cern@db> exec dbms_propagation_adm.start_propagation(‘STREAMS_PROP_TEMP‘); strmprop_cern@db> exec dbms_capture_adm.start_capture(‘STRMADMIN_CAP_TEMP‘); PL/SQL procedure successfully completed.

  11. What do I have to do? Hi all, it looks like it is our turn now… So, what do I have to do? Cheers, Olli Streams Monitor wrote: > Streams Monitor Error Report > Report date: 2008-09-25 14:43:09 > Affected Site: NDGF-T1 > Affected Database: ATLAS.DB1TIER1.NDGF.ORG > Process Name: STRMADMIN_APPLY_ATLN > Error Time: 25-09-2008 15:42:51 > Error Message: ORA-26714: User error encountered while applying > Current process status: ABORTED 1. Check Streams monitoring 2. Check Streams Service Manual for Tier1s 3. Ask for help

  12. What do I have to do? • Apply process status: ABORTED • “user error encountered while applying” • get more details: exec print_errors.sql as streams administrator @destination • human errors • ex: modifications to system-generated names, updates by users which don’t exist at destination,… • destination schema is overwritten • ex: statement is executed first at the destination, then at the source (online – offline), Tier1 database is not read-only, … ORA-01403: no data found ORA-00955: name is already used by an existing object check with Tier0 which actions are needed

  13. What do I have to do? • Apply process status: ABORTED • “user error encountered while applying” • database administration related • ex: unable to extend tablespace, deadlock waiting for resource, … ORA-01652: unable to extend tablespace ORA-00060: deadlock detected while waiting for resource fix the problem re-execute the error and re-start apply process exec dbms_apply_adm.execute_all_errors(‘STRMADMIN_APPLY‘); exec dbms_apply_adm.start_apply(‘STRMADMIN_APPLY‘);

  14. What do I have to do? • Propagation is DISABLED after 16 attempts: ORA-00257: archiver error. Connect internal only, until freed ORA-12514: TNS:listener does not currently know of service requested in connect descriptor ORA-12545: Connect failed because target host or object does not exist ORA-12170: TNS:Connect timeout occurred ORA-12560: TNS:protocol adapter error fix the problem re-enable propagation

  15. What do I have to do? • Check our wiki: https://twiki.cern.ch/twiki/bin/view/PSSGroup/StreamsServiceReview • Oracle Streams Documentation • Oracle Streams Concepts and Administration 10g Release 2 • Oracle Streams Replication Administrator's Guide 10g Release 2 • Send us an email with your questions • Help us to maintain the wiki updated • you can also update it !!!

  16. Recent problems • Missing primary keys / indexes • Apply is aborted because of duplicated rows • cannot identify an unique row to apply the change • Apply performance seriously impacted • apply server performs full table scans  Delay on the whole replication system • dependent transactions have to wait • ATLAS has already implemented an automatic job to detect tables without primary key

  17. Recent problems • Apply gets stuck on “applying” status • Reader and coordinator are IDLE • Server shows APPLYING • LCRs spilled over to disk • Under investigation by Oracle • Connection lost contact to Gridka • Only LFC replication to Gridka affected • Under investigation by Oracle • Diagnostic patch installed

  18. Recent problems • Unresponsive NDGF • propagation could not send LCRs to destination • processes were healthy – no errors reported • large number of spilled LCRs kicked up the flow control (≈ 6.000.000 LCRs) • capture process « temporarily » paused • Additional capture latency monitored • alert sent when 90 minutes threshold exceeded • Tests on the streams pool memory usage • new node allocated for the downstream cluster 2.6 GB Streams Pool Size (MB)

  19. Interventions • LFC migration out of SRM v1 endpoint • Streams replication stopped • Data updated at source and all destinations • problems with RAL, where data was finally imported from CERN • CNAF, PIC and IN2P3 hardware migration • re-synchronization using transportable tablespaces • Tier1 sites should consider the use of Data Guard in order to minimize the impact

  20. Bugs related to Streams • Fixed: • ORA-600 when dropping propagation • ORA-26687 no instantiation SCN provided when drop table (2 streams setup between same source and destination databases) • To be fixed: • <BUG:6402302> create view on schema not in streams is replicated • drop view is not replicated!

  21. Recommended patches Metalink note 437838.1 • 7363767 addresses performance improvement for capture process and logminer: merge label requesy on top of 10.2.0.4 for Bugs: • Bug 7345904 Streams capture slow processing direct path insert, high cpu for logmnr builder • Bug:6683178 High latencies in Streams capture, while capturing primary workload with a lot of DDL activities such as truncates of empty tables • Bug:6994160 Capture reader process constantly writing messages to trace file • Bug:6413089 Restarting a logminer session can be slow if the session has fallen behind • Bug:6650256 Parallel DDL (PDDL) transactions can cause logminer memory spill for Streams, or run slowly during adhoc log mining • 7263055 + 7480651 in order to fix ORA-600 [KWQBMCRCPTS101] when dropping propagation • 5933656 Propagation ora-600 [KWQPCBK179], [1], [1369] • 6827260 Excessive memory usage for lcr cache due to large freelists • 7219752 ORA-26773 Malformed redo on capture of long • 6452375 ORA-26687 No instantiation scn provided when drop child table • 7033630 Apply aborting with ORA-600 [KNLQDQM2USR:4] after installing 10.2.0.4 patchset

  22. Pending requests • MUON sites replication to CERN • master: 3 Tier2 sites (Rome, Munich, Michigan) • target: ATLAS offline • AMI replication to CERN • master: Tier1 Lyon • target: ATLAS offline • Resources: • currently 2 apply process @ATLAS offline • 4 more to be added!! • Service level: • problems must be addressed to the master side

  23. New 11g features • Combined Capture and Apply • capture sends LCRs directly to apply • only 1 target, detected automatically • big performance improvement • rate: 14.000 LCRs/sec (before 5.000 LCRs/sec) • Split/Merge of Streams • Cross-database LCR tracking • Source and Target data compare & converge • compare rows in an object at 2 databases • converge objects in case of differences

  24. Summary • Keep the monitoring operational • spot problems quickly, understand bottlenecks, ... • Coordination with Tier0 • complex streams environments where the activity at one point might impact the whole system • Feedback!!! • and collaboration to improve the documentation and the service Interventions during 3 last months

More Related