Designing High-Performance Distributed Systems for Event Processing

Designing High-Performance Distributed Systems for Event Processing Roman Elizarov, 2005-07 Devexperts [JUG 2007 version]

Event processing? • Common situation: • You are designing a system that processes events (stock quotes, sports bets, telemetry from factory/network hardware, etc) • This system is usually distributed, because events come from different places. • What design choices are you going to consider?

The usual design dichotomy Remote Procedure Calls (RPC) for synchronous event processing Message Oriented Middleware (MOM) for asynchronous event processing V S • I will show that this question, as it is usually stated, is misleading. You should not ask this question at all.

Message Oriented Middleware? • Popular approach to message passing • Lots of books and publications • Wikipedia definition: • Message Oriented Middleware is a category of inter-application communication software that relies on asynchronous message passing as opposed to a request/response metaphor.

MOM Advantages • Asynchronous message transfer (sender is decoupled from receiver) • Message persistence • Transactional support • Interoperability (cross-platform) • Standards-based APIs (for example, JMS)

MOM Disadvantages • It adds additional (and usually 3rd party) component to the system architecture (Message Transfer Agent) • Harder to maintain • Reduces reliability • Reduces performance • It requires learning of [a large] 3rd party or standard APIs.

More on performance • Events that you are planning to process may happen 10K times per second or more often. • This rate is above peak performance of most MOM systems. Rates at 100K+ events per second cannot be reached by any modern MOMs on a decent hardware.

Common misconception #1 • Myth: Asynchronous event processing implies MOM • Truth: You can design high-performance system for [asynchronous] event processing yourself without MOM.

Common misconception #2 • Myth: MOM vendors are spending millions of $$$ on their software. How could we possibly design something with a higher performance in our project within a much smaller budget? • Truth: Design is not about RPC vs MOM. Design is about data structures, algorithms, patterns, and constraints [of your particular project].

Data structures & algorithms? • … Hold on: • You said “data structures and algorithms”. What this has to do with design anyway? We always thought that those are the coder’s issues, not designer’s ... Designer is drawing diagrams, determines system components and interfaces. He does not implement any data structures and algorithms, does he?

Software Designer? • Designer does not code data structures and algorithms. • Designer defines interfaces that place constraints on implementation (data structures and algorithms). • Which lead to constrains on maximal system performance.

Design example • Compare two designs: interface Foo1 { void bar(Set s); } interface Foo2 { void bar(SortedSet s); }

Design example cont’d • SortedSet operations can be implemented efficiently in O(log n) time using different kinds of trees (B-Trees, Red-Black trees, Splay trees, etc). • Set operations can be implemented efficiently in O(1) time using hashing.

Designing with algorithms in mind • Design must enforce only constraints that are absolutely necessary. • You should follow the rule of the least possible constraint. (we’ll get back to it later)

MOM Performance demystified • MOM systems are usually slow by design. It is not because of some negligence or oversight of coders who implement them. • This is especially true for the standards-conforming MOM systems. • No JMS implementation can be really fast because of the complexity of the JMS specification and its requirements.

Questions to ask before using MOM • Does my system absolutely need message persistence, transactional processing, and/or interoperability (cross-platform)? For many systems the answer is NO. • Does my system absolutely need maintainability, reliability, and high-performance? For many systems the answer is YES.

The Sample Design Problem • Financial application with quote table (ticker). • Keeps track of last known data and shows it to the client. • Updates data as it changes. • 100K+ (target=1M) quotes per second on 300K distinct symbols enter into the system.

Deployment scenario (Cluster) • Data events are distributed via a tree of multiplexing nodes (multiplexors) • Data goes downwards, subscription goes upwards • Each multiplexor, data source, and client uses “Ticker Core”

Overall Architecture • Make two layers – separate data structure layer (Ticker Core) from data transport layer (Socket Connector) • Ticker core can be used with different transport layers • There is no need for a dedicated MTA process. You can multiplex data inside any process on your deployment diagram using Ticker Core component.

Optimization rule • Do not optimize your code until you can prove by profiling that it needs optimization. • But this does not apply to Design! • If your design if broken, it might be impossible to optimize your implementation without actually redesigning and rewriting everything from scratch. • You have to have at least one efficient implementation in mind when you are doing design.

Ways to spoil high-performance • Locking (synchronization) • Locking/unlocking too much • Locking for too long • Memory operations • Allocating too much memory • Locality of data access • Ineffective algorithms & data structures • Using inappropriate data structures

First design attempt • Let us try to design method that feeds data into Ticker Core. Why is it wrong? Think about possible implementations of processEvent method… Hint: our data structure must be MT-Safe. interface Distributor { void processEvent(Event e); } Not good

Locking measurement • You cannot process 1M events per second with MT-Safe data structure (even on a single-CPU machine). * Performance is measured on Pentium 4M 1.7GHz laptop with Java 1.5.0_03 (server JVM) under Windows XP SP2.

Locking measurement cont’d • On SMP system it becomes even worse. • And we’re not even doing anything inside synchronized section of code (just “k++”). * Performance is measured on 2-way SPARC Sun Fire V240 with Java 1.5.0_01 (server JVM) under Solaris 5.9

Lock contention & context switches

Locking solution • Events must be processed in blocks, thus paying locking cost “per block” instead of “per event”. interfaceDistributor { voidprocessEvents(Event[] e); } Better!

Locking conclusion • Keep in mind the cost of locking when you are designing MT-Safe data structures. • Some data structures can be implemented without locking, but not all of them, so do not count on work-around unless you know it. • Contention becomes worse if you keep locks during time-consuming operations.

Object allocation inquiry • Let us take a close look at “Distributor” • You have to allocate “Event” object for every incoming event. Is it good or bad? interfaceDistributor { voidprocessEvents(Event[] e); }

Object allocation measurement • It is tough to allocate 1M objects per second if you plan to keep 1M+ of them in memory. * Performance is measured on Pentium 4M 1.7GHz laptop with Java 1.5.0_03 (server JVM) under Windows XP SP2. Test is allocating 32 byte objects keeping 1M objects in memory array.

Garbage collection • The more objects you keep in memory, the slower garbage collection becomes. • The more objects you allocate, the more often garbage collection occurs. • It all becomes relevant for high-performance event-processing systems. • However, object allocation/GC performance can be improved (multiple times) by fine-tuning GC and memory settings (up to 6.2M objects per second on previous test).

Design patterns • If you use arrays (like Event[]) in your design, then you inherently limit performance and constrain implementation: • What if data source keeps events in the hash? • What if data source constructs events on-the-fly (deserializes them from the external source)? • There are design patterns to help you: • Iterator pattern • Visitor pattern

Iterator pattern • Keeps flow control on the data receiver side. interfaceDistributor { voidprocessData(DataIteratorit); } interfaceDataIterator { Event nextEvent(); }

Iterator pattern cont’d • Lock is held inside Ticker for the duration of processData. • DataSource can get data from anywhere (from array, hash, deserialize, etc).

Visitor pattern • Keeps flow control on the data provider side. interface Agent { voidretrieveData(DataVisitor vis); } interfaceDataVisitor { voidvisitEvent(Event e); }

Visitor pattern cont’d • Lock is held inside Ticker for the duration of retrieveData. • DataConsumer can do anything with data (store to array, hash, serialize, etc)

Patterns contemplation • So, what design is better from “the least possible constraint” point of view? interface Distributor1 { voidprocessEvents(Event[] e); } ------------------------------------- OR ------------------------------------- interface Distributor2 { voidprocessData(DataIteratorit); } interfaceDataIterator { Event nextEvent(); }

Patterns bonuses • Both patterns let you transfer complex/composite data without actually creating objects in memory (if you need to). interfaceDataIterator { intnextBid(); intnextAsk(); } interfaceDataVisitor { voidvisitEvent(intbid, intask); }

Patterns conclusion • Design patterns are your friends. • Design Patterns book by Erich Gamma, et al (aka “Gang of Four”, aka GoF) is a highly recommended reading for everybody.

Data structures & algorithms • What class (data structure) would you typically use as a temporary data buffer, list of events, etc? • ArrayList (array data structure) • LinkedList (liked list data structure) • LinkedList only outperforms ArrayList on operations that involve inserting/deleting elements from an arbitrary position inside list. • Most programs can be immediately made faster by replacing LinkedList with ArrayList. • If you add and remove elements from both sides (head and tail) arrays are still faster (cyclic buffer).

Iteration measurement • Let us measure how many objects we can iterate over in ArrayList vs LinkedList. * Performance is measured on Pentium 4M 1.7GHz laptop with Java 1.5.0_03 (server JVM) under Windows XP SP2.

Data locality • The primary reason for 8x difference is locality of data access in the case of ArrayList.

Algorithms conclusion • General knowledge of different data structures and algorithms helps even in design stage. • Introduction to Algorithms by Cormen, et al is a highly recommended reading for everybody.

Designing for overload • What if the load is too high? (too many events) • The system may fail, but how it fails is quite an important design decision. • What usually happens under the load beyond high CPU consumption: • Lock contention • Buffers overflow • The problems should not be getting progressively worse under the load.

Designing for overload attempt • Let us try to design means to notify data consumers on available events. interface Agent { voidsetEventListener(EventListener l); } interfaceEventListener { voideventsArrived(Event[] e); // or a version withvisitor }

Overload scenario • What would EventListener do if it receives most events that it can put into the outgoing TCP stream? • Buffer extra events • Drop extra events • Design shall inherently accommodate the corresponding buffering and dropping strategies.

Design for overload solution • Let data consumer fetch data from the data source only when data consumer has spare CPU cycles to do so. • Automatically increase block size under the load. It reduces all kinds of block overheads (lock overhead, I/O overhead, etc). interface Agent { voidsetEventListener(EventListener l); voidfetchData(DataVisitior v); } interfaceEventListener { voideventsArrived(); }

Scalability problem • You can scale by splitting processing onto multiple serves (one server processes A-M symbol, the other N-Z, etc) • However, any design that attempts to always deliver every event is inherently unstable – in peak loads queues begin to build, delays increase and “real-timeness” of events suffer, potentially rendering all delivered events unusable.

Scalability solution • Scalable design must have a dropping strategy (if we have two back-to-back IBM quotes in our outgoing buffer, then we don’t need to keep the first one!) • This is a key to a truly graceful degradation under load – the fewer CPU cycles you have, the fewer events are delivered, but all delivered events are real-time.

Overall conclusion • Process events in blocks throughout all places in your system with high number of events per second • Save on locking • Save on I/O cost • Increase block size under the load • Use design patterns • Use appropriate algorithms and data structures

The End • Thank you for your attention. • Any questions? • elizarov@devexperts.com

Designing High-Performance Distributed Systems for Event Processing