Measuring Media Gateway Software Efficiency Using Performance Monitor Counters Mikko Viitanen S-38.310 Thesis seminar on networking technology Helsinki University of Technology 12.05.2005
Basic Information • Thesis written at Oy L M Ericsson Ab, Finland • Supervisor: Professor Jörg Ott • Instructors: M. Sc. Stefan Blomqvist M. Sc. Dietmar Fiedler
Contents • Background • Problem Description • Objectives • Scope • Performance Monitor Counters • Memory Hierarchy • Measurement Environment • Results • Future Work
Background (1/2) • The Universal Mobile Telecommunications System (UMTS) is a third generation mobile network standard specified by the 3rd Generation Partnership Project (3GPP) • UMTS network is based on the GSM and GPRS • UMTS specifications and features grouped into releases • Enable vendors to make interoperable networks
Background (2/2) • Release 4 introduced the layered network architecture • The Mobile services Switching Centre (MSC) was divided into the MSC server and the Circuit-Switched Media Gateway (CS-MGW). • The MSC server handles the call control. • The Media Gateway (MGW) handles the media and the bearer control.
Problem Description • The Media Gateway is a real-time multiprocessor system • A common problem in complex systems is how to verify and measure software performance • Performance monitor counters offers a way to monitor code efficiency on the processor level • The following problems are dealt with in this thesis: • Which kind of efficiency problems can be found by using the performance monitor counters? • Which kind of programming methods should be used to reach better results than before?
Objectives • The purpose is to get results that can be used to find efficiency problems in the MGW’s software • Find ways to improve the system performance
Scope • The MGW’s software will be introduced • The software development tools used in the MGW software development will be presented • Overall software performance issues will be discussed • Performance Monitor Counters measurement method is explained
Performance Monitor Counters (1/2) • Performance Monitor Counters (included into many PowerPC family processors) are special registers for the usage of performance measurement. • The measurements are implemented in runtime. The processor steps the registers when monitored events occur. • Due to the fact that the method uses special resources built into the processor in parallel with others, it does not affect system performance and that is why it can provide very realistic results.
Performance Monitor Counters (2/2) The following events can be measured: • Completed instructions per processor clock cycles • Memory hierarchy behavior (e.g. cache misses) • Usage of different execution units • Types of instructions dispatched • Branch predictions • etc
0 cycles 7 cycles 18 cycles 70 cycles Registers L1 cache L2 cache Main memory Memory Hierarchy • Fetching data from different parts of the memory system requires different amounts of time/cycles Source for estimations: IBM PowerPC 740 / PowerPC 750 RISC Microprocessor User’s Manual.
Measurement Environment (1/2) • The first M-MGW (a complete node) is the System Under Test (SUT). The second M-MGW is a dummy one, not connected to any access networks. It just answers the SUT’s requests. • Several Catapult DCT2000s initiate all the traffic (act as UTRAN/GERAN simulators). • UPLoad generates user plane traffic according to Q.AAL2 signaling received from Catapults. • The MSC server is a real node, which is controlled by the Catapults. It manages both of the M-MGWs. • TTCN is used to initialize the PMC measurement procedure by activating the PMC registers and specifying the measured events.
Results (1/3) • L2 instruction cache misses affect quite severely to IPC (Instructions Per Clock cycles) • Most probably the main reason for the large delay is that when an L2 instruction cache miss occurs, the processor cannot execute the following instructions, because the missing instruction can affect the next ones. The processor has to wait until the missing instruction is available.
Results (2/3) • Different amounts of load have quite small effect on the results when comparing the IPC values in general. However, there exist some measurement points that face a strong impact when increasing the load. • What is then common for these points that got a lot better IPC values during high load? They all contain data structure operations, such as searches, adds and removes. When the system is having a high load, the number of elements in these data structures is considerable and managing data structures can be done efficiently from the processor’s point of view.
Results (3/3) • The amount of code in the operation has an effect on the IPC value. The lengths of the measured pieces of code differ quite a lot. • The usage of complicated state machines is the main reason for low IPC values in short operations. When code is generated from a state machine with small pieces of code, the program is very fragmented (contains numerous small blocks).
Future Work • Topics for future work: • Comparing the results to some other pieces of software that are implemented using different development tools. • The comparison can also be done by using different processors. For instance, if there would be a similar processor that would have double sized L1 and L2 caches, the results would surely be different.
Thank you! Questions or comments?