MBUF Problems and solutions on VxWorks

MBUF Problems and solutions on VxWorks Dave Thompson and cast of many.

MBUF Problems This is usually how it lands in my inbox: On Tue, 2003-05-06 at 20:38, Kay-Uwe Kasemir wrote: > Hi: > > Neither ics-accl-srv1 nor the CA gateway were able to get to dtl-hprf-ioc3. > > Via "cu", the IOC looked fine except for error messages > (CA_TCP): CAS: Client accept error was "S_errno_ENOBUFS" (CA_online): ../online_notify.c: CA beacon error was "S_errno_ENOBUFS“ • This has been a problem since before our front end commissioning even though we are using power pc IOCs and a fully switched, full duplex, 100 MHz Cisco based network infrastructure. • The error is coming from the Channel Access Server.

Contributing Circumstances (According to Jeff Hill) • The total number of connected clients is high. • the server's sustained (data) production rate is higher than the client's sustained consumption rate. • clients that subscribe for monitor events but do not call ca_pend_event() or ca_poll() to process their CA input queue • The server does not get a chance to run • The server has multiple stale connections And also probably: • tNetTask does not get to run

Contributing Circumstances • SNS Now has a number of different IOCs : • 21 VxWorks IOCS • 21 +/- Windows IOCs • 1 Linux IOC • 4 OPIs in control room and many others on site • Servers running CA clients like the archiver • Users remotely logged in running edm via ssh’s X tunnel. • CA Gateway • Other IP clients and services running on vxWorks and servers. • Other IP applications running on IOCs such as log tasks, etherIP and serial devices running over IP.

Our experience to date At SNS we have seen all of the contributing circumstances that Jeff mentions. • At BNL, Larry Hoff saw the problem on an IOC where the network tasks were being starved. • Many of our IOCs have heavy connection loads. • There are some CA client and Java CA client applications which need to be checked. • IOCs get hard reboots to fix problems and thus leave stale connections. • Other network problems have existed and been “fixed” including CA gateway loopback.

Late breaking: Jeff Hill was at ORNL last week. • One of the things he suspected was that the noise on the Ethernet wiring causes the link to re-negotiate speed and full/half duplex operation. • He confirmed that the combination of the MV2100 and the Cisco switches is prone to frequent auto-negotiation, shutting down Ethernet I/O on the IOC. • This is not JUST a boot-up problem.

What is an mbuf anyway? VxWorks uses this structure to avoid calls to the heap functions malloc() and free() from within the network driver. • mBlks are the nodes that make up a linked list of clusters. • The clusters store the data while it is in the network stack. • There is a fixed number of clusters of differing sizes. • Since a given cluster block can exist on more than one list, then you need 2X as many mBlks as clusters.

Mbuf and cluster pools • Each network interface has its own mbuf pool netStackDataPoolShow() (aka mbufShow) • The system has a separate mbuf/cluster pool used for routing, socket information, and the arp table. netStackSysPoolShow()

Output from mbufShow number of mbufs: 400 number of times failed to find space: 0 number of times waited for space: 0 number of times drained protocols for space: 0 size clusters free usage ------------------------------------------------------------------------------- 64 200 199 1746 128 400 400 190088 256 80 80 337 512 80 80 0 1024 50 50 1 2048 50 50 0 4096 50 50 0 8192 50 50 0 High turnover rate Added at SNS This one is mis-configured. Why?

Our Default Net Pool Sizes You should add these lines to config.h or maybe configAll.h #define NUM_64 100 /* no. 64 byte clusters */ #define NUM_128 200 #define NUM_256 40 /* no. 256 byte clusters */ #define NUM_512 40 /* no. 512 byte clusters */ #define NUM_1024 25 /* no. 1024 byte clusters */ #define NUM_2048 25 /* no. 2048 byte clusters */ #define NUM_CL_BLKS (NUM_64 + NUM_128 + NUM_256 + \ NUM_512 + NUM_1024 + NUM_2048+ \ NUM_4096+NUM_8192) #define NUM_NET_MBLKS 2*(NUM_CL_BLKS) These will override the definitions in usrNetwork.c.

What we are doing at SNS • We are using a kernel addition that provides for setting the network stack sizes on the bootline. • 4X the vxWorks default sizes are working well. • We see high use rates for the 128 byte clusters so that allocation is set extra high. • Use huge numbers only if trying to diagnose problem such as a resource leak. • Configuring the network interfaces to disable auto-negotiation of speed and full-duplex. Code for the kernel addition is available at http://ics-web1.sns.ornl.gov/EPICS-S2003

MBUF Problems and solutions on VxWorks