网络应用在商用多核体系上的并行化

网络应用在商用多核体系上的并行化 指导老师：郭燕组员：王金雷、周岳峰、禹龙、马旭光、陈龙

主要内容 • 项目背景 • 设计概要 • 无锁队列的研究 • Libnids并行化 • 人员分工 • 结语 • 参考资料

背景概览 • 因特网是建在TCP/IP协议上的、万维网是建立在HTTP协议上的，现有的网络协议都是在单核上运行，无法充分利用多核的特点，根本达不到网络高速处理的要求。 • 使用多核技术对TCP/IP和HTTP等网络协议加速，尽可能的提高因特网和万维网的吞吐率，对网络应用产生促进作用。比如，对网络上的数据包进行详细的深层分析，从而可以应用于企业、银行、教育、医疗、政府的信息安全和数据防盗上。

构建基于多核平台的并行网络程序 • 利用通用多核处理器建立高速网络处理平台，实现网络程序并行化处理。 • 通过在多核平台上并行化已有的串行网络程序来达到利用多核结构的目的，而不是完全从头开始编写一个并行的网络程序。 • 实验平台为学院提供的多核处理器

设计概要： • 4个核组织成一个2级功能流水线，第一级（IP）负责获取包，第二级（APP）负责其余的处理。 • 并行程序的组织： • 由于协议栈的分层结构，包处理过程很容易组织成流水线结构； • 由于不同流的数据包往往不相关，可以使用多个核执行某一阶段的任务。 • IP核运用连接亲和性（connection-affinity）原则，将属于同一个流（用一对IP地址标识）的数据包分发给同一个APP核处理。

设计概要 • 由于没有提供硬件级的fifo为了要达到核与核之间高速共享数据，设计了一种并发无锁的FIFO队列，实现相邻核之间的高效通信。 • 使用对称的hash函数对包进行分派。 • 尽可能消除全局变量以及程序中显示或隐式的锁操作。比如，采用预分配缓冲区的方法去除malloc()操作（从而消除隐含的锁操作）。

任务映射 (a) (b)

无锁队列的研究 • Lock_free fifo • 性能分析 • 算法改进 • 性能对比

FIFO实现:Locking queue implementation • 此种队列将生产者、消费者紧紧耦合在一起，会出现互相等待的情况，效率不高。不可取。

FIFO实现:Lamport queue implementation • Lamport已经证明，在顺序一致性模型下，单生产者-单 • 消费者队列的锁可以去掉，从而形成一个并发无锁队列(CLF)。 • 消除了生产者、消费者之间的耦合，以及消除了显示的同步。

FIFO实现：FastForward CLF Queue • 解除了head和tail之间的耦合，使用NULL表示空单元，通过检查head或tail所指向的单元是否为空来判断队列的状态，避免直接比较head和tail。

Slip maintenance • 以上算法在初始化时运行。 • 如果阶段的处理时间呈现非零均值的波动，或者具有非均衡的阶段，则需要周期性地运行该算法进行调整。

False sharing Thread0 Thread1 core1 core2 Cache line Cache line • 核与核之间以及cache要进行同步导致严重的性能问题。这就是所谓的False sharing问题, Memory

FIFO实现：Read/Write aggregate Queue FIFO_ELEM tmp[ELEM_PER_CACHELINE] ; int enqueue_aggregate(FIFO_ELEM *data) { if (NULL != queue[head]) return FALSE ; tmp[current] = *data ; current++ ; if (current == ELEM_PER_CACHELINE) { memcpy(queue[head], tmp, sizeof(FIFO_ELEM) * ELEM_PER_CACHELINE) ; head = NEXT(head) ; current = 0 ; } return TRUE : } • 虽然有效的避免了false sharing问题。 • 但是引入的tmp进行了多余的拷贝，访存操作。

FIFO实现:Align-shift queue implementation init_queue() { p = pre_malloc(QUEUE_SIZE * sizeof(FIFO_ELEM) + CACHE_LINE_SIZE - 1) ; if (! (p mod CACHE_LINE_SIZE)) queue = p + CACHE_LINE_SIZE - 1 ; else queue = p + CACHE_LINE_SIZE - 1 - p mod CACHE_LINE_SIZE ; return TRUE ; } • 队列初始化，完成“对齐”工作。

p 64n queue queue 64(n + 1) queue

p 64n queue 64(n + 1) queue

FIFO实现:Align-shift queue implementation enqueue_align_shift(FIFO_ELEM *data) { if (count == ELEM_PER_CACHE_LINE && queue[next(head)] != NULL) return FALSE ; if (count == ELEM_PER_CACHE_LINE) { head = NEXT(head); count = 1; } else count ++; if (count == ELEM_PER_CACHE_LINE ) queue[head] = data; else queue[head + count] = data; return TRUE ; } • 交替的读写并且写入标记位

演示： Thread0 Reader Reader Thread1 Writer Writer core1 core2 Cache line Cache line 64（n+1） 64（n+1） 64(n + 1） 64(n + 2） 64(n + 2） Memory

演示： Thread0 Reader Thread1 Writer core1 core2 Cache line Cache line 64（n+1） 64(n + 2） Memory

性能比较：Fast ForwardVS R,W Aggregate VS Align Shift

性能比较：Fast Forward不同实现 • 取模运算非常耗时

性能比较:Optimized FastForward vs Align Shift

Libnids并行化及性能测试 • 并行化的设计 • 性能测试 • 性能对比

Libnids捕包的详细流程 Nids_init() nids_exit() open_live() raw_init() init_procs() test_malloc() test_malloc() tcp_init() init_hash() getrnd() ip_frag_init() scan_init() nids_register_tcp() register_callback() test_malloc() nids_run() nids_pcap_handler() call_ip_frag_procs() gen_ip_frag_proc() nids_ip_filter() ip_fast_csum() nids_register_chksum_ctl() ip_defrag_stub() hostfrag_find() frag_index() gen_ip_proc() my_udp_check() nids_register_chksum_ctl csum_partial() csum_tcpudp_magic() csum_fold() nids_pcap_handler() call_ip_frag_procs() nids_ip_filter()

我们的并行化

串行并行串行 。注意： front: modified by CP core(c:collect) current: modified by AP core(a:application) rear: modified by IP core(i:input) [front, current] = unsettled [current, rear] = settled uncollect

Hash函数的选择 • 虽然非对称的hash函数能将数据包更均匀的分派到各个核上，但是却不能确保连接亲和性(connection-affinity)。 • 连接亲和性的两个主要属性： • 1)顺序处理属于同一个连接的数据包。 • 2)属于不同连接的数据包可以并行处理。 • 由于多核平台上hash函数的命中率比负载平衡更重要，所以，我们选用的是对称的hash函数。 int val = hash(ip)

代码可重入性的考量： • 若一个程序或子程序可以“安全的被并行执行”，则称其为可重入的。即当该子程序正在运行时，可以再次进入并执行它（并行执行时，个别的执行结果，都符合设计时的预期）。 • 若一个函数是可重入的，则该函数： • 不能含有静态（全局）非常量数据。 • 不能返回静态（全局）非常量数据的地址。 • 只能处理由调用者提供的数据。 • 不能调用不可重入的函数。

做法： • 可消除的全局变量条件：满足连接亲和性的要求，即该全局变量表示某一个连接的“状态”，通过如下的做法： static int timenow = 0 ; /*timenow can't be used to paralelized code directly*/ static int jiffies() { if (timenow) return timenow ; timenow = ... ; return timenow ; }

修改后的代码： • 可重入代码的例子： static int jiffies(IP_THREAD_LOCAL_P ip_thread_local_p) { // timenow is a private data of each thread if (ip_thread_local_p -> timenow) { return ip_thread_local_p -> timenow ; } ip_thread_local_p -> timenow = ... ; return ip_thread_local_p ; }

“不可”消除的全局变量： • 此全局变量作为整个运行环境所有，所有连接共享的状态，比如日志等 static int n_tickets 100 ; void* pthread_sell (THREAD_LOCAL_P tickets_local_p) { int n = tickets_local_p -> n_tickets ; sell(n) ; return NULL : } 如果有三个线程，那么就相当于有三个窗口分别卖这一百张票。就改变了原来的初衷。

Libnids性能测试方法： • 1：Wireshark进行网上抓包安装：apt-get install wireshark 运行：

Libnids性能测试方法： • 2: 收集数据包

Libnids性能测试方法： 保存数据包为pcap格式：

Libnids性能测试方法： 3.安装tcpreplay: apt-get install tcpreplay 4.利用保存的数据包生成重放所使用的文件：

Libnids性能测试方法： 6.运行编译后的测试程序： gcc –o z test.c –lnids –lpcap –lnet –pthread –O0 ./z 7.重放数据包：

性能对比：串行 vs并行

Amdahl's law • If f is the fraction of a computation that is inherently sequential, then • Note: n is a single parameter characterization of hardware [to speed up the parallelizable fraction (1 – f) of the program], f [inherently sequential fraction] is a single parameter characterization of the software, and S(n) denotes the overall speed-up of the computation. Observe: • The assumption is the parallelizable part of the program is infinitely parallelizable without scheduling or communication overhead and the size of the input remains constant. • When f is large, optimizations will have little effect. • As n approaches infinity, speedup is bounded by 1/f .

人员分工 • 周岳峰、王金雷负责fifo算法调优以及实现。 • 马旭光、陈龙负责用wireshark等工具对libnidns进行性能测试。 • 马旭光、禹龙负责libnids数据包重放工作。 • 文档由禹龙、陈龙等合作完成。 • 系统测试部分交替进行

结语 • 总结 libnids的源代码仍然需要进一步的研读，找到最耗时的部分将其并行化，最终才能提高效率。

主要参考文献 [1] Practice of Parallelizing Network Applications on Multi-core Architecture. [2] IP Router Architectures: An Overview. [3]高速网络环境下的入侵检测技术 [4]http://zh.wikipedia.org/wiki/ [5]http://www.ibm.com/developerworks/cn/linux/l-oprof/index.html [6]An Intro to Lock-free Programming [7]多线程并发访问无锁队列的算法研究

Thank You!

网络应用在商用多核体系上的并行化

网络应用在商用多核体系上的并行化

Presentation Transcript