1 / 21

Lecture 18: Case Study of SoC Design

ECE 412: Microcomputer Laboratory. Lecture 18: Case Study of SoC Design. Outline. Web server example MP3 example. Example: Embedded web server application. Basic web server capable of responding to simple HTTP requests Simple CGI requests for dynamic HTML

cynara
Download Presentation

Lecture 18: Case Study of SoC Design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE 412: Microcomputer Laboratory Lecture 18: Case Study of SoC Design Lecture 18

  2. Outline • Web server example • MP3 example Lecture 18

  3. Example: Embedded web server application • Basic web server capable of responding to simple HTTP requests • Simple CGI requests for dynamic HTML • Read a timer peripheral before, during, and after servicing an HTTP request to log throughput calculations, which are then displayed on a dynamically generated web page • Simple read only file system was implemented using flash memory to store static web pages and JPEG images Lecture 18

  4. Throughput calculations • Transmission throughput • Reflects the latency between starting to send the first TCP packet containing the HTTP response until the file was completely sent • Could theoretically reach a maximum of 10Mbps • Raw network speed that the CPU and TCP/IP stack are capable of sustaining. • HTTP server throughput • Takes into account all delay between the incoming HTTP connection request and file send completion • Includes the transmission latency above • Also measures the time the HTTP server took to open a TCP connection to the host Lecture 18

  5. Baseline system • Web server put to test to serve up JPEG images of varying sizes across the LAN to a host PC • During each transfer several snapshots of the timer peripheral were taken Lecture 18

  6. Baseline system dataflow NIOs CPU Instruction Master Data Master Data flow Avalon Bus FLASH Ethernet MAC UART, IO, Timer, etc. SRAM The Nios CPU’s data master port is used to read data memory (SRAM) and write to the Ethernet MAC. This would occur for each packet transmitted in the baseline system. Lecture 18

  7. Performance optimization • Using a DMA to transfer data from incoming packets into memory without the intervention of the microprocessor • The use of a custom peripheral to do the checksum calculation • The combination of the two • Optimization of the slave-arbitration priority for the memories to provide maximum data throughput Lecture 18

  8. Dataflow enhancement with DMA • Using DMA to transfer packets between Ethernet MAC and data memory • CPU higher priority for any conflicts with the DMA • During DMA, CPU is free to access other peripherals • For access to the shared SRAM, arbitration is performed NIOs CPU DMA Controller Instruction Master Data Master Read Master Write Master Data flow Data flow Avalon Bus Avalon Bus Arbitrator UART, IO, Timer, etc. FLASH Ethernet MAC SRAM Lecture 18

  9. Performance improvement Transmission throughput is doubled compared to baseline The entire HTTP server throughput is about 2.5X that of the baseline 36% increase of logic resource usage (3600 logic elements) Lecture 18

  10. TCP checksum • Checksum calculations can be regarded as a necessary evil in dataflow-sensitive applications • For a 1300-byte payload, it takes 33,000 clock cycles • At a 33 Mhz clock speed it requires 1ms latency for each maximum size packet • In the benchmark, the largest file (60KB) breaks down into 46 maximum-sized packets • 46ms out of 156ms transmission latency in the baseline • The inner loop of TCP/IP stack checksum performs a 16-bit one’s complement checksum calculation • Adding up data repeatedly is a simple task for hardware • A Verilog implementation can be designed • The checksum peripheral operates • Reading the payload contents directly out of data memory • Performing the checksum calculation • Storing the result in a CPU-addressable register • It takes 386 clock cycles now • Speedup of 90X over the software version Lecture 18

  11. Checksum peripheral • Again, for access to the shared SRAM, arbitration is performed NIOs CPU Checksum Peripheral Instruction Master Data Master Read Master Data flow Data flow Avalon Bus Avalon Bus Arbitrator UART, IO, Timer, etc. FLASH SRAM Lecture 18

  12. Performance boost Transmission latency decreased by 44ms Average transmission throughput increase of 40% and average HTTP throughput increase of 25% over the baseline Resource usage 22% increase over the baseline (3250 logic elements) Lecture 18

  13. Putting it all together Lecture 18

  14. Embedded uP systems in Xilinx FPGA Traditional embedded microprocessor system as implemented on a platform FPGA Co-processor Architecture with multiple hardware accelerators 1. Start with developing for the first architecture 2. Automatically generating the second architecture under the control of the user Lecture 18

  15. Profiling results DCT32 and IMDCT36 perform the discrete cosine transform and inverse discrete cosine transform respectively. The other functions are multiply-accumulate functions of various sizes. These functions comprise over 90% of the total application execution time on the host. Lecture 18

  16. Design automation • Implement co-processor accelerators to meet performance requirements. • Using the tagging facilities in Xilinx design environment to mark the functions for hardware acceleration. • ‘Compile for target’ • The tool chain will create an implementation that includes a MicroBlaze processor and interfaces the same as before • Augmented with three hardware accelerators that implement the multiplications, DCT and inverse DCT. • The creation of the hardware accelerator blocks is done automatically: • The use of an advanced C to hardware compiler optimized for Platform FPGAs. • The ‘stitching’ of the accelerators into the new co-processing architecture. • Handling the movement of the appropriate data to and from the accelerators. Lecture 18

  17. New architecture Lecture 18

  18. Final results Enables the mp3 application to run in real time at a system clock rate of 67.5MHz. Lecture 18

  19. A simple summary • Platform-based design involves hardware/software codesign • Right design decisions can provide significant amount of performance improvement • Need careful tradeoff between performance, resource usage, cost and design time • Platform FPGAs are a convenient/low cost platform for such a task Lecture 18

  20. Overview of the Rest of the Semester • This is the last formal lecture • If we haven’t covered it already, we can’t really expect you to use it on your projects • Quiz 2. Next Thursday. No class next Tuesday. • Final project proposal is 4/13 and 4/15. • 2 teams each day. Each team has 20 minutes • Proposal presentations can be sent to me through email before class or brought in using a flash memory • Initial report due on 4/20 (new due date) • Three-pages (four at most) • May contain: introduction, background, motivation, impact, block diagram, and workload partition among team members • Goal: give us enough information that we can provide feedbacks about project complexity and suggestions • From now on, I’ll have office hours during class meeting times to discuss final project-related issues • Final Project Presentation: 5/12 • Final Project Report/Demo: Due 5/14 • Details referring to Lecture 14 Lecture 18

  21. Next time • Quiz 2 (next Thursday, 4/8) Lecture 18

More Related