SmartDIMM Application Examples

Application Example 1: Image Processing

While the field of digital media processing contains a wide variety of applications, the majority of these share several common features: i.) they require a great deal of memory bandwidth, and ii.) the low-level operations are usually simple and operate on 8- or 16-bit integer data.  These features are well matched to the capabilities of SmartDIMM  – the close integration of FPGA and DRAM will allow maximum memory throughput, while the simple, small-data-size operations map well to the Virtex FPGA logic. 

As part of our preliminary investigation into the performance of CIMA designs, we implemented a variety of image processing kernels in VHDL.  Using commercial CAD tools, we then synthesized these kernels onto several families of Xilinx FPGAs in order to obtain detailed performance predictions.  A synopsis of the results that are relevant to SmartDIMM appears in the table below.  It lists execution times for the following three hardware configurations: i.) conventional Complex Instruction Set CPU (CISC) performance – a 500MHz Intel Celeron), ii.) Virtex XCV300 performance using the full I/O signaling bandwidth of the 240-pin PQFP package, and iii.)  SmartDIMM performance, which includes bandwidth restrictions of memory communication over the PC100 frontside bus.  As can be seen in the table, implementing parallelizable operations on the virtual hardware of the FPGA results in tremendous application speedup over a CISC CPU.  Also note that although the PC100 memory data transfer rates limit the SmartDIMM performance a bit (70% of the theoretical FPGA speedup), there is still a 6X improvement over that obtained when conventional PCI bus transfer rates limit performance.

Image Processing Operation

CISC CPU Processing Time  500MHz Celeron

XCV300PQ240 Virtex FPGA - incl. I/O BW limitations

SmartDIMM - incl. PC100 memory BW limitations

Threshold

19.65 ns/pixel

1.84 ns/pixel

2.62 ns/pixel

Averaging Filter

37.24 ns/pixel

1.89 ns/pixel

2.62 ns/pixel

Edge Enhance

26.54 ns/pixel

2.42 ns/pixel

3.02 ns/pixel

 

Application Example 2: Java Acceleration

Translation of Java bytecodes to native code has become a popular alternative for improving the performance of Java interpretation [11]. Another leap in Java execution speed is possible by providing custom hardware acceleration for specific (sets of) methods through mapping of performance-critical (hot spot) sections of a Java application directly to FPGA hardware [12]. Significant gains are anticipated based on the underlying parallelism available in the application.  For example, the DES encryption core implemented on the proposed SmartDIMM is estimated to perform at 100X speed up compared to native code execution on our test PC.

This proposal will investigate the following two dynamic configuration techniques. (1) Dynamic synthesis of the hardware to be configured on a Field Programmable Gate Array (FPGA) from bytecode sequences, and (2) Use of pre-compiled hardware cores that correspond to commonly used Java methods; these cores will be downloaded to the FPGA on-the-fly. The first approach is similar to how the JIT compiler dynamically opts to compile-execute rather than interpret, and has been shown in [13] to be promising for a subset of Java bytecodes.  However, their constraint of limited memory bandwidth prevented any overall performance acceleration.  The SmartDIMM architecture will help overcome this obstacle.

The use of the second approach can be beneficial for several reasons.  First, it avoids the hardware compilation cost by using pre-compiled hardware cores for the specific class of applications.  Due to the high cost of configuring the hardware, only a few methods can amortize this cost during their execution on specialized hardware.  Thus, one can expect only a few specific functions to benefit from this capability.  The second advantage is in the use of application specific cores that usually perform much better than synthesized circuits.  Our preliminary observations regarding the distribution of time spent by the dynamic compiler also motivate the use of application-specific hardware cores [11].

 

 

 

 

 

   

  Mapping of Java executable code to FPGA hardware

While FPGA accelerator cards are commercially available, they are limited by the comparatively slow PCI peripheral interface bus.  The low-latency access between the host CPU and accelerator that is available by interfacing with the SDRAM bus will allow this mapping and execution to take place with much lower overhead than would be available in a PCI-based FPGA accelerator, providing a much greater level of acceleration.

Application Example 3: Network Interface      

Network interfaces have become quite a bit smarter, with many (such as the Myrinet interface) providing a programmable processor.  This feature makes it possible to support User-Level Networking (ULN) such as the Virtual Interface Architecture (VIA) [14].  With ULN, user processes are directly able to send/receive messages to/from the network without the involvement of the operating system and without compromising on protection.  The interface is able to poll regions of memory for outgoing messages, and incoming messages are examined and directly transferred to the appropriate user-level buffers.  This approach results in end-to-end latencies (from a user process on one machine to a user process on another machine) of less than 10 microseconds.  By comparison, the same communications take hundreds of microseconds using traditional kernel-based mechanisms (sockets, RPCs, etc.).  These factors have popularized the use of off-the-shelf workstations connected by high speed interconnects and ULNs.

Despite these performance-enhancing innovations, current interface hardware still resides on the PCI bus and this results in several performance limitations.  First, the transfer bandwidths are limited by those offered by the PCI bus.  Second, additional DMA transfers are needed to move data back and forth between main memory and the buffers on the Network Interface Card (NIC).  This results in latency problems as well.  One can think of our SmartDIMM memory as being the equivalent of a  (very large) buffer on the NIC.  Thus, we actually eliminate a DMA transfer in both the send and receive operations, leading to potentially even lower latencies (and higher bandwidths) than that offered by current interfaces.  While this issue has been briefly discussed in the literature [15], it has not been seriously explored because of the unattractiveness of introducing peripherals on the memory bus.  However, the high-speed SmartDIMM interface makes our approach a viable alternative, while achieving the low-latency/high-bandwidth goals for data transfers.  The scope of this project with respect to the NIC design will be limited to exploring FPGA configurations, and the software interface to the SmartDIMM which transfers data between user data structures and the memory on the card.  Exploration of hardware and software support for interfacing to external links/switches is beyond the scope of the proposed project.

 

 

 

 

                     

 A traditional NIC vs. a Smart Memory Card NIC