Breakout the Zynq Ultrascale+ GEMs with Ethernet FMC

June 16, 2016, 6:28 pm

≫ Next: Bye bye Platform Cable USB II, Hello JTAG HS3

≪ Previous: FMC for Connecting an SSD to an FPGA

Did you know that the Zynq Ultrascale+ has 4 built-in Gigabit Ethernet MACs (GEMs)? That makes it awesome for Ethernet applications which is why I’ve just developed and shared an example design for the Zynq Ultrascale+ ZCU102 Evaluation board, armed with an Ethernet FMC to break-out those handy GEMs. The ZCU102 board has two FMC connectors, both high-pin-count (HPC), so I’ve created one basic design with two sets of constraints to choose from, depending on which FMC connector you want to use.

These scripts will build the Vivado project and block diagram for you: Zynq GEM Ethernet FMC example design

The scripts rely on the ZCU102 board definition files which don’t come built into Vivado 2016.1. I’m guessing that they will in the near future, but for now, to be able to build the project you’ll need to request access to the ZCU102 HeadStart Lounge and properly install the board definition files.

Want to know more about the Zynq UltraScale+ MPSoC? Checkout the video from Xilinx below. By the way, the image above comes from 0:58 of the video.

↧

Bye bye Platform Cable USB II, Hello JTAG HS3

June 22, 2016, 10:30 am

≫ Next: At last! Affordable and fast, non-volatile storage for FPGAs

≪ Previous: Breakout the Zynq Ultrascale+ GEMs with Ethernet FMC

Now that I think about it, I’ve been using my Xilinx Platform Cable USB II for 10 years now!!! That’s a terrific run in my opinion, I got it in a kit for the Virtex-5 ML505 board in 2006 and I would have kept using it if I didn’t start getting these strange error messages recently. So from a recommendation, I got myself a JTAG HS3 from Digilent and it is just ridiculously better. As you can see from the photo, it’s much smaller although some people might see that as a down-side because it’s easier to lose… I don’t know.. for me, the real advantage is that it is so much faster than the Platform cable. I like tools that don’t make me wait, because my time is important and I have no patience for that moment when I’m waiting for the bitstream to download and I need to know whether my design changes are going to work or not. This tool rocks!

↧

At last! Affordable and fast, non-volatile storage for FPGAs

July 1, 2016, 10:15 am

≫ Next: Measuring the speed of an NVMe PCIe SSD in PetaLinux

≪ Previous: Bye bye Platform Cable USB II, Hello JTAG HS3

Let me introduce you to Opsero’s latest offering: FPGA Drive FMC, a new FPGA Mezzanine Card that allows you to connect an NVMe PCIe solid-state drive to your FPGA.

There’s got to be a better way. In the past, if you were developing an FPGA based product that needed a large amount of fast non-volatile storage, the best solution was to connect a SATA drive. Physical interfacing was pretty simple because all you needed was one gigabit transceiver. The downside however with SATA drives is that they require an IP core to implement the protocol layers between the host processor and the gigabit transceivers. This IP core can cost thousands of dollars and it uses up a lot of the FPGA resources, which all pushes up the total system cost.

The better solution is the new SSD technology based on the NVM express interface. NVMe is a new technology that is set to replace SATA as the most common way for connecting SSDs in personal computers. The new M.2 NVMe SSDs use a 4 lane PCI express interface, which can connect directly to one of the integrated blocks for PCI express that are available in most of the Series-7 FPGAs and the Zynq-7000.

fpga-drive-fmc-10

Benefits. The most significant benefit to using this solution versus using a SATA or SAS drive, is that you don’t need the SATA or SAS IP core. This greatly simplifies the FPGA design, reduces development time and saves the customer thousands of dollars in IP costs. The second benefit to this solution is that NVMe is dramatically faster than SATA and SAS. This is mainly due to two things: a faster physical interface and a more efficient protocol stack. A 4-lane Gen2 PCI express link has more than 3 times the bandwidth of SATA 3.0. What’s more, NVMe has a more efficient protocol stack which was designed from the ground up to better exploit the potential of modern SSDs. Intel has shown that NVMe reduces latency overhead by more than 50%. And the knock out punch is this: all major Linux distributions now have NVMe driver in-box support, and that includes PetaLinux from version 2015.4 and up. So this means that NVMe drives can be used in a Linux OS running on a Microblaze in an FPGA, or on the Zynq ARM, and without even having to write custom drivers. If you want to see how, just check out these tutorials on the subject.

fpga-drive-fmc-6

Features. So let’s get back to the FPGA Drive FMC and take a look at what’s on it:

1x M.2 socket M-keyed for PCIe SSDs
1x High-pin count FMC connector (mates with both LPC and HPC connectors)
1x 100MHz clock oscillator to supply a clock to the FPGA and the SSD
1x EEPROM to store the board’s FRU data (eg. serial number and power supply information)
2x LEDs: one for power good, one for the SSD activity

The board gets all of it’s power from the FMC connector, so there’s no power supply circuitry as there is on the PCIe edge connector version of the product. The board also has 4 mounting holes so that it can support all the 4 sizes of M.2 NVMe SSDs. The FMC uses two I/O signals, one for PERST (PCIe reset) driven by the FPGA, and one for PEDET (PCIe detect) driven by the SSD.

The images below show how the board interfaces with the PicoZed FMC Carrier Card V2 and the KC705.

fpga-drive-fmc-picozed-5

fpga-drive-fmc-kc705-6

If you want more information about this product, please visit the product website at fpgadrive.com.

and Happy Canada Day!

↧

Measuring the speed of an NVMe PCIe SSD in PetaLinux

July 2, 2016, 12:59 pm

≫ Next: M.2 NGFF Loopback Module

≪ Previous: At last! Affordable and fast, non-volatile storage for FPGAs

With FPGA Drive we can connect an NVM Express SSD to an FPGA, but what kind of real-world read and write speeds can we achieve with an FPGA? The answer is: it depends. The R/W speed of an SSD depends as much on the SSD as it does on the system it’s connected to. If I connect my SSD to a 286, I can’t expect to get the same performance as when it’s connected to a Xeon. And depending on how it’s configured, the FPGA can be performing more like a Xeon or more like a 286. To get the highest performance from the SSD, the FPGA must be a pure hardware design, implementing NVMe protocol in RTL to minimize latency and maximize throughput. But that’s hard work, and not very flexible, which is why most people will opt for the less efficient configuration whereby the FPGA implements a microprocessor running an operating system. In this configuration, we typically wont be able to exploit the full bandwidth of NVMe SSDs because our processor is just not powerful enough.

But we still want to know, what speeds do we get from an FPGA running PetaLinux? To answer this question, I’ve done tests on two platforms. One on the KC705 board, running PetaLinux on the Microblaze soft processor, and another on the PicoZed 7030, running PetaLinux on the ARM Cortex-A9 processor.

How to measure the speed of an SSD in Linux. There are many ways to measure the read and write speed of an SSD in Linux, but the only one available to us in PetaLinux is the dd command. But that wont suffice. Normally the dd command alone gives us the read/write speed, but PetaLinux is built with a leaner version of dd that does not make this calculation for us. So we have to use the time command as well, and make the calculation ourselves.

The dd command lets us specify an input device, an output device and the number of bytes to transfer between them. The commands below will transfer 2 Gigabytes of data between the input device and output device.

Write test: time dd if=/dev/zero of=/dev/nvme0n1p1 bs=2M count=1000
Read test: time dd if=/dev/nvme0n1p1 of=/dev/null bs=2M count=1000

KC705 Test

kc705-nvme-ssd-speed-test-in-petalinux

From the screenshot above, we can see that the write test transfers 2 Gigabytes of data in 7 minutes and 45 seconds. The read test transfers 2 Gigabytes in 2 minutes and 21 seconds.

The video was taken during the write test to show the SSD activity LED. Notice that the SSD has a couple of seconds of inactivity at regular intervals. I’m not sure exactly what is happening during those couple seconds, but something is holding up the show. Although not shown in the video, during the read test, the SSD doesn’t seem to go through these same periods of inactivity.

PicoZed Test

picozed-nvme-ssd-speed-test-in-petalinux

You can see in the screenshot above that I used the lsblk command to get the name of a partition on the SSD (nvme0n1p1). I use this device name as the input device in the write test, and the output device in the read test. The transfers went a lot faster in the PicoZed test, so I used a larger transfer size just to make the test last a bit longer and improve the accuracy of the result. The write test transfers 16 Gigabytes of data in 3 minutes and 9 seconds. The read test transfers 16 Gigabytes in 2 minutes and 12 seconds.

The video was taken during the write test to show the SSD activity LED. Notice that there are no breaks in SSD activity, in contrast to the Microblaze design. During the read test, same thing.

Results

Kintex-7 KC705 MicroBlaze processor clocked at 125MHz

Write speed: 4.3 MBps
Read speed: 14.2 MBps

Zynq-7000 PicoZed 7030 ARM Cortex-A9 clocked at 667MHz

Write speed: 84.7 MBps
Read speed: 121.2 MBps

So there’s a massive difference between the performance of the PicoZed and that of the KC705. The Zynq gets almost 20 times faster write speeds, and about 9 times faster read speeds. This isn’t only due to the faster processor, in the Zynq design, the AXI Memory Mapped to PCIe IP connects to the system memory via a high-performance (HP) AXI slave interface, and it doesn’t have to share that interface with the processor. In the Microblaze design, both the processor and the PCIe IP have to share access to the MIG through an AXI Interconnect.

Neither design comes close to the performance in the Samsung specs nor test results by Arstechnica, showing sequential write speeds of 944MBps and read speeds of 2,299MBps. So this raises the next question: Can a hardware NVMe IP core running on this same hardware actually reach those speeds? If you want to help me find out, please get in touch.

↧

M.2 NGFF Loopback Module

August 2, 2016, 11:14 am

≫ Next: Micron’s new M.2 Solid-State Drive

≪ Previous: Measuring the speed of an NVMe PCIe SSD in PetaLinux

Half the fun of making cool stuff is sharing it with others. The photos I’m sharing in this post are of my new M.2 NGFF loopback module – it’s a M.2 form-factor module with a loopback on each of the 4 PCIe lanes, as well as some electronics to test other connections such as the 3.3V power supply and the 100MHz clock. It allows my assembler to test the FPGA Drive boards that come out of production. The other half of the test jig is of course the FPGA board, which I’ve designed to be driven by the PicoZed 7015 (I’ll share photos of this board in a later post).

m2-loopback-6

The challenge in designing an automated test for the FPGA Drive boards is that they supply a lot of connections to the SSD (which must be checked), but the only connection that the FPGA has with the SSD is via the 4 PCIe lanes (1). So any manufacturing faults that are detected by the M.2 loopback module must be communicated to the FPGA via the 4x PCIe lanes. If I had placed a Zynq on the M.2 module, it would have been easy to communicate any number of faults to the FPGA, but then the module would have cost a hell of a lot more. So my solution was to use a PCIe MUX/DEMUX whose outputs are connected in loopback. One of the outputs of the MUX/DEMUX is looped back with it’s polarity reversed, while the other is looped back with normal polarity. This way, I can use the SEL pin of the MUX/DEMUX to indicate a manufacturing defect to the FPGA. With 4 lanes, I can signal 4 different types of manufacturing errors. By also using the device’s shutdown pin, which removes the loopback, I can signal 4 more defects.

m2-loopback-5

In the production test, the PicoZed 7015 sends a PRBS signal at 5Gbps on each of the 4 PCIe lanes. If the M.2 loopback module does not detect any manufacturing defects, all 4 PCIe lanes are connected in loopback with normal polarity. Any signal integrity problems should be detected by the PicoZed in the form of bit errors in the received PRBS signal. If the M.2 module detects a problem with the power supply, the 100MHz clock, the LED, the PEDET connection, the reset signal or the DEVSLP connection, it will result in one of the lanes being polarity reversed, or being disconnected – both of which can be detected by the PicoZed. The M.2 loopback module also has power resistors which draw 2.5A of current and make sure that the power supply meets the M.2 standard requirements.

m2-loopback-7

As the loopback module was designed to be compliant to the PCI Express M.2 specification, it can be used to test any M.2 carrier. If you want more information on these modules, or you’d like to purchase one, please contact me.

Notes: (1) Actually, the FPGA also has the PERST (reset) connection to the SSD, but we can’t use this signal for testing because (a) it’s driven by the FPGA Drive board on the PCIe edge-connector version and (b) it’s an input to the FPGA Drive FMC version.

↧

Micron’s new M.2 Solid-State Drive

August 10, 2016, 6:45 am

≫ Next: FPGA Drive now available to purchase

≪ Previous: M.2 NGFF Loopback Module

Computer memory giant, Micron, sent me a pre-production sample of their brand new M.2 NVMe solid-state drive. I tested it under PetaLinux on the PicoZed FMC Carrier Card V2 and the FPGA Drive adapter, and as expected, it passed all tests with flying colours. Although all of my previous tests were done with the Samsung VNAND 950 Pro SSD, the FPGA Drive adapter is designed to work with all M.2 PCIe compliant SSDs, and this test is confirmation of that.

As it’s pre-production, I can’t post photos of the actual SSD, so the image you see above is a stock image that they sent me. It’s not yet available for purchase, but I’ll be sure to update this post with a link when it does become available.

↧

FPGA Drive now available to purchase

August 16, 2016, 6:33 pm

≫ Next: NVMe Host IP tested on FPGA Drive

≪ Previous: Micron’s new M.2 Solid-State Drive

Orders can now be placed for the FPGA Drive products on the Opsero website. Both the PCIe and FMC versions allow you to connect an M.2 PCIe solid-state drive to an FPGA development board and both can be purchased at the same price of $249 USD (solid-state drive not included).

The PCIe version has an 8-lane PCIe edge connector for interfacing with the PCIe blade (aka. goldfingers) of an FPGA development board. The board is powered by 12VDC so it comes with a power cable which allows you to power the FPGA Drive from the same power adapter that supplies power to the FPGA board.

fpga-drive-bring-up-6

The FMC version has a high pin count (HPC) FPGA Mezzanine Card (FMC) connector for interfacing with the FMC connectors of FPGA development boards. It gets all of it’s power through the FMC connector and it is compatible with both low pin count (LPC) and high pin count (HPC) FMC connectors (note that only 1 lane PCIe is supported on LPC connectors).

fpga-drive-fmc-kc705-6

Both versions have an M.2 M-key socket for the SSD, a 100MHz oscillator to supply a clock to both FPGA and SSD, and all the required mounting accessories. At the moment, there are example designs available for all of the compatible Xilinx Series-7 evaluation boards, and the PicoZed FMC Carrier Card V2. Please checkout the product website for more detailed information.

If you would like to purchase either of these products, follow these links to place your order on the Opsero website: FPGA Drive (PCIe version) or FPGA Drive FMC

↧

NVMe Host IP tested on FPGA Drive

October 23, 2016, 7:03 am

≫ Next: Tcl Automation Tips for Vivado and Xilinx SDK

≪ Previous: FPGA Drive now available to purchase

I’ve been totally overloaded with projects in the last couple months but I’m back with some really exciting news today. A few months back a company called IntelliProp, based in Colorado, released a NVMe Host Accelerator IP core for interfacing FPGAs with NVMe SSDs. This IP core allows reads and writes to be performed directly from the FPGA fabric, without the latency overhead of an operating system (read about the NVMe speed tests I did under PetaLinux). IntelliProp has tested their IP core with an FPGA Drive FMC loaded with a Samsung 950 Pro 256GB SSD and here are the results:

Kintex-7 KC705 Evaluation Board (PCIe Gen2): write speeds of 750MB/s and read speeds of 1,270MB/s
Kintex Ultrascale KCU105 Evaluation Board (PCIe Gen3): write speeds of 1,000MB/s and read speeds of 2,000MB/s

These numbers are impressive, considering that test results on the same SSD by Arstechnica (probably using PCIe Gen3) showed write speeds of 944MBps and read speeds of 2,299MBps. IntelliProp’s IP is a great solution for applications needing large non-volatile storage and a high bandwidth channel to the FPGA fabric. One such application is high speed data acquisition, where you’ve got a lot of data coming in quickly and you need to store it for later processing, like what they’d use in the Large Hadron Collider. Another advantage of this solution is that SSDs typically store much more data per square inch than DDR memory, so some applications currently using DDR for the bulk of their data storage might have an reason to switch over to NVMe SSDs now that the simplicity and throughput of the interface has significantly improved.

For more information on IntelliProp’s NVMe Host Accelerator IP Core:

http://intelliprop.com/hardware-storage-design/ip-cores/nvme-host-accelerator-ip-core-IPC-NV164-HI.htm

For more information on the NVMe SSD to FPGA interface solution:

http://fpgadrive.com

In the next few days I’ll be trying to reproduce IntelliProp’s results on our own hardware and I’ll post the results soon after.

↧

Tcl Automation Tips for Vivado and Xilinx SDK

November 1, 2016, 8:18 am

≫ Next: A quick look at the Kintex Ultrascale KCU105

≪ Previous: NVMe Host IP tested on FPGA Drive

Tcl automation is one of the most powerful features integrated into the Vivado and Xilinx SDK tools and should be fully exploited to maximize your productivity as an FPGA developer. In this post I’ve put together a “cheat sheet” of some of the most useful commands and tricks that you can use to get more done through Tcl scripting. If you want more things added to the list, please let me know in the comments section at the end.

Vivado Tcl Automation Cheat Sheet

Get the Vivado install path
Useful when you need access to the IP sources.

# Vivado install path (eg. "C:/Xilinx/Vivado/2016.3")
set vivado_dir $::env(XILINX_Vivado)

Get the top level module name of a Vivado project
We often need to know the top level module of a Vivado design so that we can appropriately name other things, such as the SDK hardware project. I think the easiest way to get this name is by searching for one of the files in the Vivado project that uses the top level module name. Some of these files are: *.bit, *.hwdef, *.sysdef, *.hdf

set hdf_filename [lindex [glob -dir $vivado_folder/$vivado_folder.sdk *.hdf] 0]
set hdf_filename_only [lindex [split $hdf_filename /] end]
set top_module_name [lindex [split $hdf_filename_only .] 0]

Open a Vivado project
We first open a project in the Tcl script to be able to synthesize, implement and export the design.

# Open project
open_project $origin_dir/$proj_name/$proj_name.xpr

Synthesize a Vivado project
The project has to be opened first.

# Synthesize project
launch_runs synth_1
wait_on_run synth_1

Implement a Vivado project
The project has to be opened first.

# Implement project
launch_runs impl_1 -to_step write_bitstream
wait_on_run impl_1

Export a Vivado project for SDK
The code below works out the top module name from the bitstream file, then creates a .sdk subdirectory and then uses the write_sysdef command to export the design.

# Export project to SDK
set bit_filename [lindex [glob -dir "$origin_dir/$proj_name/${proj_name}.runs/impl_1" *.bit] 0]
set bit_filename_only [lindex [split $bit_filename /] end]
set top_module_name [lindex [split $bit_filename_only .] 0]
set export_dir "$origin_dir/$proj_name/$proj_name.sdk"
file mkdir $export_dir
write_sysdef -force \
  -hwdef "$origin_dir/$proj_name/${proj_name}.runs/impl_1/$top_module_name.hwdef" \
  -bitfile "$origin_dir/$proj_name/${proj_name}.runs/impl_1/$top_module_name.bit" \
  -meminfo "$origin_dir/$proj_name/${proj_name}.runs/impl_1/$top_module_name.mmi" \
$export_dir/$top_module_name.hdf

More info on Vivado Tcl
For more information on the Vivado Tcl commands, refer to the Vivado Design Suite Tcl Command Reference Guide (UG835).

Xilinx SDK Tcl Automation Cheat Sheet

Get the Xilinx SDK install path
Useful when you need access to the driver and library sources.

# SDK install path (eg. "C:/Xilinx/Vivado/2016.3")
set sdk_dir $::env(XILINX_SDK)

Create a hardware project in an SDK workspace
To create a hardware project, you need to provide a name for the hardware project, such as ${top_module_name}_hw_platform_0, and the path to the .hdf file which is the exported Vivado project and is usually located in the Vivado project files in the $vivado_folder.sdk subdirectory. You can avoid an error message, by checking to see if it exists already.

if {[file exists "$hw_project_name"] == 0} {
  createhw -name ${hw_project_name} -hwspec $hdf_filename
}

Get the name of the processor in the design
When creating software applications, we have to specify which processor the application will execute on. To find out what processors are available in the hardware project, you can parse the output of the getperipherals procedure. The following function will return the first processor that it finds.

# Below is an example of the output of "getperipherals":
# ================================================================================
# 
#               IP INSTANCE   VERSION                   TYPE           IP TYPE
# ================================================================================
# 
#            axi_ethernet_0       7.0           axi_ethernet        PERIPHERAL
#       axi_ethernet_0_fifo       4.1          axi_fifo_mm_s        PERIPHERAL
#           gmii_to_rgmii_0       4.0          gmii_to_rgmii        PERIPHERAL
#      processing_system7_0       5.5     processing_system7
#          ps7_0_axi_periph       2.1       axi_interconnect               BUS
#              ref_clk_fsel       1.1             xlconstant        PERIPHERAL
#                ref_clk_oe       1.1             xlconstant        PERIPHERAL
#                 ps7_pmu_0    1.00.a                ps7_pmu        PERIPHERAL
#                ps7_qspi_0    1.00.a               ps7_qspi        PERIPHERAL
#         ps7_qspi_linear_0    1.00.a        ps7_qspi_linear      MEMORY_CNTLR
#    ps7_axi_interconnect_0    1.00.a   ps7_axi_interconnect               BUS
#            ps7_cortexa9_0       5.2           ps7_cortexa9         PROCESSOR
#            ps7_cortexa9_1       5.2           ps7_cortexa9         PROCESSOR
#                 ps7_ddr_0    1.00.a                ps7_ddr      MEMORY_CNTLR
#            ps7_ethernet_0    1.00.a           ps7_ethernet        PERIPHERAL
#            ps7_ethernet_1    1.00.a           ps7_ethernet        PERIPHERAL
#                 ps7_usb_0    1.00.a                ps7_usb        PERIPHERAL
#                  ps7_sd_0    1.00.a               ps7_sdio        PERIPHERAL
#                  ps7_sd_1    1.00.a               ps7_sdio        PERIPHERAL
proc get_processor_name {hw_project_name} {
  set periphs [getperipherals $hw_project_name]
  # For each line of the peripherals table
  foreach line [split $periphs "\n"] {
    set values [regexp -all -inline {\S+} $line]
    # If the last column is "PROCESSOR", then get the "IP INSTANCE" name (1st col)
    if {[lindex $values end] == "PROCESSOR"} {
      return [lindex $values 0]
    }
  }
  return ""
}

Create an SDK workspace
We really just target a directory as the SDK workspace, and if it’s empty, Xilinx SDK creates all of the files to describe the workspace.

set $sdk_ws_dir "./sdk"
if {[file exists $sdk_ws_dir] == 0} {
  file mkdir $sdk_ws_dir
}
setws $sdk_ws_dir

Add a local repository to the workspace
If your applications have to refer to some custom or modified libraries, then you will typically place them in a remote directory and add this directory as a local SDK repository
to your workspace. You have to use the “setws” command first.

set $sdk_repo "../repo"
repo -set $sdk_repo

Create a software application from one of the Xilinx SDK templates
Xilinx SDK comes with a few software application templates which are useful for doing basic hardware tests when bringing up new boards. The code below creates the lwIP Echo Server
application. Notice that the code refers to the function get_processor_name which was described above.

# Generate the lwIP echo server application
createapp -name echo_server \
  -app {lwIP Echo Server} \
  -proc [get_processor_name $hw_project_name] \
  -hwproject ${hw_project_name} \
  -os standalone

Build the BSPs and software applications

# Build all
projects -build

More info on SDK Tcl
A simple Google search brings up lots of old information about the SDK batch mode which has changed a lot since 2016.1. The best way to get more information about the Xilinx SDK Tcl commands is by going into the Xilinx SDK Help and searching with the keywords “batch mode”:
1. Open Xilinx SDK
2. Select Help->Help Contents
3. Type “batch mode” in the search field and press Enter.
4. Click on “XSCT Commands” in the search results.
5. Click on “SDK Projects” on the XSCT Commands page that opens.

There is also some limited but recent information on this answer record:
https://www.xilinx.com/support/answers/66629.html

↧

A quick look at the Kintex Ultrascale KCU105

January 21, 2017, 11:38 am

≫ Next: Connecting an M.2 SSD to FPGA Drive FMC

≪ Previous: Tcl Automation Tips for Vivado and Xilinx SDK

I’ve got the Kintex Ultrascale Development Kit on my desk today so it’s a good time to take a look inside and see what’s special about this board. The Ultrascale (20nm) and Ultrascale+ (16nm) FPGAs are taking over from the Series-7 devices (28nm), and I’ve seen more and more customer interest in them in recent months. The Kintex Ultrascale is the little brother of the Ultrascale family, providing the “best price/performance/watt” and “an optimum blend of capability and cost-effectiveness” according to Xilinx.

There’s a Zynq on the board! I didn’t expect to see a Zynq 7Z010 on the KCU105 but there it is. It’s used as a system controller – read more about it in the user guide.

HDMI output port using Analog Devices ADV7511.

2 Gigabytes of 64-bit DDR4 memory. Notice that they didn’t put SODIMMs on the KCU105, VCU108 or VCU118 boards either, only component memory.

PCIe edge connector with 8-lanes of Gen3. On the Series-7 boards only the Virtex ones had PCIe Gen3 so this is a nice step-up.

USB UART, JTAG, Gigabit Ethernet and 2x SFP+.

Power supply.

CAT5e Ethernet cable.

Optical patch cable.

Power cable for when you want to plug this into a PC.

3x USB cables.

2x 850nm optical SFP+ modules.

FMC loopback.

PCIe loopback.

Fedora install CD.

Another nice addition to this board is the P-MOD connector. The Series-7 boards didn’t have P-MOD connectors but they can really come in handy. Hopefully I’ll get to have a bit of fun with this over the next few weeks.

↧

Connecting an M.2 SSD to FPGA Drive FMC

January 28, 2017, 9:16 am

≫ Next: Demo of Intelliprop’s NVMe Host Accelerator IP core

≪ Previous: A quick look at the Kintex Ultrascale KCU105

Just released a video showing how to connect an M.2 SSD to the FPGA Drive FMC.

↧

Demo of Intelliprop’s NVMe Host Accelerator IP core

January 31, 2017, 12:43 pm

≫ Next: Using AXI DMA in Vivado Reloaded

≪ Previous: Connecting an M.2 SSD to FPGA Drive FMC

I’ve just done a video to demo Intelliprop’s NVMe Host Accelerator IP core on the Xilinx Kintex Ultrascale KCU105 dev board and the Samsung 950 Pro M.2 NVMe SSD. To connect them together I’ve used the FPGA Drive FMC plugged into the HPC connector to give us a 4-lane PCIe Gen3 interface with the SSD. The read/write speeds I got are simply incredible and line up very well with the numbers I wrote about in an earlier post. So here they are:

Write speed:

Sequential: 969 MB/s
Random: 971 MB/s

Read speed:

Sequential: 1922 MB/s
Random: 1922 MB/s

Intelliprop states that they’ve achieved even higher speeds on the Intel DC P3600 SSD and also on their own NVMe Target IP core, so these results are not limited by the core. If you want more information about the core, contact them: Intelliprop. If you want more information about how to set this up with the FPGA Drive FMC, contact me.

↧

Using AXI DMA in Vivado Reloaded

October 10, 2017, 6:44 pm

≫ Next: Getting Started with the MYIR Z-turn

≪ Previous: Demo of Intelliprop’s NVMe Host Accelerator IP core

The DMA is one of the most critical elements of any FPGA or high speed computing design. It allows data to be transferred from source to memory, and memory to consumer, in the most efficient manner and with minimal intervention from the processor. It’s no wonder then that a tutorial I wrote three years ago about using the AXI DMA IP, is still relevant and still getting thousands of visits per month. I decided to remake that tutorial, this time as a video and using Vivado 2017.2 (just today they released Vivado 2017.3, doh!). Although I prefer doing written tutorials, I think that video tutorials can be very useful in their own way, and they’re a hell of a lot easier for me to produce. I hope you find this one useful.

Video transcript:

Hi I’m Jeff. In this video I’m show you how to a simple example of using the AXI DMA in Vivado. This is going to be based on a tutorial that I did in 2014. Now I’m going to refer to this diagram a few times. So to tell you a little bit about DMA, DMA is basically an interface between a data producer or a consumer, and a memory controller, so you’d need a DMA if for example you had data coming in from an ADC and you need to store it very quickly into memory. Or in the other case when you have a DAC and you have data in a memory and you need to send that data as quickly as possible through to your DAC. In both of these cases, you could always use the processor to do this job so the DMA is not the only solution but obviously using the processor to transfer data from one place to another is very time consuming for the processor and it’s a bit of a waste of the processor. The processor should be left to do intelligent things. So the DMA is really a hardware solution for transferring data from one place to another and it is the most efficient way of doing so.

So I’ll get into the example now. In Vivado we start by creating project. Now I’m going to base this one on the MicroZed 7010 so I’ll call the project mz_7010_dma_test. Next. Now it’s an RTL project and I won’t specify sources at this time because I don’t have sources for this. Here we select the board, so I’m going to select MicroZed 7010 and all that’s going to do is configure this project for the right part depending on the hardware we’re using. So I click finish.

Now I can click Create Block Design, I’ll leave it with the default name. And the first thing I do is add my Zynq processing system to the design. I’m going to click run block automation. Now what the block automation’s going to do in the beginning is apply the board preset on the processing system. So that’s going to configure the Zynq PS for the hardware we’re using. So depending on the board preset that we chose earlier, which means the DDR and whatever other hardware devices we have connected to the Zynq, well the board preset should configure that. So I click OK.

Now in the block diagram I can see that the DDR interface has been externalized, so I know that my Zynq is configured with the DDR to which it is connected to on the MicroZed. Now I also have FIXED_IO port, that’s for all the other devices that are wired to the Zynq on the MicroZed, so if I want to know what they are I could just click on the Zynq PS and have a look at the block diagram here. I can see that I have a UART connected, because I know the MicroZed has a USB UART on there, GPIO probably has some LEDs, has an SD card, so an SDIO interface for that, USB port and Ethernet, so that’s one of the Gigabit Ethernet MACs that’s enabled there and connected to the Gigabit Ethernet PHY and RJ45 connector that’s on the MicroZed. So the other thing the board preset is going to do for us is configure a clock for us, so we can see here that we have one of the fabric clocks that has been configured to 100MHz, so I’m going to use that clock for all of this design. OK so now what I want to do in this design because I have an AXI DMA, AXI DMA is going to need access to the DDR memory controller and it’s also going to need a configuration interface which is AXI lite, so the processor is going to need to configure the AXI DMA through an AXI lite port and the AXI DMA is going to need access to the DDR. That’s the important thing to know because I need to configure the Zynq PS for those things. So If I go to my Zynq block design, firstly for the DMA configuration, the Zynq is going to need a general purpose AXI master port so that it can configure the DMA, when I say configure the DMA, I mean setup DMA transfers and trigger them. That’s what the AXI slave port is for and that’s what I’m going to enable here, so the easy way to do it is I click on this block here and Vivado takes me through to the right setting that I have to enable, so here I can see general purpose master AXI interface and I click, I tick that to enable one, the GP0 interface. So again, that’s my interface that I’m going to use to configure the DMA from the processor. The other interface that I need is to access the DDR controller, the memory controller, and I can see from this diagram what I need to enable, is one of these high performance AXI slave ports. So that would allow the DMA to read and write from the DDR. So to enable one of these, I have to click on that. Tick one of those, there’s four of them, I only need one. So now I’ve got those two ports. The only other thing that I need to configure here, is the interrupts, so I need to enable fabric interrupts, because I’m going to be receiving interrupts from the DMA IP. So I have to enable this IRQ_F2P which means FPGA to Processor system. I enable that. Click OK. So now my Zynq is properly configured. I can start off by connecting the fabric clock, the 100MHz clock through to these AXI interfaces. My two AXI interfaces, the general purpose master AXI interface, and the high performance slave. So here is my high performance slave and here is my general purpose master. So that’s that, now what I can do is add my DMA. There’s a few DMAs there for different applications, the one that I want though is “AXI Direct Memory Access”. So now what I can see is I’ve got an AXI lite interface, that’s for configuration of the DMA, setting up DMA transfers from the processor, and I’ve got these other interfaces here which … M_AXI or just AXI itself is just an AXI memory mapped interface, whereas AXIS is going to be the AXI streaming interfaces. So these AXI memory mapped interfaces are going to need to go through to the DDR controller, or to the high performance port. Whereas the AXI lite is going to need to go to the general purpose AXI master. Then we have the streaming interfaces and in our case, what we want to do is we want to connect the streaming interfaces through an AXI Data FIFO so we can loop back the data, so that way our application is going to just send data from the memory through the DMA, to the FIFO, that’s going to be looped back to the DMA and be written back into the memory, so the processor can just verify that the transfer was successful by comparing the data that was sent and the data that was received. So that only leaves two ports here, which are the AXI streaming status and AXI streaming control ports. We don’t actually need those, those are used in Ethernet applications, so we’re going to disable them. So that gets rid of them, so now I can start connecting my interfaces, so I’m going to run connection automation. I’ll tick on first the AXI lite interface, slave interface. It’s always good when you’re using connection automation to check what Vivado wants to connect things to, but here I can see it wants to connect the AXI lite interface through to the processing system’s general purpose AXI master port – that’s correct, that’s what we want. Now it can’t really make a mistake here because there is only one master AXI interface configured on the Zynq at the moment, the other is a slave interface, the high performance port is a slave interface, so it’s not going to use that. So that’s right, now I can tick on the high performance slave AXI interface of the PS. Vivado wants to connect it to the scatter gather, AXI master interface of the DMA. Now it could’ve chosen any of the other AXI master interfaces, it chose scatter gather, it doesn’t really matter because it’s just going to create either an AXI interconnect, or an AXI smartconnect for this and then the other two ports will be able to go through that as well. So anyway we start things off like that.

OK now if I just run through quickly what that’s done. We just want to make sure that our general purpose AXI master port, that’s going through now to an AXI interconnect, which is called peripheral, so that’s for all of your peripherals. It goes through here. Out here. And it should go through to the AXI lite interface of the DMA, so thats for configuration of the DMA, and for triggering and setting up DMA transfers. So what about the master interfaces of the AXI DMA. We’ve only connected one of them for now, the scatter gather interface, so that goes through here, through to an AXI smart connect, and then this should go through to our high performance slave interface which is basically access to the DMA .. not to the DMA to the DDR sorry, the memory. That’s what that has done. I’m going to run connection automation again to hook up my last two AXI master interfaces of the DMA, that’s going to connect both of them through to the high performance slave port, so click OK. And here they are. And all of that’s done is opened up two more ports on this AXI smartconnect. OK so now I’ve hooked up those things, that leaves my AXI streaming interfaces to connect. So if I go back to the diagram, I’m talking about these two interfaces, so I need to add my AXI data FIFO and connect up my AXI streaming interfaces. So I go plus, FIFO, I want an AXI4 Stream data FIFO. OK and I want to connect the AXI streaming master interface through to the AXI streaming slave interface of the DMA. And I want to connect the AXI streaming master interface of the DMA through to the AXI streaming slave interface of the FIFO. So that’s going to be my loopback, so the data’s going to come out of here, memory mapped to streaming, it’s going to go through the FIFO and it’s going to come out of the FIFO and back into the DMA, the streaming 2 memory mapped interface and be written again to the DDR memory. So what about these things here the FIFO needs to be clocked, we’re going to use the same 100MHz clock that everything else uses. So I hook that through to there. And for the reset, I want to use the reset that the rest of my design is using which is generated by the automatically generated processor system reset. The source of which is the fabric clock reset here. So that’s what I’m going to use. So now that’s all connected properly the only thing that I have to connect up now are the two interrupts of the DMA. I need to connect them through to here, the IRQ_F2P. The way to do that is to use a Concat. So the output of my concat has to go to there. And then my two interrupts have to go to there. And my interrupts are now connected. So that’s my design and I can save the block design now. I can click validate design, to make sure that I haven’t made any mistakes or forgotten to connect any clocks or resets. OK so this is an intermittent problem that sometimes happens with MicroZed designs, it’s something that started I think a couple of versions ago, but you can safely ignore these messages, they’re basically coming from the board preset and Vivado’s complaining about them now whereas it didn’t complain about them at all in previous versions. Anyway so we’ll just ignore those. And save the design again.

For more info regarding this issue, checkout this forum post: https://forums.xilinx.com/t5/Design-Entry/Vivado-critical-warning-when-creating-hardware-wrapper/td-p/762938

So now the only thing I have to do is to generate my HDL wrapper. So I click on that and I say let Vivado manage wrapper and auto update. Now the only thing I have to do is generate the bitstream.

OK so my bitstream has been generated. I’m going to tick on view reports because I don’t want to open the implemented design. Now what I have to do is I want to bring this hardware design into the SDK so I can run a software application on it and test out the hardware. So to do that I have to say File-Export-Export Hardware and “include the bitstream”. I’ll export it local to the project. I click OK. So now the hardware’s been exported for SDK, I just have to run SDK, so the easy way to do that is go File-Launch SDK. And I exported it local to the project, my SDK workspace I’ll also leave it local to the project, it doesn’t really change much for me here. So OK.

So the SDK workspace at the moment has nothing in it, it should have nothing in it except my hardware platform specification. Which is here. It’s got the name of the block design. So I’ve got to add my application to the SDK at this point. So the way I do that is I say File-New-Application project. Now I can call this dma_test like that. If I just look into what’s going on here, here you can choose the processor that the application is going to run on. Now because the Zynq on the MicroZed has a dual-core ARM processor so we can choose which one we want to use, I’ll just choose that one. You obviously have to specify the hardware platform, or the hardware platform which is defined here. So there’s only one that I’ve got to choose from so that’s why it’s choosing that. Then I click next. What I’m going to do is I’m going to use an empty application for this, that’s going to be an application with no code, I’ve got to supply the code which I’ll do. So I say finish. So in my dma_test application here I’ve got no sources just the linker script and a readme. So to do this, I’m going to add my software application. What I’m going to use as a software application is an example software application that is provided by Xilinx. It is in this folder here Xilinx SDK version number, data, embeddedsw, Xilinx processor IP lib drivers, AXI DMA, examples (actual folder C:\Xilinx\SDK\2017.2\data\embeddedsw\XilinxProcessorIPLib\drivers\axidma_v9_3\examples). So this is in your Xilinx installation files. So what I want to do is use the example scatter gather poll, to begin with, let’s try that. So if I take this file, maybe I can drag it over here. Copy files. Click OK. So that’s copied that into my software application and now.. project build automatically,so I’m using the build automatically setting, so it should have built that project automattically. So let’s try and run that application.

So first of all, you’ve got to make sure that you’ve set the jumpers correctly for the configuration of the Zynq. So here I’ve got the jumpers set for configuration by JTAG. So here I have my JTAG programmer here. Here’s my JTAG programmer that I’m going to use, I’ll just plug it in. And now I can plug in my USB cable. Plug into here. And when I do that, the MicroZed you’ll notice that the LEDs turned ON, that’s because it’s getting it’s power from the USB port. So now what I can do is go back into the SDK, click Xilinx Tools, Program FPGA. Now that’s loaded the bitstream into the FPGA on the Zynq. Now I have to do is run the application, but before I run the application, I’m going to setup a connection to it. So I’ve already set that up earlier, so here is my COMPORT terminal window, connected through to the MicroZed, so when I run my application, I should see some, I should see some text coming up onto my console window. So to run the test, I’m going to click on dma_test application, click on run configurations, and I want to use the System Debugger for this, so I double click on that. And I can then click Run. And it will run the application on the hardware, I can see here that the application was successful.

Now just one last thing, I’ll create another application. I want to run another application but this time using the application using the example application that has interrupts. I’ll again choose an empty application, and move, and copy the application code into the workspace. OK so now I can see in the application I have no code. So I want to grab this, the scatter gather example with interrupts. I’ll drag it over into my application. OK now it’s going to crash (meant to say: fail to compile!). If you go and look and see why it crashed (meant to say: failed!), you’ll find that there are a couple of defines here that Vivado (the SDK) can’t find. So if I hover of that it says to me that this define is undeclared “first use in this function”. SO this is a define that should be in xparameters.h in the BSP. So I’m going to open up the BSP and see why that define isn’t there, and maybe change the, maybe it’s changed names. So I’ll go into xparameters.h, and see.. let’s search for this name maybe I can find it.. OK so here I can see that the defines that this application is looking for have changed names in this version of Vivado (SDK), so all I want to do is take the new names and modify my code with the new names, so this is the MM2S. I’ll change that. and then get the other one, S2MM. And change that one. Then save the file. It builds automatically and I can see that now my, now the SDK can, knows, the interrupt vector IDs. So the application is built, I just have to run it, so I’m again going to say I’m going to click on the application. Click on Run configurations. Double click on system debugger. And then run. OK and when I run that, go back to my terminal window, to see the output. And I see that it says “successfully ran the AXI DMA scatter gather interrupt example” so, that’s my two examples working. At this point I guess I leave you guys to muck around with the example applications, see what you can learn from the code. So thanks for watching and good luck with your projects.

↧

Getting Started with the MYIR Z-turn

October 17, 2017, 6:03 pm

≫ Next: Quick look at the UltraZed-EG SoM

≪ Previous: Using AXI DMA in Vivado Reloaded

In this video I create a simple Vivado design for the MYIR Z-turn Zynq SoM and we run a hello world application on it, followed by the lwIP echo server. We connect the Z-turn to a network, then we use “ping” and “telnet” to test the echo server from a PC that is connected to the same network.

If you want to try it out yourself, download the SD card boot files here:

The SoM

The Z-turn stands out in the market of Zynq based SoMs because it’s got a few features that the others don’t; of most interest to me being the accelerometer and the HDMI interface. My first impressions of the board were good, it has a clean look, it’s compact and it has most features I’d normally be looking for. But there was one thing I didn’t like: the JTAG header. They’ve chosen the big 100mil pitch header for the old Platform Cable USB. Most Xilinx dev boards nowadays have a smaller JTAG header, so none of my JTAG programmers can actually plug into this. Anyway, if I do anything serious with this board I’ll definitely have to wire up an adapter.

Z-turn JTAG

Board files

Another little issue I found was with the support. I couldn’t find the board preset files anywhere on the MYIR website, nor on the CD that comes with the board. So in the end I found Sergiusz Bazanski’s Github repo and his own hand-coded board files for the Z-turn:

https://github.com/q3k/zturn-stuff

You’ll need to install those board files before going through the example. Thanks Sergiusz!

USB-UART trap

Something I emphasized in the video and I want to re-iterate here; the USB-UART on the Z-turn is connected to the PS UART1 (ONE) peripheral. That’s important to know because the PS UART0 (ZERO) peripheral is also enabled by Sergiusz’ board preset, and it’s this peripheral that the SDK will choose by default for STDIO. This means that when you create a BSP in the SDK, it will select the PS UART0 for your STDIO – not your USB-UART. So you have to manually change it, or you can expect nothing to come up on your UART console window.

Ethernet PHY issue

When trying to get the lwIP echo server running, be aware that the Z-turn has an AR8035 Atheros Ethernet PHY. The lwIP driver doesn’t contain code for properly configuring that PHY, instead it’s designed for TI and Marvell PHYs. In this video, I show you how to modify the lwIP driver so that it does properly configure the PHY. Here is the code snippet for that:

	// Enable RGMII TX clock delay in the AD8035 PHY
	XEmacPs_PhyWrite(xemacpsp,phy_addr, 0x1D, 0x05);
	XEmacPs_PhyWrite(xemacpsp,phy_addr, 0x1E, 0x0100);
	// Enable RGMII RX clock delay in the AD8035 PHY
	XEmacPs_PhyWrite(xemacpsp,phy_addr, 0x1D, 0x0);
	XEmacPs_PhyWrite(xemacpsp,phy_addr, 0x1E, 0x8000);

Here is the name of the file that needs to be modified:

\echo_server_bsp\ps7_cortexa9_0\libsrc\lwip141_v1_9\src\contrib\ports\xilinx\netif\xemacpsif_physpeed.c

As you can see from the code, the main issue is the configuration of the RGMII TX and RX clock delays. The Zynq GEM expects both of those delays to be enabled in the PHY. The lwIP code actually tries to enable those delays, but it’s writing to the wrong registers because it’s expecting a Marvell PHY, not an Atheros PHY. If we don’t use the above code then we get bad timing on the RGMII interface and the echo server wont work.

Great hardware, lacks support

Overall, I like the board but the support you find online is limited. The price is great, so if you particularly need a Zynq SoM with HDMI, then yes I’d recommend this board.

↧

Quick look at the UltraZed-EG SoM

October 24, 2017, 12:19 pm

≫ Next: Creating a custom AXI-Streaming IP in Vivado

≪ Previous: Getting Started with the MYIR Z-turn

In this video I take a look at the features of the UltraZed-EG System-on-Module and the Zynq UltraScale+ MPSoC. As is typical for Avnet products, it’s a great deal with a price tag of only $485 USD, when the device alone (XCZU3EG-1SFVA625E) would cost you $354 USD. This SoM can’t be used as an SBC (single board computer), it needs a carrier card such as the UltraZed PCIe Carrier Card; this board will cost you $499 USD and has most of the hardware you need to exploit the Zynq UltraScale+ device to its full potential: Gigabit Ethernet, Display Port, PCIe, USB3 and SATA among others. And of course, all of these peripherals are routed through to the PS (hardened IP) of the ZU+, so you don’t need to use up any programmable logic to take advantage of them – so all of your FPGA can be dedicated to the implementation of your ‘edge’, or whatever it is that makes your product better/faster/leaner than the competition’s.

So at least it looks like some great hardware, but in the next few days I’ll build some designs for it and tell you about my user experience. Looking forward to it!

↧

Creating a custom AXI-Streaming IP in Vivado

November 1, 2017, 7:47 am

≫ Next: Artix-7 Arty Base Project

≪ Previous: Quick look at the UltraZed-EG SoM

The AXI-Streaming interface is important for designs that need to process a stream of data, such as samples coming from an ADC, or images coming from a camera. In this tutorial, we go through the steps to create a custom IP in Vivado with both a slave and master AXI-Streaming interface. The custom IP will be written in Verilog and it will simply buffer the incoming data at the slave interface and make it available at the master interface – in other words, it will be a FIFO. We’ll test the custom IP using a DMA which we’ll use to push streaming data into the IP and pull data out of the IP. We’ll use an SDK application to setup these DMA transfers and compare the sent data with the received data. The hardware we use for testing this will be the MicroZed 7010, so this is a Zynq-7000 design.

The above image is a basic block diagram of our Vivado design, it shows how the DMA connects to the Zynq Processing System, and also how the custom IP connects to the AXI-Streaming interfaces of the DMA. If you are not familiar with the DMA IP, you should checkout this tutorial on using the DMA.

Source code for the custom IP

The Verilog code for our custom IP is based on an asynchronous AXI-Streaming FIFO written by Alex Forencich. You can find the original code on his Github repo, as well as a bunch of other useful modules. I’ve had to slightly modify the code for this project and you’ll be able to copy and paste it from below:

/*

Copyright (c) 2014-2017 Alex Forencich

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

*/

/*

Modified by Jeff Johnson http://www.fpgadeveloper.com

- Renamed ports to match Vivado's naming for AXI-Streaming slave and master
- Removed the async reset input to the module
- Added separate resets for slave and master interfaces
- Removed the tuser signals (not used by Vivado)

*/

// Language: Verilog 2001

`timescale 1ns / 1ps

/*
 * AXI4-Stream asynchronous FIFO
 */
module axis_fifo_v1_0 #
(
    parameter ADDR_WIDTH = 12,
    parameter C_AXIS_TDATA_WIDTH = 32
)
(
    /*
     * AXI slave interface (input to the FIFO)
     */
    input  wire                   s00_axis_aclk,
    input  wire                   s00_axis_aresetn,
    input  wire [C_AXIS_TDATA_WIDTH-1:0]  s00_axis_tdata,
    input  wire [(C_AXIS_TDATA_WIDTH/8)-1 : 0] s00_axis_tstrb,
    input  wire                   s00_axis_tvalid,
    output wire                   s00_axis_tready,
    input  wire                   s00_axis_tlast,
    
    /*
     * AXI master interface (output of the FIFO)
     */
    input  wire                   m00_axis_aclk,
    input  wire                   m00_axis_aresetn,
    output wire [C_AXIS_TDATA_WIDTH-1:0]  m00_axis_tdata,
    output wire [(C_AXIS_TDATA_WIDTH/8)-1 : 0] m00_axis_tstrb,
    output wire                   m00_axis_tvalid,
    input  wire                   m00_axis_tready,
    output wire                   m00_axis_tlast
);

reg [ADDR_WIDTH:0] wr_ptr_reg = {ADDR_WIDTH+1{1'b0}}, wr_ptr_next;
reg [ADDR_WIDTH:0] wr_ptr_gray_reg = {ADDR_WIDTH+1{1'b0}}, wr_ptr_gray_next;
reg [ADDR_WIDTH:0] wr_addr_reg = {ADDR_WIDTH+1{1'b0}};
reg [ADDR_WIDTH:0] rd_ptr_reg = {ADDR_WIDTH+1{1'b0}}, rd_ptr_next;
reg [ADDR_WIDTH:0] rd_ptr_gray_reg = {ADDR_WIDTH+1{1'b0}}, rd_ptr_gray_next;
reg [ADDR_WIDTH:0] rd_addr_reg = {ADDR_WIDTH+1{1'b0}};

reg [ADDR_WIDTH:0] wr_ptr_gray_sync1_reg = {ADDR_WIDTH+1{1'b0}};
reg [ADDR_WIDTH:0] wr_ptr_gray_sync2_reg = {ADDR_WIDTH+1{1'b0}};
reg [ADDR_WIDTH:0] rd_ptr_gray_sync1_reg = {ADDR_WIDTH+1{1'b0}};
reg [ADDR_WIDTH:0] rd_ptr_gray_sync2_reg = {ADDR_WIDTH+1{1'b0}};

reg s00_rst_sync1_reg = 1'b1;
reg s00_rst_sync2_reg = 1'b1;
reg s00_rst_sync3_reg = 1'b1;
reg m00_rst_sync1_reg = 1'b1;
reg m00_rst_sync2_reg = 1'b1;
reg m00_rst_sync3_reg = 1'b1;

reg [C_AXIS_TDATA_WIDTH+2-1:0] mem[(2**ADDR_WIDTH)-1:0];
reg [C_AXIS_TDATA_WIDTH+2-1:0] mem_read_data_reg = {C_AXIS_TDATA_WIDTH+2{1'b0}};
reg mem_read_data_valid_reg = 1'b0, mem_read_data_valid_next;
wire [C_AXIS_TDATA_WIDTH+2-1:0] mem_write_data;

reg [C_AXIS_TDATA_WIDTH+2-1:0] m00_data_reg = {C_AXIS_TDATA_WIDTH+2{1'b0}};

reg m00_axis_tvalid_reg = 1'b0, m00_axis_tvalid_next;

// full when first TWO MSBs do NOT match, but rest matches
// (gray code equivalent of first MSB different but rest same)
wire full = ((wr_ptr_gray_reg[ADDR_WIDTH] != rd_ptr_gray_sync2_reg[ADDR_WIDTH]) &&
             (wr_ptr_gray_reg[ADDR_WIDTH-1] != rd_ptr_gray_sync2_reg[ADDR_WIDTH-1]) &&
             (wr_ptr_gray_reg[ADDR_WIDTH-2:0] == rd_ptr_gray_sync2_reg[ADDR_WIDTH-2:0]));
// empty when pointers match exactly
wire empty = rd_ptr_gray_reg == wr_ptr_gray_sync2_reg;

// control signals
reg write;
reg read;
reg store_output;

assign s00_axis_tready = ~full & ~s00_rst_sync3_reg;

assign m00_axis_tvalid = m00_axis_tvalid_reg;

assign mem_write_data = {s00_axis_tlast, s00_axis_tdata};
assign {m00_axis_tlast, m00_axis_tdata} = m00_data_reg;

// reset synchronization
always @(posedge s00_axis_aclk) begin
    if (!s00_axis_aresetn) begin
        s00_rst_sync1_reg <= 1'b1;
        s00_rst_sync2_reg <= 1'b1;
        s00_rst_sync3_reg <= 1'b1;
    end else begin
        s00_rst_sync1_reg <= 1'b0;
        s00_rst_sync2_reg <= s00_rst_sync1_reg | m00_rst_sync1_reg;
        s00_rst_sync3_reg <= s00_rst_sync2_reg;
    end
end

always @(posedge m00_axis_aclk) begin
    if (!m00_axis_aresetn) begin
        m00_rst_sync1_reg <= 1'b1;
        m00_rst_sync2_reg <= 1'b1;
        m00_rst_sync3_reg <= 1'b1;
    end else begin
        m00_rst_sync1_reg <= 1'b0;
        m00_rst_sync2_reg <= s00_rst_sync1_reg | m00_rst_sync1_reg;
        m00_rst_sync3_reg <= m00_rst_sync2_reg;
    end
end

// Write logic
always @* begin
    write = 1'b0;

    wr_ptr_next = wr_ptr_reg;
    wr_ptr_gray_next = wr_ptr_gray_reg;

    if (s00_axis_tvalid) begin
        // input data valid
        if (~full) begin
            // not full, perform write
            write = 1'b1;
            wr_ptr_next = wr_ptr_reg + 1;
            wr_ptr_gray_next = wr_ptr_next ^ (wr_ptr_next >> 1);
        end
    end
end

always @(posedge s00_axis_aclk) begin
    if (s00_rst_sync3_reg) begin
        wr_ptr_reg <= {ADDR_WIDTH+1{1'b0}};
        wr_ptr_gray_reg <= {ADDR_WIDTH+1{1'b0}};
    end else begin
        wr_ptr_reg <= wr_ptr_next;
        wr_ptr_gray_reg <= wr_ptr_gray_next;
    end

    wr_addr_reg <= wr_ptr_next;

    if (write) begin
        mem[wr_addr_reg[ADDR_WIDTH-1:0]] <= mem_write_data;
    end
end

// pointer synchronization
always @(posedge s00_axis_aclk) begin
    if (s00_rst_sync3_reg) begin
        rd_ptr_gray_sync1_reg <= {ADDR_WIDTH+1{1'b0}};
        rd_ptr_gray_sync2_reg <= {ADDR_WIDTH+1{1'b0}};
    end else begin
        rd_ptr_gray_sync1_reg <= rd_ptr_gray_reg;
        rd_ptr_gray_sync2_reg <= rd_ptr_gray_sync1_reg;
    end
end

always @(posedge m00_axis_aclk) begin
    if (m00_rst_sync3_reg) begin
        wr_ptr_gray_sync1_reg <= {ADDR_WIDTH+1{1'b0}};
        wr_ptr_gray_sync2_reg <= {ADDR_WIDTH+1{1'b0}};
    end else begin
        wr_ptr_gray_sync1_reg <= wr_ptr_gray_reg;
        wr_ptr_gray_sync2_reg <= wr_ptr_gray_sync1_reg;
    end
end

// Read logic
always @* begin
    read = 1'b0;

    rd_ptr_next = rd_ptr_reg;
    rd_ptr_gray_next = rd_ptr_gray_reg;

    mem_read_data_valid_next = mem_read_data_valid_reg;

    if (store_output | ~mem_read_data_valid_reg) begin
        // output data not valid OR currently being transferred
        if (~empty) begin
            // not empty, perform read
            read = 1'b1;
            mem_read_data_valid_next = 1'b1;
            rd_ptr_next = rd_ptr_reg + 1;
            rd_ptr_gray_next = rd_ptr_next ^ (rd_ptr_next >> 1);
        end else begin
            // empty, invalidate
            mem_read_data_valid_next = 1'b0;
        end
    end
end

always @(posedge m00_axis_aclk) begin
    if (m00_rst_sync3_reg) begin
        rd_ptr_reg <= {ADDR_WIDTH+1{1'b0}};
        rd_ptr_gray_reg <= {ADDR_WIDTH+1{1'b0}};
        mem_read_data_valid_reg <= 1'b0;
    end else begin
        rd_ptr_reg <= rd_ptr_next;
        rd_ptr_gray_reg <= rd_ptr_gray_next;
        mem_read_data_valid_reg <= mem_read_data_valid_next;
    end

    rd_addr_reg <= rd_ptr_next;

    if (read) begin
        mem_read_data_reg <= mem[rd_addr_reg[ADDR_WIDTH-1:0]];
    end
end

// Output register
always @* begin
    store_output = 1'b0;

    m00_axis_tvalid_next = m00_axis_tvalid_reg;

    if (m00_axis_tready | ~m00_axis_tvalid) begin
        store_output = 1'b1;
        m00_axis_tvalid_next = mem_read_data_valid_reg;
    end
end

always @(posedge m00_axis_aclk) begin
    if (m00_rst_sync3_reg) begin
        m00_axis_tvalid_reg <= 1'b0;
    end else begin
        m00_axis_tvalid_reg <= m00_axis_tvalid_next;
    end

    if (store_output) begin
        m00_data_reg <= mem_read_data_reg;
    end
end

endmodule

Remember, when you create the custom IP, Vivado will auto-generate a top level wrapper (filename is axis_fifo_v1_0.v) and some code to drive the slave and master AXI-Streaming interfaces. You’ll have to paste the above code over the top module source code (axis_fifo_v1_0.v) of the auto-generated IP. The other two auto-generated source files can be left as they are – they will be removed from the hierarchy as soon as you replace and save the top module code, because they will no longer be instantiated by the top module.

MicroZed Board Preset issue

When building our Vivado design, just after generating a HDL wrapper for the block design, you will see some critical warnings related to timing of the DDR interface. These critical warnings can be ignored and they are related to some values in the board files. See this forum post for more information:

https://forums.xilinx.com/t5/Design-Entry/Vivado-critical-warning-when-creating-hardware-wrapper/td-p/762938

The test application for SDK

We test the custom IP by making the DMA push data through the AXI-Streaming slave interface and to pull data out of the AXI-Streaming master interface of our custom IP. The application we will use for this is one of the example applications for the DMA that can be found in the Xilinx SDK installation files. You will find it on this path:

C:\Xilinx\SDK\2017.3\data\embeddedsw\XilinxProcessorIPLib\drivers\axidma_v9_4\examples

In this tutorial, we use the scatter gather poll example (xaxidma_example_sg_poll.c), but as we hooked up the interrupts in the Vivado design, we could have also used the interrupt based one (xaxidma_example_sg_intr.c).

What to try

Once you’ve gotten this working, I suggest you try modifying the test application in the SDK to print out what is actually being sent and received. You could then modify your Verilog code to do some kind of manipulation of the incoming data, rebuild everything and verify with your test application that the data coming out is what you expected. Another useful thing to do when building custom IP blocks like this is to write a test bench and simulate the custom IP, this will be the topic of a future tutorial.

↧

Artix-7 Arty Base Project

November 7, 2017, 6:15 pm

≫ Next: PetaLinux for Artix-7 Arty Base Project

≪ Previous: Creating a custom AXI-Streaming IP in Vivado

Here’s a base project for the Arty board based on the Artix-7 FPGA. The Arty is a nice little dev board because it’s low cost ($99 USD) but it’s still got enough power and connectivity to make it very useful. I really like the fact that the JTAG and UART are both accessed through the same USB connector, so I only need to connect one USB cable. I also like the fact that I can power it from the USB connector alone – provided I don’t connect too many power hungry PMods or an Arduino shield.

In this project, we leverage the Arty’s board files and Vivado’s automation features to quickly put together a base design to exploit most of the hardware on the board. Then in the second video, we shift to the Xilinx SDK and test our design on hardware by running a “hello world” application and then the lwIP echo server application. In future Microblaze tutorials we’ll build on this design.

Board files

Before you can run through this tutorial, you’ll need to install the Arty’s board files to your Vivado installation. You can download the board files here, and follow Digilent’s instructions for installing them.

Clocking

The Arty has an on-board oscillator to generate a 100MHz clock. We need to feed this clock into a Clock Wizard to generate three clocks: two for the MIG (DDR) and one for the Ethernet PHY.

166.667MHz: For the MIG’s sys_clk_i input
200MHz: For the MIG’s clk_ref_i input
25MHz: For the Ethernet PHY reference clock

The rest of our design will run off the MIG’s ui_clk output (83.333MHz).

Ethernet reference clock

On the Arty schematics, you’ll see that the Ethernet PHY has provisions for a 25MHz crystal to generate it’s own 25MHz reference clock. However the crystal is not loaded on the board – probably to help get that price down to $99! Anyway, for this reason, the FPGA needs to generate and feed a clock to the Ethernet PHY, and this is why we generate the 25MHz from the Clock Wizard. The FPGA pin that connects to the Ethernet reference clock input on the PHY is G18, and we have to provide a LOC constraint for this in our design. Here are the constraints to add to the design for this purpose:

# Arty Ethernet reference clock
set_property IOSTANDARD LVCMOS33 [get_ports eth_ref_clk]
set_property PACKAGE_PIN G18 [get_ports eth_ref_clk]

AXI Timer

I’ve included the AXI Timer IP in the base design, because it’s needed by the lwIP echo server application AND PetaLinux. We’ll build PetaLinux for the Arty in a future tutorial.

UART settings

To read Arty’s console output, you’ll have to use a UART console such as Putty and connect to the comport that your Arty chooses when you plug it in to the PC. To find the right comport, just go into the Device manager after connecting the Arty to your PC via USB. Once you’ve got that, just remember to use a baud rate of 9600 and you’ll be in business.

↧

PetaLinux for Artix-7 Arty Base Project

November 15, 2017, 7:58 am

≫ Next: IntelliProp Demos NVMe Host Accelerator on FPGA Drive

≪ Previous: Artix-7 Arty Base Project

In the final part of the Arty base project tutorial, we build a PetaLinux project that’s tailored to our Arty base design. Then we boot PetaLinux on our hardware and verify that we have network connectivity by checking the Arty’s DHCP assigned IP address and then pinging it from a PC.

Tools used

I used the following setup to do this project:

Vivado 2017.3 on a Windows 10 machine
PetaLinux 2017.3 on a Ubuntu 16.04 LTS machine

Vivado project modifications

Before we get started with PetaLinux, we have to make sure that our Vivado design satisfies the minimum requirements for running PetaLinux:

Microblaze must use configuration “Linux with MMU” or “Low-end Linux with MMU“
At least 32MB of external memory
Dual channel timer with interrupt connected
UART IP with interrupt connected
Ethernet IP with interrupt connected

Our original base design satisfies all but one of those requirements – the first one. So the first thing we have to do in this tutorial is to select the “Linux with MMU” configuration for the Microblaze. The next thing we do is to enable the GPIO interrupts and connect them through to the Microblaze – this isn’t a requirement, but it’s useful. We then have to save our block design, re-generate a bitstream for the project and export it.

PetaLinux tool commands

To build the PetaLinux project, we transfer our entire Vivado project to a Linux machine with the PetaLinux tools installed. These are the PetaLinux tool commands that we use in the tutorial, in the order that we use them:

  # Launch PetaLinux tools (note that you'll have to specify your own PetaLinux install path)
  source ./PetaLinux-2017-3/settings.sh
  # Cd to the working directory (where the arty_base directory has been copied to)
  cd /media/opsero/arty
  # Create the PetaLinux project, using the "microblaze" template
  petalinux-create --type project --template microblaze --name arty_petalinux
  # Cd to the PetaLinux project
  cd arty_petalinux
  # Import the hardware description into our PetaLinux project
  petalinux-config --get-hw-description ../arty_base/arty_base.sdk --oldconfig
  # Optional: Configure the kernel
  petalinux-config -c kernel
  # Optional: Configure the root filesystem
  petalinux-config -c rootfs
  # Build the PetaLinux project
  petalinux-build

Kernel configuration

The Linux driver for the AXI Ethernetlite IP requires certain drivers to be enabled in the PetaLinux kernel. Fortunately, the PetaLinux tools are pretty good at enabling the drivers for the IP that it finds in your exported Vivado design. So the required drivers are already enabled and we don’t have to run the kernel configuration (petalinux-config -c kernel), but for completeness, here is a list of the required kernel configurations:

CONFIG_ETHERNET
CONFIG_NET_VENDOR_XILINX
CONFIG_XILINX_EMACLITE

Device tree modification

We have to make an addition to the device tree in order to specify the Ethernet PHY’s address with respect to the MDIO bus. This address depends on how the PHY is physically wired, for any particular board it is usually mentioned in the user guide or if not we can usually figure it out from the schematics. In the case of the Arty, the PHY address is 1 (one) and we need to specify this in the device tree so that the Ethernet driver can communicate with the PHY. Below is the device tree code that we need to add to the system-user.dtsi file.

arty_petalinux/project-spec/meta-user/recipes-bsp/device-tree/files/system-user.dtsi

&axi_ethernetlite_0 {
  local-mac-address = [00 0a 35 00 01 22];
  phy-handle = <&phy0>;
  xlnx,has-mdio = <0x1>;
  mdio {
    #address-cells = <1>;
    #size-cells = <0>;
    phy0: phy@1 {
      device_type = "ethernet-phy";
      reg = <1>;
  };
};

Launching it on the Arty

Once the PetaLinux project is built, we then launch the Putty UART console and program the FPGA with bitstream and kernel. Here are the commands we used:

  # Launch Putty, the UART console
  sudo putty &
  # Program the FPGA with the bitstream
  petalinux-boot --jtag --fpga
  # Load the kernel into memory and run it
  petalinux-boot --jtag --kernel

How to package the PetaLinux project

It’s useful to be able to program the flash with our bitstream and Linux kernel so that it boots up automatically when we power up the board. To be able to do this, we need to package the PetaLinux project and generate a .mcs file. We don’t go through this in the video, but if you’re interested, here’s how to do it:

In the Linux command terminal, type:

petalinux-config

In the menu, enable the following option:

Subsystem AUTO Hardware Settings->Advanced bootable images storage Settings

Set the flash partition sizes as follows:

Subsystem AUTO Hardware Settings->Flash Settings

fpga partition size    0x300000

boot partition size    0x100000

bootenv partition size 0x100000

kernel partition size  0xA40000

Build the PetaLinux project:

petalinux-build

Package the PetaLinux project:

petalinux-package --boot --force --fpga ../arty_base/arty_base.runs/impl_1/design_1_wrapper.bit --u-boot --kernel --flash-size 16 --flash-intf SPIx1

You’ll find the boot.mcs file under arty_petalinux/images/linux.

Try it yourself

If you want to run this project on your Arty board, just download the boot files that I’ve provided here: Arty PetaLinux boot files

JTAG instructions

In the compressed file, you’ll find a bitstream and .elf file (the PetaLinux kernel) that can be downloaded to your Arty via JTAG using the XMD tool. Launch XMD and type these commands:

  
  fpga -f design_1_wrapper.bit
  connect mb mdm
  dow image.elf
  run

Flash instructions

Also in the compressed file, you’ll find a .mcs file that you can program into the Arty’s flash memory so that PetaLinux boots up every time you power up the board. To program the Arty’s flash memory:

launch the Hardware Manager in Vivado
make a connection with the FPGA
add configuration memory device “n25q128-3.3v-spi-x1_x2_x4“
program the configuration memory device with the .mcs file

Digilent has a good tutorial on this here: Programming the Arty using Quad SPI Flash

Make sure to open a UART terminal for a baud rate of 9600, so that you don’t miss the boot log. Also, remember to connect the Arty to your network router so that the IP address gets automatically assigned during the boot sequence.

↧

IntelliProp Demos NVMe Host Accelerator on FPGA Drive

February 26, 2018, 7:14 am

≫ Next: Python for the Zynq and the PYNQ-Z1

≪ Previous: PetaLinux for Artix-7 Arty Base Project

Early this year IntelliProp released a demo video of their NVMe Host Accelerator IP core running on the Intel Arria 10 GX FPGA Development board. As you can see in the video, they are using Opsero’s FPGA Drive product with the PCIe slot connector to interface the NVMe SSD to the FPGA board. They measured an impressive performance of around 2300MBps sequential write speed and 3200MBps sequential read speed. The FPGA Drive adapter was designed to fully handle Gen3 speeds precisely because these high throughputs are only possible with a Gen3 interface (note that M.2 SSDs have a 4-lane PCIe interface).

↧

Python for the Zynq and the PYNQ-Z1

February 27, 2018, 8:34 am

≫ Next: Create a custom PYNQ overlay for PYNQ-Z1

≪ Previous: IntelliProp Demos NVMe Host Accelerator on FPGA Drive

Being a big fan of Python, for ages I’ve wanted to explore the possibilities of running Python on the Zynq. Thankfully Xilinx and Digilent saw the value in this too and they developed the PYNQ-Z1 and more importantly the PYNQ libraries for Python. The PYNQ-Z1 is basically a single board computer based on the Zynq-7020 device from Xilinx. So thats got a dual core ARM plus integrated FPGA or programmable logic. The board runs Ubuntu Linux, it’s got Python installed and it has a file system on the micro SD card. The board’s got Gigabit Ethernet so you can connect this to your network and to the Internet. That’s useful for adding packages to Linux and the like but we also use the network interface to develop and run Python applications on the board. You see, the board runs the Jupyter web application. Jupyter allows us to program and run Python scripts through a web interface using a web browser (see screenshot below). This is pretty handy when you’re developing code for a single board computer because you typically don’t have a screen. With Jupyter, you’ve got an interactive web interface, so it’s got features like code completion, you can step through code blocks, display images and heaps of other things.

I can run Python on the Raspberry Pi! What’s so special about this?

Yes you can, but the Zynq has FPGA programmable logic that you can use to accelerate your programs and the PYNQ Python libraries allow you to interface with the FPGA from within your Python code. So the PYNQ-Z1 is able to do things that the Raspberry Pi can’t do because it’s able to offload compute intensive tasks to the FPGA. Think of trying to run image filters or a neural network on a Raspberry Pi, it would run so slowly that it wouldn’t be very practical.

So let’s look at an example design flow. Let’s say you want the PYNQ-Z1 to read video frames from the HDMI input, run them through a neural network that’s trained to detect people, highlight any detected people and send the new frames to the HDMI output. First you’d write this code in Python without using any acceleration. For the reading and writing of HDMI video frames, the PYNQ-Z1 comes with code examples that you can just copy. For the neural network, you’d use one of the many existing Python libraries for machine learning. Once your Python code is working and the hardware is doing what you want, then you’d look at how you can use the FPGA to make things run faster. In this example, obviously most of the processor’s time would be spent running the neural network, so that’s what we’d try to accelerate first. To do this, we’d develop IP to implement the neural network on the FPGA fabric. In PYNQ terminology, this is called a PYNQ overlay – just another way of describing the FPGA configuration or the bitstream. Once our PYNQ overlay has been designed, we can upload it to our PYNQ board and test it out. We’d first have to write functions that interface with our IP, then we’d want to write functions to replace the machine learning library functions that we used before. If everything is done right, the accelerated functions should run a lot faster than the non-accelerated functions, and we should achieve a much higher frame rate, and hopefully one that is practical.

So in my opinion, what makes this board a great dev platform are two things: one, you can prototype quickly in Python and leverage all of the existing Python packages and two, you can accelerate algorithms by offloading to the FPGA, and creating compute intensive designs that are not possible on other embedded devices.

↧