my tech blog » FPGA

Xilinx’ MiG memory controller’s init process reverse engineered

eli — Mon, 07 Dec 2009 14:32:25 +0000

Introduction

I’m using Xilinx’ MiG 1.7.3 for running DDR2 memories on a Virtex-4 FPGA. It didn’t take me long to realize that the controller never finishes initialization. The problem is that I had no idea of why, and as far as I know, no documentation to refer to in my attempts to understand where the controller got stuck, which is an essential stage in getting it unstuck.

Since Xilinx are wise enough to release the IP core with its source, I was able to reverse engineer the initialization process to the level necessary for my own purpose. This is a memo of the details, just in case I’ll need to do this again some time. I sure hope that won’t be necessary…

In my case, the problem seems to have been overheating of the FPGA. I’m not 100% sure about this, but with 90 degrees centigrade measured on the case, and everything starting to work OK when a descent heatsink (with fan) was put in place, it looks pretty much like good old heat.

Overview

The initialization process consists of several stages. During the entire process, the controller is governed by the init_state one-hot state machine in the ddr2_controller module. The end of this process is marked by init_done_int going high, which goes out as init_done, hence marking the end to the IP core’s user.

The initialization consists of roughly three stages:

Setting up the memory device
Setting up the IDELAYs taps so that the DQ inputs are samples with good timing.
Learning the correct latency for reading data from DQs during read cycles.

Throughout the init process, active and precharge operations take place as required by standard. These operations are not mentioned here, since they don’t add anything to understanding the principle.

Setting up the memory device

This is the normal JEDEC procedure, which includes a preknown sequence of peculiar operations, as defined in the DDR2 standard. This includes writing to the memory’s mode registers. During this phase, the controller will not care if it’s talking to a memory or not, since it never reads anything back from the memory.

Setting up the IDELAYs taps

The importance of this stage is to make sure that data is sampled from DQ lines at the best possible timing. Each DQ input is calibrated separately.

This stage begins with a single write command to column zero. The write data FIFO has already been written some data to it, so that the rising edge contains all ones, and the falling edge is all zeros. For example, for a memory with 16 DQ lines, the FIFO has been fed with 0xFFFF0000 twice for memories with burst length of 4, and four times if the burst length is 8.

This can be seen in the backend_fifos module. In that module, one can see that data is written to the write data FIFO immediately after reset. Also, there is another set of words written to the FIFO, which are intended for the next stage.

All in all, this single write commands drains the FIFO with the words containing all ones or all zeros, so that column zero contains this data. Next the controller reads column zero continuously while adjusting the delay taps to achieve proper input timing for the DQs.

The logic for moving the taps is outside the ddr2_controller module. The latter merely helps by performing reads. When the tap logic finishes, it signals it’s done by raising the signal known as phy_Dly_Slct_Done in the ddr2_controller module, and carries many other names such as SEL_DONE. In the tap_logic module (from which it origins) it’s called tap_sel_done.

The tap calibrator increments the tap delay until the data on that line shifts, or until 55 increments has taken place. Whenever this happens, it’s considered to be the data edge. The tap delay is then decremented by the number of times defined by the tby4tapvalue parameter (17 in my case).

Note that even if no edge is found at all, the tap delay calibrator will consider the calibration of that tap OK.

Here is a short list of lines I found useful to look at with a scope (using the FPGA Editor):

ddr2_ctrl_tandem/data_path_00/tap_logic_00/data_tap_inc_0/calib_done_int
ddr2_ctrl_tandem/data_path_00/tap_logic_00/tap_sel_done
ddr2_ctrl_tandem/data_path_00/tap_logic_00/data_tap_inc_done
ddr2_ctrl_tandem/data_path_00/tap_logic_00/dlyce_dqs[0]
ddr2_ctrl_tandem/data_path_00/tap_logic_00/dlyinc_dqs[0]

CHAN_DONE is the most interesting signal, because it goes high briefly every time a data line has finished its tap calibration. Unfortunately, the synthesizer messes up the identification of this signal, so the only way to tell it, is by finding what causes ddr2_ctrl_tandem/data_path_00/tap_logic_00/data_tap_inc_0/chan_sel_int to change state. In my case it was

ddr2_ctrl_tandem/data_path_00/tap_logic_00/data_tap_inc_0/chan_sel_int_not0002

This signal should go high 8 times (or the number of data lines per DQ you have). If it does a fewer numbers and then nothing happens, you can tell which of this data lines is problematic simply by counting these strobes.

Latency for reading data

The purpose of this stage is to tell when, in terms of semiclocks, to sample the data read from the memory. I’m not 100% clear on why this stage is necessary at all, but that won’t change the fact that it exists.

This stage starts with a single write operation again. This time the written data is slightly more sophisticated (keep in mind that it was loaded to the write data FIFO immediately after wakeup from reset). The first column will have the data 0xA written to it, duplicated to occupy all DQs. For example, on a memory with 16 DQs, the first column will be 0xAAAA. The second column is 0x5 duplicated, the third 0x9, and the fourth 0x6, all duplicated. If the burst length is 8, this four word sequence is repeated.

After writing this, the controller reads column zero continously, until COMP_DONE goes high. This signal origins from the pattern_compare8 module, which tells the controller it has recovered the correct input data alignment. More precisely, the rd_data send the ddr2_controller a logical AND of all pattern_compare8′s comp_done signals.

These pattern_compare8 modules simply looks for an 0xAA pattern followed by a 0x99 pattern in the input during rising edges only, or an 0x55 followed by 0x66 on the rising edge. So it will catch the reads of the first and third column, or the second or forth, but either way this solves the alignment ambiguity completely.

As the pattern_compare8 module tries to match the data, it increments (among others) its clk_cnt_rise register (not to be confused with the clk_count_rise wire, which contains the final result). Monitoring clk_cnt_rise[0] (using FPGA Editor, for example) can give a positive feedback that the initialization is at this phase. It should give a nice square wave at half the DDR2 controller’s clk0 frequency, and then stop when this phase is done.

Summary.

The initialization process is not the simplest in the world, and it’s likely to fail if you got anything wrong with your memory, in particular if you have as little as one data wire line miswired. This is not really good news, but understanding the process may help at least understand what went wrong, and hopefully fixing it too.

Using Perl to map FPGA pins from a board design to UCF pin constraints

eli — Thu, 26 Mar 2009 23:42:03 +0000

One of the things I try to avoid as an FPGA engineer, is to manually configure the pin constraints (in the UCF file) in order to tell the tools which FPGA pin is connected to what. Not only is this extremely boring, but I also think that getting it done right (at the first go) is more or less a miracle.

If you insist on working with the Orcad schematics, you’re doomed. Yes, this graphical representation of the board is useful for getting an idea of what’s going on. But when I want to know for sure what is connected to what on the PCB, I read the netlist file. It’s that text file, which the board designer sends to the PCB manufacturing plant. So even if schematics is convincing, what counts is what the netlist file says. Sometimes reading the netlist reveals connections which were not obvious at all from looking at the schematics. But I’m diverting from the point, which is how to generate the UCF file sort-of automatically. Or at least spare most of the work.

So let’s have a look on what a netlist file looks like. This is a snippet from the middle, where the precious information is:

SIN_D8  = U43/V5 U48/42 ;
CLK_SEL  = R393/1 U52/36 ;
N14463023  = U29/7 R154/1 R326/2 ;
N16132557  = U64/1 C589/1 L90/1 L91/1
            C594/1 U64/2 ;
SOUT_A_A9  = U38/34 U43/AB21 ;
SENSOR_B_D3  = U43/N8 U49/11 ;
SIN_D9  = U43/U4 U48/44 ;

So the structure is very simple. A statement begins with the net’s name, an equation sign and then a listing of the connected pins. As you can see, each statement is terminated by a semicolon, and may consist of several lines.

The nets’ names are given by the board designer manually. There is no assurance that this name has anything to do with what the net is connected to, so if you’re really pedantic, you should verify that as well. Checking the schematics is fairly efficient, or you could verify the pin connections in the netlist, which may be fairly easy with some scripting skills.

Also, if no name was given to the net, but it’s a result of just connecting two pins, Orcad will make up a name, which usually looks something like N16132557. A common, and annoying case is when there is a resistor between two chips’ pins (say, for debouncing). Because of the resistor in the middle, two nets make the connection. If the board designer gives the name to the net between the FPGA and the resistor, we get the name for free. If not, we need to be smarter.

And that brings us to the pin listing in the netlist. If we take the SOUT_A_A9 net for example, we can see that it’s connecting between pin 34 of device marked U38, and pin AB21 of U43. On this specific board, U43 happens to be the FPGA.

This leaves us with two possible strategies. One is to trust the net’s name, and make an entry in the UCF file, which binds pin AB21 to a Verilog/VHDL toplevel I/O port with a similar name, say “sout_a_a[9]“. I’ll show an example script for this below.

The second strategy would try to find out what pin 34 of U38 stands for, and give the port’s name accordingly. This is trickier, of course, but given a reliable pin mapping of the other chip, this neutralizes any mistake possibly made by the board designer. This is where I’d like to mention, that certain board design tools create a “chip file”, which is a text file as well. This text file contains meaningful names for each chip included in the design (these are, in fact, the names that appear on the schematics). This file’s name is typically pstchip.dat. Unfortunately, the information in this file is commonly fed manually from a datasheet at some stage of the board design, so if an error was made during this stage, both the board and the FPGA pinout will get it wrong.

But let’s leave the pessimism for a while, and assume that we can rely on the nets’ names. Here’s a script, which finds the FPGA’s connections, an attempts to create a UCF file. Please keep in mind that it’s just an example, and that it doesn’t cover all nets even in the design I wrote it for. If you want to use this technique in your own designs, you’ll have to adapt it to the quirks of the netlist you’re facing.

Anyhow, here it is:

#!/usr/bin/perl
use warnings;
use strict;

undef $/; # Slurp mode. (Sane people use "local" instead)
my $file = <>;

my @chunks = ($file =~ /^([^ %].+?);/gsm);
my @out;

foreach my $chunk (@chunks) {
  my ($var, $rest) = ($chunk =~  /^([^ ]+)[ ]*=[ ]*(.*)/s);
  die("Failed to read line: $chunk\n")
    unless (defined $var);
  next if (grep { $_ eq lc $var }
	   qw[vcc_int vcc_aux v3_3 gnd]);
  my @pins = ($rest =~ /U43\/([^ \n\r\t]+)/gi);
  push @out, "NET \"".(lc $var)."\" LOC = \"$_\";\n" foreach (@pins);
}
print sort @out;

Just a few clarifications: U43 is the FPGA in my netlist, right? So that’s why I filter out anything else. And now a few Perl clarifications.

I begin with undeffing $/. This makes the single ‘$file=<>’ statement read the entire file at once (that’s why they call it slurp mode). The Perl manpage encourages to use “local” instead of “undef”, and it explains why, but it’s not relevant for a short script.
The first regular expression (feeding @chunks) cuts the file into pieces between semicolons.
The second regular expression splits each chunk into the string before ‘=’ (in $var) and everything that comes afterwards (in $rest, possibly longer than a single line).
The grep sentence checks if we’re not on a power net, which should be skipped.
The last regular expression looks for a ‘U43/’-something, and if that is found, the pins are stored in the @pins list (in a sane case, this will be only one pin, or we’re messed up)

As I said before, I don’t really expect this script to work out of the box for you, but I hope I made the point about using a Perl script on a netlist to make life easier. And in case you’re an FPGA Engineer, and don’t know Perl, I hope this gave an idea of why you should start learning…

Xilinx’ XST synthesizer bug: ROM generation using case

eli — Sat, 14 Mar 2009 16:46:49 +0000

Take a close look on the Verilog code below. This is a plainly-written synchronous ROM. Do you see anything wrong with it? (Spoiler: There is nothing wrong with it. Not that I know of)

module coeffs
  (
   clk, en,
   addr, data
   );

   input clk, en;
   input [9:0] addr;
   output [15:0] data;

   reg [15:0]      data;

   always @(posedge clk)
     if (en)
       case (addr)
     0: data <= 16'h101a;
     1: data <= 16'h115b;
     2: data <= 16'h0f1c;
     3: data <= 16'h0f6d;
     4: data <= 16'hffa4;

... and counting up ...

     249: data <= 16'h0031;
     250: data <= 16'hfffa;
     251: data <= 16'hffee;
     default: data <= 0;
   endcase
endmodule

But it so happens, that Xilinx’ XST synthesizer failed to get this one right. XST J.39, release 9.2.03i, if you insist.

And when I say it didn’t get it right, I mean that what I got on the hardware didn’t implement what the Verilog says it should.

First, what it should have done: Since the address space consists of 10 bits, and there are a lot of, but less than 1024 data elements, the synthesizer should have matched this with a 1k x 18 block RAM, set the values as INIT parameters, and not allow any writes. And so it did. Almost.

The problem, it seems, lies in the fact that only 252 data slots are assigned, leaving 3/4 of the ROM with zeroes. This is where the synthesizer tried to be smarter, for no practical reason. Based upon what I saw with the FPGA Editor, the synthesizer detected, that if any of addr[9] or addr[8] are nonzero, then the output is zero anyhow. Since the block RAM has a synchronous reset input, which affects only the output, the synthesizer decided to feed this reset with (addr[9] || addr[8]). This doesn’t change anything: If any of these lines is high, the output should be zero. It would be anyhow, since the block RAM itself contains zeros on the relevant addresses, but this reset logic doesn’t hurt. As long as you get it right, that is. Which wasn’t the case this time.

What really happened, was that the synthesizer mistakenly reversed the polarity of the logic of the reset line, so it got (!addr[8] && !addr[9]) instead. That made the memory array produce zeros for any address. And the design didn’t work.

It looks like the idea was to reverse the polarity at the block RAM’s reset input as well (which costs nothing in terms of logic resources) but somehow this didn’t come about.

Workaround: It looks like the “default” statement triggered this bug. Since the Verilog file was generated by a computer program anyhow, I let it go on running all the way to 1023, explicitly assigning zeros to each address. This is completely equivalent, of course, but made the design work in real life.

One of these bugs you wouldn’t expect.