my tech blog » GTX

Using MGTs in FPGA designs: Why the data is organized in packets

eli — Sat, 07 Feb 2026 13:47:49 +0000

Introduction

I’ll start with a correction: Indeed, application logic transmitting data from one FPGA to another is required to organize the data in some kind of packets or frames, but there’s one exception, which I’ll discuss later on: Xillyp2p. Anyhow, let’s take it from the beginning.

Multi-Gigabit Transceivers (MGTs, sometimes also referred to as RocketIO, GTX, GTH, GTY, GTP, GTM, etc.) have long ago become the de facto standard for serialized data communication between digital components. The most famous use cases are for a computer and its peripheral (often between the CPU’s companion chip and a peripheral), for example, PCIe, SuperSpeed USB (a.k.a. USB 3.x), and SATA. Also related to computers, Gigabit Ethernet (as well as 10GbE) is based upon MGTs, and the DisplayPort protocol can be used for connecting a graphics card with the monitor.

Many FPGAs are equipped with MGTs. These are often used for turning the FPGA into a computer peripheral (with the PCIe protocol, possibly using Xillybus, or with the SuperSpeed USB protocol, possibly using XillyUSB, or as a storage device with SATA). Gigabit Ethernet can also play in, allowing the FPGA to communicate with a computer with this protocol. Another use of MGTs is for connecting to electronic components, in particular ADC/DAC devices with a very high sampling frequency, hence requiring a high data rate.

But what about communication between FPGAs? At times, there are several FPGAs on a PCB that need to exchange information among themselves, possibly at high rates. In other usage scenarios, there’s a physical distance between the FPGAs. For example, test equipment often has a hand-held probe containing one FPGA that collects information, and a second FPGA that resides inside the table-top unit. If the data rate is high, MGTs on both sides make it possible to avoid heavy, cumbersome and error-prone cabling. In fact, a thin fiber-optic cable is a simple solution when MGTs are used anyhow, and in some scenarios it also offers an extra benefit, except for being lightweight: Electrical isolation. This is in particular important in some medical applications (for electrical safety) or when long cables need to be drawn outdoors (avoiding being hit by lightning).

Among the annoying things about MGT communication there’s the fact that the data flow somehow always gets organized in packets (or frames, bursts, pick your name for it), and these packets don’t necessarily align properly with the application data’s natural boundaries. Why is that so?

This post attempts to explain why virtually all protocols (e.g. Interlaken, RapidIO, AMD’s Aurora, and Altera’s SeriaLite) require the application data to be arranged in some kind of packets that are enforced by the protocol. The only exception is Xillyp2p, which presents error-free continuous channels from one FPGA to another (or with packets that are sensible for the application data). This is not to say that packets aren’t used under the hood; it’s just that this packet mechanism is transparent to the application logic.

I’ll discuss a few reasons for the use of packets:

Word alignment
Error detection and retransmission
Clock frequency differences

Reason #1: Word alignment

When working with an MGT, it’s easy to forget that the transmitted data is sent as a serial data stream of bits. The fact that both the transmitting and receiving side have the same data word width might give the false impression that the MGT has some magic way of aligning the word correctly at the receiver side. In reality, there is no such magic. There is no hidden trick allowing the receiver to know which bit is the first or last in a transmitted word. This is something that the protocol needs to take care of, possibly with some help from the MGT’s features.

When 8b/10b encoding is used, the common solution is to transmit a synchronization word, often referred to as a comma, which is known as the K28.5 symbol. This method takes advantage of the fact that the 8b/10b encoding uses 10 bits on the wire for each 8 bits of payload data for transmission. And this allows for a small number of extra codes for transmission, that can’t be just regular data. These extra codes are called K-symbols, and K28.5 is one of them.

Hence if the bit sequence for a K28.5 symbol is encountered on the raw data link, it can’t be a data word. Most MGTs in FPGAs have a feature allowing them to automatically align the K28.5 word to the beginning of a word boundary. So word alignment can be ensured by transmitting a comma symbol. The comma symbol is often used to reset the scrambler as well, if such is used.

Each protocol defines when the comma is transmitted. There are many variations on this topic, but they all boil down to two alternatives:

Transmitting comma symbols occasionally and periodically. Or possibly, as part of the marker for the beginning of a packet.
Transmitting comma symbols only as part of an initialization of the channel. This alternative is adopted by protocols like SuperSpeed USB and PCIe, which have specific patterns for initializing the channel, referred to as Ordered Sets for Training and Recovery. These patterns include comma symbols, among others.

Truth to be told, if the second approach is taken, the need for word alignment isn’t a reason by itself for dividing the data into packets, as the alignment takes place once and is preserved afterwards. But the concept of initializing the channel is quite complicated, and is not commonly adopted.

There are other methods for achieving word alignment, in particular when 8b/10b encoding isn’t used. The principles remain the same, though.

Reason #2: Error detection and retransmission

When working with an MGT, bit errors must be taken into account. These errors mean simply that a ’0′ is received for a bit that was transmitted as a ’1′, or vice versa. In some hardware setups such errors may occur relatively often (with a rate of say, 10^-9, which usually means more than once per second), and with other setups they may practically never occur. If an error in the application data can’t be tolerated, a detection mechanism for these bit errors must be in place at the very least, in order to prevent delivery of incorrectly received data to the application logic. Even if a link appears to be completely error free judging by long-term experience, this can’t be guaranteed in the long run, in particular as electronic components from different manufacturing batches are used.

In order to detect errors, some kind of CRC (or other redundant data) must be inserted occasionally in order to allow the receiver to check if the data has arrived correctly. As the CRC is always calculated on a segment (whether it has a fixed length or not), the information must be divided into packets, even if just for the purpose of attaching a CRC to each.

And then we have the question of what to do if an error is detected. There are mainly two possibilities:

Requesting a retransmission of the faulty packet. This ensures that an error-free channel is presented to the application logic.
Informing the application logic about the error, possibly halting the data flow so that faulty data isn’t delivered. This requires the application logic to somehow recover from this state and restart its operation.

High-end protocols like PCIe, SATA and SuperSpeed USB take the first approach, and ensure that all packets arrive correctly by virtue of a retransmission mechanism.

Gigabit Ethernet takes the second approach — there’s a CRC on the Ethernet packets, but the Ethernet protocol itself doesn’t intervene much if a packet arrives with an incorrect CRC. Such a packet is simply discarded (either by the hardware implementing the protocol or by software), so faulty data doesn’t go further. Even the IP protocol, which is usually one level above, does nothing special about the CRC error and the packet loss that occurred as a result of it. It’s only the TCP protocol that eventually detects the packet loss by virtue of a timeout, and requests retransmission.

What about FPGA-to-FPGA protocols, then? Well, each protocol takes its own approach. Xillyp2p is special in that it requests retransmissions when the physical link is bidirectional, but if the link is unidirectional it only discards the faulty data and halts everything until the application logic resumes operation — a retransmission request is impossible in the latter case.

Reason #3: Clock frequency differences

Clock frequency differences should have been the first topic, because it’s the subtle detail that prevents the solution that most FPGA engineers would consider at first for communication between two FPGAs: One FPGA sends a stream of data words at a regular pace, and the other FPGA receives and processes it. Simple and clean.

But I put it third and last, because it’s the most difficult to deal with, and the explanations became really long. So try to hang on. And if you don’t, here’s the short version: The transmission of data can’t be continuous, because the receiver’s clock might be just a few ppm slower. Hence the rate at which the receiver can process arriving data might be slightly lower than the transmitter’s rate, if it keeps sending data non-stop. So to avoid the receiver from being overflowed with data, the transmitter must pause the flow of application data every now and then to let the receiver catch up. And if there are pauses, the segments between these pauses are some kind of packets.

And now, to the long explanation, starting with the common case: The data link is bidirectional, and the data content in both directions is tightly related. Even if application data goes in one direction primarily, there is often some kind of acknowledgement and/or status information going the other way. All “classic” protocols for computers (PCIe, USB 3.x and SATA) are bidirectional, for bidirectional data as well as acknowledge packets, and there is usually a similar need when connecting two FPGAs.

The local and CDR clocks

I’ll need to make a small detour now and discuss clocks. Tedious, but necessary.

In most applications, each of the two involved FPGAs uses a different reference clock to drive its MGT, and the same reference clock is often used to drive the logic around it. These reference clocks of the two FPGAs have the same frequency, except for a small tolerance. Small, but causes big trouble.

Each MGT transmits data based upon its own reference clock (I’ll explain below why it’s always this way). The logic in the logic fabric that produces the data for transmission is usually driven by a clock derived from the same reference clock. In other words, the entire transmission chain is derived from the local reference clock.

The natural consequence is that the data which the MGT receives is based upon the other side’s reference clock. The MGT receiving this data stream locks a local clock oscillator on the data rate of the arriving data stream. This mechanism is referred to as clock data recovery, CDR. The MGT’s logic that handles the arriving data stream is clocked by the CDR clock, and is hence synchronized with this data stream’s bits.

Unlike most other IP blocks in an FPGA, the clocks that are used to interface with the MGT are outputs from the MGT block. In other words, the MGT supplies the clock to the logic fabric, and not the other way around. This is a necessary arrangement, not only because the MGT generates the CDR clock: The main reason is that the MGT is responsible for handling the clocks that run at the bit rate, having a frequency of several GHz, which is far above what the logic fabric can handle. Also, the reference clock used to generate these GHz clocks must be very “clean” (low jitter), so the FPGA’s regular clock resources can’t be used. Frequency dividers inside the MGT generate the clock or clocks used to interface with the logic fabric.

In particular, the data words that are transferred from the logic fabric into the MGT for transmission, as well as data words from the MGT to the logic fabric (received data), are clocked by the outputs of these frequency dividers. The fact that these clocks are used in the interface with the logic fabric makes it possible to apply timing constraints on paths between the MGT’s internal logic and the logic fabric.

For the purpose of this discussion, let’s forget about the clocks inside the MGT, and focus only on those accessible by the logic fabric. It’s already clear that there are two clocks involved, one generated from the local oscillator, based upon the local reference clock (“local” clock), and the CDR clock, which is derived from the arriving data stream. Two clocks, two clock domains.

Clock or clocks used for implementing the protocol

As there are two clocks involved, the question is which clock is used by the logic that processes the data. This is the logic that implements the protocol. The answer is obviously one of the two clocks supplied by the MGT. It’s quite pointless to implement the protocol in a foreign clock domain.

In principle, the logic (in the logic fabric) implementing the protocol could be clocked by both clocks, however the vast majority is usually clocked only by one of them: It’s difficult to implement a protocol across two clock domains, so even if both clocks are used, the actual protocol implementation is always clocked by one of the clocks, and the other clock is used by a minimal amount of logic.

In all practical implementations, the protocol is implemented on the local clock’s domain (the clock used for transmission). The choice is almost obvious: Given that one needs to choose one of the two clocks, the choice is naturally inclined towards the local clock, which is always present and always stable.

The logic running on the CDR clock usually does some minimal processing on the arriving data, and then pushes it into the local clock domain. And this brings us naturally to the next topic.

Crossing clock domains

Every FPGA engineer knows (or should know) that a dual-clock FIFO is the first solution to consider when a clock domain crossing is required. And indeed, this is the most common solution for crossing the clock domain from the CDR clock towards the local clock. It’s the natural choice when the only need is to hand over the arriving data to the local clock domain.

Therefore, several protocol implementations are clocked only by the local clock, and only this clock is exposed by the MGT. The dual-clock FIFO is implemented inside the MGT, and is usually called an “elastic buffer”. This way, all interaction with the MGT is done in one clock domain, which simplifies the implementation.

It’s also possible to implement the protocol with both clocks, and perform the clock domain crossing in the logic fabric, most likely with the help of a FIFO IP provided by the FPGA tools.

To reiterate, it boils down to two options:

Doing the clock domain crossing inside the MGT with an “elastic buffer”, and clock the logic fabric only with the local clock.
Using both clocks in the logic fabric, and accordingly do the clock domain crossing in the logic fabric.

Preventing overflow / underflow

As mentioned earlier, the two clocks usually have almost the same frequency, with a difference that results from the oscillators’ frequency tolerance. To illustrate the problem, let’s take an example with a bidirectional link of 1 Gbit/s, and the clock oscillators have a tolerance of 10 ppm each, which is considered pretty good. If the transmitter’s clock frequency is 10 ppm above, and the receiver’s frequency is 10 ppm below, there is a 20 ppm difference in the 1 Gbit/s data rate. In other words, the receiver gets 20,000 bits more than it can handle every second: No matter which of the two options mentioned above for clock domain crossing is chosen, there’s a FIFO whose write clock runs 20 ppm faster than the read clock. And soon enough, it overflows.

It can also be the other way around: If the write clock is slower than the read clock, this FIFO becomes empty every now and then. This scenario needs to be addressed as well.

There are several solutions to this problem, and they all boil down to that the transmitter pauses the flow of application data with regular intervals, and inserts some kind of stuffing inbetween to indicate these pauses. There is no possibility to stop the physical data stream, only to send data words that are discarded by the receiver instead of ending up in the FIFO. Recall that the protocol is almost always clocked by the local clock, which is the clock reading from the FIFO. So for example, just inserting some idle time between transmitted packets is not a solution in the vast majority of cases: The packets’ boundaries are detected by the logic that reads from the FIFO, not on the side writing to it. Hence most protocols resort to much simpler ways to mark these pauses.

The most famous mechanism is called skip ordered sets, or skip symbols. It’s the common choice when 8b/10b encoding is used. It takes advantage of the fact mentioned above, that when 8b/10b is used, it’s possible to send K-symbols that are distinguishable from the regular data flow. For example, a SuperSpeed USB transmitter emits two K28.1 symbols with regular intervals. The logic before the FIFO at the receiver discards K28.1 symbols rather than writing them into the FIFO.

It’s also common that the logic reading from the FIFO injects K28.1 symbols when the FIFO is empty. This allows a continuous stream of data towards the protocol logic, even if the local clock is faster than the CDR clock. It’s then up to the protocol logic to discard K28.1 symbols.

There are of course other solutions, in particular when 8b/10b isn’t used. The main point is however that the transmitting side can’t just transmit data continuously. At the very least, there must be some kind of pauses. And as already said, when there are pauses, there are packets between them, even if they don’t have headers and CRCs.

But why not transmit with the CDR clock?

This can sound like an appealing solution, and it’s possible at least in theory: Let one side (“master”) transmit data based upon its local clock, just as described above, and let the other side (“slave”) transmit data based upon the CDR clock. In other words, the slave’s transmission clock follows the master’s clock, so they have exactly the same frequency.

First, why it’s a bad idea to use the CDR clock directly for transmission: Jitter. I’ve already used the word jitter above, but now it deserves an explanation: In theory, a clock signal has a fixed time period between each transition. In practice, the time between each such transition varies randomly. It’s a slight variation, but it can have a devastating effect on the data link’s reliability: As each clock transitions sets the time point at which a new bit is presented on the physical link, by virtue of changing the voltage between the wires, a randomness of the timing has an effect similar to adding noise.

This is why MGTs should always be driven by “clean” reference clocks, meaning oscillators that are a bit more expensive, a bit more carefully placed on the PCB, and have been designed with focus on low jitter.

So what happens if the slave side uses the CDR clock to transmit data? Well, the transmitter’s clock already has a certain amount of jitter, which is the result of the reference clock’s own jitter, plus the jitter added by the PLL that created bit-rate clock from it. The CDR creates a clock based upon the arriving data stream, which usually adds a lot of jitter. That too has the same effect as adding noise to its input, because the receiver samples the analog signal using the CDR clock. However, this effect is inevitable. In order to mitigate this effect, the PLL that generates the CDR clock is often tuned to produce as little jitter as possible, while still being able to lock on the master’s frequency.

As the CDR clock has a relatively high jitter due to how it’s created, using it directly to transmit data is equivalent to adding noise to the physical channel, and is therefore a bad idea.

It’s however possible to take a divided version of the CDR clock (most likely the CDR clock as it appears on the MGT’s output port) and drive one of the FPGA’s output pins with it. That output goes to a “jitter cleaner” component on the PCB, which returns the same clock, but with much less jitter. And the latter clock can then be used as a reference clock to transmit data.

I’ve never heard of anyone attempting the trick with a “jitter cleaner”, let alone tried this myself. I suppose a few skip symbols are much easier than playing around with clocks.

But if the link is unidirectional?

If there’s a physical data link only in one direction, the CDR clock can be used on the receiving side to clock the protocol logic without any direct penalty. But it’s still a foreign clock. The MGT at the receiving side still needs a local reference clock in order to lock the CDR on the arriving data stream.

And as things usually turn around, the same local reference clock becomes the reference for all logic on the FPGA. So using the local clock for receiving data often saves a clock domain crossing between the protocol logic and the rest of it. It becomes a question of where the clock domain crossing occurs.

Conclusion

If data is transmitted through an MGT, it will most likely end up divided into packets. At least one of the reasons mentioned above will apply.

It’s possible to avoid the encapsulation, stripping, multiplexing and error checking of packets by using Xillyp2p. Unlike other protocol cores, this IP core takes care of these tasks, and presents the application logic with error-free and continuous application data channels. The packet-related tasks aren’t avoided, but rather taken care of by the IP core instead of the application logic.

This is comparable with using raw Ethernet frames vs TCP/IP: There is no way around using packets for getting information across a network. Choosing raw Ethernet frames requires the application to chop up the data into frames and ensure that they arrive correctly. If TCP/IP is chosen, all this is done and taken care of.

One way or another, there will be packets on wire.

Reverse engineering Cyclone 10 transceiver’s attributes

eli — Thu, 14 Oct 2021 15:01:55 +0000

Introduction

This post summarizes some scattered findings I made while trying to make a Cyclone 10′s signal detect feature work properly for detecting a SuperSpeed USB LFPS signal. As it turned out, Cyclone 10′s transceiver isn’t capable of this, as explained below.

But since the documentation on this issue was lacking, I resorted to reverse engineering Quartus in the attempt to find a solution. So this post is a bit about the transceiver and more about the reverse engineering efforts, which might be relevant in completely different contexts.

I should mention that everything on this page relates to Cyclone 10, even though the output from the tools keep naming different logic elements with “a10″, as if it was Arria 10. Clearly, the transceivers for the two FPGA families are the same.

Software: Quartus Pro 17.1 running on a 64-bit Linux machine (Mint 19).

Cyclone 10′s signal detect is rubbish

The purpose of the signal detector is to tell whether the differential wires are in an electrical idle state, or if there’s some activity on these. This is used by several protocols to wake up the link partners from a low power state: A PCIe link can be awaken by an upstream facing link partner (typically the device waking up the host) from a L2 state by virtue of a beacon, which consists of toggling the polarity at a rate of 30 kHz — 500 MHz. A SATA link can be awaken by one of the link partners transmitting a special data pattern. The USB 3.x protocol also uses out-of-band (OOB) signals if this sort, for various purposes, and calls them LFPS (Low Frequency Pulse Signaling). The toggling rate is defined between 10 — 50 MHz.

The first, relatively simple obstacle, was to turn on the signal detector. The Cyclone 10 GX Transceiver PHY User Guide says in Table 67, regarding rx_std_signaldetect:

When enabled, the signal threshold detection circuitry senses whether the signal level present at the RX input buffer is above the signal detect threshold voltage. You can specify the signal detect threshold using a Quartus Prime Settings File (.qsf) assignment. This signal is required for the PCI Express, SATA and SAS protocols.

Similar notes are made in other places in that guide. However it doesn’t mention that if the transceiver is configured in “basic” mode (as opposed to SATA mode, as well as PCIe mode, I suppose), rx_std_signaldetect is stuck on logic ’1′, so enabling this signal alone isn’t enough.

But the real problem is that the signal detector is probably not good for anything but detecting SATA’s OOB: When I selected SATA mode, I did get some response on rx_std_signaldetect, but it was clearly not detecting the LFPS activity in a useful way. Unlike Cyclone V’s signal detector, Cyclone 10′s detector barely responded at all to a 31.25 MHz LFPS, and the detections occurred with pretty arbitrary timing, often with a pulse when the LFPS signal stopped, and some other random pulses as the wires went into electrical idle. In short, far from the desired assertion when the LFPS signals starts and deassertion when it stops.

Things got better as the toggling frequency increased, and around 125 MHz the assertion of the signal detect was steadily aligned with the onset of the LFPS toggling, however the deassertion was often delayed after the LFPS stopped. So even if the LFPS signal could be guaranteed to be at this frequency (it can’t, as it’s produced by the USB 3.x link partner, and 125 MHz is above maximum) the issue with the deassertion makes it impossible to use it with LFPS, which is extremely sensitive to the timing of onset and release of the toggling.

In fact, it’s probably useless for PCIe as well, as a PCIe beacon is allowed between 30 kHz – 500 MHz. This might explain why recent version of user guides for PCIe block for Cyclone V, Arria V, Cyclone 10 and Arria 10, had this sentence added:

These IP cores also do not support the in-band beacon or sideband WAKE# signal, which are mechanisms to signal a wake-up event to the upstream device.

The problem was probably not spotted for a while because the beacon is rarely used: The PCIe spec utilizes beacon transmission only from a device towards the host (upstream) for the sake of bringing up the link from a low power state. So signal detection by an FPGA for the sake of PCIe is only required when the FPGA acts as a host, and low-power modes are supported. In short, practically never.

What’s left? SATA. That will probably work, because the differential wires toggle rapidly, and it doesn’t matter so much if the detection is a bit off-beat.

So I resorted to detecting the LFPS bursts directly from uncoded received data, rather than using the signal detect feature. The rest of this post relates to my attempts before I gave up.

The options are limited

Quartus is going a long way to be “helpful” by verifying that the parameter assignments make sense with regards to the intended protocol (e.g. SATA, PCIe etc), as reflected by the “prot_mode” parameter. This often means that the fitter throws an error when one tries to alter a parameter from its default. It’s like someone said nope, if you’re using the transceiver for SATA, these and these are the correct analog parameters for the PMA, and if you try otherwise, the fitter will kick your bottom for your own protection.

Or maybe it’s a gentle way of telling us users not to try anything but the protocols for which the transceiver is directly intended for.

The fitter may also ignore assignments because they were assinged an unrelated entity (e.g. to gtx_tx instead of the positive-signal’s name, gtx_txp). So be always sure to look in the fitter report’s “Ignored Assignments” section.

One could speculate that this nanny-like behavior can be disabled by setting one of the pma_*_sup_mode parameters to “engineering_mode” rather than the default “user_mode”, but see below on this.

QSF

I expected to solve this by turning the parameters of the signal detector like I’ve previously done with Cyclone V: By virtue of assignments of QSF file.

So here’s the catch: The User Guide also says that the signal detect threshold can be set by virtue of .qsf assignments, but none such are documented in the it (as of the version for Quartus 20.1), and the Assignment Editor offers no parameter of this sort.

For Cyclone V, it’s documented in the V-Series Transceiver PHY IP Core User Guide around page 20-31, and there are recommended values on this page. My anecdotal experiments seem to indicate that assigning XCVR_* attributes (without the C10 part) to a Cyclone 10 transceiver is accepted however ignored by Quartus. In other words, trying to use Cyclone-V QSF assignments won’t cut.

So let’s start the guesswork on the names of the QSF parameters.

Hint source I: The fitter report

The fitter report has a section called “Receiver Channel”, which shows the attributes of the transceiver’s components as applied de-facto. Among others, there’s a part saying

;             -- Name                                           ; frontend_ins|xcvr_inst|xcvr_native_a10_0|xcvr_native_a10_0|g_xcvr_native_insts[0].twentynm_xcvr_native_inst|twentynm_xcvr_native_inst|inst_twentynm_pma|gen_twentynm_hssi_pma_rx_sd.inst_twentynm_hssi_pma_rx_sd                               ;
;             -- Location                                       ; HSSIPMARXSD_1D4                                                                                                                                                                                                                                                       ;
;         -- Advanced Parameters                                ;                                                                                                                                                                                                                                                                       ;
;             -- link                                           ; mr                                                                                                                                                                                                                                                                    ;
;             -- power_mode                                     ; mid_power                                                                                                                                                                                                                                                             ;
;             -- prot_mode                                      ; sata_rx                                                                                                                                                                                                                                                               ;
;             -- sd_output_off                                  ; 1                                                                                                                                                                                                                                                                     ;
;             -- sd_output_on                                   ; 1                                                                                                                                                                                                                                                                     ;
;             -- sd_pdb                                         ; sd_on                                                                                                                                                                                                                                                                 ;
;             -- sd_threshold                                   ; sdlv_3

It’s actually recommended to go through this part in the fitter report in any case, to make sure it was set up as desired.

But this part allows guessing the names of the parameters for the QSF file. For example, the following assignments are perfectly legal (and match the setting shown above):

set_instance_assignment -name XCVR_C10_RX_SD_OUTPUT_OFF 1 -to gtx_rxp
set_instance_assignment -name XCVR_C10_RX_SD_OUTPUT_ON 1 -to gtx_rxp
set_instance_assignment -name XCVR_C10_RX_SD_THRESHOLD SDLV_3 -to gtx_rxp

It doesn’t take a cyber hacker to see the connection between the QSF parameter names and those appearing in the report. This works for some parameters, and not for other. But this is the easiest way to guess parameter names.

Hint source II: Read the sources

But what’s the allowed values that can be assigned to these parameters? Hints on that can be obtained from the System Verilog files generated for the IP, in particular the one named xcvr_xcvr_native_a10_0_altera_xcvr_native_a10_171_ev4uzpa.sv (the “ev4uzpa” suffix varies), which has a section going:

parameter pma_rx_sd_prot_mode = "basic_rx",//basic_kr_rx basic_rx cpri_rx gpon_rx pcie_gen1_rx pcie_gen2_rx pcie_gen3_rx pcie_gen4_rx qpi_rx sata_rx unused
parameter pma_rx_sd_sd_output_off = 1,//0:28
parameter pma_rx_sd_sd_output_on = 1,//0:15
parameter pma_rx_sd_sd_pdb = "sd_off",//sd_off sd_on
parameter pma_rx_sd_sd_threshold = 3,//0:15
parameter pma_rx_sd_sup_mode = "user_mode",//engineering_mode user_mode

Note the comments, saying which values are allowed for each parameter. On a good day, staying within these value ranges makes the tools accept the assignments, and on an even better day, the fitter won’t throw an error because it considers the values unsuitable.

It’s worth taking a look on the other modules as well, even though they’re likely to have the same comments.

As far as I’ve seen, these parameters are set by the toplevel module for the transceiver IP. QSF assignments, if present, override the former.

Hint source III: What do these assignments mean?

For this I suggest looking at ip/altera/alt_xcvr/alt_xcvr_core/nd/doc/PMA_RegMap.csv (or similar file name) under the root directory of the Quartus installation. Yes, we’re digging in Quartus’ backyard now. I found these files by searching for strings in all files of the Quartus installation. Reverse engineering, after all.

In fact, I’m not sure if this is the correct file to look at, or maybe CR2_PMA_RegMap.csv or whatever. Neither do I know what they mean exactly. It’s however quite evident that these CSV files (opens with your favorite spreadsheet application) were intended to document the register space of the PMA. But the table that shows has a “Attribute Description” column with a few meaningful words on each attribute as well as a column named “Attribute Encoding”, which may happen to be the value to use in a QSF assignment (may and may not work).

There’s also an official register map from Intel available for download, named c10-registermap-official.xlsx, which apparently contains complimentary information. But it’s not possible to deduce QSF names from this file.

Hint source IV: What assignments are legal, then?

I mentioned earlier that the fitter rejects certain value assignment because they apparently don’t make sense. The rules seem to be written as a Tcl script in ip/altera/alt_xcvr/alt_xcvr_core/nd/tcl/ct2_pma_rx_sd_simple.tcl (and similar). Once again, under the Quartus installation root. And yet once again, sometimes this helps, and sometimes it doesn’t.

Hint source V: The names of the QSF paramaters

Up to this point, the names of the parameters to assign in the QSF file were a matter of speculation, based upon similar names in other contexts.

It’s possible to harvest all possible names by searching for strings in one of Quartus’ installed binaries, as shown on this post for non-Pro Quartus 17.1, and this post for Quartus Pro 19.2.

For a complete list of allowed QSF assignment that relate to Cyclone 10 transceivers (or so I groundlessly believe), search for strings in libdb_acf.so, e.g.

$ strings ./quartus/linux64/libdb_acf.so | grep XCVR_C10 | sort
XCVR_C10_CDR_PLL_ANALOG_MODE
XCVR_C10_CDR_PLL_POWER_MODE
XCVR_C10_CDR_PLL_REQUIRES_GT_CAPABLE_CHANNEL
XCVR_C10_CDR_PLL_UC_RO_CAL
XCVR_C10_CMU_FPLL_ANALOG_MODE
XCVR_C10_CMU_FPLL_PLL_DPRIO_CLK_VREG_BOOST
XCVR_C10_CMU_FPLL_PLL_DPRIO_FPLL_VREG1_BOOST
XCVR_C10_CMU_FPLL_PLL_DPRIO_FPLL_VREG_BOOST
XCVR_C10_CMU_FPLL_PLL_DPRIO_STATUS_SELECT
XCVR_C10_CMU_FPLL_POWER_MODE
XCVR_C10_LC_PLL_ANALOG_MODE
XCVR_C10_LC_PLL_POWER_MODE
XCVR_C10_PM_UC_CLKDIV_SEL
XCVR_C10_PM_UC_CLKSEL_CORE
XCVR_C10_PM_UC_CLKSEL_OSC
XCVR_C10_REFCLK_TERM_TRISTATE
XCVR_C10_RX_ADAPT_DFE_CONTROL_SEL
XCVR_C10_RX_ADAPT_DFE_SEL
XCVR_C10_RX_ADAPT_VGA_SEL
XCVR_C10_RX_ADAPT_VREF_SEL
XCVR_C10_RX_ADP_CTLE_ACGAIN_4S
XCVR_C10_RX_ADP_CTLE_EQZ_1S_SEL
XCVR_C10_RX_ADP_DFE_FLTAP_POSITION
XCVR_C10_RX_ADP_DFE_FXTAP1
XCVR_C10_RX_ADP_DFE_FXTAP10
XCVR_C10_RX_ADP_DFE_FXTAP10_SGN
XCVR_C10_RX_ADP_DFE_FXTAP11
XCVR_C10_RX_ADP_DFE_FXTAP11_SGN
XCVR_C10_RX_ADP_DFE_FXTAP2
XCVR_C10_RX_ADP_DFE_FXTAP2_SGN
XCVR_C10_RX_ADP_DFE_FXTAP3
XCVR_C10_RX_ADP_DFE_FXTAP3_SGN
XCVR_C10_RX_ADP_DFE_FXTAP4
XCVR_C10_RX_ADP_DFE_FXTAP4_SGN
XCVR_C10_RX_ADP_DFE_FXTAP5
XCVR_C10_RX_ADP_DFE_FXTAP5_SGN
XCVR_C10_RX_ADP_DFE_FXTAP6
XCVR_C10_RX_ADP_DFE_FXTAP6_SGN
XCVR_C10_RX_ADP_DFE_FXTAP7
XCVR_C10_RX_ADP_DFE_FXTAP7_SGN
XCVR_C10_RX_ADP_DFE_FXTAP8
XCVR_C10_RX_ADP_DFE_FXTAP8_SGN
XCVR_C10_RX_ADP_DFE_FXTAP9
XCVR_C10_RX_ADP_DFE_FXTAP9_SGN
XCVR_C10_RX_ADP_LFEQ_FB_SEL
XCVR_C10_RX_ADP_ONETIME_DFE
XCVR_C10_RX_ADP_VGA_SEL
XCVR_C10_RX_ADP_VREF_SEL
XCVR_C10_RX_BYPASS_EQZ_STAGES_234
XCVR_C10_RX_EQ_BW_SEL
XCVR_C10_RX_EQ_DC_GAIN_TRIM
XCVR_C10_RX_INPUT_VCM_SEL
XCVR_C10_RX_LINK
XCVR_C10_RX_OFFSET_CANCELLATION_CTRL
XCVR_C10_RX_ONE_STAGE_ENABLE
XCVR_C10_RX_POWER_MODE
XCVR_C10_RX_QPI_ENABLE
XCVR_C10_RX_RX_SEL_BIAS_SOURCE
XCVR_C10_RX_SD_OUTPUT_OFF
XCVR_C10_RX_SD_OUTPUT_ON
XCVR_C10_RX_SD_THRESHOLD
XCVR_C10_RX_TERM_SEL
XCVR_C10_RX_TERM_TRI_ENABLE
XCVR_C10_RX_UC_RX_DFE_CAL
XCVR_C10_RX_VCCELA_SUPPLY_VOLTAGE
XCVR_C10_RX_VCM_CURRENT_ADD
XCVR_C10_RX_VCM_SEL
XCVR_C10_RX_XRX_PATH_ANALOG_MODE
XCVR_C10_TX_COMPENSATION_EN
XCVR_C10_TX_DCD_DETECTION_EN
XCVR_C10_TX_DPRIO_CGB_VREG_BOOST
XCVR_C10_TX_LINK
XCVR_C10_TX_LOW_POWER_EN
XCVR_C10_TX_POWER_MODE
XCVR_C10_TX_PRE_EMP_SIGN_1ST_POST_TAP
XCVR_C10_TX_PRE_EMP_SIGN_2ND_POST_TAP
XCVR_C10_TX_PRE_EMP_SIGN_PRE_TAP_1T
XCVR_C10_TX_PRE_EMP_SIGN_PRE_TAP_2T
XCVR_C10_TX_PRE_EMP_SWITCHING_CTRL_1ST_POST_TAP
XCVR_C10_TX_PRE_EMP_SWITCHING_CTRL_2ND_POST_TAP
XCVR_C10_TX_PRE_EMP_SWITCHING_CTRL_PRE_TAP_1T
XCVR_C10_TX_PRE_EMP_SWITCHING_CTRL_PRE_TAP_2T
XCVR_C10_TX_RES_CAL_LOCAL
XCVR_C10_TX_RX_DET
XCVR_C10_TX_RX_DET_OUTPUT_SEL
XCVR_C10_TX_RX_DET_PDB
XCVR_C10_TX_SLEW_RATE_CTRL
XCVR_C10_TX_TERM_CODE
XCVR_C10_TX_TERM_SEL
XCVR_C10_TX_UC_DCD_CAL
XCVR_C10_TX_UC_GEN3
XCVR_C10_TX_UC_GEN4
XCVR_C10_TX_UC_SKEW_CAL
XCVR_C10_TX_UC_TXVOD_CAL
XCVR_C10_TX_UC_TXVOD_CAL_CONT
XCVR_C10_TX_UC_VCC_SETTING
XCVR_C10_TX_USER_FIR_COEFF_CTRL_SEL
XCVR_C10_TX_VOD_OUTPUT_SWING_CTRL
XCVR_C10_TX_XTX_PATH_ANALOG_MODE

So yes, now we’re looking for strings in a binary file.

And yet, all this doesn’t necessarily help

With all these hints, there’s still some pure guesswork. For example, I tried

set_instance_assignment -name XCVR_C10_RX_SD_OUTPUT_ON 3 -to gtx_rxp

and the fitter gave me

    Error (15744): The settings must match one or more of these conditions:
    Error (15744): ( sup_mode == ENGINEERING_MODE ) OR ( prot_mode != SATA_RX ) OR ( sd_output_on == DATA_PULSE_6 )
    Error (15744): But the following assignments violate the above conditions:
    Error (15744): sup_mode = USER_MODE
    Error (15744): prot_mode = SATA_RX
    Error (15744): sd_output_on = DATA_PULSE_10 -- Set by Pin Assignment "XCVR_A10_RX_SD_OUTPUT_ON" (QSF Name "XCVR_A10_RX_SD_OUTPUT_ON")

So first, let’s notice that it blames a XCVR_A10_* assignment, even though I used a XCVR_C10_* assignment in the QSF file. Really.

Also note the hint that setting sup_mode to ENGINEERING_MODE would have let us off the hook. More on that below (however don’t expect much).

But how did the assigning XCVR_C10_RX_SD_OUTPUT_ON with the integer 3 turn into DATA_PULSE_10? Maybe look in PMA_RegMap.csv, mentioned above? But no, DATA_PULSE_10 is assigned 5′b01110 as a value to write to a register, and it’s the 5th value listed, so no matter if you count from zero or one, 3 is not the answer.

Maybe ct2_pma_rx_sd_simple.tcl, also mentioned above? That helps even less, as there’s no sign there that DATA_PULSE_10 would be special. In short, just play with the integer value until hitting gold. Or even better, don’t assign anything, and use the default.

Likewise, setting

set_instance_assignment -name XCVR_C10_RX_SD_OUTPUT_OFF 6 -to gtx_rxp

yields

    Error (15744): In atom 'frontend_ins|xcvr_inst|xcvr_native_a10_0|xcvr_native_a10_0|g_xcvr_native_insts[0].twentynm_xcvr_native_inst|twentynm_xcvr_native_inst|inst_twentynm_pma|gen_twentynm_hssi_pma_rx_sd.inst_twentynm_hssi_pma_rx_sd'
    Error (15744): The settings must match one or more of these conditions:
    Error (15744): ( sup_mode == ENGINEERING_MODE ) OR ( prot_mode != SATA_RX ) OR ( sd_output_off == CLK_DIVRX_2 )
    Error (15744): But the following assignments violate the above conditions:
    Error (15744): sup_mode = USER_MODE
    Error (15744): prot_mode = SATA_RX
    Error (15744): sd_output_off = CLK_DIVRX_7 -- Set by Pin Assignment "XCVR_A10_RX_SD_OUTPUT_OFF" (QSF Name "XCVR_A10_RX_SD_OUTPUT_OFF")

Attempting to enable Engineering Mode

Since ENGINEERING_MODE is often mentioned in the fitter’s error messages, I thought maybe enabling it could silence these errors and allow wider options. For example, I attempted to enable the Electrical Idle state on the transmission wires on a non-PCIe transciever by editing one of the files generated by the transceiver IP tools (xcvr_xcvr_native_a10_0.v), changing the line saying

.hssi_tx_pcs_pma_interface_bypass_pma_txelecidle("true"),

.hssi_tx_pcs_pma_interface_bypass_pma_txelecidle("false"),

but the fitter threw the following error:

    Error (15744): In atom 'xcvr_inst|xcvr_native_a10_0|xcvr_native_a10_0|g_xcvr_native_insts[0].twentynm_xcvr_native_inst|twentynm_xcvr_native_inst|inst_twentynm_pcs|gen_twentynm_hssi_tx_pcs_pma_interface.inst_twentynm_hssi_tx_pcs_pma_interface'
    Error (15744): The settings must match one or more of these conditions:
    Error (15744): ( sup_mode == ENGINEERING_MODE ) OR ( bypass_pma_txelecidle == TRUE ) OR ( pcie_sub_prot_mode_tx != OTHER_PROT_MODE )
    Error (15744): But the following assignments violate the above conditions:
    Error (15744): sup_mode = USER_MODE
    Error (15744): bypass_pma_txelecidle = FALSE
    Error (15744): pcie_sub_prot_mode_tx = OTHER_PROT_MODE

So it tells me that if I want bypass_pma_txelecidle as “false” I have to either set pcie_sub_prot_mode_tx to one of the PCIe modes, or set sup_mode to ENGINEERING_MODE. Changing pcie_sub_prot_mode_tx is out of the question, because the only way to settle the conflicts reported by the fitter is to turn the entire transceiver to follow the predefined PCIe settings. Had I been able to go that path, I would have done that long ago.

So switch to Engineering Mode, whatever that means, by editing the same file, changing

.hssi_tx_pcs_pma_interface_sup_mode("user_mode"),

.hssi_tx_pcs_pma_interface_sup_mode("engineering_mode"),

but the fitter really didn’t like that:

    Error (15744): In atom 'xcvr_inst|xcvr_native_a10_0|xcvr_native_a10_0|g_xcvr_native_insts[0].twentynm_xcvr_native_inst|twentynm_xcvr_native_inst|inst_twentynm_pcs|gen_twentynm_hssi_tx_pcs_pma_interface.inst_twentynm_hssi_tx_pcs_pma_interface'
    Error (15744): The settings must match one or more of these conditions:
    Error (15744): ( sup_mode OR ( sup_mode == USER_MODE )
    Error (15744): But the following assignments violate the above conditions:
    Error (15744): sup_mode = ENGINEERING_MODE

This is somewhat cryptic, because it implies that sup_mode could just evaluate “true” in some way. Anyhow, selecting ENGINEERING_MODE was rejected flat, so that’s not an option for us regular people. There’s probably some secret sauce method to allow this, but that goes beyond what is sensible to work around the tools’ restrictions.

Conclusion

Setting up a Cyclone 10 transceiver for a use other than specifically intended by the FPGA’s vendor is a visit to nomansland. Reverse engineering Quartus does help to some extent, but some issues are left to guessing.

And the transceiver itself appears to be a step backwards compared with the series-V FPGAs. It may reach higher rates, but that came at a cost. Or maybe it’s related to the different silicon process. This way or another, it’s not all that impressive.

FPGA + USB 3.0: Cypress EZ-USB FX3 or XillyUSB?

eli — Wed, 25 Nov 2020 05:11:25 +0000

Introduction

As the title implies, this post compares two solutions for connecting an FPGA to a host via USB 3.0: Cypress’ FX3 chipset, which has been around since around 2010, and the XillyUSB IP core, which was released in November 2020.

Cypress has been acquired by Infineon, but I’ll stick with Cypress. It’s not clear if the products are going to be re-branded (like Intel did with Altera, for example).

Since I’m openly biased towards XillyUSB, let’s be fair enough and start with its disadvantages. The first and obvious one is how long it’s been around compared with the FX3. Another thing is that XillyUSB won’t fall back to USB 2.0 if a USB 3.0 link fails to establish. This fallback option is important in particular because computer’s USB 3.x ports are sometimes of low quality, so even though the user expected to benefit from USB 3.x speed, the possibility to plug the device into a non-USB 3.x port can save the day.

This is however relevant only for applications that are still useful with USB 2.0, e.g. hard disk, USB sticks and Ethernet adapters — these still work, but do benefit from a faster connection when possible. If the application inherently needs payload speeds above 25 MBytes/s, it’s USB 3.0 or perish.

Thirdly, XillyUSB requires an FPGA with an MGT supporting 5 Gb/s. Low-cost FPGAs don’t. But from a BOM cost point of view, odds are that upgrading the FPGA costs less than adding the FX3 device along with its supporting components.

Finally, a not completely related comment: USB is good for hotpluggable, temporary connections. If a fixed link is required between an FPGA and some kind of computer, PCIe is most likely a better choice, possibly using Xillybus’ IP core for PCIe. Compared with USB 2.0, it might sound like a scary option, and PCIe isn’t always supported by embedded devices. But if USB 3.x is an option, odds are that PCIe is too. And a better one, unless hotplugging is a must.

FX3: Another device, another processor, another API and SDK

XillyUSB is an IP core, and hence resides side-by-side with the application logic on the FPGA. It requires a small number of pins for its functionality: Two differential wire pairs to the USB connector, and an additional pair of wires to a low-jitter reference clock. A few GPIO LEDs are recommended for status indications, but are not mandatory. The chances for mistakes in the PCB design are therefore relatively slim.

By contrast, using the FX3 requires following a 30+ pages hardware design application note (Cypress’ AN70707) to ensure proper operation of that device. As for FPGA pin consumption, a minimum of 40 pins is required to attain 400 MB/s of data exchange through a slave FIFO (e.g. 200 MB/s in each direction, half the link capacity), since the parallel data clock is limited to 100 MHz.

It doesn’t end there: The FX3 contains an ARM9 processor for which firmware must be developed. This firmware may produce USB traffic by itself, or configure the device to expose a slave FIFO interface for streaming data from and to the FPGA. This way or another, code for the ARM processor needs to be developed in order to carry out the desired configuration, at a minimum.

This is done with Cypress’ SDK and based upon coding examples, but there’s no way around this extra firmware task, which requires detailed knowledge on how the device works. For example, to turn off the FX3′s LPM capability (which is a good idea in general), the CyU3PUsbLPMDisable() API function should be called. And there are many more of this sort.

Interface with application logic in the FPGA

XillyUSB follows Xillybus’ paradigm regarding interface with application logic: There’s a standard synchronous FIFO between the application logic and the XillyUSB IP core for each data stream, and the application logic uses it mindlessly: For an FPGA-to-host stream, the application logic just pushes the data into the FIFO (checking that it’s not full), knowing it will reach the host in a timely manner. For the opposite direction, it reads from the FIFO when it’s non-empty.

In other words, the application logic interfaces with these FIFOs like FPGA designers are used to, for the sake of streaming data between different functional modules in a design. There is no special attention required because the destination or source of the data is a USB data link.

The FX3′s slave FIFO interface may sound innocent, but it’s a parallel data and control signal interface, allowing the FPGA to issue read and write commands on buffers inside the FX3. This requires developing logic for a controller that interfaces with the slave FIFO interface: Selection of the FX3 buffer to work with, sense its full or empty status (depending on the direction) and transfer data with this synchronous interface. If more than one data stream is required between the FPGA and the host, this controller also needs to perform scheduling and multiplexing. State machines, buffering of data, arbitration, the whole thing.

Even though a controller of this sort may seem trivial, it’s often this type of logic that is exposed to corner cases regarding flow of data: The typical randomness of data availability on one side and the ability to receive it on the other, creates scenarios that are difficult to predict, simulate and test. Obtaining a bulletproof controller of this sort is therefore often significantly more difficult than designing one for a demo.

When working with XillyUSB (or any other Xillybus IP core), the multiplexing is done inside the IP core: Designed, tested and fine polished once and for all. And this opens for another advantage: Making changes to the data stream setting, and adding streams to an existing design is simple and doesn’t jeopardize the stability of the already existing logic. Thanks to Xillybus’ IP Core Factory, this only requires some simple operations on the website and downloading the new IP core. Its deployment in the FPGA design merely consists of replacing files, making trivial changes in the HDL following a template, and adding a standard FPGA FIFO for the new stream. Nothing else in the logic design changes, so there are no side effects.

Host software design

The FX3′s scope in the project is to present a USB device. The driver has to be written more or less from scratch. So the host software, whether as a kernel driver or a libusb user-space implementation, must be written with USB transfers as the main building block. For a reasonable data rate (or else why USB 3.0?), the software design must be asynchronous: Requests are queued for submission, and completer functions are called when these requests are completed. The simple wait-until-done method doesn’t work, because this leads to long time gaps of no communication on the USB link. Aside from the obvious impact on bandwidth utilization, this is likely to cause overflows or underflows in the FPGA’s buffers.

With XillyUSB (and once again, with other Xillybus IP cores too), a single, catch-all driver presents pipe-like device files. Plain command-line utilities like “cat” and “dd” can be used to implement reliable and practical data acquisition and playback. The XillyUSB IP core and the dedicated driver use the transfer-based USB protocol for creating an abstraction of a simple, UNIX-like data stream.

FPGA application logic: USB transfers or continuous data?

The USB specification was written with well-defined transfers in mind. The underlying idea was that the host allocates a buffer and queues a data transfer request, related to a certain USB endpoint, to or from that buffer. For continuous communication, several transfers can be queued. Yet, there are data buffers of fixed size, each waiting for its turn.

Some data sinks and sources are naturally organized in defined chunks of data, and fit USB’s concept well. From a software design’s point of view, it’s simpler to comprehend a mechanism that relies on fixed-sized buffers, requests and fulfillments.

But then, what is natural in an FPGA design? In most applications, continuous, non-packeted data is the common way. Even video applications, where there’s a clear boundary between frames, are usually implemented with regular FIFOs between the internal logic block. With XillyUSB, this is the way the data flows: FIFOs on the FPGA and pipe-like device files on the host side.

With FX3, on the other hand, the USB machinery needs direct attention. For example: When transmitting data towards the host, FX3′s slave FIFO interface requires asserting PKTEND# in order to commit the data to the host, which may also issue a zero-length packet instead. This complication is necessary to maintain USB’s concept of a transfer: Sending a USB DATA packet shorter than the maximal allowed length tells the host that the transfer is finished, even if the buffer that was allocated for the transfer isn’t filled. Therefore, the FX3 can’t just send whatever data it has in the buffer because it has nothing better to do. Doing so would terminate the transfer, which can mean something in the protocol between the driver and its device.

But then, if the transfer request buffer’s size isn’t a multiple of the maximal USB DATA packet size (1024 bytes for USB 3.0), PKTEND# must be asserted before this buffer fills, or a USB protocol error occurs, as the device sends more data than can be stored. The USB protocol doesn’t allow the leftovers to be stored in the next queued transfer’s buffer, and it’s not even clear if such transfer is queued.

If this example wasn’t clear because of too much new terminology, no problem, that was exactly the point: The USB machinery one needs to be aware of.

Physical link diagnostics

As a USB device can be connected to a wide range of USB host controllers, on various motherboards, through a wide range of USB cables, the quality of the bitstream link may vary. On a good day it’s completely error-free, but sometimes it’s a complete mess.

Low-level errors don’t necessarily cause immediate problems, and sometimes the visible problems don’t look like a low-level link issue. The USB protocol is designed to keep the show running to the extent possible (retransmits and whatnot), so what appears to be occasional problems with a USB device could actually be a bad link all the time, with random clusters of mishaps that make the problem become visible, every now and then.

Monitoring the link’s health is therefore beneficial, both in a lab situation, but nevertheless in a product. The application software can collect error event information, and warn the user that even though all seems well, it’s advisable to try a different USB port or cable. Sometimes, that’s all it takes.

XillyUSB provides a simple means for telling something is wrong. There’s an output from the IP core, intended for a plain LED that flashes briefly for each error event that is detected. There are more detailed LEDs as well. Also, the XillyUSB driver creates a dedicated device file, from which diagnostic data can be read with a simple file operation. This diagnostic data chunk mainly consists of event counters for different error situations, which can be viewed with a utility that is downloaded along with XillyUSB’s driver for Linux. Likewise, a simple routine in an application suite can perform this monitoring for the sake of informing users about a problematic hardware setting.

Cypress’ FX3 does provide some error information of this sort, however this is exposed to the ARM processor inside the device itself. The SDK supplies functions such as CyU3PUsbInitEventLog() for enabling event logging and CyU3PUsbGetErrorCounts() for obtaining error count, but it’s the duty of the ARM’s firmware to transfer this data to the host. And then some kind of driver and utility are needed on the host as well.

The documentation for error counting is somewhat minimal, but looking at the definition of LNK_PHY_ERROR_CONF in the EZ-USB FX3 Technical Reference Manual helps.

Bugs and Errata

As always when evaluating a component for use, it’s suggested to read through the errata section in FX3′s datasheet. In particular, there’s a known problem causing errors in payload data towards the host, for which there is no planned fix. It occurs when a Zero Length Packet is followed by data “very quickly”, i.e. within a microframe of 125μs.

So first, 125μs isn’t “very quickly” in USB 3.0 terms. It’s the time corresponding to 62.5 kBytes of raw bandwidth of the link, which is a few dozens of DATA IN packets. Second, a zero length packet is something that is sent to finish a USB transfer. One can avoid it in some situations, but not in others. For example, if the transfer’s length is a multiple of 1024 bytes, the only way to finish it explicitly is with a zero length packet. The said errata requires not sending any data for 125 μs after such event, or there will be data errors.

This doesn’t just make the controller more complicated, but there’s a significant bandwidth penalty.

It may not be worth much saying that XillyUSB doesn’t have any bug of this sort, as it has been extensively tested with randomized data sources and sinks. It’s in fact quite odd that Cypress obviously didn’t perform tests of this sort (or they would have caught that bug easily).

The crucial difference is however that bugs in an IP core can be fixed and deployed quickly. There is no new silicon device to release, and no need to replace a physical device on the PCB.

No design is born perfect. The question is to what extent the issues that arise are fixed.

Ultrascale GTH transceivers: Advanced doesn’t necessarily mean better

eli — Wed, 09 Sep 2020 17:11:41 +0000

Introduction

I tend to naturally assume that newer FPGAs will perform better in basically everything, and that the heavier hammers are always better. Specifically, I expect the GTX / GTH / GT-whatever to perform better with the newer FPGAs (not just higher rates, but simply work better) and that their equalizers will be able to handle lousier input signals. And that the DFE equalizer will perform better than its little brother, LPM, in particular when the signal has been through some stuff.

And then there’s reality. This post summarizes my own findings with a USB 3.0 (SuperSpeed) link from the host to the FPGA, running at 5 Gb/s raw data rate on wire, with scrambler enabled. There is no official support for USB 3.0 by Xilinx’ transceivers, however the link parameters resemble those of SATA (in particular the SSC clocking without access to the clock), so I used the recommended settings for SATA, except for changing the data rate and reference clock frequency.

I’ll focus on Ultrascale’s GTH transceiver as well as the DFE equalizer, neither of which performed as I expected.

There’s a brief explanation on equalizers and related issues at bottom of this post, for those who need some introduction.

And ah, not directly related, but if a complete design example with an Ultrascale GTH would help, here’s one. Actually, there’s also the same for earlier FPGAs (7-series).

Choosing insertion loss on Ultrascale

The setting of Transceiver IP Wizard for Ultrascale and Ultrascale+ has a crucial difference regarding the receiver: Under the “Advanced” section, which is hidden by default, the physical characteristics of the channel can be set. Among others, the equalizer can be selected between “Auto” (default), “LPM” and “DFE”. This selection can be done with the Wizard for Kintex-7 and Virtex-7 FPGAs as well, but there’s a new setting in the Ultrascale Wizard: The insertion loss at the Nyquist frequency.

The default for Ultrascale, with SATA preset, is 14 dB insertion loss, with the equalizer set to Auto. The actual result is that the GTH is configured automatically by the Wizard to use the LPM equalizer. The insertion loss is quite pessimistic for a SATA link (and USB 3.0 as well), but that doesn’t matter so much, given that LPM was chosen. And it works fine.

But knowing that I’m going to have reflections on the signal, I changed the equalizer from “Auto” to “DFE”. I was under the wrong impression that the insertion loss was only a hint for the automatic selection between LPM and DFE, so I didn’t give it any further attention. The result was really poor channel performance. Lots of bit errors.

Investigating this, I found out that while the insertion loss setting doesn’t make any difference with the LPM equalizer (at least not in the range between 0 and 14 dB), it does influence the behavior of DFE. Namely, if the insertion loss is 10 dB and below, the DFE’s AGC component is disabled, and a fixed gain is assigned instead. More precisely, the GTHE3_CHANNEL primitive’s RXDFEAGCOVRDEN port is assigned a logic ’1′, and RXDFE_GC_CFG2 instantiation parameter is set to 16′b1010000 instead of all zeros.

So apparently, the DFE’s AGC doesn’t function properly unless the signal arrives with significant attenuation. This isn’t problematic when the physical link is fixed, and the insertion loss can be calculated from the PCB’s simulation. However when the link involves a user-supplied cable, such as the cases of USB 3.0 and SATA, this is an unknown figure.

Given that the insertion loss of cables is typically quite low, it makes sense to pick an insertion loss of 10 dB or less if DFE is selected. Or just go for LPM, which is configured exactly the same by the Wizard, regardless of the insertion loss setting (for the 0 dB to 14 dB range, at least). As the eye scans below show, the DFE wasn’t such a star anyhow.

In this context, it’s interesting that the Wizard for 7-series FPGAs (Kintex-7, Virtex-7 and Artix-7) doesn’t ask about insertion loss. You may select DFE or LPM, but there’s no need to be specific on that figure. So it seems like this is a workaround for a problem with the DFE on Ultrascale’s transceivers.

DFE vs. LPM on Ultrascale

As the eye scans shown below reveal, it turns out that DFE isn’t necessarily better than LPM on an Ultrascale FPGA. This is somewhat surprising, since LPM consists of a frequency response correction filter only, while the transceiver’s DFE option includes that on top of the DFE equalizer (according to the user guide). One could therefore expect that DFE would have a better result, in particular with a link that clearly produces reflections.

This, along with the Wizard’s mechanism for turning off the AGC for stronger signals, seems to indicate that the DFE didn’t turn out all that well on Ultrascale devices, and that it’s better avoided. Given that it gave no benefit with a 5 Gb/s signal that goes through quite a few discontinuities, it’s questionable whether there is a scenario for which it’s actually the preferred choice.

Eye scans: General

I’ve made a lot of statistical eyes scans for the said USB channel. This mechanism is made by Xilinx’ transceivers themselves, and is described in the respective user guides. In a nutshell, these scans show how the bit error rate is affected by moving the sampling time from the point that the CDR locks on, as well as adding a voltage offset to the detection threshold. The cyan-colored spot in the middle of the plots shows the region of zero bit errors, and hence it size displays the margins in time and voltage for retaining zero BER.

The important part is the margin in time. In the plots below, UI is the time unit used. One UI corresponds to a bit’s period (i.e. the following bit appears at UI = 1). The vertical axis is less well defined and less important, since voltage is meaningless: It can be amplified as needed. The shape of the eye plot can however give a hint sometimes about certain problems.

The plots in this post were all made on a USB 3.0 data stream (running at the standard 5 Gb/s with scrambler applied), created by a Renesas uPD720202 USB controller (PCI ID 1912:0015), received by the FPGA’s transceiver.

The physical connection, except for PCB traces, involved a Type A connector, connected to a Micro B connector with a high-quality 1 meter USB cable. The Micro-B connector is part of an sfp2usb adapter, which physically connects the signal to the SFP+ connector inside an SFP+ cage, which in turn is connected directly to the FPGA. The signal traces of the sfp2usb adapter are about the length of the SFP+ cage.

So overall, it’s the USB controller chip, PCB trace, USB type A connector mating, 1 meter of cable, Micro B connector mating, a short PCB trace on the sfp2usb adapter, an SFP+ connector mating, PCB trace on the FPGA board reaching the FPGA’s transceiver.

The Renesas USB controller was selected over other options because it showed relatively low signal quality compared with other USB signal sources. The differences are more apparent with this source, however the other sources all gave similar results.

Needless to say, testing at a specific rate with specific equipment doesn’t prove anything on the general quality of the transceivers, and yet the 5 Gb/s represents a medium rate channel quite well.

The FPGA boards used:

Xilinx KCU105 for Kintex Ultrascale
Xilinx KC705 for Kintex-7
Trenz TE0714 for Artix-7 with carrier board having an SFP+ cage

I used some home-cooked logic for making the eye scans and Octave to produce the plots, so if the format doesn’t look familiar, that’s why.

LPM vs. DFE with Ultrascale GTH

This is the eye scan plot for LPM (click to enlarge the plots):

Eye scan with Ultrascale GTH, LPM equalizer, 5 Gb/s

And this is for DFE, with insertion loss set below the 10 dB threshold:

Eye scan with Ultrascale GTH, DFE equalizer, 5 Gb/s, low insertion loss

And this is DFE again, with insertion loss set to 14 dB:

Eye scan with Ultrascale GTH, DFE equalizer, 5 Gb/s, 14 dB insertion loss

It’s quite evident that something went horribly wrong when the insertion loss was set to 14 dB, and hence the AGC was enabled, as explained above. But what is even more surprising is that even with the AGC issue away, the eye scan for DFE is slightly worse than LPM. There are three connectors on the signal paths, each making its reflections. DFE should have done better.

Comparing DFE scans with Kintex-7′s GTX

Here’s the proper DFE eye scan for Ultrascale’s GTH again (click to enlarge):

Eye scan with Ultrascale GTH, DFE equalizer, 5 Gb/s, low insertion loss

And this is Kintex-7, same channel but with a GTX, having considerably less equalizer taps:

Eye scan with Kintex-7 GTX, DFE equalizer, 5 Gb/s

It’s quite clear that the zero-BER region is considerably larger on the Kintex-7 eye scan. Never mind the y-axis of the plot, it’s the time axis that matters, and it’s clearly wider. Kintex-7 did better than Ultrascale.

Comparing LPM scans with GTX / GTP

This is the LPM eye scan for Ultrascale’s GTH again:

Eye scan with Ultrascale GTH, LPM equalizer, 5 Gb/s

And Kintex-7′s counterpart:

Eye scan with Kintex-7 GTX, LPM equalizer, 5 Gb/s

It’s clearly better than Ultrascale’s scan. Once again, never mind that the zero-BER part looks bigger: Compare the margin in the time axis. Also note that Kintex-7′s DFE did slightly better than LPM, as expected.

And since Artix-7 is also capable of LPM, here’s its scan:

Eye scan with Artix-7 GTP (LPM equalizer), 5 Gb/s

Surprise, surprise: Atrix-7′s eye scan was best of all options. The low-cost, low-power device took first prize. And it did so with an extra connector with the carrier board.

Maybe this was pure luck. Maybe it’s because the scan was obtained with a much smaller board, with possibly less PCB trace congestion. And maybe the LPM on Artix-7 is better because there’s no DFE on this device, so they put an extra effort on LPM.

Conclusion

The main takeaway from this experience of mine is that advanced doesn’t necessarily mean better. Judging by the results, it seems to be the other way around: Ultrascale’s GTH being more fussy about the signal, and losing to Kintex-7′s GTX, and both losing to Artix-7.

And also, to take the insertion loss setting in the Wizard seriously.

As I’ve already said above, this is just a specific case with specific equipment. And yet, the results turned out anything but intuitive.

Appendix: Equalizers, ISI and Nyquist frequency, really briefly

First, the Nyquist frequency: It’s just half the raw bit rate on wire. For example, it’s. 2.5 GHz for a USB Superspeed link with 5 Gb/s raw data rate. The idea behind this term is that the receiver makes one analog sample per bit period, and Nyquist’s Theorem does the rest. But it’s also typically the frequency at which one can low-pass filter the channel without any significant effect.

Next, what’s this insertion loss? For those who haven’t played with RF network analyzers for fun, insertion loss is, for a given frequency, the ratio between the inserted signal power on one side of the cable and/or PCB trace and the power that arrives at the other end. You could call it the frequency-dependent attenuation of the signal. As the frequency rises, this ratio tends to rise (more loss of energy) as this energy turns into radio transmission and heat. Had this power loss been uniform across frequency, it would have been just a plain attenuation, which is simple to correct with an amplifier. The frequency-varying insertion loss results in a distortion of the signal, typically making the originally sharp transitions in time between ’0′ and ’1′ rounder and smeared along possibly several symbol periods.

This smearing effect results in Intersymbol Interference (ISI), which means that when the bit detector samples the analog voltage for determining whether its a ’0′ or ’1′, this analog voltage is composed of a sum of voltages, depending on several bits. These additional voltage components act a bit like noise and increase the bit error rate (BER), however this isn’t really noise (such as the one picked up by crosstalk or created by the electronics), but rather the effect of a bit’s analog signal being spread out over a longer time period.

Another, unrelated reason for ISI is reflections: As the analog signal travels as an electromagnetic wave through the PCB copper trace or cable, reflections are created when the medium changes or makes sharp turns. This could be a via on the PCB board (or just a poorly designed place on the layout), or a connector, which involves several medium transitions: From the copper trace on the PCB to the connector’s leg, from one side of the connector to the connector mating with it, and then from the connector’s leg to the medium that carries the signal further. This assuming the connector doesn’t have any internal irregularities.

So ISI is what equalizers attempt to fix. There’s a relatively simple approach, employed by the linear equalizer. It merely attempts to insert a filter with a frequency response that compensates for the channel’s insertion loss pattern: The equalizer amplifies high frequencies and attenuates low frequencies. By doing so, some of the insertion loss’ effect is canceled out. This reverse filter is tuned by the equalizer for optimal results, and when this works well, the result is an improvement of the ISI. The linear equalizer doesn’t help at all regarding reflections however.

The transmitter can help with this matter by shaping its signal to contain more high-frequency components — this is called pre-emphasis — but that’s a different story.

The DFE (Decision Feedback Equalizer) attempts to fix the ISI directly. It’s designed with the assumption that the transmitted bits in the channel are completely random (which is often ensured by applying a scrambler to the bit stream). Hence the voltages that are sampled by the bit detector should be linearly uncorrelated, and when there is such correlation, it’s attributed to ISI.

This equalizer cancels the correlation between the bit that is currently detected and the analog voltages of a limited number of bits that will be detected after it. This is done by adding or subtracting a certain voltage for each of the signal samples that are used for detecting the bits after the current one. The magnitude of this operation (which can be negative, of course) depends on the time distance between the current bit and the one manipulated. Whether its an addition or subtraction depends on whether the current bit was detected as a ’0′ or ’1′.

The result is hence that when the sample arrives at the bit detector, it’s linearly uncorrelated with the bits that were detected before it. Or more precisely, uncorrelated with a number of bits detected before it, depending on the number of taps that the DFE has.

This method is more power consuming and has a strong adverse effect if the bits aren’t random. It’s however considered better, in particular for canceling the effect of signal reflections, which is a common problem as the analog signal travels on the PCB and/or cable and reaches discontinuities (vias, connectors etc.).

Having said that, one should remember that the analog signal typically travels at about half the speed of light on PCB traces (i.e. 1.5 x 10^8 m/s), so e.g. at 5 Gb/s each symbol period corresponds to 3 cm. Accordingly, an equalizer with e.g. 8 taps is able to cancel reflections that have traveled 24 cm (typically 12 cm in each direction). So DFE may help with reflections on PCBs, but not if the reflection has gone back and forth through a longer cable. Which may not an issue, since the cable itself typically attenuates the reflection’s signal as it goes back and forth.

According to the user guides, when a Xilinx transceiver is set to LPM (Low Power Mode), only a linear equalizer is employed. When DFE is selected, a linear equalizer, followed by a DFE equalizer are employed.

Setting up Si5324/Si5328 on Xilinx development boards

eli — Sun, 30 Aug 2020 03:12:31 +0000

General

These are my notes as I set up the jitter attenuator devices (Silicon Labs’ Si5324 and Si5328) on Xilinx development boards as clean clock sources for use with MGTs. As such, this post isn’t all that organized.

I suppose that the reason Xilinx put a jitter attenuator to drive the SFP+ related reference clock, rather than just a plain clock generator, is first and foremost because it can be used as a high-quality clock source anyhow. On top of that, it may sync on a reference clock that might be derived from some other source. In particular, this allows syncing the SFP+ module’s data rate with the data rate of an external data stream.

But it’s a bit of a headache to set up. But if you’re into FPGAs, you should be used to this already.

Jots on random topics

To get the register settings, run DSPLLsim, a tool issued by Silicon Labs (part of Precision Clock EVB Software). There is no need for any evaluation board to run this tool for the sake of obtaining register values, and neither is there any need to install the drivers it suggests to install during the software’s installation.
Note that there’s an outline for the settings to make with DSPLLsim in a separate section below, which somewhat overlaps these jots.
Enable Free Run Mode = the crystal, which is connected to XA/XB is routed to CKIN2. So there’s no need to feed the device with a clock, but instead set the reference clock to CKIN2. That works as a 114.285 MHz (± 20 ppm) reference clock on all Xilinx boards I’ve seen so far, including KC705, ZC706, KCU105, ZCU106, KCU116, VC707, VC709, VCU108 and VCU118). This is also the frequency recommended in the Silicon Lab’s docs for a crystal reference.
The clock outputs should be set to LVDS. This is how they’re treated on Xilinx’ boards.
f3 is the frequency at the input of the phase detector (i.e. the reference clock after division). The higher f3, the lower jitter.
The chip’s CMODE is tied to GND, so the interface is I2C.
RATE0 and RATE1 are set to MM by not connecting them, so the internal pull-up-and-down sets them to this mode. This setting requires a 3rd overtone crystal with a 114.285 MHz frequency, which is exactly what’s in the board (see Si53xx Reference Manual, Appendix A, Table 50).
The device’s I2C address depends on the A2-A0 pins, which are all tied to GND on the boards I’ve bothered to check. The 7-bit I2C address is { 4b’1101, A[2:0] }. Since A=0, it’s 7-bit address 0x68, or 0xD0 / 0xD1 in 8-bit representation.
Except for SCL, SDA, INT_C1B and RST_B, there are no control pins connected to the FPGA. Hence all other outputs can be configure to be off. RST_B is an input to reset the device. INT_C1B indicates problems with CKIN1 (only) which is useless in Free Run Mode, or may indicate an interrupt condition, which is more useful (but not really helpful due to the catch explained a few bullets down).
Hence be INT_PIN and CK1_BAD_PIN (bits 0 and 2 in register 20) should be ’1′, so that INT_C1B functions as an interrupt pin. The polarity of this pin is set by bit 0 of register 22 (INT_POL). Also set LOL_MSK to ’1′ (bit 0 of register 24), so that INT_C1B is asserted until the PLL is locked. This might take several seconds if the loop bandwidth is narrow (which is good for jitter), so the FPGA should monitor this pin and hold the CPLL in reset until the reference clock is stable.
There is a catch, though: The INT_C1B pin reflects the value of the internal register LOL_FLG (given the outlined register setting), which latches a Loss Of Lock, and resets it only when zero is written to bit 0 of register 132. Or so I understand from the datasheet’s description for register 132, saying “Held version of LOL_INT. [ ... ] Flag cleared by writing 0 to this bit.” Also from Si53xx Reference Manual, section 6.11.9: “Once an interrupt flag bit is set, it will remain high until the register location is written with a ’0′ to clear the flag”. Hence the FPGA must continuously write to this register, or INT_C1B will be asserted forever. This could have been avoided by sensing the LOL output, however it’s not connected to the FPGA.
Alternatively, one can poll the LOL_INT flag directly, as bit 0 in register 130. If the I2C bus is going to be continuously busy until lock is achieved, one might as well read the non-latched version directly.
To get an idea of lock times, take a look on Silicon Labs’ AN803, where lock times of minutes are discussed as a normal thing. Also the possibility that lock indicator (LOL) might go on and off while acquiring lock, in particular with old devices.
Even if the LOL signal wobbles, the output clock’s frequency is already correct, so it’s a matter of slight phase adjustments, made with a narrow bandwidth loop. So the clock is good enough to work with as soon as LOL has gone low for the first time, in particular in a case like mine, where the CPLL should be able to tolerate a SSC running between 0 and -5000 ppm at a 33 kHz rate.
It gets better: Before accessing any I2C device on the board, the I2C switch (PCA9548A) must be programmed to let the signal through to the jitter attenuator: It wakes up with all paths off. The I2C switch’ own 7-bit address is { 4b’1110, A[2:0] }, where A[2:0] are its address pins. Check the respective board’s user guide for the actual address that has been set up.
Checking the time to lock on an Si5324 with a the crystal as reference, it went all from 600 ms to 2000 ms. It seems like it depends on the temperature of something, the warmer the longer lock time: Turning off the board for a minute brings back the lower end of lock times. But I’ve also seen quite random lock times, so don’t rely on any figure nor that it’s shorter after powerup. Just check the LOL and be ready to wait.
As for Si5328, lock times are around 20 seconds (!). Changing the loop bandwidth (with BWSEL) didn’t make any dramatic difference.
I recommend looking at the last page of Silicon Lab’s guides for the “We make things simple” slogan.

Clock squelch during self calibration

According to the reference manual section 6.2.1, the user must set ICAL = 1 to initiate a self-calibration after writing new PLL settings with the I2C registers. This initiates a self calibration.

To avoid an output clock that is significantly off the desired frequency, SQ_ICAL (bit 4 in register 4) should be set to ’1′, and CKOUT_ALWAYS_ON (bit 5 in register 0) should be ’0′. “SQ_ICAL” stands for squelch the clock until ICAL is completed.

The documentation is vague on whether the clock is squelched until the internal calibration is done, or until the lock is complete (i.e. LOL goes low). The reference manual goes “if SQ_ICAL = 1, the output clocks are disabled during self-calibration and will appear after the self-calibration routine is completed”, so this sounds like it doesn’t wait for LOL to go low. However the truth table for CKOUT_ALWAYS_ON and SQ_ICAL (Table 28 in the reference manual, and it also appears in the data sheets) goes “CKOUT OFF until after the first successful ICAL (i.e., when LOL is low)”. Nice, huh?

And if it there is no clock when LOL is high, what if it wobbles a bit after the initial lock acquisition?

So I checked this up with a scope on an Si5324 locking on XA/XB, with SQ_ICAL=1 and CKOUT_ALWAYS_ON=0, and the output clock was squelched for about 400 μs, and then became active throughout 1.7 seconds of locking, during which LOL was low. So it seems like there’s no risk of the clock going on and off if LOL wobbles.

LOCKT

This parameter controls the lock detector, and wins the prize for the most confusing parameter.

It goes like this: The lock detector monitors the phase difference with the reference clock. If the phase difference looks fine for a certain period of time (determined by the LOCKT parameter), LOL goes low to indicate a successful lock. As a side note, since LOCKT is the time it takes to determine that nothing bad has happened with the phase, it’s the minimum amount of time that LOL will be high, as stated in the reference manual.

But then there’s Table 1 in AN803, showing how LOCKT influences the locking time. Like, really dramatically. From several seconds to a split second. How could a lock detector make such a difference? Note that LOCKT doesn’t have anything to do with the allowed phase difference, but just how long it needs to be fine until the lock detector is happy.

The answer lies in the fast lock mechanism (enabled by FAST_LOCK). It’s a wide bandwidth loop used to make the initial acquisition. When the lock detector indicates a lock, the loop is narrowed for the sake of achieving low jitter. So if the lock detector is too fast, switching to the low bandwidth loop occurs to early, and it will take a longer time to reach a proper lock. If it’s too slow (and hence fussy), it wastes time on the wide bandwidth phase.

In the end, the datasheet recommends a value of 0x1 (53 ms). This seems to be the correct fit-all balance.

Another thing that AN803 mentions briefly, is that LOL is forced high during the acquisition phase. Hence the lock detection, which switches the loop from acquisition (wide loop bandwidth) to low-jitter lock (narrow bandwidth) is invisible to the user, and for a good reason: The lock detector is likely to wobble at this initial stage, and there’s really no lock. The change made in January 2013, which is mentioned in this application note, was partly for producing a LOL signal that would make sense to the user.

DSPLLsim settings

Step by step for getting the register assignments to use with the device.

After starting DPLLsim, select “Create a new frequency plan with free running mode enabled”
In the next step, select the desired device (5328 or 5324).
In the following screen, set CKIN1 to 114.285 MHz and CKOUT1 to the desired output frequency (125 MHz in my case). Also set the XA-XB input to 114.285 MHz. Never mind that we’re not going to work with CKOUT1 — this is just for the sake of calculating the dividers’ values.
Click “Calculate ratio” and hope for some pretty figures.
The number of outputs can remain 2.
Click “Next”.
Now there’s the minimal f3 to set. 16 kHz is a bit too optimistic. Reducing it to 14 kHz was enough for the 125 MHz case. Click “Next” again.
Now a list of frequency plans appears. If there are several options, odds are that you want the one with the highest f3. If there’s just one, fine. Go for it: Select that line and click “Next” twice.
This brings us to the “Other NCn_LS Values” screen, which allows setting the output divisor of the second clock output. For the same clock on both outputs, click “Next”.
A summary of plan is given. Review and click “OK” to accept it. It will appear on the next window as well.

And this brings us to the complicated part: The actual register settings. This tool is intended to run against an evaluation module, and update its parameters from the computer. So some features won’t work, but that doesn’t matter: It’s the register’s values we’re after.

There are a few tabs:

“General” tab: The defaults are OK: In particular, enable FASTLOCK, pick the narrowest bandwidth in range with BWSEL. The RATE attribute doesn’t make any difference (it’s set to MM by virtue of pins on the relevant devices). The Digital Hold feature is irrelevant, and so are its parameters (HIST_DEL and HIST_AVG). It’s only useful when the reference clock is shaky.
“Input Clocks” tab: Here we need changes. Set AUTOSEL_REG to 0 (“Manual”) and CKSEL_REG to 1 for selecting CKIN2 (which is where the crustal is routed to). CLKSEL_PIN set to 0, so that the selection is set by registers and not from any pins. We want to stick to the crystal, no matter what happens on the other clock input. Other than that, leave the defauls: CK_ACTV_POL doesn’t matter, because the pin is unused, and same goes for CS_CA. BYPASS_REG is correctly ’0′ (or the input clock goes directly to output). LOCKT should remain 1 and and VALT isn’t so important in a crystal reference scenario (they set the lock detector’s behavior). The FOS feature isn’t used, so its parameters are irrelevant as well. Automatic clock priority is irrelevant, as manual mode is applied. If you want to be extra pedantic, set CLKINnRATE to 0x3 (“95 to 215 MHz”), which is the expected frequency, at least on CKIN2. And set FOSREFSEL to 0 for XA/XB. It won’t make any difference, as the FOS detector’s output is ignored.
“Output Clocks” tab: Set the SFOUTn_REG to LVDS (0x7). Keep the defaults on the other settings. In particular, CKOUT_ALWAYS_ON unset and SQ_ICAL set ensures that a clock output is generated only after calibration. No hold logic is desired either.
“Status” tab: Start with unchecking everything, and set LOSn_EN to 0 (disable). Then check only INT_PIN, CK1_BAD_PIN and LOL_MSK. This is discussed in the jots above. In particular, it’s important that none of the other *_MSK are checked, or the interrupt pin will be contaminated by other alarms. Or it’s unimportant, if you’re going to ignore this pin altogether, and poll LOL through the I2C interface.

The other two tabs (“Register Map” and “Pin Out”) are just for information. No settings there.

So now, the finale: Write the settings to a file. On the main menu go to Options > Save Register Map File… and, well. For extra confusion, the addresses are written in decimal, but the values in hex.

The somewhat surprising result is that the register setting for Si5324 and Si5328 are identical, even though the loop bandwidth is dramatically different: Even though both have BWSEL_REG set to 0x3, which is the narrowest bandwidth possible for either device, for Si5324 it means 8 Hz, and for Si5328 it’s 0.086 Hz. So completely different jitter performance as well as lock time (much slower on Si5328) are expected despite, once again, exactly the same register setting.

There is a slight mistake in the register map: There is no need to write to the register at address 131, as it contains statuses. And if anything, write zeroes to bits [2:0], so that these flags are maybe cleared, and surely not ones. The value in the register map is just the default value.

Same goes for register 132, but it’s actually useful to write to it for the sake of clearing LOL_FLG. Only the value in the register map is 0x02, which doesn’t clear this flag (and once again, it’s simply the default).

It’s worth to note that the last assignment in the register map is writing 0x40 to address 136, which initiates self calibration.

I2C interface

Silicon Lab’s documentation on the I2C interface is rather minimal (see Si53xx Reference Manual, section 6.13), and refers to the standard for the specifics (with a dead link). However quite obviously, the devices work with the common setting for register access with a single byte address: First byte is the I2C address, second is the register address, and the third is the data to be written. All acknowledged.

For a read access, it’s also the common method of a two-byte write sequence to set the register address, a restart (no stop condition) and one byte for the I2C address, and the second reads data.

If there’s more than one byte of data, the address is auto-incremented, pretty much as usual.

According to the data sheet, the RST pin must be asserted (held low) at least 1 μs, after which the I2C interface can be used no sooner than 10 ms (tREADY in the datasheets). It’s also possible to use the I2C interface to induce a reset with the RST_REG bit of register 136. The device performs a power-up reset after powering up (section 5.10 in the manual) however resetting is a good idea to clear previous register settings.

As for the I2C mux, there are two devices used in Xilinx’ boards: TCA9548A and PCA9548A, which are functionally equivalent but with different voltage ranges, according to TI: “Same functionality and pinout but is not an equivalent to the compared device”.

The I2C interface of the I2C mux consists of a single register. Hence there is no address byte. It’s just the address of the device followed by the setting of the register. Two bytes, old school style. Bit n corresponds to channel n. Setting a bit to ’1′ enables the channel, and setting it to ’0′ disconnects it. Plain and simple.

The I2C mux has a reset pin which must be high during operation. On KCU105, it’s connected to the FPGA with a pull-up resistor, so it’s not clear why UG917 says it “must be driven High”, but it clearly mustn’t be driven low. Also on KCU105, the I2C bus is shared between the FPGA and a “System Controller”, which is a small Zynq device with some secret sauce software on it. So there are two masters on a single I2C bus by plain wiring.

According to UG917′s Appendix C, the system controller does some initialization on power-up, hence generating traffic on the I2C bus. Even though it’s not said explicitly, it’s plausible to assume that it’s engineered to finish its business before the FPGA has had a chance to load its bitstream and try its own stunts. Unless the user requests some operations via the UART interface with the controller, but that’s a different story.

Xilinx Ultrascale / Ultrascale+ GTH/GTY CPLL calibration

eli — Sun, 23 Aug 2020 16:36:42 +0000

… or why does my GTH/GTY not come out of reset? Why are those reset_rx_done / reset_tx_done never asserted after a reset_all or a reset involving the CPLLs?

What’s this CPLL calibration thing about?

It turns out that some GTH/GTY’s on Ultrascale and Ultrascale+ FPGAs have problems with getting the CPLL to work reliably. I’ll leave PG182 for the details on which ones. So CPLL calibration is a cute name for some workaround logic, based upon the well-known principle, that if something doesn’t work, turn it off, turn it on, and then check again. Repeat.

Well, not quite that simple. There’s also some playing with a bit (they call it FBOOST, jump start or not?) in the secret-sauce CPLL_CFG0 setting.

This way or another, this extra piece of logic simply checks whether the CPLL is at its correct frequency, and if so, it does nothing. If the CPLL’s frequency isn’t as expected, with a certain tolerance, it powers the CPLL off and on (with CPLLPD), resets it (CPLLRESET) and also plays with that magic FBOOST bit. And then tries again, up to 15 times.

The need for access to the GT’s DRPs is not just for that magic register’s sake, though. One can’t just measure the CPLL’s frequency directly, as it’s a few GHz. An FPGA can only work with a divided version of this clock. As there are several possibilities for routing and division of clocks inside the GT to its clock outputs, and the clock dividers depend on the application’s configuration, there’s a need to bring one such clock output to give the CPLL’s output divided by a known number. TXOUTCLK was chosen for this purpose.

So much of the calibration logic does exactly that: It sets some DRP registers to set up a certain relation between the CPLL’s output and TXOUTCLK (divided by 20, in fact), it does its thing, and then returns those register’s values to what they were before.

A word of warning

My initial take on this CPLL calibration thing was to enable it for all targets (see below for how). Can’t hurt, can it? An extra check that the CPLL is fine before kicking off. What could possibly go wrong?

I did this on Vivado 2015.2, and all was fine. And then I tried on later Vivado version. Boom. The GTH didn’t come out of reset. More precisely, the CPLL calibration clearly failed.

I can’t say that I know exactly why, but I caught the change that makes the difference: Somewhere between 2015.2 and 2018.3, the Wizard started to set the GTH’s CPLL_INIT_CFG0 instantiation parameter to 16′b0000001010110010. Generating the IP with this parameter set to its old value, 16′b0000000000011110, made the GTH work properly again.

I compared the reset logic as well as the CPLL calibration logic, and even though there I found a few changes, they were pretty minor (and I also tried to revert some of them, but that didn’t make any difference).

So the conclusion is that the change in CPLL_INIT_CFG0 failed the CPLL calibration. Why? I have no idea. The meaning of this parameter is unknown. And the CPLL calibration just checks that the frequency is OK. So maybe it slows down the lock, so the CPLL isn’t ready when it’s checked? Possibly, but this info wouldn’t help very much anyhow.

Now, CPLL calibration is supposed to be enabled only for FPGA targets that are known to need it. The question is whether the Transceiver IP’s Wizard is clever enough to set CPLL_INIT_CFG0 to a value that won’t make the calibration fail on those. I have no idea.

By enabling CPLL calibration for a target that doesn’t need it, I selected an exotic option, but the result should have worked nevertheless. Surely it shouldn’t break from one Vivado version to another.

So the bottom line is: Don’t fiddle with this option, and if your GTH/GTY doesn’t come out of reset, consider turning CPLL calibration off, and see if that changes anything. And if so, I have no clear advice what to do. But at least the mystery will be resolved.

Note that…

The CPLL calibration is triggered by the GT’s reset_all assertion, as well as with reset_*_pll_and_datapath, if the CPLL is used in the relevant data path. The “reset done” signal for a data path that depends on the CPLL is asserted only if and when the CPLL calibration was successful and the CPLL is locked.
If cplllock_out (if exposed) is never asserted, this could indicate that the CPLL calibration failed. So it makes sense to wait indefinitely for it — better fail loudly than work with a wobbling clock.
Because the DRP clock is used to measure the period of time for counting the number of cycles of the divided CPLL clock, its frequency must be set accurately in the Wizard. Otherwise, the CPLL calibration will most certainly fail, even if the CPLL is perfectly fine.
The calibration state machine takes control of some GT ports (listed below) from when cpllreset_in is deasserted, and until the calibration state machine has finished, with success or failure.
While the calibration takes place, and if the calibration ends up failing, the cplllock_out signal presented to the user logic is held low. Only when the calibration is finished successfully, is the GT’s CPLLLOCK connected to the user logic (after a slight delay, and synchronized with the DRP clock).

Activating the CPLL calibration feature

See “A word of warning” above. You probably don’t want to activate this feature for all FPGA targets.

There are three possible choices for whether the CPLL calibration module is activated in the Wizard’s transceiver. This can’t be set from the GUI, but by editing the XCI file manually. There are two parameters in that file, PARAM_VALUE.INCLUDE_CPLL_CAL and MODELPARAM_VALUE.C_INCLUDE_CPLL_CAL, which should have the same value as follows:

0 — Don’t activate.
1 — Do activate.
2 — Activate only for devices which the Wizard deems have a problem (default).

Changing it from the default 2 to 1 makes Vivado respond with locking the core saying it “contains stale content”. To resolve this, “upgrade” the IP, which triggers a warning that user intervention is necessary.

And indeed, three new ports are added, and this change this addition of ports is also reflected in the XCI file (but nothing else should change): gtwiz_gthe3_cpll_cal_txoutclk_period_in, gtwiz_gthe3_cpll_cal_cnt_tol_in and gtwiz_gthe3_cpll_cal_bufg_ce_in.

These are three input ports, so they have to be assigned values. PG182‘s Table 3-1 gives the formulas for that (good luck with that) and the dissection notes below explain these formulas. But the TL;DR version is:

gtwiz_gthe3_cpll_cal_bufg_ce_in should be assigned with a constant 1′b1.
gtwiz_gthe3_cpll_cal_txoutclk_period_in should be assigned with the constant value of P_CPLL_CAL_TXOUTCLK_PERIOD, as found in the transceiver IP’s synthesis report (e.g. mytransceiver_synth_1/runme.log).
gtwiz_gthe3_cpll_cal_cnt_tol_in should be assigned with the constant value of P_CPLL_CAL_TXOUTCLK_PERIOD, divided by 100.

The description here relates to a single transceiver in the IP.

The meaning of gtwiz_gthe3_cpll_cal_txoutclk_period_in is as follows: Take the CPLL clock and divide it by 80. Count the number of clock cycles in a time period corresponding to 16000 DRP clock cycles. That’s the value to assign, as this is what the CPLL calibration logic expects to get.

gtwiz_gthe3_cpll_cal_cnt_tol_in is the number of counts that the result can be higher or lower than expected, and still the CPLL will be considered fine. As this is taken as the number of expected counts, divided by 100, this results in a ±1% clock frequency tolerance. Which is a good idea, given that common SSC clocking (PCIe, SATA, USB 3.0) might drag down the clock frequency by -5000 ppm, i.e. -0.5%.

The possibly tricky thing with setting these correctly is that they depend directly on the CPLL frequency. Given the data rate, there might be more than one possibility for a CPLL frequency, however it’s not expected that the Wizard will change it from run to run unless something fundamental is changed in the parameters (e.g. changing the data rate of one of the directions or both).

Besides, the CPLL frequency appears in the XCI file as MODELPARAM_VALUE.C_CPLL_VCO_FREQUENCY.

If the CPLL is activated deliberately, it’s recommended to verify that it actually takes place by setting a wrong value for gtwiz_gthe3_cpll_cal_txoutclk_period_in, and check that the calibration fails (cplllock_out remains low).

Which ports are affected?

Looking at ultragth_gtwizard_gthe3.v gives the list of ports that the CPLL calibration logic fiddles with. Within the CPLL calibration generate clause they’re assigned with certain values, and in the “else” clause, with the plain bypass:

    // Assign signals as appropriate to bypass the CPLL calibration block when it is not instantiated
    else begin : gen_no_cpll_cal
      assign txprgdivresetdone_out = txprgdivresetdone_int;
      assign cplllock_int          = cplllock_ch_int;
      assign drprdy_out            = drprdy_int;
      assign drpdo_out             = drpdo_int;
      assign cpllreset_ch_int      = cpllreset_int;
      assign cpllpd_ch_int         = cpllpd_int;
      assign txprogdivreset_ch_int = txprogdivreset_int;
      assign txoutclksel_ch_int    = txoutclksel_int;
      assign drpaddr_ch_int        = drpaddr_int;
      assign drpdi_ch_int          = drpdi_int;
      assign drpen_ch_int          = drpen_int;
      assign drpwe_ch_int          = drpwe_int;
    end

Dissection of Wizard’s output

The name of the IP was ultragth in my case. That’s the significance of this name appearing all over this part.

The impact of changing the XCI file: In the Verilog files that are produced by the Wizard, MODELPARAM_VALUE.C_INCLUDE_CPLL_CAL is used directly when instantiating the ultragth_gtwizard_top, as the C_INCLUDE_CPLL_CAL instantiation parameter.

Also, the three new input ports are passed on to ultragth_gtwizard_top.v, rather than getting all zero assignments when they’re not exposed to the user application logic.

When activating the CPLL calibration (setting INCLUDE_CPLL_CAL to 1) additional constraints are also added to the constraint file for the IP, adding a few new false paths as well as making sure that the timing calculations for the TXOUTCLK is set according to the requested clock source. The latter is necessary, because the calibration logic fiddles with TXOUTCLKSEL during the calibration phase.

In ultragth_gtwizard_top.v the instantiation parameters and the three ports are just passed on to ultragth_gtwizard_gthe3.v, where the action happens.

First, the following defines are made (they like short names in Xilinx):

`define ultragth_gtwizard_gthe3_INCLUDE_CPLL_CAL__EXCLUDE 0
`define ultragth_gtwizard_gthe3_INCLUDE_CPLL_CAL__INCLUDE 1
`define ultragth_gtwizard_gthe3_INCLUDE_CPLL_CAL__DEPENDENT 2

and further down, we have this short and concise condition for enabling CPLL calibration:

    if ((C_INCLUDE_CPLL_CAL         == `ultragth_gtwizard_gthe3_INCLUDE_CPLL_CAL__INCLUDE) ||
        (((C_INCLUDE_CPLL_CAL       == `ultragth_gtwizard_gthe3_INCLUDE_CPLL_CAL__DEPENDENT) &&
         ((C_GT_REV                 == 11) ||
          (C_GT_REV                 == 12) ||
          (C_GT_REV                 == 14))) &&
         (((C_TX_ENABLE             == `ultragth_gtwizard_gthe3_TX_ENABLE__ENABLED) &&
           (C_TX_PLL_TYPE           == `ultragth_gtwizard_gthe3_TX_PLL_TYPE__CPLL)) ||
          ((C_RX_ENABLE             == `ultragth_gtwizard_gthe3_RX_ENABLE__ENABLED) &&
           (C_RX_PLL_TYPE           == `ultragth_gtwizard_gthe3_RX_PLL_TYPE__CPLL)) ||
          ((C_TXPROGDIV_FREQ_ENABLE == `ultragth_gtwizard_gthe3_TXPROGDIV_FREQ_ENABLE__ENABLED) &&
           (C_TXPROGDIV_FREQ_SOURCE == `ultragth_gtwizard_gthe3_TXPROGDIV_FREQ_SOURCE__CPLL))))) begin : gen_cpll_cal

which simply means that the CPLL calibration module should be generated if _INCLUDE_CPLL_CAL is 1 (as I changed it to), or if it’s 2 (default) and some conditions for enabling it automatically are met).

Further down, the hint for how to assign those three new ports is given. Namely, if CPLL was added automatically due to the default assignment and specific target FPGA, the values calculated by the Wizard itself are used

      // The TXOUTCLK_PERIOD_IN and CNT_TOL_IN ports are normally driven by an internally-calculated value. When INCLUDE_CPLL_CAL is 1,
      // they are driven as inputs for PLL-switching and rate change special cases, and the BUFG_GT CE input is provided by the user.
      wire [(`ultragth_gtwizard_gthe3_N_CH* 18)-1:0] cpll_cal_txoutclk_period_int;
      wire [(`ultragth_gtwizard_gthe3_N_CH* 18)-1:0] cpll_cal_cnt_tol_int;
      wire [(`ultragth_gtwizard_gthe3_N_CH*  1)-1:0] cpll_cal_bufg_ce_int;
      if (C_INCLUDE_CPLL_CAL == `ultragth_gtwizard_gthe3_INCLUDE_CPLL_CAL__INCLUDE) begin : gen_txoutclk_pd_input
        assign cpll_cal_txoutclk_period_int = {`ultragth_gtwizard_gthe3_N_CH{gtwiz_gthe3_cpll_cal_txoutclk_period_in}};
        assign cpll_cal_cnt_tol_int         = {`ultragth_gtwizard_gthe3_N_CH{gtwiz_gthe3_cpll_cal_cnt_tol_in}};
        assign cpll_cal_bufg_ce_int         = {`ultragth_gtwizard_gthe3_N_CH{gtwiz_gthe3_cpll_cal_bufg_ce_in}};
      end
      else begin : gen_txoutclk_pd_internal
        assign cpll_cal_txoutclk_period_int = {`ultragth_gtwizard_gthe3_N_CH{p_cpll_cal_txoutclk_period_int}};
        assign cpll_cal_cnt_tol_int         = {`ultragth_gtwizard_gthe3_N_CH{p_cpll_cal_txoutclk_period_div100_int}};
        assign cpll_cal_bufg_ce_int         = {`ultragth_gtwizard_gthe3_N_CH{1'b1}};
      end

These `ultragth_gtwizard_gthe3_N_CH things are just duplication of the same vector, in case there are multiple channels for the same IP.

First, note that cpll_cal_bufg_ce is assigned constant 1. Not clear why this port is exposed at all.

And now to the calculated values. Given that it says

      wire [15:0] p_cpll_cal_freq_count_window_int      = P_CPLL_CAL_FREQ_COUNT_WINDOW;
      wire [17:0] p_cpll_cal_txoutclk_period_int        = P_CPLL_CAL_TXOUTCLK_PERIOD;
      wire [15:0] p_cpll_cal_wait_deassert_cpllpd_int   = P_CPLL_CAL_WAIT_DEASSERT_CPLLPD;
      wire [17:0] p_cpll_cal_txoutclk_period_div100_int = P_CPLL_CAL_TXOUTCLK_PERIOD_DIV100;

a few rows above, and

  localparam [15:0] P_CPLL_CAL_FREQ_COUNT_WINDOW      = 16'd16000;
  localparam [17:0] P_CPLL_CAL_TXOUTCLK_PERIOD        = (C_CPLL_VCO_FREQUENCY/20) * (P_CPLL_CAL_FREQ_COUNT_WINDOW/(4*C_FREERUN_FREQUENCY));
  localparam [15:0] P_CPLL_CAL_WAIT_DEASSERT_CPLLPD   = 16'd256;
  localparam [17:0] P_CPLL_CAL_TXOUTCLK_PERIOD_DIV100 = (C_CPLL_VCO_FREQUENCY/20) * (P_CPLL_CAL_FREQ_COUNT_WINDOW/(400*C_FREERUN_FREQUENCY));
  localparam [25:0] P_CDR_TIMEOUT_FREERUN_CYC         = (37000 * C_FREERUN_FREQUENCY) / C_RX_LINE_RATE;

it’s not all that difficult to do the math. And looking at Table 3-1 of PG182, the formulas match perfectly, but I didn’t feel very reassured by those.

So why bother? Much easier to use the values calculated by the tools, as they appear in ultragth_synth_1/runme.log (for a 5 Gb/s rate and reference clock of 125 MHz, but YMMV as there’s more than one way to achieve a line rate):

	Parameter P_CPLL_CAL_FREQ_COUNT_WINDOW bound to: 16'b0011111010000000
	Parameter P_CPLL_CAL_TXOUTCLK_PERIOD bound to: 18'b000000111110100000
	Parameter P_CPLL_CAL_WAIT_DEASSERT_CPLLPD bound to: 16'b0000000100000000
	Parameter P_CPLL_CAL_TXOUTCLK_PERIOD_DIV100 bound to: 18'b000000000000101000
	Parameter P_CDR_TIMEOUT_FREERUN_CYC bound to: 26'b00000011100001110101001000

The bottom line is hence to set gtwiz_gthe3_cpll_cal_txoutclk_period_in to 18′b000000111110100000, and gtwiz_gthe3_cpll_cal_cnt_tol_in to 18′b000000000000101000. Which is 4000 and 40 in plain decimal, respectively.

Dissection of CPLL Calibration module (specifically)

The CPLL calibrator is implemented in gtwizard_ultrascale_v1_5/hdl/verilog/gtwizard_ultrascale_v1_5_gthe3_cpll_cal.v.

Some basic reverse engineering. This may be inaccurate, as I wasn’t very careful about the gory details on this matter. Also, when I say that a register is modified below, it’s to values that are listed after the outline of the state machine (further below).

So just to get an idea:

TXCLKOUTSEL start with value 0.
Using the DRP ports, it fetches the existing values of the PROGCLK_SEL and PROGDIV registers, and modifies their values.
It changes TXCLKOUTSEL to 3′b101, i.e. TXCLKOUT is routed to TXPROGDIVCLK. This can be more than one clock source, but it’s the CPLL directly, divided by PROGDIV (judging by the value assigned to PROGCLK_SEL).
CPLLRESET is asserted for 32 clock cycles, and then deasserted.

The state machine now enters a loop as follows.

The state machine waits 16384 clock cycles. This is essentially waiting for the CPLL to lock, however the CPLL’s lock detector isn’t monitored. Rather, it waits this fixed amount of time.
txprogdivreset is asserted for 32 clock cycles.
The state machine waits for the assertion of the GT’s txprgdivresetdone (possibly indefinitely).
The state machine checks that the frequency counter’s output (more on this below) is in the range of TXOUTCLK_PERIOD_IN ± CNT_TOL_IN. If so, it exits this loop (think C “break” here), with the intention of declaring success. If not, and this is the 15th failed attempt, it exits the loop as well, but with the intention of declaring failure. Otherwise, it continues as follows.
The FBOOST DRP register is read and then modified.
32 clock cycles later, CPLLRESET is asserted.
32 clock cycles later, CPLLPD is asserted for a number of clock cycles (determined by the module’s WAIT_DEASSERT_CPLLPD_IN input), and then deasserted (the CPLL is powered down and up!).
32 clock cycles later, CPLLRESET is deasserted.
The FBOOST DRP register is restored to its original value.
The state machine continues at the beginning of this loop.

And the final sequence, after exiting the loop:

PROGDIV and PROGCLK_SEL are restored to its original value
CPLLRESET is asserted for 32 clock cycles, and then deasserted.
The state machine waits for the assertion of the GT’s cplllock, possibly indefinitely.
txprogdivreset is asserted for 32 clock cycles.
The state machine waits for the assertion of the GT’s txprgdivresetdone (possibly indefinitely).
The state machine finishes. At this point one of the module’s CPLL_CAL_FAIL or CPLL_CAL_DONE is asserted, depending on the reason for exiting the loop.

As for the values assigned when I said “modified” above, I won’t get into that in detail, but just put a related snippet of code. Note that these values are often shifted to their correct place in the DRP registers in order to fulfill their purpose:

  localparam [1:0]  MOD_PROGCLK_SEL = 2'b10;
  localparam [15:0] MOD_PROGDIV_CFG = 16'hA1A2; //divider 20
  localparam [2:0]  MOD_TXOUTCLK_SEL = 3'b101;
  localparam        MOD_FBOOST = 1'b1;

Now, a word about the frequency counter: It’s a bit complicated because of clock domain issues, but what it does is to divide the clock under test by 4, and then count how many cycles the divided clock has during a period of FREQ_COUNT_WINDOW_IN DRP clocks. Which is hardcoded as 16000 clocks.

If we’ll trust the comment saying that PROGDIV is set to 20, it means that the frequency counter gets the CPLL clock divided by 20. It then divides this further by 4, and counts this for 16000 DRP clocks. Which is exactly the formula given in Table 3-1 of PG182.

Are we having fun?

Cyclone V and some transceiver CDR/PLL parameters

eli — Mon, 21 May 2018 06:42:57 +0000

Introduction

Connecting an Intel FPGA (Altera) Cyclone V’s Native Transceiver IP to a USB 3.0 channel (which involves a -5000 ppm Spread Spectrum modulation), I got a significant bit error rate and what appeared to be occasional losses of lock. Suspecting that the CDR didn’t catch up with the frequency modulation, I wanted to try out a larger PLL bandwidth = track more aggressively at the expense of higher jitter. That turned out to be not so trivial.

This post sums up my findings related to Quartus. As for solving the original problem (bit errors and that), changing the bandwidth made no difference.

Toolset: Quartus Lite 15.1 on Linux.

And by the way, the problem turned out to be unrelated to the PLL, but the lack of an equalizer on Cyclone V’s receiver. Hence no canceling of the low-pass filtering effect of the USB 3.0 cable. I worked this around by setting XCVR_RX_LINEAR_EQUALIZER_CONTROL to 2 in the QSF file and the errors were gone. However this just activates a constant compensation high-pass filter on the receiver’s input (see the Cyclone V Device Datasheet, CV-51002, 2018.05.07, Figure 4) and consequently works the problem around for a specific cable, not more.

Assignments in the QSF file

In order to change the CDR’s bandwidth, assignments in the QSF are due, as detailed in V-Series Transceiver PHY IP Core User Guide (UG-01080, 2017.07.06) in the section “Analog Settings for Cyclone V Devices” and on page 20-28. In principle, CDR_BANDWIDTH_PRESET should be set to High instead of its default “Auto”. In this post, I’ll also set PLL_BANDWIDTH_PRESET to High, even though I’m quite confident it has nothing to do with locking to data (rather, it controls locking to the reference clock). But it causes quite some confusion, as shown below.

So all that is left is to nail down the CDR’s instance name, and assign it these parameters.

Now first, what not to do: Using wildcards. This is quite tempting because the path to the CDR is very long. So at first, I went for this, which is wrong:

set_instance_assignment -name CDR_BANDWIDTH_PRESET High -to *|xcvr_inst|*rx_pma.rx_cdr
set_instance_assignment -name PLL_BANDWIDTH_PRESET High -to *|xcvr_inst|*rx_pma.rx_cdr

And nothing happened, except a small notice in some very important place of the fitter report:

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
; Ignored Assignments                                                                                                                                                                                   ;
+--------------------------------------------------+---------------------------+--------------+------------------------------------------------------------+---------------+----------------------------+
; Name                                             ; Ignored Entity            ; Ignored From ; Ignored To                                                 ; Ignored Value ; Ignored Source             ;
+--------------------------------------------------+---------------------------+--------------+------------------------------------------------------------+---------------+----------------------------+
; Merge TX PLL driven by registers with same clear ; altera_xcvr_reset_control ;              ; alt_xcvr_reset_counter:g_pll.counter_pll_powerdown|r_reset ; ON            ; Compiler or HDL Assignment ;
; CDR Bandwidth Preset                             ; myproj                    ;              ; *|xcvr_inst|*rx_pma.rx_cdr                                 ; HIGH          ; QSF Assignment             ;
; PLL Bandwidth Preset                             ; myproj                    ;              ; *|xcvr_inst|*rx_pma.rx_cdr                                 ; HIGH          ; QSF Assignment             ;
+--------------------------------------------------+---------------------------+--------------+------------------------------------------------------------+---------------+----------------------------+

Ayeee. So it seems like there’s no choice but to spell out the entire path. I haven’t investigated this thoroughly, though. Maybe there is some form of wildcards that would work. I also discuss this topic briefly in another post of mine.

So this is more like it:

set_instance_assignment -name CDR_BANDWIDTH_PRESET High -to frontend_ins|xcvr_inst|xcvr_inst|gen_native_inst.av_xcvr_native_insts[0].gen_bonded_group_native.av_xcvr_native_inst|inst_av_pma|av_rx_pma|rx_pmas[0].rx_pma.rx_cdr
set_instance_assignment -name PLL_BANDWIDTH_PRESET High -to frontend_ins|xcvr_inst|xcvr_inst|gen_native_inst.av_xcvr_native_insts[0].gen_bonded_group_native.av_xcvr_native_inst|inst_av_pma|av_rx_pma|rx_pmas[0].rx_pma.rx_cdr

I guess this clarifies why wildcards are tempting.

Verifying something happened

This is where things get confusing. Looking at the fitter report, in the part on transceivers, this was the output before adding the QSF assignments above (pardon the wide line, this is what the Fitter produced):

;         -- Name                                                                                           ; frontend:frontend_ins|xcvr:xcvr_inst|altera_xcvr_native_av:xcvr_inst|av_xcvr_native:gen_native_inst.av_xcvr_native_insts[0].gen_bonded_group_native.av_xcvr_native_inst|av_pma:inst_av_pma|av_rx_pma:av_rx_pma|rx_pmas[0].rx_pma.rx_cdr                                                                                                                                                                                                                                                     ;
;         -- PLL Location                                                                                   ; CHANNELPLL_X0_Y49_N32                                                                                                                                                                                                                                                                                                                                                                                                                                                                       ;
;         -- PLL Type                                                                                       ; CDR PLL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     ;
;         -- PLL Bandwidth Type                                                                             ; Auto (Medium)                                                                                                                                                                                                                                                                                                                                                                                                                                                                               ;
;         -- PLL Bandwidth Range                                                                            ; 2 to 4 MHz

And after adding the QSF assignments:

;         -- Name                                                                                           ; frontend:frontend_ins|xcvr:xcvr_inst|altera_xcvr_native_av:xcvr_inst|av_xcvr_native:gen_native_inst.av_xcvr_native_insts[0].gen_bonded_group_native.av_xcvr_native_inst|av_pma:inst_av_pma|av_rx_pma:av_rx_pma|rx_pmas[0].rx_pma.rx_cdr                                                                                                                                                                                                                                                     ;
;         -- PLL Location                                                                                   ; CHANNELPLL_X0_Y49_N32                                                                                                                                                                                                                                                                                                                                                                                                                                                                       ;
;         -- PLL Type                                                                                       ; CDR PLL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     ;
;         -- PLL Bandwidth Type                                                                             ; High                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ;
;         -- PLL Bandwidth Range                                                                            ; 4 to 8 MHz

Bingo, huh? Well, not really. Which of these two assignments made this happen? CDR_BANDWIDTH_PRESET or PLL_BANDWIDTH_PRESET? In other words: Does the fitter report tell us about the bandwidth of the PLL on the reference clock or the data?

The answer is PLL_BANDWIDTH_PRESET. Setting CDR_BANDWIDTH_PRESET doesn’t change anything in the Fitter report at all. I know it all too well (after spending some pleasant quality time trying to figure out why, before realizing it’s about PLL_BANDWIDTH_PRESET).

So where’s does CDR_BANDWIDTH_PRESET do its trick?

To find that, one needs to get down to the post-fitting properties of the rx_cdr instance. The following sequence applies to Quartus 15.1′s GUI:

After fitting, select Tools > Netlist Viewers > Technology Map Viewer (Post-Fitting). Locate the instance in the Find tab (to the left; it’s a plain substring search on the instance name given in the QSF assignment). Once found, click on the block in the graphics display so its bounding box becomes red, and then right-click this block. On the menu that shows up, select Locate in Resource Property Editor.

And that displays a list of properties (which can be exported into a CSV file). One of which is rxpll_pd_bw_ctrl. Changing CDR_BANDWIDTH_PRESET to High altered this property’s value from 300 to 600. Changing it to Low sets it to 240.

And by the way, a change in PLL_BANDWIDTH_PRESET to High has no impact on any of the properties listed in the Resource Property Editor for the said instance, but making it Low takes pfd_charge_pump_current_ctrl from 30 to 20, and rxpll_pfd_bw_ctrl from 4800 to 3200. Whatever that means.

It’s worth mentioning that the CDR is instantiated as an arriav_channel_pll primitive (yes, an Arria V primitive on a Cyclone V FPGA) in the av_rx_pma.sv module (generated automatically for the Transceiver Native PHY IP). One of the instantiation parameters is rxpll_pd_bw_ctrl, which is assigned 300 by default. The source file doesn’t change as a result of the said change in the QSF file. So the tools somehow change something post-synthesis. I guess.

There are however no instantiation parameters for neither pfd_charge_pump_current_ctrland nor rxpll_pfd_bw_ctrl. So the rxpll_pd_bw_ctrl naming match is probably more of a coincidence. Once again, I guess.

A closer look on the PLL

It’s quite clear from above that CDR_BANDWIDTH_PRESET influenced rxpll_pd_bw_ctrl (note the _pd_ part) and that PLL_BANDWIDTH_PRESET is related to a couple of parameters with pfd them. This terminology goes along with the one used in the documentation (see e.g. Figure 1-17, “Channel PLL Block Diagram” in Cyclone V Device Handbook Volume 2: Transceivers, cv_5v3.pdf, 2016.01.28): The displayed terminology is that PFD relates to the Lock-To-Reference loop, which locks on the reference clock, and PD relates to the Lock-To-Data loop, which is the CDR.

This isn’t just a curiosity, because the VCO’s output dividers, L, are assigned separately for the PD and PDF loops (see the fitter report as well as Table 1-9).

As for the numbers in the fitter report, they match the doc’s as shown in the two relevant segments below. The first relates to a Native PHY IP, and the second to a PCIe PHY, both on the same design, both targeted at 5 Gb/s (and hence having the same “Output Clock Frequency”).

;         -- Reference Clock Frequency                                                                      ; 100.0 MHz                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   ;
;         -- Output Clock Frequency                                                                         ; 2500.0 MHz                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ;
;         -- L Counter PD Clock Disable                                                                     ; Off                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         ;
;         -- M Counter                                                                                      ; 25                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ;
;         -- PCIE Frequency Control                                                                         ; pcie_100mhz                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 ;
;         -- PD L Counter                                                                                   ; 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ;
;         -- PFD L Counter                                                                                  ; 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ;
;         -- Powerdown                                                                                      ; Off                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         ;
;         -- Reference Clock Divider                                                                        ; 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ;

versus

;         -- Reference Clock Frequency                                                                      ; 100.0 MHz                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   ;
;         -- Output Clock Frequency                                                                         ; 2500.0 MHz                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ;
;         -- L Counter PD Clock Disable                                                                     ; Off                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         ;
;         -- M Counter                                                                                      ; 25                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ;
;         -- PCIE Frequency Control                                                                         ; pcie_100mhz                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 ;
;         -- PD L Counter                                                                                   ; 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ;
;         -- PFD L Counter                                                                                  ; 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ;
;         -- Powerdown                                                                                      ; Off                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         ;
;         -- Reference Clock Divider                                                                        ; 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ;

In both transceivers, a 2500 MHz clock is generated from a 100 MHz reference clock. It seems like the trick to understanding what’s going on is noting footnote (2) of Table 1-17, saying that the output of L_PD is the one that applies when the PLL is configured as a CDR.

In the first case, the reference clock is fed into the phase detector without division. Since the reference clock is not divided, 100 MHz reaches one input of the phase detector. As the output is divided by PFD_L = 2 and then by M=25, the VCO has to run at 5000 MHz so that its output divided by 50 matches the 100 MHz reference. That doesn’t seem very clever to me (why not pick L=1, and avoid 5 GHz, which I’m not even sure is possible on that silicon?). But at least the math adds up: The output is divided with PD_L = 2, and we have 2500 MHz.

Now to the second case (PCIe): The reference clock is divided by 2, so the phase detector is fed with a 50 MHz reference. The VCO’s clock is divided by PDF_L = 2 and then with M = 25, and hence the VCO runs at 2500 MHz. This way, the total division by 50 (again) matches the 50 MHz reference on the phase detector. PD_L = 1, so the VCO’s output is used undivided, hence an output clock of 2500 MHz, again.

I’m not sure that I’m buying this explanation myself, actually, but it’s the only way I found to make sense of these figures. At some point I tried to convince the tools to divide the reference clock by 2 on the Native PHY (first case above) by adding

set_instance_assignment -name PLL_PFD_CLOCK_FREQUENCY "50 MHz" -to "frontend:frontend_ins|xcvr:xcvr_inst|altera_xcvr_native_av:xcvr_inst|av_xcvr_native:gen_native_inst.av_xcvr_native_insts[0].gen_bonded_group_native.av_xcvr_native_inst|av_pma:inst_av_pma|av_rx_pma:av_rx_pma|rx_pmas[0].rx_pma.rx_cdr"

to the QSF file. This assignment was silently ignored. It wasn’t mentioned anywhere in the reports (not even in the Ignored Assignments) part, but the Divider remained at 1. I should mention that this assignment isn’t documented for Cyclone V, but Quartus Assignment Editor nevertheless agreed to generate it. And Quartus usually refuses to load a project if anything is fishy in the QSF file.

HTG’s USB 3.0 FMC module: Things you surely want to know

eli — Sat, 10 Feb 2018 11:41:36 +0000

Introduction

I purchased a HiTech Global 3-Port USB 3.0 SuperSpeed FMC Module (also known as USB3 FMC), which is an FPGA Mezzanine Card, primarily based upon TI’s SuperSpeed USB 3.0 Transceiver, TUSB1310A. Even though this board works fine at the end of the day, my experience with it was full with surprises, not all of which were helpful. But first:

Don’t connect the USB/FMC board to the FPGA board before reading through this post. There’s a good chance that the FPGA board will feed the board’s 1.8V power supply with 2.5V, which can potentially damage the USB board permanently. More about that below.

The board has three USB 3.0 connectors, two of which are connected to TUSB1310A transceivers, and one going to a GTX transceiver on the FPGA board. The two photos below show the front and back side of the board. I’ve made the red and green dot markings on the USB Micro-B connectors. The red dot marks the useless connector (explained below), the green one is the one that can be used, and goes to the TUSB1310A. The one not marked is connected directly to the GTX (and is useful, albeit with a few issues).

Click to enlarge photos:

A few minor irritations

These are a few issues that aren’t all that important each by itself, but they mount up. So let’s have them listed.

The schematics isn’t available until you purchase the board.
When the package arrived, there was a note telling me to contact support for getting the “product’s documentation for electronic delivery”, as this is an “easy and efficient mechanism for updating the reference designs, user manuals, and other related documents”. What I got was a pdf file with the schematics. That’s it. Meaning, that I had to get the FMC’s pinout from the schematics file. No text-based file (let alone an XDC file for KC705, for example), no user guide, nothing. It’s not just a waste of time to do this manually, but a source for human errors.
It says (as of writing this) that a USB3 cable is included, but none was. USB 3.0 Micro-B connectors are different from USB 2.0, so if you’re ordering the board, be sure to acquire a cable as well.
Their rate for shipping & handling was $100. This is just annoyingly high, being roughly twice the commonly required amount. And to be fair about it, there was an alternative method for reducing the shipping costs, which required taking care of the logistics myself. Not really important, but nevertheless annoying.

The FMC VADJ supply voltage

The board’s main power supply is VADJ_FMC, which is connected to the 1.8V power net through a 0 Ohm resistor. VADJ_FMC, as its name implies, is an adjustable voltage, generated by the FPGA board. All Xilinx boards I’m aware of have this voltage set to 2.5V by default. So whoever doesn’t pay attention to this issue, connects several components’ VDD-1.8V to 2.5V. Those of the USB transceivers and clock oscillators, that is.

It would, of course, have been much safer to use the standard 3.3V FMC power supply pins (3P3V, the voltage is standardized), and convert it to 1.8V by means of a DC/DC converter on the board, exactly as the board’s 1.1V supply is generated. It wouldn’t solve the voltage incompatibilities of the I/O connections, but there’s a difference between 2.5V on the power supply and 2.5V on the I/O wires.

Either way, there’s no user manual to warn about this, and no warning note in the schematics. Given the risk to literally blow $800, a piece of paper with a warning written in red, when you open the package, is common practice. There was nothing of this sort.

Even though “VADJ” is an “adjustable voltage”, it doesn’t mean it’s easy to adjust it. As this voltage is generated by a sophisticated power supply controller on all Xilinx’ FPGA boards, it requires reprogramming one of these controllers via PMBUS (which is an I2C-like interface). There are, in principle, two ways for doing this:

Attaching a USB/PMBUS adapter (available from TI) and programming the power supply with TI’s GUI tool for Windows, Fusion Digital Power. The adapter isn’t expensive ($75 as of writing this), so you probably want to purchase one along with the HTG board.
Accessing the power supply controller from the FPGA itself. I’ve written a post on this. Doesn’t require purchasing anything, but it may take quite some effort to set it up. Not recommended.

Regardless of the what way chosen, it involves changing one of the many voltages generated on the FPGA board. Depending on how bad you consider blowing the FPGA board to be, you probably want to spend some time getting acquainted with the power supply controllers, which one controls which voltage etc. No matter how you twist and turn it, you’re one little human error away from feeding the FPGA’s core voltage with 12V.

My post on this will probably help, even though it contains a lot of details that aren’t relevant for the recommended GUI tool + adapter route.

To sum this up: VADJ must be set to 1.8V on the FPGA board before the HTG is attached to the FPGA board with the FMC connector. It takes 5 minutes to do this, once you have the USB adapter at hand, the GUI tool installed and running, and the knowledge of exactly which power rail of which controller, at what PMBUS address it answers to.

One TUSB1310A path is (probably) useless

The board has two ports that are connected to TUSB1310A devices, which are in turn connected to the FMC interface. However if the board is attached to any of Xilinx’ primary development kits, KC705, KCU105 or ML605, only one of these ports can be used. I haven’t checked with other FPGA boards, but I’d expect the situation to be the same. Or maybe the second port can be used for USB 2.0 only. I had no interest in the USB 2.0 part, so I didn’t bother to check.

The problem is that not all FMC pins are connected to the FPGA on these FPGA boards: On KC705 and KCU105, neither boards have the FMC connector’s E21-E37 and F21-F37 pins connected to the FPGA, which are used for the PIPE signals on behalf of the USB port connected to J1 (with prefix P2_* on the relevant signal names in the schematics).

As for ML605, it has almost all FMC connections wired, except for four: FMC pins F37, F38, E36 and E37, which are HB20/HB21_P/N. HB20_P/N are assigned to P2_POWER_DOWN0/1 on the HTG board. These pins are disconnected (floating) on ML605. As they are lacking pulldown resistors, neither physically on the board nor by the chip itself, these wires that control the overall power state of the MGT are left floating. So ML605 can’t be used either.

Maybe there is an FPGA board out there that can use both USB ports. Still to find it.

No GTX reference clock

Even though there is a pair of signals intended as a GTX reference clock, GB_CLK_P/N, this clock pair carries no signal. The reason is that the oscillator that produces this 156.25 MHz reference clock, U9, is an TI LMK61E2 with LVPECL output. Unfortunately, the mandatory termination resistors between the LVPECL output and the capacitors are missing. As a result, the LVPECL outputs (BJT transistor’s emitters, current going only one way) just charge the capacitors, with no route to discharge them, so there’s no voltage swing, and hence no clock signal.

The obvious workaround is to use another reference clock, hopefully available on the FPGA board (on KC705 I went for the SGMII ref clock at 125 MHz, pins G7/G8).

By the way, the other identical clock oscillator, U5, generates a clock which is available as four differential clocks on the FMC interface, none of which is classified as an MGT clock on the FMC pinout, so odds are that these aren’t wired to the GTX reference pins on any FPGA board. U5 feeds a clock distributor, U9, which is a Microchip SY89833L without any capacitors in the middle. The latter chip has LVPECL-compatible inputs and LVDS outputs, so there is no problem there. It’s just not helpful for the GTX case. For general-purpose use, they are available as CLKo/1_M2C_P/N.

Design errors with the GTX data path

The board’s third USB port, J3, is intended for a direct connection with the FPGA’s GTX via one of the FMC’s dedicated gigabit pins pairs. There are a few issues to be aware of however:

First, the wire pairs are flipped in polarity in both directions (TX and RX), something that is quite apparent when looking on the FMC connector’s wiring. For example, P3_USB30_SSTX_N is connected to DP0_C2M_P. This is quite harmless, since the GTX has RXPOLARITY and TXPOLARITY ports, which can be asserted to to compensate for this error. And the USB 3.0 spec requires that both sides tolerate an P/N flip. And yet, it’s confusing to see the data stream received by the GTX without knowing about this.

Second, capacitors: There are 100 nF capacitors on the receiving wires of J3 (e.g. P3_USB30_SSRX_N) which shouldn’t be there. The standard for USB 3.0, as well as several other protocols, is that the capacitors are on the transmitting side only. Whoever designed the board knew that, because the capacitors of J1 and J2 are placed correctly (on P{1,2}_USB30_SSTX_{N,P} only).

There is a similar issue with the reference clock that is generated for the gigabit transceiver, GB_CLK_P/N: In this case, there are capacitors on the HTG board as well as the FPGA board. This isn’t really a mistake, because there is no standard on which side should put the capacitors, so both sides played safe. And this doesn’t matter, as this reference clock is dead anyhow, as mentioned above.

Putting two 100 nF capacitors in series yields an equivalent capacitor of 50 nF. For the P3_USB30_SSRX_{N,P} wires, this takes capacitance below the minimum allowed per spec, which is 75 nF. This will hardly have any effect on the gigabit data transmission, but may influence the receiver detection mechanism, which measures the current response to a voltage step. Even though a properly designed USB link partner shouldn’t be this fussy.

And one can always fetch the stereoscope and soldering iron, and replace C10 and C11 with 0 Ohm resistors.

By the way, on KC705, KCU105 (and probably all other Xilinx development kits) they’ve placed 100 nF capacitors on the receiving side only of the SMA GTX connectors (the PCIe fingers are done properly, of course). So trying to connect the USB 3.0 wires to the SMA connectors will not work, unless 100 nF capacitors are added in series with the FPGA’s transmission signals. Go figure.

The XCF constraints for KC705

Since I have it, these are the placement constraints for a KC705, as I figured them out from the schematics of the USB board and Xilinx’ reference XCF for the board. I have tested P1 and P3 as USB 3.0, but that doesn’t guarantee all below is correct. The constraints for P2 aren’t given, because they are useless, as explained above. All single-ended pins are LVCMOS18.

set_property PACKAGE_PIN C27 [get_ports CLK0_M2C_N ]
set_property PACKAGE_PIN D27 [get_ports CLK0_M2C_P ]
set_property PACKAGE_PIN D18 [get_ports CLK1_M2C_N ]
set_property PACKAGE_PIN D17 [get_ports CLK1_M2C_P ]
set_property PACKAGE_PIN C8 [get_ports GB_CLK_N ] # No signal
set_property PACKAGE_PIN C7 [get_ports GB_CLK_P ] # No signal
set_property PACKAGE_PIN C25 [get_ports P1_CLKOUT ]
set_property PACKAGE_PIN B24 [get_ports P1_ELAS_BUF_MODE ]
set_property PACKAGE_PIN H21 [get_ports P1_GPIO ]
set_property PACKAGE_PIN C29 [get_ports P1_PHY_RESET_N ]
set_property PACKAGE_PIN B27 [get_ports P1_PHY_STATUS ]
set_property PACKAGE_PIN C30 [get_ports P1_PIPE_RX[0] ]
set_property PACKAGE_PIN D29 [get_ports P1_PIPE_RX[1] ]
set_property PACKAGE_PIN A30 [get_ports P1_PIPE_RX[2] ]
set_property PACKAGE_PIN B30 [get_ports P1_PIPE_RX[3] ]
set_property PACKAGE_PIN D28 [get_ports P1_PIPE_RX[4] ]
set_property PACKAGE_PIN E30 [get_ports P1_PIPE_RX[5] ]
set_property PACKAGE_PIN F30 [get_ports P1_PIPE_RX[6] ]
set_property PACKAGE_PIN H27 [get_ports P1_PIPE_RX[7] ]
set_property PACKAGE_PIN G30 [get_ports P1_PIPE_RX[8] ]
set_property PACKAGE_PIN H24 [get_ports P1_PIPE_RX[9] ]
set_property PACKAGE_PIN H30 [get_ports P1_PIPE_RX[10] ]
set_property PACKAGE_PIN G28 [get_ports P1_PIPE_RX[11] ]
set_property PACKAGE_PIN H26 [get_ports P1_PIPE_RX[12] ]
set_property PACKAGE_PIN E29 [get_ports P1_PIPE_RX[13] ]
set_property PACKAGE_PIN E28 [get_ports P1_PIPE_RX[14] ]
set_property PACKAGE_PIN F28 [get_ports P1_PIPE_RX[15] ]
set_property PACKAGE_PIN D26 [get_ports P1_PIPE_RX_CLK ]
set_property PACKAGE_PIN H25 [get_ports P1_PIPE_RX_K[0] ]
set_property PACKAGE_PIN G29 [get_ports P1_PIPE_RX_K[1] ]
set_property PACKAGE_PIN G27 [get_ports P1_PIPE_RX_VALID ]
set_property PACKAGE_PIN A17 [get_ports P1_PIPE_TX[0] ]
set_property PACKAGE_PIN A18 [get_ports P1_PIPE_TX[1] ]
set_property PACKAGE_PIN A16 [get_ports P1_PIPE_TX[2] ]
set_property PACKAGE_PIN B18 [get_ports P1_PIPE_TX[3] ]
set_property PACKAGE_PIN F17 [get_ports P1_PIPE_TX[4] ]
set_property PACKAGE_PIN A21 [get_ports P1_PIPE_TX[5] ]
set_property PACKAGE_PIN G17 [get_ports P1_PIPE_TX[6] ]
set_property PACKAGE_PIN A20 [get_ports P1_PIPE_TX[7] ]
set_property PACKAGE_PIN C20 [get_ports P1_PIPE_TX[8] ]
set_property PACKAGE_PIN B20 [get_ports P1_PIPE_TX[9] ]
set_property PACKAGE_PIN F18 [get_ports P1_PIPE_TX[10] ]
set_property PACKAGE_PIN A22 [get_ports P1_PIPE_TX[11] ]
set_property PACKAGE_PIN B22 [get_ports P1_PIPE_TX[12] ]
set_property PACKAGE_PIN F21 [get_ports P1_PIPE_TX[13] ]
set_property PACKAGE_PIN G18 [get_ports P1_PIPE_TX[14] ]
set_property PACKAGE_PIN D19 [get_ports P1_PIPE_TX[15] ]
set_property PACKAGE_PIN E19 [get_ports P1_PIPE_TX_CLK ]
set_property PACKAGE_PIN E21 [get_ports P1_PIPE_TX_K[0] ]
set_property PACKAGE_PIN F20 [get_ports P1_PIPE_TX_K[1] ]
set_property PACKAGE_PIN B29 [get_ports P1_POWER_DOWN[0] ]
set_property PACKAGE_PIN A25 [get_ports P1_POWER_DOWN[1] ]
set_property PACKAGE_PIN D21 [get_ports P1_PWRPRESENT ]
set_property PACKAGE_PIN D16 [get_ports P1_RATE ]
set_property PACKAGE_PIN C21 [get_ports P1_RESET_N ]
set_property PACKAGE_PIN F27 [get_ports P1_RX_ELECIDLE ]
set_property PACKAGE_PIN A27 [get_ports P1_RX_POLARITY ]
set_property PACKAGE_PIN B28 [get_ports P1_RX_STATUS[0] ]
set_property PACKAGE_PIN C24 [get_ports P1_RX_STATUS[1] ]
set_property PACKAGE_PIN A28 [get_ports P1_RX_STATUS[2] ]
set_property PACKAGE_PIN A26 [get_ports P1_RX_TERMINATION ]
set_property PACKAGE_PIN G22 [get_ports P1_TX_DEEMPH[0] ]
set_property PACKAGE_PIN C16 [get_ports P1_TX_DEEMPH[1] ]
set_property PACKAGE_PIN B17 [get_ports P1_TX_DETRX_LPBK ]
set_property PACKAGE_PIN C19 [get_ports P1_TX_ELECIDLE ]
set_property PACKAGE_PIN F22 [get_ports P1_TX_MARGIN[0] ]
set_property PACKAGE_PIN D22 [get_ports P1_TX_MARGIN[1] ]
set_property PACKAGE_PIN C22 [get_ports P1_TX_MARGIN[2] ]
set_property PACKAGE_PIN B19 [get_ports P1_TX_ONESZEROS ]
set_property PACKAGE_PIN C17 [get_ports P1_TX_SWING ]
set_property PACKAGE_PIN H14 [get_ports P1_ULPI_CLK ]
set_property PACKAGE_PIN A13 [get_ports P1_ULPI_D[0] ]
set_property PACKAGE_PIN K16 [get_ports P1_ULPI_D[1] ]
set_property PACKAGE_PIN G15 [get_ports P1_ULPI_D[2] ]
set_property PACKAGE_PIN B15 [get_ports P1_ULPI_D[3] ]
set_property PACKAGE_PIN H16 [get_ports P1_ULPI_D[4] ]
set_property PACKAGE_PIN H15 [get_ports P1_ULPI_D[5] ]
set_property PACKAGE_PIN L15 [get_ports P1_ULPI_D[6] ]
set_property PACKAGE_PIN C15 [get_ports P1_ULPI_D[7] ]
set_property PACKAGE_PIN J16 [get_ports P1_ULPI_DIR ]
set_property PACKAGE_PIN L16 [get_ports P1_ULPI_NXT ]
set_property PACKAGE_PIN K15 [get_ports P1_ULPI_STP ]
set_property PACKAGE_PIN E4 [get_ports P3_USB30_SSRX_N ]
set_property PACKAGE_PIN E3 [get_ports P3_USB30_SSRX_P ]
set_property PACKAGE_PIN D2 [get_ports P3_USB30_SSTX_N ]
set_property PACKAGE_PIN D1 [get_ports P3_USB30_SSTX_P ]

Summary

An FMC board for USB 3.0 probably doesn’t sell in large quantities, and it’s quite understandable that its vendor is not interested in spending too much efforts on it. Its design flaws and errors are fairly acceptable once on the table, but given the lack of documentation and supplementary data, the overall picture is not as one would expect from a vendor that has been around for quite a while.

Gigabit tranceivers on FPGA: Selected topics

eli — Sat, 26 Mar 2016 07:18:39 +0000

Introduction

This is a summary of a few topics that should to be kept in mind when a Multi-Gigabit Tranceiver (MGT) is employed in an FPGA design. It’s not a substitute for reading the relevant user guide, nor a tutorial. Rather, it’s here to point at issues that may not be obvious at first glance.

It’s worth nothing that a simple and yet complete design example is available, based upon Xillyp2p.

The terminology and signal names are those used with Xilinx FPGA. The tranceiver is referred to as GTX (Gigabit Transceiver), but other variants of transceivers, e.g. GTH and GTZ, are to a large extent the same components with different bandwidth capabilities.

Overview

GTXs, which are the basic building block for common interface protocols (e.g. PCIe and SATA) are becoming an increasingly popular solution for communication between FPGAs. As the GTX’ instance consists of a clock and parallel data interface, it’s easy to mistake it for a simple channel that moves the data to the other end in a failsafe manner. A more realistic view of the GTX’ is a front end for a modem, with the possible bit errors and a need to synchronize serial-to-parallel data alignment at the receiver. Designing with the GTX also requires attention to classic communication related topics, e.g. the use of data encoding, equalizers and scramblers.

As a result, there are a few application-dependent pieces of logic that needs to be developed to support the channel:

The possibility of bit errors on the channel must be handled
The alignment from a bit stream to a parallel word must be taken care of (which bit is the LSB of the parallel word in the serial stream?)
If the transmitter and receiver aren’t based on a common clock, a protocol that injects and tolerates idle periods on the data stream must be used, or the clock difference will cause data underflows or overflows. Sending the data in packets in a common solution. In the pauses between these packets, special skip symbols must be inserted into the data stream, so that the GTX’ receiver’s clock correction mechanism can remove or add such symbols into the stream presented to the application logic, which runs at a clock slightly different from the received data stream.
Odds are that a scrambler needs to be applied on the channel. This requires logic that creates the scrambling sequence as well as synchronizes the receiver. The reason is that an equalizer assumes that the bit stream is uncorrelated on the average. Any average correlation between bit positions is considered ISI and is “fixed”. See Wikipedia

Having said the above, it’s not uncommon that no bit errors are ever observed on a GTX channel, even at very high rates, and possibly with no equalization enabled. This can’t be relied on however, as there is in fact no express guarantee for the actual error probablity of the channel.

Clocking

The clocking of the GTXs is an issue in itself. Unlike the logic fabric, each GTX has a limited number of possible sources for its reference clock. It’s mandatory to ensure that the reference clock(s) are present in one of the allowed dedicated inputs. Each clock pin can function as the reference clock of up to 12 particular GTXs.

It’s also important to pay attention to the generation of the serial data clocks for each GTX from the reference clock(s). It’s not only a matter of what multiplication ratios are allowed, but also how to allocate PLL resources and their access to the required reference clocks.

QPLL vs. CPLL

Two types of PLLs are availble for producing the serial data clock, typically running at severtal GHz: QPLLs and CPLLS.

The GTXs are organized in groups of four (“quads”). Each quad shares a single QPLL (Quad PLL), which is instantiated separately (as a GTXE2_COMMON). In addition, each GTX has a dedicated CPLL (Channel PLL), which can generate the serial clock for that GTX only.

Each GTX may select its clock source from either the (common) QPLL or its dedicated CPLL. The main difference between these is that the QPLL covers higher frequencies. High-rate applications are hence forced to use the QPLL. The downside is that all GTXs sharing the same QPLL must have the same data rate (except for that each GTX may divide the QPLL’s clock by a different rate). The CPLL allow for a greater flexibility of the clock rates, as each GTX can pick its clock independently, but with a limited frequency range.

Jitter

Jitter on the reference clock(s) is the silent killer of GTX links. It’s often neglected by designers because “it works anyhow”, but jitter on the reference clock has a disastrous effect on the channel’s quality, which can be by far worse than a poor PCB layout. As both jitter and poor PCB layout (and/or cabling) contribute to the bit error rate and the channel’s instability, the PCB design is often blamed when things go bad. And indeed, playing with the termination resistors or similar black-magic actions sometimes “fix it”. This makes people believe that GTX links are extremely sensitive to every via or curve in the PCB trace, which is not the case at all. It is, on the other hand, very sensitive to the reference clock’s jitter. And with some luck, a poorly chosen reference clock can be compensated for with a very clean PCB trace.

Jitter is commonly modeled as a noise component which is added to the timing of the clock transition, i.e. t=kT+n (n is the noise). Consequently, it is often defined in terms of the RMS of this noise component, or a maximal value which is crossed at a sufficiently low probability. The treatment of an GTX’ reference clock requires a slightly different approach; the RMS figures are not necessarily a relevant measures. In particular, clock sources with excellent RMS jitter may turn out inadequate, while other sources, with less impressive RMS figures may work better.

Since the QPLL or CPLL locks on this reference clock, jitter on the reference clock results in jitter in the serial data clock. The prevailing effect is on the transmitter, which relies on this serial data clock; the receiver is mainly based on the clock it recovers from the incoming data stream, and is therefore less sensitive to jitter.

Some of the jitter – in particular “slow” jitter (based upon low frequency components) is fairly harmless, as the other side’s receiver clock synchronization loop will cancel its effect by tracking the random shifts of the clock. On the other hand, very fast jitter in the reference clock may not be picked up by the QPLL/CPLL, and is hence harmless as well.

All in all, there’s a certain band of frequency components in the clock’s timing noise spectrum, which remains relevant: The band that causes jitter components which are slow enough for the QPLL/CPLL to track and hence present on the serial data clock, and too fast for the receiver’s tracking loop to follow. The measurable expression for this selective jitter requirement is given in terms of phase noise frequency masks, or sometimes as the RMS jitter in bandwidth segments (e.g. PCIe Base spec 2.1, section 4.3.7, or Xilinx’ AR 44549). Such spectrum masks required for GTX published by the hardware vendors. The spectral behavior of clock sources is often more difficult to predict: Even when noise spectra are published in datasheets, they are commonly given only for certain scenarios as typical figures.

8b/10b encoding

Several standardized uses of MGT channels (SATA, PCIe, DisplayPort etc.) involve a specific encoding scheme between payload bytes for transmission and the actual bit sequence on the channel. Each (8-bit) byte is mapped to an 10-bit word, based upon a rather peculiar encoding table. The purposes of this encoding is to ensure a balance between the number of 0′s and 1′s on the physical channel, allowing AC-coupling of the electrical signal. Also, this encoding also ensures frequent toggling between 0′s and 1′s, which ensures the proper bit synchronization at the receiver by virtue of the of the clock recovery loop (“CDR”). Other things that are worth noting about this encoding:

As there are 1024 possible code words covering 256 possible input bytes, some of the excessive code words are allocated as control characters. In particular, a control character designated K.28.5 is often referred to as “comma”, and is used for synchronization.
The 8b/10b encoding is not an error correction code despite its redundancy, but it does detect some errors, if the received code word is not decodable. On the other hand, a single bit error may lead to a completely different decoded word, without any indication that an error occurred.

Scrambling

To put it short and concise: If an equalizer is applied, the user-supplied data stream must be random. If the data payload can’t be ensured to be random itself (this is almost always the case), a scrambler must be defined in the communication protocol, and applied in the logic design.

Applying a scrambler on the channel is a tedious task, as it requires a synchronization mechanism between the transmitter and receiver. It’s often quite tempting to skip it, as the channel will work quite well even in the absence of a scrambler, even where it’s needed. However in the long run, occasional channel errors are typically experienced.

The rest of this paragraph attempts to explain the connection between the equalizer and scrambler. It’s not the easiest piece of reading, so it’s fine to skip it, if my word on this is enough for you.

In order to understand why scrambling is probably required, it’s first necessary to understand what an equalizer does.

The problem equalizers solve is the filtering effect of the electrical media (the “channel”) through which the bit stream travels. Both cables and PCBs reduce the strength of the signal, but even worse: The attenuation depends on the frequency, and reflections occur along the metal trace. As a result, the signal doesn’t just get smaller in magnitude, but it’s also smeared over time. A perfect, sharp, step-like transition from -1200 mV to +1200mV at the transmitter’s pins may end up as a slow and round rise from -100mV to +100mV. Because of this slow motion of the transitions at the receiver, the clear boundaries between the bits are broken. Each transmitted bit keeps leaving its traces way after its time period. This is called Inter-Symbol Interference (ISI): The received voltage at the sampling time for the bit at t=0 depends on the bits at t=-T, t=t-2T and so on. Each bit effectively produces noise for the bits coming after it.

This is where the equalizer comes in. The input of this machine is the time samples of the bit at t=0, but also a number of measured voltage samples of the bits before and after it. By making a weighted sum of these inputs, the equalizer manages, to a large extent, to cancel the Inter-Symbol Interference. In a way, it implements a reverse filter of the channel.

So how does the equalizer acquire the coefficients for each of the samples? There are different techniques for training an equalizer to work effectively against the channel’s filtering. For example, cellular phones do their training based upon a sequence of bits on each burst, which is known in advance. But when the data stream runs continuously, and the channel may change slightly over time (e.g. a cable is being bent) the training has to be continuous as well. The chosen method for the equalizers in GTXs is therefore continuous.

The Decision Feedback Equalizer, for example, starts with making a decision on whether each input bit is a ’0′ or ’1′. It then calculates the noise signal for this bit, by subtracting the measured voltage with the expected voltage for a ’0′ or ’1′, whichever was decided upon. The algorithm then slightly alters the weighted sums in a way that removes any statistical correlation between the noise and the previous samples. This works well when the bit sequence is completely random: There is no expected correlation between any input sample, and if such exists, it’s rightfully removed. Also, the adaptation converges into a compromise that works on the average best for all bit sequences.

But what happens if there is a certain statistical correlation between the bits in the data itself? The equalizer will specialize in reducing the ISI for the bit patterns occurring more often, possibly doing very bad on the less occurring patterns. The equalizer’s role is to compensate for the channel’s filtering effect, but instead, it adds an element of filtering of its own, based upon the common bit patterns. In particular, note that if a constant pattern runs through the channel when there’s no data for transmission (zeros, idle packets etc.) the equalizer will specialize in getting that no-data through, and mess up with the actual data.

One could be led to think that the 8b/10b encoding plays a role in this context, but it doesn’t. Even though cancels out DC on the channel, it does nothing about the correlation between the bits. For example, if the payload for transmission consists of zeros only, the encoded words on the channel will be either 1001110100 or 0110001011. The DC on the channel will remain zero, but the statistical correlation between the bits is far from being zero.

So unless the data is inherently random (e.g. an encrypted stream), using an equalizer means that the data which is supplied by the application to the transmitter must be randomized.

The common solution is a scrambler: XORing the payload data by a pseudo-random sequence of bits, generated by a simple state machine. The receiver must XOR the incoming data with the same sequence in order to retrieve the payload data. The comma (K28.5) symbol is often used to synchronize both state machines.

In GTX applications, the (by far) most commonly used scrambler is the G(X)=X^16+X^5+X^4+X^3+1 LFSR, which is defined in a friendly manner in the PCIe standard (e.g. the PCI Express Base Specification, rev. 1.1, section 4.2.3 and in particular Appendix C).

TX/RXUSRCLK and TX/RXUSRCLK2

Almost all signals between the FPGA logic fabric and the GTX are clocked with TXUSRCLK2 (for transmission) and RXUSRCLK2 (for reception). These signals are supplied by the user application logic, without any special restriction, except that the frequency must match the GTX’ data rate so as to avoid overflows or underflows. A common solution for generating this clock is therefore to drive the GTX’ RX/TXOUTCLK through a BUFG.

The logic fabric is required to supply a second clock in each direction, TXUSRCLK and RXUSRCLK (without the “2” suffix). These two clocks are the parallel data clocks in a deeper position of the GTX.

The rationale is that sometimes, it’s desired to let the logic fabric work with a word width which is twice as wide as the actual word width. For example, in a high-end data rate application, the GTX’ word width may be set to 40 bits with 8b/10b, so the logic fabric would interface with the GTX through a 32 bit data vector. But because of the high rate, the clock frequency may still be too high for the logic fabric, in which case the GTX allows halving the clock, and applying the data through a 80 bits word. In this case, the logic fabric supplies the 80-bit word clocked with TXUSRCLK2, and is also required to supply a second clock, TXUSRCLK having twice the frequency, and being phase aligned with TXUSRCLK2. TXUSRCLK is for the GTX’ internal use.

A similar arrangement applies for reception.

Unless the required data clock rate is too high for the logic fabric (which is usually not the case), this dual-clock arrangement is best avoided, as it requires an MMCM or PLL to generate two phase aligned clocks. Except for the lower clock applied to the logic fabric, there is no other reason for this.

Word alignment

On the transmitting side, the GTX receives a vector of bits, which forms a word for transmission. The width of this word is one of the parameters that are set when the GTX is instantiated, and so is whether 8b/10b encoding is applied. Either way, some format of parallel words is transmitted over the channel in a serialized manner, bit after bit. Unless explicitly required, there is nothing in this serial bitstream to indicate the words’ boundaries. Hence the receiver has no way, a-priori, to recover the word alignment.

The receiver’s GTX’ output consists of a parallel vector of bits, typically with the same width as the transmitter. Unless a mechanism is employed by the user logic, the GTX has no way to recover the correct alignment. Without such alignment, the organization into a parallel words arrives wrong at the receiver, and possibly as complete garbage, as an incorrect alignment prevents 8b/10b decoding (if employed).

It’s up to the application logic to implement a mechanism for synchronizing the receiver’s word alignment. There are two methodologies for this: Moving the alignment one bit at a time at the receiver’s side (“bit slipping”) until the data arrives properly, or transmitting a predefined pattern (a “comma”) periodically, and synchronize the receiver when this pattern is detected.

Bit slipping is the less recommended practice, even though simpler to understand. It keeps most of the responsibility in the application logic’s domain: The application logic monitors the arriving data, and issues a bit slip request when it has gathered enough errors to conclude that the alignment is out of sync.

However most well-established GTX-based protocols use commas for alignment. This method is easier in the way that the GTX aligns the word automatically when a comma is detected (if the GTX is configured to do so). If injecting comma characters periodically into the data stream fits well in the protocol, this is probably the preferred solution. The comma character can also be used to synchronize other mechanisms, in particular the scrambler (if employed).

Comma detection may also have false positives, resulting from errors in the raw data channel. As these data channels usually have a very low bit error probability (BER), this possibility can be overlooked in applications where a short-term false alignment resulting from a false comma detected is acceptable. When this is not acceptable, the application logic should monitor the incoming data, and disable the GTX automatic comma alignment through the rxpcommaalignen and/or rxmcommaalignen inputs of the GTX.

Tx buffer, to use or not to use

The Tx buffer is a small dual-clock (“asynchronous”) FIFO in the transmitter’s data path + some logic that makes sure that it starts off in the state of being half full.

The underlying problem, which the Tx buffer potentially solves, is that the serializer inside the GTX runs on a certain clock (XCLK) while the application logic is exposed to another clock (TXUSRCLK). The frequency of these clocks must be exactly the same to prevent overflow or underflow inside the GTX. This is fairly simple to achieve. Ensuring proper timing relationships between these two clocks is however less trivial.

There are hence two possibilies:

Not requiring a timing relationship between these clock (just the same frequency). Instead, use a dual-clock FIFO, which interfaces between these two clock domains. This small FIFO is referred to as the “Tx buffer”. Since it’s part of the GTX’ internal logic, going this path doesn’t require any additional resources from the logic fabric.
Make sure that the clocks are aligned, by virtue of a state machine. This state machine is implemented in the logic fabric.

The first solution is simpler and requires less resources from the FPGA’s logic fabric. Its main drawback is the latency of the Tx buffer, which is typically around 30 TXUSRCLK cycles. While this delay is usually negligible from a functional point of view, it’s not possible to predict its exact magnitude. It’s therefore not possible to use the Tx buffer on several parallel lanes of data, if the protocol requires a known alignment between the data in these lanes, or when an extremely low latency is required.

The second solutions requires some extra logic, but there is no significant design effort: This logic that aligns the clocks is included automatically by the IP core generator on Vivado 2014.1 and later, when “Tx/Rx buffer off” mode is chosen.

Xilinx GTX’ documentation is somewhat misleading in that it details the requirements of the state machine to painful detail: There’s no need to read through that long saga in the user guide. As a matter of fact, this logic is included automatically by the IP core generator on Vivado 2014.1, so there’s really no reason to dive into this issue. Only note that gtN_tx_fsm_reset_done_out may take a bit longer to assert after a reset (something like 1 ms on a 10 Gb/s lane).

Rx buffer

The Rx buffer (also called “Rx elastic buffer”) is also a dual-clock FIFO, which is placed in the same clock domain gap as the Tx buffer, and has the same function. Bypassing it requires the same kind of alignment mechanism in the logic fabric.

As with its Tx counterpart, bypassing the Rx buffer makes the latency short and deterministic. It’s however less common that such a bypass is practically justified: While a deterministic Tx latency may be required to ensure data alignment between parallel lanes in order to meet certain standard protocol requirements, there is almost always fairly easy methods to compesate for the unknown latency in user logic. Either way, it’s preferred not to rely on the transmitter to meet requirements on data alignment, and align the data, if required, by virtue of user logic.

Leftover notes

sysclk_in must be stable when the FPGA wakes up from configuration. A state machine that brings up the transceivers is based upon this clock. It’s referred to as the DRP clock in the wizard (find more imformation at http://www.directics.com/).
It’s important to declare the DRP clock’s frequency correctly, as certain required delays which are measured in nanoseconds are implemented by dwelling for a number of clocks, which is calculated from this frequency.
In order to transmit a comma, set the txcharisk to 1 (since it’s a vector, it sets the LSB) and the value of the 8 LSBs of the data to 0xBC, which is the code for K.28.5.