my tech blog » USB

Using MGTs in FPGA designs: Why the data is organized in packets

eli — Sat, 07 Feb 2026 13:47:49 +0000

Introduction

I’ll start with a correction: Indeed, application logic transmitting data from one FPGA to another is required to organize the data in some kind of packets or frames, but there’s one exception, which I’ll discuss later on: Xillyp2p. Anyhow, let’s take it from the beginning.

Multi-Gigabit Transceivers (MGTs, sometimes also referred to as RocketIO, GTX, GTH, GTY, GTP, GTM, etc.) have long ago become the de facto standard for serialized data communication between digital components. The most famous use cases are for a computer and its peripheral (often between the CPU’s companion chip and a peripheral), for example, PCIe, SuperSpeed USB (a.k.a. USB 3.x), and SATA. Also related to computers, Gigabit Ethernet (as well as 10GbE) is based upon MGTs, and the DisplayPort protocol can be used for connecting a graphics card with the monitor.

Many FPGAs are equipped with MGTs. These are often used for turning the FPGA into a computer peripheral (with the PCIe protocol, possibly using Xillybus, or with the SuperSpeed USB protocol, possibly using XillyUSB, or as a storage device with SATA). Gigabit Ethernet can also play in, allowing the FPGA to communicate with a computer with this protocol. Another use of MGTs is for connecting to electronic components, in particular ADC/DAC devices with a very high sampling frequency, hence requiring a high data rate.

But what about communication between FPGAs? At times, there are several FPGAs on a PCB that need to exchange information among themselves, possibly at high rates. In other usage scenarios, there’s a physical distance between the FPGAs. For example, test equipment often has a hand-held probe containing one FPGA that collects information, and a second FPGA that resides inside the table-top unit. If the data rate is high, MGTs on both sides make it possible to avoid heavy, cumbersome and error-prone cabling. In fact, a thin fiber-optic cable is a simple solution when MGTs are used anyhow, and in some scenarios it also offers an extra benefit, except for being lightweight: Electrical isolation. This is in particular important in some medical applications (for electrical safety) or when long cables need to be drawn outdoors (avoiding being hit by lightning).

Among the annoying things about MGT communication there’s the fact that the data flow somehow always gets organized in packets (or frames, bursts, pick your name for it), and these packets don’t necessarily align properly with the application data’s natural boundaries. Why is that so?

This post attempts to explain why virtually all protocols (e.g. Interlaken, RapidIO, AMD’s Aurora, and Altera’s SeriaLite) require the application data to be arranged in some kind of packets that are enforced by the protocol. The only exception is Xillyp2p, which presents error-free continuous channels from one FPGA to another (or with packets that are sensible for the application data). This is not to say that packets aren’t used under the hood; it’s just that this packet mechanism is transparent to the application logic.

I’ll discuss a few reasons for the use of packets:

Word alignment
Error detection and retransmission
Clock frequency differences

Reason #1: Word alignment

When working with an MGT, it’s easy to forget that the transmitted data is sent as a serial data stream of bits. The fact that both the transmitting and receiving side have the same data word width might give the false impression that the MGT has some magic way of aligning the word correctly at the receiver side. In reality, there is no such magic. There is no hidden trick allowing the receiver to know which bit is the first or last in a transmitted word. This is something that the protocol needs to take care of, possibly with some help from the MGT’s features.

When 8b/10b encoding is used, the common solution is to transmit a synchronization word, often referred to as a comma, which is known as the K28.5 symbol. This method takes advantage of the fact that the 8b/10b encoding uses 10 bits on the wire for each 8 bits of payload data for transmission. And this allows for a small number of extra codes for transmission, that can’t be just regular data. These extra codes are called K-symbols, and K28.5 is one of them.

Hence if the bit sequence for a K28.5 symbol is encountered on the raw data link, it can’t be a data word. Most MGTs in FPGAs have a feature allowing them to automatically align the K28.5 word to the beginning of a word boundary. So word alignment can be ensured by transmitting a comma symbol. The comma symbol is often used to reset the scrambler as well, if such is used.

Each protocol defines when the comma is transmitted. There are many variations on this topic, but they all boil down to two alternatives:

Transmitting comma symbols occasionally and periodically. Or possibly, as part of the marker for the beginning of a packet.
Transmitting comma symbols only as part of an initialization of the channel. This alternative is adopted by protocols like SuperSpeed USB and PCIe, which have specific patterns for initializing the channel, referred to as Ordered Sets for Training and Recovery. These patterns include comma symbols, among others.

Truth to be told, if the second approach is taken, the need for word alignment isn’t a reason by itself for dividing the data into packets, as the alignment takes place once and is preserved afterwards. But the concept of initializing the channel is quite complicated, and is not commonly adopted.

There are other methods for achieving word alignment, in particular when 8b/10b encoding isn’t used. The principles remain the same, though.

Reason #2: Error detection and retransmission

When working with an MGT, bit errors must be taken into account. These errors mean simply that a ’0′ is received for a bit that was transmitted as a ’1′, or vice versa. In some hardware setups such errors may occur relatively often (with a rate of say, 10^-9, which usually means more than once per second), and with other setups they may practically never occur. If an error in the application data can’t be tolerated, a detection mechanism for these bit errors must be in place at the very least, in order to prevent delivery of incorrectly received data to the application logic. Even if a link appears to be completely error free judging by long-term experience, this can’t be guaranteed in the long run, in particular as electronic components from different manufacturing batches are used.

In order to detect errors, some kind of CRC (or other redundant data) must be inserted occasionally in order to allow the receiver to check if the data has arrived correctly. As the CRC is always calculated on a segment (whether it has a fixed length or not), the information must be divided into packets, even if just for the purpose of attaching a CRC to each.

And then we have the question of what to do if an error is detected. There are mainly two possibilities:

Requesting a retransmission of the faulty packet. This ensures that an error-free channel is presented to the application logic.
Informing the application logic about the error, possibly halting the data flow so that faulty data isn’t delivered. This requires the application logic to somehow recover from this state and restart its operation.

High-end protocols like PCIe, SATA and SuperSpeed USB take the first approach, and ensure that all packets arrive correctly by virtue of a retransmission mechanism.

Gigabit Ethernet takes the second approach — there’s a CRC on the Ethernet packets, but the Ethernet protocol itself doesn’t intervene much if a packet arrives with an incorrect CRC. Such a packet is simply discarded (either by the hardware implementing the protocol or by software), so faulty data doesn’t go further. Even the IP protocol, which is usually one level above, does nothing special about the CRC error and the packet loss that occurred as a result of it. It’s only the TCP protocol that eventually detects the packet loss by virtue of a timeout, and requests retransmission.

What about FPGA-to-FPGA protocols, then? Well, each protocol takes its own approach. Xillyp2p is special in that it requests retransmissions when the physical link is bidirectional, but if the link is unidirectional it only discards the faulty data and halts everything until the application logic resumes operation — a retransmission request is impossible in the latter case.

Reason #3: Clock frequency differences

Clock frequency differences should have been the first topic, because it’s the subtle detail that prevents the solution that most FPGA engineers would consider at first for communication between two FPGAs: One FPGA sends a stream of data words at a regular pace, and the other FPGA receives and processes it. Simple and clean.

But I put it third and last, because it’s the most difficult to deal with, and the explanations became really long. So try to hang on. And if you don’t, here’s the short version: The transmission of data can’t be continuous, because the receiver’s clock might be just a few ppm slower. Hence the rate at which the receiver can process arriving data might be slightly lower than the transmitter’s rate, if it keeps sending data non-stop. So to avoid the receiver from being overflowed with data, the transmitter must pause the flow of application data every now and then to let the receiver catch up. And if there are pauses, the segments between these pauses are some kind of packets.

And now, to the long explanation, starting with the common case: The data link is bidirectional, and the data content in both directions is tightly related. Even if application data goes in one direction primarily, there is often some kind of acknowledgement and/or status information going the other way. All “classic” protocols for computers (PCIe, USB 3.x and SATA) are bidirectional, for bidirectional data as well as acknowledge packets, and there is usually a similar need when connecting two FPGAs.

The local and CDR clocks

I’ll need to make a small detour now and discuss clocks. Tedious, but necessary.

In most applications, each of the two involved FPGAs uses a different reference clock to drive its MGT, and the same reference clock is often used to drive the logic around it. These reference clocks of the two FPGAs have the same frequency, except for a small tolerance. Small, but causes big trouble.

Each MGT transmits data based upon its own reference clock (I’ll explain below why it’s always this way). The logic in the logic fabric that produces the data for transmission is usually driven by a clock derived from the same reference clock. In other words, the entire transmission chain is derived from the local reference clock.

The natural consequence is that the data which the MGT receives is based upon the other side’s reference clock. The MGT receiving this data stream locks a local clock oscillator on the data rate of the arriving data stream. This mechanism is referred to as clock data recovery, CDR. The MGT’s logic that handles the arriving data stream is clocked by the CDR clock, and is hence synchronized with this data stream’s bits.

Unlike most other IP blocks in an FPGA, the clocks that are used to interface with the MGT are outputs from the MGT block. In other words, the MGT supplies the clock to the logic fabric, and not the other way around. This is a necessary arrangement, not only because the MGT generates the CDR clock: The main reason is that the MGT is responsible for handling the clocks that run at the bit rate, having a frequency of several GHz, which is far above what the logic fabric can handle. Also, the reference clock used to generate these GHz clocks must be very “clean” (low jitter), so the FPGA’s regular clock resources can’t be used. Frequency dividers inside the MGT generate the clock or clocks used to interface with the logic fabric.

In particular, the data words that are transferred from the logic fabric into the MGT for transmission, as well as data words from the MGT to the logic fabric (received data), are clocked by the outputs of these frequency dividers. The fact that these clocks are used in the interface with the logic fabric makes it possible to apply timing constraints on paths between the MGT’s internal logic and the logic fabric.

For the purpose of this discussion, let’s forget about the clocks inside the MGT, and focus only on those accessible by the logic fabric. It’s already clear that there are two clocks involved, one generated from the local oscillator, based upon the local reference clock (“local” clock), and the CDR clock, which is derived from the arriving data stream. Two clocks, two clock domains.

Clock or clocks used for implementing the protocol

As there are two clocks involved, the question is which clock is used by the logic that processes the data. This is the logic that implements the protocol. The answer is obviously one of the two clocks supplied by the MGT. It’s quite pointless to implement the protocol in a foreign clock domain.

In principle, the logic (in the logic fabric) implementing the protocol could be clocked by both clocks, however the vast majority is usually clocked only by one of them: It’s difficult to implement a protocol across two clock domains, so even if both clocks are used, the actual protocol implementation is always clocked by one of the clocks, and the other clock is used by a minimal amount of logic.

In all practical implementations, the protocol is implemented on the local clock’s domain (the clock used for transmission). The choice is almost obvious: Given that one needs to choose one of the two clocks, the choice is naturally inclined towards the local clock, which is always present and always stable.

The logic running on the CDR clock usually does some minimal processing on the arriving data, and then pushes it into the local clock domain. And this brings us naturally to the next topic.

Crossing clock domains

Every FPGA engineer knows (or should know) that a dual-clock FIFO is the first solution to consider when a clock domain crossing is required. And indeed, this is the most common solution for crossing the clock domain from the CDR clock towards the local clock. It’s the natural choice when the only need is to hand over the arriving data to the local clock domain.

Therefore, several protocol implementations are clocked only by the local clock, and only this clock is exposed by the MGT. The dual-clock FIFO is implemented inside the MGT, and is usually called an “elastic buffer”. This way, all interaction with the MGT is done in one clock domain, which simplifies the implementation.

It’s also possible to implement the protocol with both clocks, and perform the clock domain crossing in the logic fabric, most likely with the help of a FIFO IP provided by the FPGA tools.

To reiterate, it boils down to two options:

Doing the clock domain crossing inside the MGT with an “elastic buffer”, and clock the logic fabric only with the local clock.
Using both clocks in the logic fabric, and accordingly do the clock domain crossing in the logic fabric.

Preventing overflow / underflow

As mentioned earlier, the two clocks usually have almost the same frequency, with a difference that results from the oscillators’ frequency tolerance. To illustrate the problem, let’s take an example with a bidirectional link of 1 Gbit/s, and the clock oscillators have a tolerance of 10 ppm each, which is considered pretty good. If the transmitter’s clock frequency is 10 ppm above, and the receiver’s frequency is 10 ppm below, there is a 20 ppm difference in the 1 Gbit/s data rate. In other words, the receiver gets 20,000 bits more than it can handle every second: No matter which of the two options mentioned above for clock domain crossing is chosen, there’s a FIFO whose write clock runs 20 ppm faster than the read clock. And soon enough, it overflows.

It can also be the other way around: If the write clock is slower than the read clock, this FIFO becomes empty every now and then. This scenario needs to be addressed as well.

There are several solutions to this problem, and they all boil down to that the transmitter pauses the flow of application data with regular intervals, and inserts some kind of stuffing inbetween to indicate these pauses. There is no possibility to stop the physical data stream, only to send data words that are discarded by the receiver instead of ending up in the FIFO. Recall that the protocol is almost always clocked by the local clock, which is the clock reading from the FIFO. So for example, just inserting some idle time between transmitted packets is not a solution in the vast majority of cases: The packets’ boundaries are detected by the logic that reads from the FIFO, not on the side writing to it. Hence most protocols resort to much simpler ways to mark these pauses.

The most famous mechanism is called skip ordered sets, or skip symbols. It’s the common choice when 8b/10b encoding is used. It takes advantage of the fact mentioned above, that when 8b/10b is used, it’s possible to send K-symbols that are distinguishable from the regular data flow. For example, a SuperSpeed USB transmitter emits two K28.1 symbols with regular intervals. The logic before the FIFO at the receiver discards K28.1 symbols rather than writing them into the FIFO.

It’s also common that the logic reading from the FIFO injects K28.1 symbols when the FIFO is empty. This allows a continuous stream of data towards the protocol logic, even if the local clock is faster than the CDR clock. It’s then up to the protocol logic to discard K28.1 symbols.

There are of course other solutions, in particular when 8b/10b isn’t used. The main point is however that the transmitting side can’t just transmit data continuously. At the very least, there must be some kind of pauses. And as already said, when there are pauses, there are packets between them, even if they don’t have headers and CRCs.

But why not transmit with the CDR clock?

This can sound like an appealing solution, and it’s possible at least in theory: Let one side (“master”) transmit data based upon its local clock, just as described above, and let the other side (“slave”) transmit data based upon the CDR clock. In other words, the slave’s transmission clock follows the master’s clock, so they have exactly the same frequency.

First, why it’s a bad idea to use the CDR clock directly for transmission: Jitter. I’ve already used the word jitter above, but now it deserves an explanation: In theory, a clock signal has a fixed time period between each transition. In practice, the time between each such transition varies randomly. It’s a slight variation, but it can have a devastating effect on the data link’s reliability: As each clock transitions sets the time point at which a new bit is presented on the physical link, by virtue of changing the voltage between the wires, a randomness of the timing has an effect similar to adding noise.

This is why MGTs should always be driven by “clean” reference clocks, meaning oscillators that are a bit more expensive, a bit more carefully placed on the PCB, and have been designed with focus on low jitter.

So what happens if the slave side uses the CDR clock to transmit data? Well, the transmitter’s clock already has a certain amount of jitter, which is the result of the reference clock’s own jitter, plus the jitter added by the PLL that created bit-rate clock from it. The CDR creates a clock based upon the arriving data stream, which usually adds a lot of jitter. That too has the same effect as adding noise to its input, because the receiver samples the analog signal using the CDR clock. However, this effect is inevitable. In order to mitigate this effect, the PLL that generates the CDR clock is often tuned to produce as little jitter as possible, while still being able to lock on the master’s frequency.

As the CDR clock has a relatively high jitter due to how it’s created, using it directly to transmit data is equivalent to adding noise to the physical channel, and is therefore a bad idea.

It’s however possible to take a divided version of the CDR clock (most likely the CDR clock as it appears on the MGT’s output port) and drive one of the FPGA’s output pins with it. That output goes to a “jitter cleaner” component on the PCB, which returns the same clock, but with much less jitter. And the latter clock can then be used as a reference clock to transmit data.

I’ve never heard of anyone attempting the trick with a “jitter cleaner”, let alone tried this myself. I suppose a few skip symbols are much easier than playing around with clocks.

But if the link is unidirectional?

If there’s a physical data link only in one direction, the CDR clock can be used on the receiving side to clock the protocol logic without any direct penalty. But it’s still a foreign clock. The MGT at the receiving side still needs a local reference clock in order to lock the CDR on the arriving data stream.

And as things usually turn around, the same local reference clock becomes the reference for all logic on the FPGA. So using the local clock for receiving data often saves a clock domain crossing between the protocol logic and the rest of it. It becomes a question of where the clock domain crossing occurs.

Conclusion

If data is transmitted through an MGT, it will most likely end up divided into packets. At least one of the reasons mentioned above will apply.

It’s possible to avoid the encapsulation, stripping, multiplexing and error checking of packets by using Xillyp2p. Unlike other protocol cores, this IP core takes care of these tasks, and presents the application logic with error-free and continuous application data channels. The packet-related tasks aren’t avoided, but rather taken care of by the IP core instead of the application logic.

This is comparable with using raw Ethernet frames vs TCP/IP: There is no way around using packets for getting information across a network. Choosing raw Ethernet frames requires the application to chop up the data into frames and ensure that they arrive correctly. If TCP/IP is chosen, all this is done and taken care of.

One way or another, there will be packets on wire.

udev, the “authorized” attribute and other failed attempts to ban a bogus USB keyboard

eli — Sun, 09 Apr 2023 11:04:40 +0000

Introduction

This is a spin-off post about failing attempts to fix the problem with a webcam’s keyboard buttons. Namely, that the a shaky physical connections caused the USB device to go on and off the bus rapidly, and consequently crash X windows. The background story is in this post.

There is really nothing to learn from this post regarding how to accomplish something. The only reason I don’t trash this is that there’s some possibly useful information about udev.

What I tried to do

There is a possibility to ban a USB device from being accessed by Linux, by virtue of the “authorized” attribute. Something like this.

# cd /sys/devices/pci0000:00/0000:00:14.0/usb2/2-5/
# echo 0 > authorized
^C^Z
# echo 1 > authorized
bash: echo: write error: Invalid argument

The ^C^Z after the first command is not a mistake. The first command got stuck for several seconds.

And this can be done with udev rules as well.

But surprisingly enough, there doesn’t seem to be a way to avoid the generation of the /dev/input/event* file without ignoring the USB device completely. It’s possible to delete it early enough, but that doesn’t really help, it turns out.

ATTRS{authorized} can be set to 0 only for the entire USB device. There is no such parameter for a udev event with the “input” subsystem.

Some udev queries

While trying to figure out the ATTRS{authorized} thing, these are my little play-arounds. Nothing really useful here:

$ sudo udevadm monitor --udev --property

I got

UDEV  [5662716.427855] add      /devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.2/1-5.2.1 (usb)
ACTION=add
BUSNUM=001
DEVNAME=/dev/bus/usb/001/098
DEVNUM=098
DEVPATH=/devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.2/1-5.2.1
DEVTYPE=usb_device
DRIVER=usb
ID_BUS=usb
ID_MODEL=USB2.0_PC_CAMERA
ID_MODEL_ENC=USB2.0\x20PC\x20CAMERA
ID_MODEL_ID=2311
ID_REVISION=0100
ID_SERIAL=Generic_USB2.0_PC_CAMERA
ID_USB_INTERFACES=:0e0100:0e0200:
ID_VENDOR=Generic
ID_VENDOR_ENC=Generic
ID_VENDOR_FROM_DATABASE=GEMBIRD
ID_VENDOR_ID=1908
MAJOR=189
MINOR=97
PRODUCT=1908/2311/100
SEQNUM=24413
SUBSYSTEM=usb
TYPE=239/2/1
USEC_INITIALIZED=5662716427506

UDEV  [5662716.430744] add      /devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.2/1-5.2.1/1-5.2.1:1.1 (usb)
ACTION=add
DEVPATH=/devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.2/1-5.2.1/1-5.2.1:1.1
DEVTYPE=usb_interface
DRIVER=uvcvideo
ID_USB_CLASS_FROM_DATABASE=Miscellaneous Device
ID_USB_PROTOCOL_FROM_DATABASE=Interface Association
ID_VENDOR_FROM_DATABASE=GEMBIRD
INTERFACE=14/2/0
MODALIAS=usb:v1908p2311d0100dcEFdsc02dp01ic0Eisc02ip00in01
PRODUCT=1908/2311/100
SEQNUM=24420
SUBSYSTEM=usb
TYPE=239/2/1
USEC_INITIALIZED=5662716430425

UDEV  [5662716.430935] add      /devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.2/1-5.2.1/1-5.2.1:1.0 (usb)
ACTION=add
DEVPATH=/devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.2/1-5.2.1/1-5.2.1:1.0
DEVTYPE=usb_interface
DRIVER=uvcvideo
ID_USB_CLASS_FROM_DATABASE=Miscellaneous Device
ID_USB_PROTOCOL_FROM_DATABASE=Interface Association
ID_VENDOR_FROM_DATABASE=GEMBIRD
INTERFACE=14/1/0
MODALIAS=usb:v1908p2311d0100dcEFdsc02dp01ic0Eisc01ip00in00
PRODUCT=1908/2311/100
SEQNUM=24414
SUBSYSTEM=usb
TYPE=239/2/1
USEC_INITIALIZED=5662716430396

UDEV  [5662716.433265] add      /devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.2/1-5.2.1/1-5.2.1:1.0/media5 (media)
ACTION=add
DEVNAME=/dev/media5
DEVPATH=/devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.2/1-5.2.1/1-5.2.1:1.0/media5
MAJOR=509
MINOR=5
SEQNUM=24416
SUBSYSTEM=media
USEC_INITIALIZED=5662716433110

UDEV  [5662716.435400] bind     /devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.2/1-5.2.1/1-5.2.1:1.1 (usb)
ACTION=bind
DEVPATH=/devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.2/1-5.2.1/1-5.2.1:1.1
DEVTYPE=usb_interface
DRIVER=uvcvideo
ID_USB_CLASS_FROM_DATABASE=Miscellaneous Device
ID_USB_PROTOCOL_FROM_DATABASE=Interface Association
ID_VENDOR_FROM_DATABASE=GEMBIRD
INTERFACE=14/2/0
MODALIAS=usb:v1908p2311d0100dcEFdsc02dp01ic0Eisc02ip00in01
PRODUCT=1908/2311/100
SEQNUM=24421
SUBSYSTEM=usb
TYPE=239/2/1
USEC_INITIALIZED=5662716430425

UDEV  [5662716.436539] add      /devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.2/1-5.2.1/1-5.2.1:1.0/video4linux/video0 (video4linux)
ACTION=add
COLORD_DEVICE=1
COLORD_KIND=camera
DEVLINKS=/dev/v4l/by-id/usb-Generic_USB2.0_PC_CAMERA-video-index0 /dev/v4l/by-path/pci-0000:00:14.0-usb-0:5.2.1:1.0-video-index0
DEVNAME=/dev/video0
DEVPATH=/devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.2/1-5.2.1/1-5.2.1:1.0/video4linux/video0
ID_BUS=usb
ID_FOR_SEAT=video4linux-pci-0000_00_14_0-usb-0_5_2_1_1_0
ID_MODEL=USB2.0_PC_CAMERA
ID_MODEL_ENC=USB2.0\x20PC\x20CAMERA
ID_MODEL_ID=2311
ID_PATH=pci-0000:00:14.0-usb-0:5.2.1:1.0
ID_PATH_TAG=pci-0000_00_14_0-usb-0_5_2_1_1_0
ID_REVISION=0100
ID_SERIAL=Generic_USB2.0_PC_CAMERA
ID_TYPE=video
ID_USB_DRIVER=uvcvideo
ID_USB_INTERFACES=:0e0100:0e0200:
ID_USB_INTERFACE_NUM=00
ID_V4L_CAPABILITIES=:capture:
ID_V4L_PRODUCT=USB2.0 PC CAMERA: USB2.0 PC CAM
ID_V4L_VERSION=2
ID_VENDOR=Generic
ID_VENDOR_ENC=Generic
ID_VENDOR_ID=1908
MAJOR=81
MINOR=0
SEQNUM=24415
SUBSYSTEM=video4linux
TAGS=:seat:uaccess:
USEC_INITIALIZED=5662716436054

UDEV  [5662716.436956] add      /devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.2/1-5.2.1/1-5.2.1:1.0/input/input121 (input)
ACTION=add
DEVPATH=/devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.2/1-5.2.1/1-5.2.1:1.0/input/input121
EV=3
ID_BUS=usb
ID_FOR_SEAT=input-pci-0000_00_14_0-usb-0_5_2_1_1_0
ID_INPUT=1
ID_INPUT_KEY=1
ID_MODEL=USB2.0_PC_CAMERA
ID_MODEL_ENC=USB2.0\x20PC\x20CAMERA
ID_MODEL_ID=2311
ID_PATH=pci-0000:00:14.0-usb-0:5.2.1:1.0
ID_PATH_TAG=pci-0000_00_14_0-usb-0_5_2_1_1_0
ID_REVISION=0100
ID_SERIAL=Generic_USB2.0_PC_CAMERA
ID_TYPE=video
ID_USB_DRIVER=uvcvideo
ID_USB_INTERFACES=:0e0100:0e0200:
ID_USB_INTERFACE_NUM=00
ID_VENDOR=Generic
ID_VENDOR_ENC=Generic
ID_VENDOR_ID=1908
KEY=100000 0 0 0
MODALIAS=input:b0003v1908p2311e0100-e0,1,kD4,ramlsfw
NAME="USB2.0 PC CAMERA: USB2.0 PC CAM"
PHYS="usb-0000:00:14.0-5.2.1/button"
PRODUCT=3/1908/2311/100
PROP=0
SEQNUM=24417
SUBSYSTEM=input
TAGS=:seat:
USEC_INITIALIZED=5662716436500

UDEV  [5662716.591160] add      /devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.2/1-5.2.1/1-5.2.1:1.0/input/input121/event22 (input)
ACTION=add
BACKSPACE=guess
DEVLINKS=/dev/input/by-path/pci-0000:00:14.0-usb-0:5.2.1:1.0-event /dev/input/by-id/usb-Generic_USB2.0_PC_CAMERA-event-if00
DEVNAME=/dev/input/event22
DEVPATH=/devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.2/1-5.2.1/1-5.2.1:1.0/input/input121/event22
ID_BUS=usb
ID_INPUT=1
ID_INPUT_KEY=1
ID_MODEL=USB2.0_PC_CAMERA
ID_MODEL_ENC=USB2.0\x20PC\x20CAMERA
ID_MODEL_ID=2311
ID_PATH=pci-0000:00:14.0-usb-0:5.2.1:1.0
ID_PATH_TAG=pci-0000_00_14_0-usb-0_5_2_1_1_0
ID_REVISION=0100
ID_SERIAL=Generic_USB2.0_PC_CAMERA
ID_TYPE=video
ID_USB_DRIVER=uvcvideo
ID_USB_INTERFACES=:0e0100:0e0200:
ID_USB_INTERFACE_NUM=00
ID_VENDOR=Generic
ID_VENDOR_ENC=Generic
ID_VENDOR_ID=1908
LIBINPUT_DEVICE_GROUP=3/1908/2311:usb-0000:00:14.0-5.2
MAJOR=13
MINOR=86
SEQNUM=24418
SUBSYSTEM=input
TAGS=:power-switch:
USEC_INITIALIZED=5662716590816
XKBLAYOUT=us,il
XKBMODEL=pc105
XKBOPTIONS=grp:alt_shift_toggle,grp_led:scroll
XKBVARIANT=,

UDEV  [5662716.593390] bind     /devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.2/1-5.2.1/1-5.2.1:1.0 (usb)
ACTION=bind
DEVPATH=/devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.2/1-5.2.1/1-5.2.1:1.0
DEVTYPE=usb_interface
DRIVER=uvcvideo
ID_USB_CLASS_FROM_DATABASE=Miscellaneous Device
ID_USB_PROTOCOL_FROM_DATABASE=Interface Association
ID_VENDOR_FROM_DATABASE=GEMBIRD
INTERFACE=14/1/0
MODALIAS=usb:v1908p2311d0100dcEFdsc02dp01ic0Eisc01ip00in00
PRODUCT=1908/2311/100
SEQNUM=24419
SUBSYSTEM=usb
TYPE=239/2/1
USEC_INITIALIZED=5662716430396

UDEV  [5662716.595836] bind     /devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.2/1-5.2.1 (usb)
ACTION=bind
BUSNUM=001
DEVNAME=/dev/bus/usb/001/098
DEVNUM=098
DEVPATH=/devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.2/1-5.2.1
DEVTYPE=usb_device
DRIVER=usb
ID_BUS=usb
ID_MODEL=USB2.0_PC_CAMERA
ID_MODEL_ENC=USB2.0\x20PC\x20CAMERA
ID_MODEL_ID=2311
ID_REVISION=0100
ID_SERIAL=Generic_USB2.0_PC_CAMERA
ID_USB_INTERFACES=:0e0100:0e0200:
ID_VENDOR=Generic
ID_VENDOR_ENC=Generic
ID_VENDOR_FROM_DATABASE=GEMBIRD
ID_VENDOR_ID=1908
MAJOR=189
MINOR=97
PRODUCT=1908/2311/100
SEQNUM=24422
SUBSYSTEM=usb
TYPE=239/2/1
USEC_INITIALIZED=5662716427506

So the device I want to avoid was /dev/input/event22 this time. What’s its attributes?

$ sudo udevadm info -a -n /dev/input/event22 

Udevadm info starts with the device specified by the devpath and then
walks up the chain of parent devices. It prints for every device
found, all possible attributes in the udev rules key format.
A rule to match, can be composed by the attributes of the device
and the attributes from one single parent device.

  looking at device '/devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.2/1-5.2.1/1-5.2.1:1.0/input/input121/event22':
    KERNEL=="event22"
    SUBSYSTEM=="input"
    DRIVER==""

  looking at parent device '/devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.2/1-5.2.1/1-5.2.1:1.0/input/input121':
    KERNELS=="input121"
    SUBSYSTEMS=="input"
    DRIVERS==""
    ATTRS{name}=="USB2.0 PC CAMERA: USB2.0 PC CAM"
    ATTRS{phys}=="usb-0000:00:14.0-5.2.1/button"
    ATTRS{properties}=="0"
    ATTRS{uniq}==""

  looking at parent device '/devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.2/1-5.2.1/1-5.2.1:1.0':
    KERNELS=="1-5.2.1:1.0"
    SUBSYSTEMS=="usb"
    DRIVERS=="uvcvideo"
    ATTRS{authorized}=="1"
    ATTRS{bAlternateSetting}==" 0"
    ATTRS{bInterfaceClass}=="0e"
    ATTRS{bInterfaceNumber}=="00"
    ATTRS{bInterfaceProtocol}=="00"
    ATTRS{bInterfaceSubClass}=="01"
    ATTRS{bNumEndpoints}=="01"
    ATTRS{iad_bFirstInterface}=="00"
    ATTRS{iad_bFunctionClass}=="0e"
    ATTRS{iad_bFunctionProtocol}=="00"
    ATTRS{iad_bFunctionSubClass}=="03"
    ATTRS{iad_bInterfaceCount}=="02"
    ATTRS{interface}=="USB2.0 PC CAMERA"
    ATTRS{supports_autosuspend}=="1"

  looking at parent device '/devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.2/1-5.2.1':
    KERNELS=="1-5.2.1"
    SUBSYSTEMS=="usb"
    DRIVERS=="usb"
    ATTRS{authorized}=="1"
    ATTRS{avoid_reset_quirk}=="0"
    ATTRS{bConfigurationValue}=="1"
    ATTRS{bDeviceClass}=="ef"
    ATTRS{bDeviceProtocol}=="01"
    ATTRS{bDeviceSubClass}=="02"
    ATTRS{bMaxPacketSize0}=="64"
    ATTRS{bMaxPower}=="256mA"
    ATTRS{bNumConfigurations}=="1"
    ATTRS{bNumInterfaces}==" 2"
    ATTRS{bcdDevice}=="0100"
    ATTRS{bmAttributes}=="80"
    ATTRS{busnum}=="1"
    ATTRS{configuration}==""
    ATTRS{devnum}=="98"
    ATTRS{devpath}=="5.2.1"
    ATTRS{idProduct}=="2311"
    ATTRS{idVendor}=="1908"
    ATTRS{ltm_capable}=="no"
    ATTRS{manufacturer}=="Generic"
    ATTRS{maxchild}=="0"
    ATTRS{product}=="USB2.0 PC CAMERA"
    ATTRS{quirks}=="0x0"
    ATTRS{removable}=="unknown"
    ATTRS{speed}=="480"
    ATTRS{urbnum}=="16"
    ATTRS{version}==" 2.00"

  looking at parent device '/devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.2':
    KERNELS=="1-5.2"
    SUBSYSTEMS=="usb"
    DRIVERS=="usb"
    ATTRS{authorized}=="1"
    ATTRS{avoid_reset_quirk}=="0"
    ATTRS{bConfigurationValue}=="1"
    ATTRS{bDeviceClass}=="09"
    ATTRS{bDeviceProtocol}=="01"
    ATTRS{bDeviceSubClass}=="00"
    ATTRS{bMaxPacketSize0}=="64"
    ATTRS{bMaxPower}=="100mA"
    ATTRS{bNumConfigurations}=="1"
    ATTRS{bNumInterfaces}==" 1"
    ATTRS{bcdDevice}=="0100"
    ATTRS{bmAttributes}=="e0"
    ATTRS{busnum}=="1"
    ATTRS{configuration}==""
    ATTRS{devnum}=="75"
    ATTRS{devpath}=="5.2"
    ATTRS{idProduct}=="7250"
    ATTRS{idVendor}=="214b"
    ATTRS{ltm_capable}=="no"
    ATTRS{maxchild}=="4"
    ATTRS{product}=="USB2.0 HUB"
    ATTRS{quirks}=="0x0"
    ATTRS{removable}=="unknown"
    ATTRS{speed}=="480"
    ATTRS{urbnum}=="409"
    ATTRS{version}==" 2.00"

  looking at parent device '/devices/pci0000:00/0000:00:14.0/usb1/1-5':
    KERNELS=="1-5"
    SUBSYSTEMS=="usb"
    DRIVERS=="usb"
    ATTRS{authorized}=="1"
    ATTRS{avoid_reset_quirk}=="0"
    ATTRS{bConfigurationValue}=="1"
    ATTRS{bDeviceClass}=="09"
    ATTRS{bDeviceProtocol}=="02"
    ATTRS{bDeviceSubClass}=="00"
    ATTRS{bMaxPacketSize0}=="64"
    ATTRS{bMaxPower}=="0mA"
    ATTRS{bNumConfigurations}=="1"
    ATTRS{bNumInterfaces}==" 1"
    ATTRS{bcdDevice}=="0123"
    ATTRS{bmAttributes}=="e0"
    ATTRS{busnum}=="1"
    ATTRS{configuration}==""
    ATTRS{devnum}=="73"
    ATTRS{devpath}=="5"
    ATTRS{idProduct}=="5411"
    ATTRS{idVendor}=="0bda"
    ATTRS{ltm_capable}=="no"
    ATTRS{manufacturer}=="Generic"
    ATTRS{maxchild}=="4"
    ATTRS{product}=="4-Port USB 2.0 Hub"
    ATTRS{quirks}=="0x0"
    ATTRS{removable}=="removable"
    ATTRS{speed}=="480"
    ATTRS{urbnum}=="69"
    ATTRS{version}==" 2.10"

  looking at parent device '/devices/pci0000:00/0000:00:14.0/usb1':
    KERNELS=="usb1"
    SUBSYSTEMS=="usb"
    DRIVERS=="usb"
    ATTRS{authorized}=="1"
    ATTRS{authorized_default}=="1"
    ATTRS{avoid_reset_quirk}=="0"
    ATTRS{bConfigurationValue}=="1"
    ATTRS{bDeviceClass}=="09"
    ATTRS{bDeviceProtocol}=="01"
    ATTRS{bDeviceSubClass}=="00"
    ATTRS{bMaxPacketSize0}=="64"
    ATTRS{bMaxPower}=="0mA"
    ATTRS{bNumConfigurations}=="1"
    ATTRS{bNumInterfaces}==" 1"
    ATTRS{bcdDevice}=="0415"
    ATTRS{bmAttributes}=="e0"
    ATTRS{busnum}=="1"
    ATTRS{configuration}==""
    ATTRS{devnum}=="1"
    ATTRS{devpath}=="0"
    ATTRS{idProduct}=="0002"
    ATTRS{idVendor}=="1d6b"
    ATTRS{interface_authorized_default}=="1"
    ATTRS{ltm_capable}=="no"
    ATTRS{manufacturer}=="Linux 4.15.0-20-generic xhci-hcd"
    ATTRS{maxchild}=="16"
    ATTRS{product}=="xHCI Host Controller"
    ATTRS{quirks}=="0x0"
    ATTRS{removable}=="unknown"
    ATTRS{serial}=="0000:00:14.0"
    ATTRS{speed}=="480"
    ATTRS{urbnum}=="454"
    ATTRS{version}==" 2.00"

  looking at parent device '/devices/pci0000:00/0000:00:14.0':
    KERNELS=="0000:00:14.0"
    SUBSYSTEMS=="pci"
    DRIVERS=="xhci_hcd"
    ATTRS{broken_parity_status}=="0"
    ATTRS{class}=="0x0c0330"
    ATTRS{consistent_dma_mask_bits}=="64"
    ATTRS{d3cold_allowed}=="1"
    ATTRS{dbc}=="disabled"
    ATTRS{device}=="0xa2af"
    ATTRS{dma_mask_bits}=="64"
    ATTRS{driver_override}=="(null)"
    ATTRS{enable}=="1"
    ATTRS{irq}=="33"
    ATTRS{local_cpulist}=="0-11"
    ATTRS{local_cpus}=="0,00000000,00000fff"
    ATTRS{msi_bus}=="1"
    ATTRS{numa_node}=="0"
    ATTRS{revision}=="0x00"
    ATTRS{subsystem_device}=="0x5007"
    ATTRS{subsystem_vendor}=="0x1458"
    ATTRS{vendor}=="0x8086"

  looking at parent device '/devices/pci0000:00':
    KERNELS=="pci0000:00"
    SUBSYSTEMS==""
    DRIVERS==""

And what udev rules are currently in effect for this? Note that this doesn’t require root, and nothing really happens to the system:

$ udevadm test -a add $(udevadm info -q path -n /dev/input/event22)calling: test
version 237
This program is for debugging only, it does not run any program
specified by a RUN key. It may show incorrect results, because
some values may be different, or not available at a simulation run.

Load module index
Parsed configuration file /etc/systemd/network/eth1.link
Skipping empty file: /etc/systemd/network/99-default.link
Created link configuration context.

[ ... reading a lot of files ... ]

rules contain 393216 bytes tokens (32768 * 12 bytes), 39371 bytes strings
25632 strings (220044 bytes), 22252 de-duplicated (184054 bytes), 3381 trie nodes used
GROUP 104 /lib/udev/rules.d/50-udev-default.rules:29
IMPORT builtin 'hwdb' /lib/udev/rules.d/60-evdev.rules:8
IMPORT builtin 'hwdb' returned non-zero
IMPORT builtin 'hwdb' /lib/udev/rules.d/60-evdev.rules:17
IMPORT builtin 'hwdb' returned non-zero
IMPORT builtin 'hwdb' /lib/udev/rules.d/60-evdev.rules:21
IMPORT builtin 'hwdb' returned non-zero
IMPORT builtin 'input_id' /lib/udev/rules.d/60-input-id.rules:5
capabilities/ev raw kernel attribute: 3
capabilities/abs raw kernel attribute: 0
capabilities/rel raw kernel attribute: 0
capabilities/key raw kernel attribute: 100000 0 0 0
properties raw kernel attribute: 0
test_key: checking bit block 0 for any keys; found=0
test_key: checking bit block 64 for any keys; found=0
test_key: checking bit block 128 for any keys; found=0
test_key: checking bit block 192 for any keys; found=1
IMPORT builtin 'hwdb' /lib/udev/rules.d/60-input-id.rules:6
IMPORT builtin 'hwdb' returned non-zero
IMPORT builtin 'usb_id' /lib/udev/rules.d/60-persistent-input.rules:11
/sys/devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.2/1-5.2.1/1-5.2.1:1.0: if_class 14 protocol 0
LINK 'input/by-id/usb-Generic_USB2.0_PC_CAMERA-event-if00' /lib/udev/rules.d/60-persistent-input.rules:32
IMPORT builtin 'path_id' /lib/udev/rules.d/60-persistent-input.rules:35
LINK 'input/by-path/pci-0000:00:14.0-usb-0:5.2.1:1.0-event' /lib/udev/rules.d/60-persistent-input.rules:40
PROGRAM 'libinput-device-group /sys/devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.2/1-5.2.1/1-5.2.1:1.0/input/input121/event22' /lib/udev/rules.d/80-libinput-device-groups.rules:7
starting 'libinput-device-group /sys/devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.2/1-5.2.1/1-5.2.1:1.0/input/input121/event22'
'libinput-device-group /sys/devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.2/1-5.2.1/1-5.2.1:1.0/input/input121/event22'(out) '3/1908/2311:usb-0000:00:14.0-5.2'
Process 'libinput-device-group /sys/devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.2/1-5.2.1/1-5.2.1:1.0/input/input121/event22' succeeded.
IMPORT builtin 'hwdb' /lib/udev/rules.d/90-libinput-model-quirks.rules:46
IMPORT builtin 'hwdb' returned non-zero
IMPORT builtin 'hwdb' /lib/udev/rules.d/90-libinput-model-quirks.rules:50
IMPORT builtin 'hwdb' returned non-zero
handling device node '/dev/input/event22', devnum=c13:86, mode=0660, uid=0, gid=104
preserve permissions /dev/input/event22, 020660, uid=0, gid=104
preserve already existing symlink '/dev/char/13:86' to '../input/event22'
found 'c13:86' claiming '/run/udev/links/\x2finput\x2fby-id\x2fusb-Generic_USB2.0_PC_CAMERA-event-if00'
found 'c13:85' claiming '/run/udev/links/\x2finput\x2fby-id\x2fusb-Generic_USB2.0_PC_CAMERA-event-if00'
found 'c13:84' claiming '/run/udev/links/\x2finput\x2fby-id\x2fusb-Generic_USB2.0_PC_CAMERA-event-if00'
found 'c13:83' claiming '/run/udev/links/\x2finput\x2fby-id\x2fusb-Generic_USB2.0_PC_CAMERA-event-if00'
creating link '/dev/input/by-id/usb-Generic_USB2.0_PC_CAMERA-event-if00' to '/dev/input/event22'
preserve already existing symlink '/dev/input/by-id/usb-Generic_USB2.0_PC_CAMERA-event-if00' to '../event22'
found 'c13:86' claiming '/run/udev/links/\x2finput\x2fby-path\x2fpci-0000:00:14.0-usb-0:5.2.1:1.0-event'
found 'c13:85' claiming '/run/udev/links/\x2finput\x2fby-path\x2fpci-0000:00:14.0-usb-0:5.2.1:1.0-event'
found 'c13:84' claiming '/run/udev/links/\x2finput\x2fby-path\x2fpci-0000:00:14.0-usb-0:5.2.1:1.0-event'
found 'c13:83' claiming '/run/udev/links/\x2finput\x2fby-path\x2fpci-0000:00:14.0-usb-0:5.2.1:1.0-event'
creating link '/dev/input/by-path/pci-0000:00:14.0-usb-0:5.2.1:1.0-event' to '/dev/input/event22'
preserve already existing symlink '/dev/input/by-path/pci-0000:00:14.0-usb-0:5.2.1:1.0-event' to '../event22'
ACTION=add
BACKSPACE=guess
DEVLINKS=/dev/input/by-path/pci-0000:00:14.0-usb-0:5.2.1:1.0-event /dev/input/by-id/usb-Generic_USB2.0_PC_CAMERA-event-if00
DEVNAME=/dev/input/event22
DEVPATH=/devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5.2/1-5.2.1/1-5.2.1:1.0/input/input121/event22
ID_BUS=usb
ID_INPUT=1
ID_INPUT_KEY=1
ID_MODEL=USB2.0_PC_CAMERA
ID_MODEL_ENC=USB2.0\x20PC\x20CAMERA
ID_MODEL_ID=2311
ID_PATH=pci-0000:00:14.0-usb-0:5.2.1:1.0
ID_PATH_TAG=pci-0000_00_14_0-usb-0_5_2_1_1_0
ID_REVISION=0100
ID_SERIAL=Generic_USB2.0_PC_CAMERA
ID_TYPE=video
ID_USB_DRIVER=uvcvideo
ID_USB_INTERFACES=:0e0100:0e0200:
ID_USB_INTERFACE_NUM=00
ID_VENDOR=Generic
ID_VENDOR_ENC=Generic
ID_VENDOR_ID=1908
LIBINPUT_DEVICE_GROUP=3/1908/2311:usb-0000:00:14.0-5.2
MAJOR=13
MINOR=86
SUBSYSTEM=input
TAGS=:power-switch:
USEC_INITIALIZED=5662716590816
XKBLAYOUT=us,il
XKBMODEL=pc105
XKBOPTIONS=grp:alt_shift_toggle,grp_led:scroll
XKBVARIANT=,
Unload module index
Unloaded link configuration context.

Other failed attempts

I tried the following:

# Rule for disabling bogus keyboard on webcam. It causes X-Windows to
# crash if it goes on and off too much

SUBSYSTEM=="input", ENV{ID_VENDOR_ID}=="1908", ENV{ID_MODEL_ID}=="2311", MODE:="000"
SUBSYSTEM=="input", ATTRS{name}=="USB2.0 PC CAMERA:*", ENV{LIBINPUT_IGNORE_DEVICE}="1"

(the := assignment makes this assignment final).

However none of these two rules managed to stop X from reacting.

Setting the mode to 000 made the device file inaccessible, but yet it was registered. As for the second rule, it doesn’t help, because it indeed set LIBINPUT_IGNORE_DEVICE correctly, but for the wrong udev event. That’s because the udev event that triggers libinput is based upon that the KERNEL attribute is event[0-9]*, which is executed earlier (see 80-libinput-device-groups.rules), but ATTRS{name} isn’t defined for that specific udev event (see output of udevadm info above).

I also tried RUN+=”/bin/rm /dev/input/event%n”, and that indeed removed the device node, but X still reacted, and complained with “libinput: USB2.0 PC CAMERA: USB2.0 PC CAM: Failed to create a device for /dev/input/event28″. Because it was indeed deleted.

But since it appears like X.org accesses keyboards through libinput, maybe use the example for ignoring a device, as given on this page, even though it’s quite similar to what I’ve already attempted?

So I saved this file as /etc/udev/rules.d/79-no-camera-keyboard.rules:

# Make libinput ignore webcam's button as a keyboard. As a result there's
# no event to X-Windows

ACTION=="add|change", KERNEL=="event[0-9]*", \
   ENV{ID_VENDOR_ID}=="1908", \
   ENV{ID_MODEL_ID}=="2311", \
   ENV{LIBINPUT_IGNORE_DEVICE}="1"

And then reload:

# udevadm control --reload

but that didn’t make any apparent difference (I verified that the rule was matched).

And that’s all, folks. Recall that I didn’t promise a happy end.

Linux + webcam: Poor man’s DIY surveillance camera

eli — Sat, 20 Aug 2022 14:28:43 +0000

Introduction

Due to an incident that is beyond the scope of this blog, I wanted to put a 24/7 camera that watched a certain something, just in case that incident repeated itself.

Having a laptop that I barely use, and a cheap e-bay web camera, I thought I set up something and let ffmpeg do the job.

I’m not sure if a Raspberry Pi would be up for this job, even when connected to an external hard disk through USB. It depends much on how well ffmpeg performs on that platform. Haven’t tried. The laptop’s clear advantage is when there’s a brief power outage.

Overall verdict: It’s as good as the stability of the USB connection with the camera.

Note to self: I keep this in the misc/utils git repo, under surveillance-cam/.

Warming up

Show the webcam’s image on screen, the ffmpeg way:

$ ffplay -f video4linux2 /dev/video0

Let ffmpeg list the formats:

$ ffplay -f video4linux2 -list_formats all /dev/video0

Or with a dedicated tool:

# apt install v4l-utils

and then

$ v4l2-ctl --list-formats-ext -d /dev/video0

Possibly also use “lsusb -v” on the device: It lists the format information, not necessarily in a user-friendly way, but that’s the actual source of information.

Get all parameters that can be tweaked:

$ v4l2-ctl --all

See an example output for this command at the bottom of this post.

If control over the exposure time is available, it will be listed as “exposure_absolute” (none of the webcams I tried had this). The exposure time is given in units of 100µs (see e.g. the definition of V4L2_CID_EXPOSURE_ABSOLUTE).

Get a specific parameter, e.g. brightness

$ v4l2-ctl --get-ctrl=brightness
brightness: 137

Set the control (can be done while the camera is capturing video)

$ v4l2-ctl --set-ctrl=brightness=255

Continuous capturing

This is a simple bash script that creates .mp4 files from the captured video:

#!/bin/bash

OUTDIR=/extra/videos  SRC=/dev/v4l/by-id/usb-Generic*
DURATION=3600 # In seconds

while [ 1 ]; do
  TIME=`date +%F-%H%M%S`
  if ! ffmpeg -f video4linux2 -i $SRC -t $DURATION -r 10 $OUTDIR/video-$TIME.mp4 < /dev/null ; then
    echo 2-2 | sudo tee /sys/bus/usb/drivers/usb/unbind
    echo 2-2 | sudo tee /sys/bus/usb/drivers/usb/bind
    sleep 5;
  fi
done

Comments on the script:

To make this a real surveillance application, there must be another script that deletes old files, so that the disk isn’t full. My script on this matter is so hacky, that I left it out here.
The real problem I encountered was occasional USB errors. They happened every now and then, without any specific pattern. Sometimes the camera disconnected briefly and reconnected right away, sometimes it failed to come back for a few minutes. Once in a week or so, it didn’t come back at all, and only a lot of USB errors appeared in the kernel log, so a reboot was required. This is most likely some kind of combination of cheap hardware, a long and not so good USB cable and maybe hardware + kernel driver issues. I don’t know. This wasn’t important enough to solve in a bulletproof way.
Because of these USB errors, those two “echo 2-2″ commands attempt to reset the USB port if ffmpeg fails, and then sleep 5 seconds. The “2-2″ is the physical position of the USB port to which the USB camera was connected. Ugly hardcoding, yes. I know for sure that these commands were called occasionally, but whether this helped, I’m not sure.
Also because of these disconnections, the length of the videos wasn’t always 60 minutes as requested. But this doesn’t matter all that much, as long as the time between the clips is short. Which it usually was (less than 5 seconds, the result of a brief disconnection).
Note that the device file for the camera is found using a /dev/v4l/by-id/ path rather than /dev/video0, not just to avoid mixing between the external and built-in webcam: There were sporadic USB disconnections after which the external webcam ended up as /dev/video2. And then back to /dev/video1 after the next disconnection. The by-id path remained constant in the sense that it could be found with the * wildcard.
Frame rate is always a dilemma, as it ends up influencing the file’s size, and hence how long back videos are stored. At 5 fps, an hour long .mp4 took about 800 MB for daytime footage, and much less than so during night. At 10 fps, it got up to 1.1 GB, so by all means, 10 fps is better.
Run the recording on a text console, rather than inside a terminal window inside X-Windows (i.e. use Ctrl-Alt-F1 and Ctrl-Alt-F7 to go back to X). This is because the graphical desktop crashed at some point — see below on why. So if this happens again, the recording will keep going.
For the purpose of running ffmpeg without a console (i.e. run in the background with an “&” and then log out), note that the ffmpeg command has a “< /dev/null”. Otherwise ffmpeg expects to be interactive, meaning it does nothing if it runs in the background. There’s supposed to be a -nostdin flag for this, and ffmpeg recognized it on my machine, but expected a console nevertheless. So I went for the old method.

How a wobbling USB camera crashes X-Windows

First, the spoiler: I solved this problem by putting a physical weight on the USB cable, close to the plug. This held the connector steady in place, and the vast majority of the problems were gone.

I also have a separate post about how I tried to make Linux ignore the offending bogus keyboard from being. Needless to say, that failed (because either you ban the entire USB device or you don’t ban at all).

This is the smoking gun in /var/log/Xorg.0.log: Lots of

[1194182.076] (II) config/udev: Adding input device USB2.0 PC CAMERA: USB2.0 PC CAM (/dev/input/event421)
[1194182.076] (**) USB2.0 PC CAMERA: USB2.0 PC CAM: Applying InputClass "evdev keyboard catchall"
[1194182.076] (II) Using input driver 'evdev' for 'USB2.0 PC CAMERA: USB2.0 PC CAM'
[1194182.076] (**) USB2.0 PC CAMERA: USB2.0 PC CAM: always reports core events
[1194182.076] (**) evdev: USB2.0 PC CAMERA: USB2.0 PC CAM: Device: "/dev/input/event421"
[1194182.076] (--) evdev: USB2.0 PC CAMERA: USB2.0 PC CAM: Vendor 0x1908 Product 0x2311
[1194182.076] (--) evdev: USB2.0 PC CAMERA: USB2.0 PC CAM: Found keys
[1194182.076] (II) evdev: USB2.0 PC CAMERA: USB2.0 PC CAM: Configuring as keyboard
[1194182.076] (EE) Too many input devices. Ignoring USB2.0 PC CAMERA: USB2.0 PC CAM
[1194182.076] (II) UnloadModule: "evdev"

and at some point the sad end:

[1194192.408] (II) config/udev: Adding input device USB2.0 PC CAMERA: USB2.0 PC CAM (/dev/input/event423)
[1194192.408] (**) USB2.0 PC CAMERA: USB2.0 PC CAM: Applying InputClass "evdev keyboard catchall"
[1194192.408] (II) Using input driver 'evdev' for 'USB2.0 PC CAMERA: USB2.0 PC CAM'
[1194192.408] (**) USB2.0 PC CAMERA: USB2.0 PC CAM: always reports core events
[1194192.408] (**) evdev: USB2.0 PC CAMERA: USB2.0 PC CAM: Device: "/dev/input/event423"
[1194192.445] (EE)
[1194192.445] (EE) Backtrace:
[1194192.445] (EE) 0: /usr/bin/X (xorg_backtrace+0x48) [0x564128416d28]
[1194192.445] (EE) 1: /usr/bin/X (0x56412826e000+0x1aca19) [0x56412841aa19]
[1194192.445] (EE) 2: /lib/x86_64-linux-gnu/libpthread.so.0 (0x7f6e4d8b4000+0x10340) [0x7f6e4d8c4340]
[1194192.445] (EE) 3: /usr/lib/xorg/modules/input/evdev_drv.so (0x7f6e45c4c000+0x39f5) [0x7f6e45c4f9f5]
[1194192.445] (EE) 4: /usr/lib/xorg/modules/input/evdev_drv.so (0x7f6e45c4c000+0x68df) [0x7f6e45c528df]
[1194192.445] (EE) 5: /usr/bin/X (0x56412826e000+0xa1721) [0x56412830f721]
[1194192.446] (EE) 6: /usr/bin/X (0x56412826e000+0xb731b) [0x56412832531b]
[1194192.446] (EE) 7: /usr/bin/X (0x56412826e000+0xb7658) [0x564128325658]
[1194192.446] (EE) 8: /usr/bin/X (WakeupHandler+0x6d) [0x5641282c839d]
[1194192.446] (EE) 9: /usr/bin/X (WaitForSomething+0x1bf) [0x5641284142df]
[1194192.446] (EE) 10: /usr/bin/X (0x56412826e000+0x55771) [0x5641282c3771]
[1194192.446] (EE) 11: /usr/bin/X (0x56412826e000+0x598aa) [0x5641282c78aa]
[1194192.446] (EE) 12: /lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main+0xf5) [0x7f6e4c2f3ec5]
[1194192.446] (EE) 13: /usr/bin/X (0x56412826e000+0x44dde) [0x5641282b2dde]
[1194192.446] (EE)
[1194192.446] (EE) Segmentation fault at address 0x10200000adb
[1194192.446] (EE)
Fatal server error:
[1194192.446] (EE) Caught signal 11 (Segmentation fault). Server aborting
[1194192.446] (EE)

The thing is that webcam presents itself as a keyboard, among others. I guess the chipset has inputs for control buttons (which the specific webcam doesn’t have), so as the USB device goes on and off, X windows registers the nonexistent keyboard on and off, and eventually some bug causes it to crash (note that number of the event device is 423, so there were quite a few on and offs). It might very well be that the camera camera connected, started some kind of connection event handler, which didn’t finish its job before it disconnected. Somewhere in the code, the handler fetched information that didn’t exist, it got a bad pointer instead (NULL?) and used it. Boom. Just a wild guess, but this is the typical scenario.

The crash can be avoided by making X windows ignore this “keyboard”. I did this by adding a new file named /usr/share/X11/xorg.conf.d/10-nocamera.conf as follows:

# Ignore bogus button on webcam
Section "InputClass"
 Identifier "Blacklist USB webcam button as keyboard"
 MatchUSBID "1908:2311"
 Option "Ignore" "on"
EndSection

This way, X windows didn’t fiddle with the bogus buttons, and hence didn’t care if they suddenly went away.

Anyhow, it’s a really old OS (Ubuntu 14.04.1) so this bug might have been solved long ago.

Accumulation of /dev/input/event files

Another problem with this wobbling is that /dev/input/ becomes crowded with a lot of eventN files:

$ ls /dev/input/event*
/dev/input/event0    /dev/input/event267  /dev/input/event295
/dev/input/event1    /dev/input/event268  /dev/input/event296
/dev/input/event10   /dev/input/event269  /dev/input/event297
/dev/input/event11   /dev/input/event27   /dev/input/event298
/dev/input/event12   /dev/input/event270  /dev/input/event299
/dev/input/event13   /dev/input/event271  /dev/input/event3
/dev/input/event14   /dev/input/event272  /dev/input/event30
/dev/input/event15   /dev/input/event273  /dev/input/event300
/dev/input/event16   /dev/input/event274  /dev/input/event301
/dev/input/event17   /dev/input/event275  /dev/input/event302
/dev/input/event18   /dev/input/event276  /dev/input/event303
/dev/input/event19   /dev/input/event277  /dev/input/event304
/dev/input/event2    /dev/input/event278  /dev/input/event305
/dev/input/event20   /dev/input/event279  /dev/input/event306
/dev/input/event21   /dev/input/event28   /dev/input/event307
/dev/input/event22   /dev/input/event280  /dev/input/event308
/dev/input/event23   /dev/input/event281  /dev/input/event309
/dev/input/event24   /dev/input/event282  /dev/input/event31
/dev/input/event25   /dev/input/event283  /dev/input/event310
/dev/input/event256  /dev/input/event284  /dev/input/event311
/dev/input/event257  /dev/input/event285  /dev/input/event312
/dev/input/event258  /dev/input/event286  /dev/input/event313
/dev/input/event259  /dev/input/event287  /dev/input/event314
/dev/input/event26   /dev/input/event288  /dev/input/event315
/dev/input/event260  /dev/input/event289  /dev/input/event316
/dev/input/event261  /dev/input/event29   /dev/input/event4
/dev/input/event262  /dev/input/event290  /dev/input/event5
/dev/input/event263  /dev/input/event291  /dev/input/event6
/dev/input/event264  /dev/input/event292  /dev/input/event7
/dev/input/event265  /dev/input/event293  /dev/input/event8
/dev/input/event266  /dev/input/event294  /dev/input/event9

Cute, huh? And this is even before there was a problem. So what does X windows make of this?

$ xinput list
⎡ Virtual core pointer                    	id=2	[master pointer  (3)]
⎜   ↳ Virtual core XTEST pointer              	id=4	[slave  pointer  (2)]
⎜   ↳ ELAN Touchscreen                        	id=9	[slave  pointer  (2)]
⎜   ↳ SynPS/2 Synaptics TouchPad              	id=13	[slave  pointer  (2)]
⎣ Virtual core keyboard                   	id=3	[master keyboard (2)]
    ↳ Virtual core XTEST keyboard             	id=5	[slave  keyboard (3)]
    ↳ Power Button                            	id=6	[slave  keyboard (3)]
    ↳ Video Bus                               	id=7	[slave  keyboard (3)]
    ↳ Power Button                            	id=8	[slave  keyboard (3)]
    ↳ Lenovo EasyCamera: Lenovo EasyC         	id=10	[slave  keyboard (3)]
    ↳ Ideapad extra buttons                   	id=11	[slave  keyboard (3)]
    ↳ AT Translated Set 2 keyboard            	id=12	[slave  keyboard (3)]
    ↳ USB 2.0 PC Cam                          	id=14	[slave  keyboard (3)]
    ↳ USB 2.0 PC Cam                          	id=15	[slave  keyboard (3)]
    ↳ USB 2.0 PC Cam                          	id=16	[slave  keyboard (3)]
    ↳ USB 2.0 PC Cam                          	id=17	[slave  keyboard (3)]
    ↳ USB 2.0 PC Cam                          	id=18	[slave  keyboard (3)]
    ↳ USB 2.0 PC Cam                          	id=19	[slave  keyboard (3)]

Now, let me assure you that there were not six webcams connected when I did this. Actually, not a single one.

Anyhow, I didn’t dig further into this. The real problem is that all of these /dev/input/event files have the same major. Which means that when there are really a lot of them, the system runs out of minors. So if the normal kernel log for plugging in the webcam was this,

usb 2-2: new high-speed USB device number 22 using xhci_hcd
usb 2-2: New USB device found, idVendor=1908, idProduct=2311
usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=0
usb 2-2: Product: USB2.0 PC CAMERA
usb 2-2: Manufacturer: Generic
uvcvideo: Found UVC 1.00 device USB2.0 PC CAMERA (1908:2311)
uvcvideo 2-2:1.0: Entity type for entity Processing 2 was not initialized!
uvcvideo 2-2:1.0: Entity type for entity Camera 1 was not initialized!
input: USB2.0 PC CAMERA: USB2.0 PC CAM as /devices/pci0000:00/0000:00:14.0/usb2/2-2/2-2:1.0/input/input274

after all minors ran out, I got this:

usb 2-2: new high-speed USB device number 24 using xhci_hcd
usb 2-2: New USB device found, idVendor=1908, idProduct=2311
usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=0
usb 2-2: Product: USB2.0 PC CAMERA
usb 2-2: Manufacturer: Generic
uvcvideo: Found UVC 1.00 device USB2.0 PC CAMERA (1908:2311)
uvcvideo 2-2:1.0: Entity type for entity Processing 2 was not initialized!
uvcvideo 2-2:1.0: Entity type for entity Camera 1 was not initialized!
media: could not get a free minor

And then immediately after:

systemd-udevd[4487]: Failed to apply ACL on /dev/video2: No such file or directory
systemd-udevd[4487]: Failed to apply ACL on /dev/video2: No such file or directory

Why these eventN files aren’t removed is unclear. The kernel is pretty old, v4.14, so maybe this has been fixed since.

Sample output of v412-all

This is small & junky webcam. Clearly no control over exposure time.

$ v4l2-ctl --all -d /dev/v4l/by-id/usb-Generic_USB2.0_PC_CAMERA-video-index0
Driver Info (not using libv4l2):
	Driver name   : uvcvideo
	Card type     : USB2.0 PC CAMERA: USB2.0 PC CAM
	Bus info      : usb-0000:00:14.0-2
	Driver version: 4.14.0
	Capabilities  : 0x84200001
		Video Capture
		Streaming
		Device Capabilities
	Device Caps   : 0x04200001
		Video Capture
		Streaming
Priority: 2
Video input : 0 (Camera 1: ok)
Format Video Capture:
	Width/Height  : 640/480
	Pixel Format  : 'YUYV'
	Field         : None
	Bytes per Line: 1280
	Size Image    : 614400
	Colorspace    : Unknown (00000000)
	Custom Info   : feedcafe
Crop Capability Video Capture:
	Bounds      : Left 0, Top 0, Width 640, Height 480
	Default     : Left 0, Top 0, Width 640, Height 480
	Pixel Aspect: 1/1
Selection: crop_default, Left 0, Top 0, Width 640, Height 480
Selection: crop_bounds, Left 0, Top 0, Width 640, Height 480
Streaming Parameters Video Capture:
	Capabilities     : timeperframe
	Frames per second: 30.000 (30/1)
	Read buffers     : 0
                     brightness (int)    : min=0 max=255 step=1 default=128 value=128
                       contrast (int)    : min=0 max=255 step=1 default=130 value=130
                     saturation (int)    : min=0 max=255 step=1 default=64 value=64
                            hue (int)    : min=-127 max=127 step=1 default=0 value=0
                          gamma (int)    : min=1 max=8 step=1 default=4 value=4
           power_line_frequency (menu)   : min=0 max=2 default=1 value=1
                      sharpness (int)    : min=0 max=15 step=1 default=13 value=13
         backlight_compensation (int)    : min=1 max=5 step=1 default=1 value=1

FPGA + USB 3.0: Cypress EZ-USB FX3 or XillyUSB?

eli — Wed, 25 Nov 2020 05:11:25 +0000

Introduction

As the title implies, this post compares two solutions for connecting an FPGA to a host via USB 3.0: Cypress’ FX3 chipset, which has been around since around 2010, and the XillyUSB IP core, which was released in November 2020.

Cypress has been acquired by Infineon, but I’ll stick with Cypress. It’s not clear if the products are going to be re-branded (like Intel did with Altera, for example).

Since I’m openly biased towards XillyUSB, let’s be fair enough and start with its disadvantages. The first and obvious one is how long it’s been around compared with the FX3. Another thing is that XillyUSB won’t fall back to USB 2.0 if a USB 3.0 link fails to establish. This fallback option is important in particular because computer’s USB 3.x ports are sometimes of low quality, so even though the user expected to benefit from USB 3.x speed, the possibility to plug the device into a non-USB 3.x port can save the day.

This is however relevant only for applications that are still useful with USB 2.0, e.g. hard disk, USB sticks and Ethernet adapters — these still work, but do benefit from a faster connection when possible. If the application inherently needs payload speeds above 25 MBytes/s, it’s USB 3.0 or perish.

Thirdly, XillyUSB requires an FPGA with an MGT supporting 5 Gb/s. Low-cost FPGAs don’t. But from a BOM cost point of view, odds are that upgrading the FPGA costs less than adding the FX3 device along with its supporting components.

Finally, a not completely related comment: USB is good for hotpluggable, temporary connections. If a fixed link is required between an FPGA and some kind of computer, PCIe is most likely a better choice, possibly using Xillybus’ IP core for PCIe. Compared with USB 2.0, it might sound like a scary option, and PCIe isn’t always supported by embedded devices. But if USB 3.x is an option, odds are that PCIe is too. And a better one, unless hotplugging is a must.

FX3: Another device, another processor, another API and SDK

XillyUSB is an IP core, and hence resides side-by-side with the application logic on the FPGA. It requires a small number of pins for its functionality: Two differential wire pairs to the USB connector, and an additional pair of wires to a low-jitter reference clock. A few GPIO LEDs are recommended for status indications, but are not mandatory. The chances for mistakes in the PCB design are therefore relatively slim.

By contrast, using the FX3 requires following a 30+ pages hardware design application note (Cypress’ AN70707) to ensure proper operation of that device. As for FPGA pin consumption, a minimum of 40 pins is required to attain 400 MB/s of data exchange through a slave FIFO (e.g. 200 MB/s in each direction, half the link capacity), since the parallel data clock is limited to 100 MHz.

It doesn’t end there: The FX3 contains an ARM9 processor for which firmware must be developed. This firmware may produce USB traffic by itself, or configure the device to expose a slave FIFO interface for streaming data from and to the FPGA. This way or another, code for the ARM processor needs to be developed in order to carry out the desired configuration, at a minimum.

This is done with Cypress’ SDK and based upon coding examples, but there’s no way around this extra firmware task, which requires detailed knowledge on how the device works. For example, to turn off the FX3′s LPM capability (which is a good idea in general), the CyU3PUsbLPMDisable() API function should be called. And there are many more of this sort.

Interface with application logic in the FPGA

XillyUSB follows Xillybus’ paradigm regarding interface with application logic: There’s a standard synchronous FIFO between the application logic and the XillyUSB IP core for each data stream, and the application logic uses it mindlessly: For an FPGA-to-host stream, the application logic just pushes the data into the FIFO (checking that it’s not full), knowing it will reach the host in a timely manner. For the opposite direction, it reads from the FIFO when it’s non-empty.

In other words, the application logic interfaces with these FIFOs like FPGA designers are used to, for the sake of streaming data between different functional modules in a design. There is no special attention required because the destination or source of the data is a USB data link.

The FX3′s slave FIFO interface may sound innocent, but it’s a parallel data and control signal interface, allowing the FPGA to issue read and write commands on buffers inside the FX3. This requires developing logic for a controller that interfaces with the slave FIFO interface: Selection of the FX3 buffer to work with, sense its full or empty status (depending on the direction) and transfer data with this synchronous interface. If more than one data stream is required between the FPGA and the host, this controller also needs to perform scheduling and multiplexing. State machines, buffering of data, arbitration, the whole thing.

Even though a controller of this sort may seem trivial, it’s often this type of logic that is exposed to corner cases regarding flow of data: The typical randomness of data availability on one side and the ability to receive it on the other, creates scenarios that are difficult to predict, simulate and test. Obtaining a bulletproof controller of this sort is therefore often significantly more difficult than designing one for a demo.

When working with XillyUSB (or any other Xillybus IP core), the multiplexing is done inside the IP core: Designed, tested and fine polished once and for all. And this opens for another advantage: Making changes to the data stream setting, and adding streams to an existing design is simple and doesn’t jeopardize the stability of the already existing logic. Thanks to Xillybus’ IP Core Factory, this only requires some simple operations on the website and downloading the new IP core. Its deployment in the FPGA design merely consists of replacing files, making trivial changes in the HDL following a template, and adding a standard FPGA FIFO for the new stream. Nothing else in the logic design changes, so there are no side effects.

Host software design

The FX3′s scope in the project is to present a USB device. The driver has to be written more or less from scratch. So the host software, whether as a kernel driver or a libusb user-space implementation, must be written with USB transfers as the main building block. For a reasonable data rate (or else why USB 3.0?), the software design must be asynchronous: Requests are queued for submission, and completer functions are called when these requests are completed. The simple wait-until-done method doesn’t work, because this leads to long time gaps of no communication on the USB link. Aside from the obvious impact on bandwidth utilization, this is likely to cause overflows or underflows in the FPGA’s buffers.

With XillyUSB (and once again, with other Xillybus IP cores too), a single, catch-all driver presents pipe-like device files. Plain command-line utilities like “cat” and “dd” can be used to implement reliable and practical data acquisition and playback. The XillyUSB IP core and the dedicated driver use the transfer-based USB protocol for creating an abstraction of a simple, UNIX-like data stream.

FPGA application logic: USB transfers or continuous data?

The USB specification was written with well-defined transfers in mind. The underlying idea was that the host allocates a buffer and queues a data transfer request, related to a certain USB endpoint, to or from that buffer. For continuous communication, several transfers can be queued. Yet, there are data buffers of fixed size, each waiting for its turn.

Some data sinks and sources are naturally organized in defined chunks of data, and fit USB’s concept well. From a software design’s point of view, it’s simpler to comprehend a mechanism that relies on fixed-sized buffers, requests and fulfillments.

But then, what is natural in an FPGA design? In most applications, continuous, non-packeted data is the common way. Even video applications, where there’s a clear boundary between frames, are usually implemented with regular FIFOs between the internal logic block. With XillyUSB, this is the way the data flows: FIFOs on the FPGA and pipe-like device files on the host side.

With FX3, on the other hand, the USB machinery needs direct attention. For example: When transmitting data towards the host, FX3′s slave FIFO interface requires asserting PKTEND# in order to commit the data to the host, which may also issue a zero-length packet instead. This complication is necessary to maintain USB’s concept of a transfer: Sending a USB DATA packet shorter than the maximal allowed length tells the host that the transfer is finished, even if the buffer that was allocated for the transfer isn’t filled. Therefore, the FX3 can’t just send whatever data it has in the buffer because it has nothing better to do. Doing so would terminate the transfer, which can mean something in the protocol between the driver and its device.

But then, if the transfer request buffer’s size isn’t a multiple of the maximal USB DATA packet size (1024 bytes for USB 3.0), PKTEND# must be asserted before this buffer fills, or a USB protocol error occurs, as the device sends more data than can be stored. The USB protocol doesn’t allow the leftovers to be stored in the next queued transfer’s buffer, and it’s not even clear if such transfer is queued.

If this example wasn’t clear because of too much new terminology, no problem, that was exactly the point: The USB machinery one needs to be aware of.

Physical link diagnostics

As a USB device can be connected to a wide range of USB host controllers, on various motherboards, through a wide range of USB cables, the quality of the bitstream link may vary. On a good day it’s completely error-free, but sometimes it’s a complete mess.

Low-level errors don’t necessarily cause immediate problems, and sometimes the visible problems don’t look like a low-level link issue. The USB protocol is designed to keep the show running to the extent possible (retransmits and whatnot), so what appears to be occasional problems with a USB device could actually be a bad link all the time, with random clusters of mishaps that make the problem become visible, every now and then.

Monitoring the link’s health is therefore beneficial, both in a lab situation, but nevertheless in a product. The application software can collect error event information, and warn the user that even though all seems well, it’s advisable to try a different USB port or cable. Sometimes, that’s all it takes.

XillyUSB provides a simple means for telling something is wrong. There’s an output from the IP core, intended for a plain LED that flashes briefly for each error event that is detected. There are more detailed LEDs as well. Also, the XillyUSB driver creates a dedicated device file, from which diagnostic data can be read with a simple file operation. This diagnostic data chunk mainly consists of event counters for different error situations, which can be viewed with a utility that is downloaded along with XillyUSB’s driver for Linux. Likewise, a simple routine in an application suite can perform this monitoring for the sake of informing users about a problematic hardware setting.

Cypress’ FX3 does provide some error information of this sort, however this is exposed to the ARM processor inside the device itself. The SDK supplies functions such as CyU3PUsbInitEventLog() for enabling event logging and CyU3PUsbGetErrorCounts() for obtaining error count, but it’s the duty of the ARM’s firmware to transfer this data to the host. And then some kind of driver and utility are needed on the host as well.

The documentation for error counting is somewhat minimal, but looking at the definition of LNK_PHY_ERROR_CONF in the EZ-USB FX3 Technical Reference Manual helps.

Bugs and Errata

As always when evaluating a component for use, it’s suggested to read through the errata section in FX3′s datasheet. In particular, there’s a known problem causing errors in payload data towards the host, for which there is no planned fix. It occurs when a Zero Length Packet is followed by data “very quickly”, i.e. within a microframe of 125μs.

So first, 125μs isn’t “very quickly” in USB 3.0 terms. It’s the time corresponding to 62.5 kBytes of raw bandwidth of the link, which is a few dozens of DATA IN packets. Second, a zero length packet is something that is sent to finish a USB transfer. One can avoid it in some situations, but not in others. For example, if the transfer’s length is a multiple of 1024 bytes, the only way to finish it explicitly is with a zero length packet. The said errata requires not sending any data for 125 μs after such event, or there will be data errors.

This doesn’t just make the controller more complicated, but there’s a significant bandwidth penalty.

It may not be worth much saying that XillyUSB doesn’t have any bug of this sort, as it has been extensively tested with randomized data sources and sinks. It’s in fact quite odd that Cypress obviously didn’t perform tests of this sort (or they would have caught that bug easily).

The crucial difference is however that bugs in an IP core can be fixed and deployed quickly. There is no new silicon device to release, and no need to replace a physical device on the PCB.

No design is born perfect. The question is to what extent the issues that arise are fixed.

Ultrascale GTH transceivers: Advanced doesn’t necessarily mean better

eli — Wed, 09 Sep 2020 17:11:41 +0000

Introduction

I tend to naturally assume that newer FPGAs will perform better in basically everything, and that the heavier hammers are always better. Specifically, I expect the GTX / GTH / GT-whatever to perform better with the newer FPGAs (not just higher rates, but simply work better) and that their equalizers will be able to handle lousier input signals. And that the DFE equalizer will perform better than its little brother, LPM, in particular when the signal has been through some stuff.

And then there’s reality. This post summarizes my own findings with a USB 3.0 (SuperSpeed) link from the host to the FPGA, running at 5 Gb/s raw data rate on wire, with scrambler enabled. There is no official support for USB 3.0 by Xilinx’ transceivers, however the link parameters resemble those of SATA (in particular the SSC clocking without access to the clock), so I used the recommended settings for SATA, except for changing the data rate and reference clock frequency.

I’ll focus on Ultrascale’s GTH transceiver as well as the DFE equalizer, neither of which performed as I expected.

There’s a brief explanation on equalizers and related issues at bottom of this post, for those who need some introduction.

And ah, not directly related, but if a complete design example with an Ultrascale GTH would help, here’s one. Actually, there’s also the same for earlier FPGAs (7-series).

Choosing insertion loss on Ultrascale

The setting of Transceiver IP Wizard for Ultrascale and Ultrascale+ has a crucial difference regarding the receiver: Under the “Advanced” section, which is hidden by default, the physical characteristics of the channel can be set. Among others, the equalizer can be selected between “Auto” (default), “LPM” and “DFE”. This selection can be done with the Wizard for Kintex-7 and Virtex-7 FPGAs as well, but there’s a new setting in the Ultrascale Wizard: The insertion loss at the Nyquist frequency.

The default for Ultrascale, with SATA preset, is 14 dB insertion loss, with the equalizer set to Auto. The actual result is that the GTH is configured automatically by the Wizard to use the LPM equalizer. The insertion loss is quite pessimistic for a SATA link (and USB 3.0 as well), but that doesn’t matter so much, given that LPM was chosen. And it works fine.

But knowing that I’m going to have reflections on the signal, I changed the equalizer from “Auto” to “DFE”. I was under the wrong impression that the insertion loss was only a hint for the automatic selection between LPM and DFE, so I didn’t give it any further attention. The result was really poor channel performance. Lots of bit errors.

Investigating this, I found out that while the insertion loss setting doesn’t make any difference with the LPM equalizer (at least not in the range between 0 and 14 dB), it does influence the behavior of DFE. Namely, if the insertion loss is 10 dB and below, the DFE’s AGC component is disabled, and a fixed gain is assigned instead. More precisely, the GTHE3_CHANNEL primitive’s RXDFEAGCOVRDEN port is assigned a logic ’1′, and RXDFE_GC_CFG2 instantiation parameter is set to 16′b1010000 instead of all zeros.

So apparently, the DFE’s AGC doesn’t function properly unless the signal arrives with significant attenuation. This isn’t problematic when the physical link is fixed, and the insertion loss can be calculated from the PCB’s simulation. However when the link involves a user-supplied cable, such as the cases of USB 3.0 and SATA, this is an unknown figure.

Given that the insertion loss of cables is typically quite low, it makes sense to pick an insertion loss of 10 dB or less if DFE is selected. Or just go for LPM, which is configured exactly the same by the Wizard, regardless of the insertion loss setting (for the 0 dB to 14 dB range, at least). As the eye scans below show, the DFE wasn’t such a star anyhow.

In this context, it’s interesting that the Wizard for 7-series FPGAs (Kintex-7, Virtex-7 and Artix-7) doesn’t ask about insertion loss. You may select DFE or LPM, but there’s no need to be specific on that figure. So it seems like this is a workaround for a problem with the DFE on Ultrascale’s transceivers.

DFE vs. LPM on Ultrascale

As the eye scans shown below reveal, it turns out that DFE isn’t necessarily better than LPM on an Ultrascale FPGA. This is somewhat surprising, since LPM consists of a frequency response correction filter only, while the transceiver’s DFE option includes that on top of the DFE equalizer (according to the user guide). One could therefore expect that DFE would have a better result, in particular with a link that clearly produces reflections.

This, along with the Wizard’s mechanism for turning off the AGC for stronger signals, seems to indicate that the DFE didn’t turn out all that well on Ultrascale devices, and that it’s better avoided. Given that it gave no benefit with a 5 Gb/s signal that goes through quite a few discontinuities, it’s questionable whether there is a scenario for which it’s actually the preferred choice.

Eye scans: General

I’ve made a lot of statistical eyes scans for the said USB channel. This mechanism is made by Xilinx’ transceivers themselves, and is described in the respective user guides. In a nutshell, these scans show how the bit error rate is affected by moving the sampling time from the point that the CDR locks on, as well as adding a voltage offset to the detection threshold. The cyan-colored spot in the middle of the plots shows the region of zero bit errors, and hence it size displays the margins in time and voltage for retaining zero BER.

The important part is the margin in time. In the plots below, UI is the time unit used. One UI corresponds to a bit’s period (i.e. the following bit appears at UI = 1). The vertical axis is less well defined and less important, since voltage is meaningless: It can be amplified as needed. The shape of the eye plot can however give a hint sometimes about certain problems.

The plots in this post were all made on a USB 3.0 data stream (running at the standard 5 Gb/s with scrambler applied), created by a Renesas uPD720202 USB controller (PCI ID 1912:0015), received by the FPGA’s transceiver.

The physical connection, except for PCB traces, involved a Type A connector, connected to a Micro B connector with a high-quality 1 meter USB cable. The Micro-B connector is part of an sfp2usb adapter, which physically connects the signal to the SFP+ connector inside an SFP+ cage, which in turn is connected directly to the FPGA. The signal traces of the sfp2usb adapter are about the length of the SFP+ cage.

So overall, it’s the USB controller chip, PCB trace, USB type A connector mating, 1 meter of cable, Micro B connector mating, a short PCB trace on the sfp2usb adapter, an SFP+ connector mating, PCB trace on the FPGA board reaching the FPGA’s transceiver.

The Renesas USB controller was selected over other options because it showed relatively low signal quality compared with other USB signal sources. The differences are more apparent with this source, however the other sources all gave similar results.

Needless to say, testing at a specific rate with specific equipment doesn’t prove anything on the general quality of the transceivers, and yet the 5 Gb/s represents a medium rate channel quite well.

The FPGA boards used:

Xilinx KCU105 for Kintex Ultrascale
Xilinx KC705 for Kintex-7
Trenz TE0714 for Artix-7 with carrier board having an SFP+ cage

I used some home-cooked logic for making the eye scans and Octave to produce the plots, so if the format doesn’t look familiar, that’s why.

LPM vs. DFE with Ultrascale GTH

This is the eye scan plot for LPM (click to enlarge the plots):

Eye scan with Ultrascale GTH, LPM equalizer, 5 Gb/s

And this is for DFE, with insertion loss set below the 10 dB threshold:

Eye scan with Ultrascale GTH, DFE equalizer, 5 Gb/s, low insertion loss

And this is DFE again, with insertion loss set to 14 dB:

Eye scan with Ultrascale GTH, DFE equalizer, 5 Gb/s, 14 dB insertion loss

It’s quite evident that something went horribly wrong when the insertion loss was set to 14 dB, and hence the AGC was enabled, as explained above. But what is even more surprising is that even with the AGC issue away, the eye scan for DFE is slightly worse than LPM. There are three connectors on the signal paths, each making its reflections. DFE should have done better.

Comparing DFE scans with Kintex-7′s GTX

Here’s the proper DFE eye scan for Ultrascale’s GTH again (click to enlarge):

Eye scan with Ultrascale GTH, DFE equalizer, 5 Gb/s, low insertion loss

And this is Kintex-7, same channel but with a GTX, having considerably less equalizer taps:

Eye scan with Kintex-7 GTX, DFE equalizer, 5 Gb/s

It’s quite clear that the zero-BER region is considerably larger on the Kintex-7 eye scan. Never mind the y-axis of the plot, it’s the time axis that matters, and it’s clearly wider. Kintex-7 did better than Ultrascale.

Comparing LPM scans with GTX / GTP

This is the LPM eye scan for Ultrascale’s GTH again:

Eye scan with Ultrascale GTH, LPM equalizer, 5 Gb/s

And Kintex-7′s counterpart:

Eye scan with Kintex-7 GTX, LPM equalizer, 5 Gb/s

It’s clearly better than Ultrascale’s scan. Once again, never mind that the zero-BER part looks bigger: Compare the margin in the time axis. Also note that Kintex-7′s DFE did slightly better than LPM, as expected.

And since Artix-7 is also capable of LPM, here’s its scan:

Eye scan with Artix-7 GTP (LPM equalizer), 5 Gb/s

Surprise, surprise: Atrix-7′s eye scan was best of all options. The low-cost, low-power device took first prize. And it did so with an extra connector with the carrier board.

Maybe this was pure luck. Maybe it’s because the scan was obtained with a much smaller board, with possibly less PCB trace congestion. And maybe the LPM on Artix-7 is better because there’s no DFE on this device, so they put an extra effort on LPM.

Conclusion

The main takeaway from this experience of mine is that advanced doesn’t necessarily mean better. Judging by the results, it seems to be the other way around: Ultrascale’s GTH being more fussy about the signal, and losing to Kintex-7′s GTX, and both losing to Artix-7.

And also, to take the insertion loss setting in the Wizard seriously.

As I’ve already said above, this is just a specific case with specific equipment. And yet, the results turned out anything but intuitive.

Appendix: Equalizers, ISI and Nyquist frequency, really briefly

First, the Nyquist frequency: It’s just half the raw bit rate on wire. For example, it’s. 2.5 GHz for a USB Superspeed link with 5 Gb/s raw data rate. The idea behind this term is that the receiver makes one analog sample per bit period, and Nyquist’s Theorem does the rest. But it’s also typically the frequency at which one can low-pass filter the channel without any significant effect.

Next, what’s this insertion loss? For those who haven’t played with RF network analyzers for fun, insertion loss is, for a given frequency, the ratio between the inserted signal power on one side of the cable and/or PCB trace and the power that arrives at the other end. You could call it the frequency-dependent attenuation of the signal. As the frequency rises, this ratio tends to rise (more loss of energy) as this energy turns into radio transmission and heat. Had this power loss been uniform across frequency, it would have been just a plain attenuation, which is simple to correct with an amplifier. The frequency-varying insertion loss results in a distortion of the signal, typically making the originally sharp transitions in time between ’0′ and ’1′ rounder and smeared along possibly several symbol periods.

This smearing effect results in Intersymbol Interference (ISI), which means that when the bit detector samples the analog voltage for determining whether its a ’0′ or ’1′, this analog voltage is composed of a sum of voltages, depending on several bits. These additional voltage components act a bit like noise and increase the bit error rate (BER), however this isn’t really noise (such as the one picked up by crosstalk or created by the electronics), but rather the effect of a bit’s analog signal being spread out over a longer time period.

Another, unrelated reason for ISI is reflections: As the analog signal travels as an electromagnetic wave through the PCB copper trace or cable, reflections are created when the medium changes or makes sharp turns. This could be a via on the PCB board (or just a poorly designed place on the layout), or a connector, which involves several medium transitions: From the copper trace on the PCB to the connector’s leg, from one side of the connector to the connector mating with it, and then from the connector’s leg to the medium that carries the signal further. This assuming the connector doesn’t have any internal irregularities.

So ISI is what equalizers attempt to fix. There’s a relatively simple approach, employed by the linear equalizer. It merely attempts to insert a filter with a frequency response that compensates for the channel’s insertion loss pattern: The equalizer amplifies high frequencies and attenuates low frequencies. By doing so, some of the insertion loss’ effect is canceled out. This reverse filter is tuned by the equalizer for optimal results, and when this works well, the result is an improvement of the ISI. The linear equalizer doesn’t help at all regarding reflections however.

The transmitter can help with this matter by shaping its signal to contain more high-frequency components — this is called pre-emphasis — but that’s a different story.

The DFE (Decision Feedback Equalizer) attempts to fix the ISI directly. It’s designed with the assumption that the transmitted bits in the channel are completely random (which is often ensured by applying a scrambler to the bit stream). Hence the voltages that are sampled by the bit detector should be linearly uncorrelated, and when there is such correlation, it’s attributed to ISI.

This equalizer cancels the correlation between the bit that is currently detected and the analog voltages of a limited number of bits that will be detected after it. This is done by adding or subtracting a certain voltage for each of the signal samples that are used for detecting the bits after the current one. The magnitude of this operation (which can be negative, of course) depends on the time distance between the current bit and the one manipulated. Whether its an addition or subtraction depends on whether the current bit was detected as a ’0′ or ’1′.

The result is hence that when the sample arrives at the bit detector, it’s linearly uncorrelated with the bits that were detected before it. Or more precisely, uncorrelated with a number of bits detected before it, depending on the number of taps that the DFE has.

This method is more power consuming and has a strong adverse effect if the bits aren’t random. It’s however considered better, in particular for canceling the effect of signal reflections, which is a common problem as the analog signal travels on the PCB and/or cable and reaches discontinuities (vias, connectors etc.).

Having said that, one should remember that the analog signal typically travels at about half the speed of light on PCB traces (i.e. 1.5 x 10^8 m/s), so e.g. at 5 Gb/s each symbol period corresponds to 3 cm. Accordingly, an equalizer with e.g. 8 taps is able to cancel reflections that have traveled 24 cm (typically 12 cm in each direction). So DFE may help with reflections on PCBs, but not if the reflection has gone back and forth through a longer cable. Which may not an issue, since the cable itself typically attenuates the reflection’s signal as it goes back and forth.

According to the user guides, when a Xilinx transceiver is set to LPM (Low Power Mode), only a linear equalizer is employed. When DFE is selected, a linear equalizer, followed by a DFE equalizer are employed.

Linux driver: Creating device files for a hotpluggable device

eli — Mon, 13 Apr 2020 09:31:31 +0000

Introduction

Most device drivers hook onto an already existing subsystem: Audio, networking, I2C, whatever. If it’s a specialized piece of hardware, a single device file is typically enough, and you get away with a misc device or, as shown in the kernel’s USB example (drivers/usb/usb-skeleton.c), with the USB framework’s usb_register_dev() call.

However in some cases, a number of dedicated device files are required, belonging to a dedicated class. Only in these cases does it make sense to generate the device files explicitly. A driver that does this for no good reason will be hard to get into the Linux kernel tree.

Hotpluggable devices are tricky in particular because of the two following reasons:

They may vanish at any time, so the driver must be sure to handle that properly. Namely, not to free any resources as long as they may be accessed by some other thread.
It’s impossible initialize the device driver once and for all against a known set of devices when the driver loads.

This post focuses on USB devices, and in particular on deleting the device files and releasing the kernel that support them when the USB device is unplugged.

Spoiler: There’s no problem deleting the device files themselves, but handling the cdev struct requires some attention.

Everything in this post relates to Linux kernel v5.3.

A reminder on device files

This post takes the down-to-details approach to allocating majors and minors, which isn’t necessarily recommended: It’s by far easier to use register_chrdev() than to muck about with alloc_chrdev_region() and setting up the cdev struct directly. However the good old Linux Device Driver book suggests to use the latter, and deems register_chrdev() to be “The Older Way”, which is about to be removed from the kernel. That was the 2005 edition, and fast forward to 2020, no other than drivers/char/misc.c uses register_chrdev(). Some drivers go one way, others go the other.

So first, we let’s look at the usual (by the book, literally) way to do it, which is in fact unsuitable for a hotpluggable device. But we have to start with something.

There are three players in this game:

The cdev struct: Makes the connection between a set of major/minors and a pointer to a struct file_operations, containing the pointers to the functions (methods) that implement open, release, read, write etc. The fops, in short.
The device files: Those files in /dev that are accessible by user space programs.
The class: Something that must be assigned to device files that are created, and assists in identifying their nature (in particular for the purpose of udev).

The class is typically a global variable in the driver module, and is created in its init routine:

static struct class *example_class;

static int __init example_init(void)
{
  example_class = class_create(THIS_MODULE, examplename);
  if (IS_ERR(example_class))
    return PTR_ERR(example_class);

  return 0;
}

Because a class is typically global in the module, and hence not accessible elsewhere, it’s impossible to create device files on behalf of another class without using their API and restrictions (for example, misc devices). On the other hand, if you try to push a driver which creates a new class into the kernel tree, odds are that you’ll have to explain why you need to add yet another class. Don’t expect a lot of sympathy on this matter.

The next player is the cdev struct. Its role is to connect between a major + a range of minors and file operations. It’s typically part of a larger struct which is allocated for each physical device. So it usually goes something like

struct example_device {
  struct device *dev;

  struct cdev cdev;

  int major;
  int lowest_minor; /* Highest minor = lowest_minor + num_devs - 1 */

  int num_devs;

  struct kref kref;

  /* Just some device related stuff */
  struct list_head my_list;
  __iomem void *registers;
  int fatal_error;
  wait_queue_head_t my_wait;
}

The only part that is relevant for this post is the struct cdev and the others marked in bold, but I left a few others that often appear in a IOMM device.

Note that the example_device struct contains the cdev struct itself, and not a pointer to it. This is the usual way, but that isn’t the correct way for a USB device. More on that below.

As mentioned above, the purpose of the cdev struct is to bind a major/minor set to a struct of fops. Something like

static const struct file_operations example_fops = {
  .owner      = THIS_MODULE,
  .read       = example_read,
  .write      = example_write,
  .open       = example_open,
  .flush      = example_flush,
  .release    = example_release,
  .llseek     = example_llseek,
  .poll       = example_poll,
};

The cdev is typically initialized and brought to life with something like

  struct example_device *mydevice;
  dev_t dev;

  rc = alloc_chrdev_region(&dev, 0, /* minor start */
			   mydevice->num_devs,
			   examplename);
  if (rc) {
    dev_warn(mydevice->dev, "Failed to obtain major/minors");
    return rc;
  }

  mydevice->major = major = MAJOR(dev);
  mydevice->lowest_minor = minor = MINOR(dev);

  cdev_init(&mydevice->cdev, &example_fops);
  mydevice->cdev.owner = THIS_MODULE;

  rc = cdev_add(&mydevice->cdev, MKDEV(major, minor),
		mydevice->num_channels);
  if (rc) {
    dev_warn(mydevice->dev, "Failed to add cdev. Aborting.\n");
    goto bummer;
  }

So there are a number of steps here: First, a major and a range of minors is allocated with the call to alloc_chrdev_region(), and the result is stored in the dev_t struct.

Then the cdev is initialized and assigned a pointer to the fops struct (i.e. the struct that assigns open, read, write release). It’s the call to cdev_add() that makes the module “live” as a device file handler by binding the fops to the set of major/minor set that was just assigned.

If there happens to exist files with the relevant major and minor in the file system, they can be used immediately to execute the methods in example_fops. This is however not likely in this case, since they were allocated dynamically. The very common procedure is hence to create them in the driver (which also triggers udev events, if such are defined). So there can be several calls to something like:

    device = device_create(example_class,
			   NULL,
			   MKDEV(major, i),
			   NULL,
			   "%s", devname);

Note that this is the only use of example_class. The class has nothing to do with the cdev.

And of course, all this must be reverted when the USB device is unplugged. So we’re finally getting to business.

Is it fine to remove device files on hot-unplugging?

Yes, but remember that the file operation methods may very well be called on behalf of file descriptors that were already opened.

So it’s completely OK to call device_destroy() on a device that is still opened by a process. There is no problem creating device files with the same names again, even while the said process still has the old file opened. It’s exactly like any inode, which is visible only by the process(es) that has a file handle on them. A device file is just a file. Remove it, and it’s really gone only when there are no more references to it.

In fact, for a simple “cat” process that held a deleted device, the entry in /proc went

# ls -l /proc/756/fd
total 0
lrwx------. 1 root root 64 Mar  3 14:12 0 -> /dev/pts/0
lrwx------. 1 root root 64 Mar  3 14:12 1 -> /dev/pts/0
lrwx------. 1 root root 64 Mar  3 14:12 2 -> /dev/pts/0
lr-x------. 1 root root 64 Mar  3 14:12 3 -> /dev/example_03 (deleted)

So no drama here. Really easy.

Also recall that removing the device files doesn’t mean all that much: It’s perfectly possible (albeit weird) to generate extra device files with mknod, and use them regardless. The call to device_destroy() won’t make any difference in this case. It just removes those convenience device files in /dev.

When to release the cdev struct

Or more precisely, the question is when to release the struct that contains the cdev struct. The kernel example’s suggestion (drivers/usb/usb-skeleton.c) is to maintain a reference counter on the enclosing struct (a kref). Then increment the reference count for each file opened, decrement for each file release, and also decrement it when the device is disconnected. This way, the device information (e.g. example_device struct above) sticks around until the device is disconnected and there are no open files. There is also an issue with locking, discussed at the end of this post.

But when cdev is part of this struct, that is not enough. cdev_del(), which is normally called in the device’s disconnect method, disables the accessibility of the fops for opening new file descriptors. But there’s much to the comment from fs/char_dev.c, above the definition of cdev_del(): “This guarantees that cdev device will no longer be able to be opened, however any cdevs already open will remain and their fops will still be callable even after cdev_del returns.”

So what’s the problem, one may ask. The kref keeps the cdev until the last release! (hopefully with proper locking, as discussed at the end of this post)

Well, that’s not good enough: It turns out that the struct cdev is accessed after the fops release method has been called, even for the last open file descriptor.

Namely, the issue is with __fput() (defined in fs/file_table.c), which is the function that calls the fops release method, and does a lot of other things that are related to the release of a file descriptor (getting it off the epoll lists, for example): If the released inode is a character device, it calls cdev_put() with the cdev struct after the release fops method has returned.

Which makes sense, after all. The cdev’s reference count must be reduced sometime, and it can’t be before calling the release, can it?

So cdev_put calls kobject_put() on the cdev’s kobject to reduce its reference count. And then module_put() on the owner of the cdev entry (the owning module, that is) as given in the @owner entry of struct cdev.

Therefore, there’s a nasty OOPS or a kernel panic if the struct cdev is on a memory segment that has been freed. Ironically enough, the call to cdev_put() brings the cdev’s reference count to zero if cdev_del() has been called previously. That, in turn, leads to a call to the kobject’s release method, which is cdev_default_release(). In other words, the oops is caused by the mechanism that is supposed to prevent the cdev (and the module) the exact problem that it ends up creating.

Ironic, but also the hint to the solution.

The lame solution

The simplest way is to have the cdev as a static global variable of the relevant module. Is this accepted practice? Most likely, as Greg K-H himself manipulated a whole lot of these in kernel commit 7e7654a. If this went through his hands, who am I to argue. However this goes along with allocating a fixed pool of minors for the cdev: The number of allocated minors is set when calling cdev_add().

The backside is that cdev_add() can only be called once, so the range of minors must be fixed. This is commonly solved by setting up a pool of minors in the module’s init routine (256 of them in usb-skeleton.c, but there are many other examples).

Even though it’s a common solution in the kernel tree, I always get a slight allergy to this concept. How many times have we heard that “when it was designed, it was considered a lot” thing?

The elegant (but lengthy) solution

As hinted above, register_chrdev() is better used instead of this solution. If you read through the function’s source in fs/char_dev.c, it’s quite apparent that it does pretty much the same as described here. It has a peculiarity, though: It’s hardcoded to allocate exactly 256 minors, in the range of 0-255. alloc_chrdev_region(), on the other hand, allows choosing the number of requested minors. But then, there’s __register_chrdev(), which is exported and allows setting the range accurately, if 256 minors isn’t enough, or the range needs to be set explicitly.

That said, let’s go back to the lengthy-elegant solution. In short: Allocate the cdev dynamically. Instead of

struct example_device {
  struct device *dev;

  struct cdev cdev;
  [ ... ]
}

struct example_device {
  struct device *dev;

  struct cdev *cdev;
  [ ... ]
}

so the cdev struct is referred to with a pointer instead. And instead of the call to cdev_init(), go:

  mydevice->cdev = cdev_alloc();
  if (!mydevice->cdev)
    goto bummer;

  mydevice->cdev->ops = &example_fops;
  mydevice->cdev->owner = THIS_MODULE;

And from there go ahead as usual. The good part is that there’s no need to free a cdev that has been allocated this way. The kernel frees it automatically when its reference count goes down to zero (it starts at one, of course). So all in all, the kernel counts the references to cdev as files are opened and closed. In particular, it decrements it when cdev_del() is called. So it really vanishes only when it’s not needed anymore.

Note that cdev_init() isn’t called. Doing this will cause a kernel memory leak (which won’t affect the allocation of major and minors, by the way). See “Read the Source” below, which also shows the details on how this solves the problem.

Only note that if cdev_add() fails, the correct unwinding is:

  rc = cdev_add(&mydevice->cdev, MKDEV(major, minor),
		mydevice->num_channels);
  if (rc) {
    dev_warn(mydevice->dev, "Failed to add cdev. Aborting.\n");
    kobject_put(&mydevice->cdev->kobj);
    goto bummer2;
  }

In other words, don’t call cdev_del() if cdev_add() fails. It’s can’t be deleted if it hasn’t been added. Decrementing its reference count is the reverse operation. This is how it’s done by __register_chrdev(), defined in char_dev.c. That’s where cdev_add() and cdev_del() are defined, so they should know…

Know cdev’s reference count rules

Since cdev’s is wiped out by the kernel, it’s important to know how the kernel counts its reference count. So these are the rules:

cdev is assigned a reference count of 1 by the call to cdev_alloc() (by virtue of kobject_init). Same goes for cdev_init(), but that’s irrelevant (see code snippets below).
cdev’s reference count is not incremented by the call to cdev_add(). So it stays on 1, which is sensible.
cdev’s reference count is decremented on a call to cdev_del(). This makes sense, even though it kind-of breaks the symmetry with cdev_add(). But the latter takes a free ride on the ref count of cdev_alloc(), so that’s how it comes together.
A reference increment is done for each opened related file, and decremented on file release.

The bottom line is that if cdev_del() is called when there is no currently opened relevant device file, it will go away immediately.

For the extra pedantic, it may seem necessary to call kobject_get(&mydevice->cdev->kobj) immediately after cdev_alloc(), and then kobject_put() only after freeing the example_device struct, because it contains the pointer to the cdev. This is what reference counting means: Count the pointers to the resource. However since the cdev struct is typically only used for the cdev_del() call, nothing bad is likely to happen because of this pointer to nowhere after the cdev has been freed. It’s more a matter of formality.

This extra reference count manipulation can also be done with cdev_get() and cdev_put(), but will add an unnecessary and possibly confusing (albeit practically harmless) reference count to the module itself. Just be sure to set the cdev’s @owner entry before calling cdev_get() or things will get messy.

Read the Source

Finally, I’ll explain why using cdev_alloc() really helps. The answer lies in the kernel’s fs/char_dev.c.

Let’s start with cdev_init(). It’s short:

void cdev_init(struct cdev *cdev, const struct file_operations *fops)
{
  memset(cdev, 0, sizeof *cdev);
  INIT_LIST_HEAD(&cdev->list);
  kobject_init(&cdev->kobj, &ktype_cdev_default);
  cdev->ops = fops;
}

Noted that kobject_init? It initializes a kernel object, which is used for reference counting. And it’s of type ktype_cdev_default, which in this case only means that the release function is defined as

static struct kobj_type ktype_cdev_default = {
  .release	= cdev_default_release,
};

So when cdev->kobj’s reference count goes to zero, cdev_default_release() is called. Which is:

static void cdev_default_release(struct kobject *kobj)
{
  struct cdev *p = container_of(kobj, struct cdev, kobj);
  struct kobject *parent = kobj->parent;

  cdev_purge(p);
  kobject_put(parent);
}

Arrgghh! So there’s a release function! Why can’t it free the memory as well? It wouldn’t have been perfect. Well, a catastrophe, in fact. How could it free a memory segment within another enclosing struct?

But in fact, there is such a release function, with a not-so-surprising name:

static void cdev_dynamic_release(struct kobject *kobj)
{
  struct cdev *p = container_of(kobj, struct cdev, kobj);
  struct kobject *parent = kobj->parent;

  cdev_purge(p);
  kfree(p);
  kobject_put(parent);
}

Exactly the same, just with the kfree() in exactly the right spot. Backed up by

static struct kobj_type ktype_cdev_dynamic = {
  .release	= cdev_dynamic_release,
};

and guess which function uses it:

struct cdev *cdev_alloc(void)
{
  struct cdev *p = kzalloc(sizeof(struct cdev), GFP_KERNEL);
  if (p) {
    INIT_LIST_HEAD(&p->list);
    kobject_init(&p->kobj, &ktype_cdev_dynamic);
  }
  return p;
}

Now let’s compare it with cdev_init():

It allocates the cdev instead of using an existing one. Well, that’s the point, isn’t it?
It doesn’t call memset(), because the segment is already zero by virtue of kzalloc.
It doesn’t assign cdev->fops, because it doesn’t have that info. The driver is responsible for this now.
It sets the kernel object to have a release method that includes the kfree() part, of course.

This is why cdev_init() must not be called after cdev_alloc(): Even though it will do nothing harmless apparently, it will re-init the kernel object to ktype_cdev_default. That’s easily unnoticed, since the only thing that will happen is that kfree() won’t be called. Causing a very small, barely notable, kernel memory leak. No disaster, but people go to kernel-hell for less.

When and how to free example_device

Now back to the topic of maintaining a reference count on the device’s information (e.g. struct example_device). It should contain this struct kref, which allows keeping a track on when the struct itself should be kept in memory, and when it can be deleted. As mentioned earlier, the kref is automatically initialized with a reference count of 1, and is then incremented every time the open method is called for a related device file, decremented for every release of such, and once again decremented when the device itself is disconnected.

On the face of it, easy peasy: The struct goes away when there are no related open device files, and the device itself is away too. But what if there’s a race condition? What if a file is opened at the same time that the device is disconnected? This requires a mutex.

The practice for using kref is to decrement the struct’s reference count with something like

kref_put(&mydevice->kref, cleanup_dev);

where cleanup_dev is a function that is called if the reference count reached zero, with a pointer to the kdev. The function then uses container_of to find the address of the structure containing the kref, and frees the former. Something like

static void cleanup_dev(struct kref *kref)
{
  struct example_device *dev =
    container_of(kref, struct example_device, kref);

  kfree(dev);
}

The locking mechanism is relatively simple. All it needs to ensure is that the open method doesn’t try to access the example_device struct after it has been freed. But since the open method must do some kind of lookup to find which example_device struct is relevant, by checking if it covers the major/minor of the opened device file, the name of the game is to unlist the example_device before freeing its memory.

So if the driver implements a list of example_device structs, one for each connected USB device, all that is necessary is to protect the access to this list with a mutex, and to hold that mutex while kref_put() is called. Likewise, this mutex is taken by the open method before looking in the list, and is released only after incrementing the reference count with kref_get().

And then make sure that the list entry is removed in cleanup_dev.

The major / minor space waste

Reading through fs/char_dev.c, one gets the impression that Linus intended to allocate majors and minors in an efficient manner when he wrote it back in 1991, but then it never happened: There’s an explicit implementation of a hash for holding the ranges of majors and minors, and the relevant routines insert entries into it, and remove them as allocations are made and dropped.

But then it seems like alloc_chrdev_region(), which allocates major / minor space dynamically, sets the minor base to zero in its call to __register_chrdev_region(). The latter calls find_dynamic_major(), which as its name implies, looks up a major that isn’t used at all. In no way is there an attempt to re-use a major by subsequent alloc_chrdev_region() calls.

The truth is that there’s no practical reason to. Looking at /proc/devices, there aren’t so many majors allocated on a typical system, so there’s no drive to optimize.

Bonus: When is it OK to access the USB API’s structs?

Not directly related, but definitely worth mentioning: The memory chunk, to which struct usb_interface *interface points to (which is given to both probe and disconnect) is released after the call to the disconnect method returns. This means that if any other method holds a copy of the pointer and uses it, there must be some kind of lock that prevents the disconnect call to return as long as this pointer may be in use. And of course, prevents any other thread to start using this pointer after that. Otherwise even something as innocent as

dev_info(&interface->dev, "Everything is going just great!\n");

may cause a nasty crash. Sleeping briefly on the disconnect method is OK, and it solves this issue. Just be sure no other thread sleeps forever with that lock taken. Should not be an issue, because asynchronous operations on the USB API have no reason to block.

This is demonstrated well in the kernel’s own usb-skeleton.c, by virtue of io_mutex. In the disconnection method, it goes

mutex_lock(&dev->io_mutex);
dev->interface = NULL;
mutex_unlock(&dev->io_mutex);

and then, whenever the driver wants to touch anything related USB, it goes

mutex_lock(&dev->io_mutex);
if (!dev->interface) {
  mutex_unlock(&dev->io_mutex);
  retval = -ENODEV;
  goto error;
}

and keeps holding that mutex during all business with the kernel’s USB API. Once again, this is reasonable when using the asynchronous API, so no call blocks.

It’s however not possible to hold this mutex in URB completer callbacks, since they are executed in atomic context (an interrupt handler or tasklet). These callbacks routines are allowed to assume that the interface data is legit throughout their own execution, because by default the kernel’s USB subsystem makes sure to complete all URBs (with a -ENOENT status), and prevent submitting new ones, before calling the device’s disconnect method (for example, in usb-skeleton.c, dev->interface->dev is used for an error message in the completion callbacks).

The soft_unbind flag

This default behavior makes a lot of sense when the device is physically unplugged, but it also applies when the driver is about to be unloaded (e.g. with rmmod). In the latter case, there is no need to cut off communication abruptly, and sometimes it’s desired to wrap up cleanly with some final URBs. To facilitate that, the driver can set the soft_unbind flag, which means “if set to 1, the USB core will not kill URBs and disable endpoints before calling the driver’s disconnect method”. When this flag is set, it’s the driver’s responsibility to make sure there are no outstanding URBs when the disconnect method returns as well as when the probe method returns with error, and that none are queued later on. Or even stricter, there must be no outstanding URBs when dev->interface is nullified before returning. But that’s it. There are no other implications.

It’s worth saying this again: If the probe method returns with error, the USB framework normally kills all outstanding URBs. But it won’t do that if soft_unbind is set (as of kernel v5.12). The fix for this should have been that the framework kills all outstanding URBs after probe returns with an error (as well as when disconnect returns, actually) because any proper driver should have done that anyhow. I would submit a patch, but last time I did that the relevant maintainers played silly (and time consuming) games with me, so I made sure my own driver gets this right and called it a day.

The soft_unbind flag affects the behavior of usb_unbind_interface() (in usb/core/driver.c), which sets intf->condition to USB_INTERFACE_UNBINDING and then checks soft_unbind. If false, it calls usb_disable_interface() to terminate all URBs before calling the disconnect method. So this ensures no new URBs are queued and the old ones are completed. So once again, it boils down to whether the USB framework kills the URBs before calling the disconnect method, or the driver does the same before returning from it.

usb-skeleton.c is misleading in this matter (as of kernel v5.12): skel_disconnect() calls usb_kill_urb() and usb_kill_anchored_urbs() even though soft_unbind isn’t set. Hence there are no URBs to kill by the time these calls are made, and they do nothing. It’s likewise questionable if setting dev->disconnected to prevent I/O from starting is necessary, but I haven’t dived into that issue.

usbpiper: A single-threaded /dev/cuse and libusb-based endpoint to device file translator

eli — Fri, 28 Feb 2020 15:59:35 +0000

Introduction

Based upon CUSE, libusb and the kernel’s epoll capability, this is a single-threaded utility which generates one /dev/usbpiper_* device file for each bulk / interrupt endpoint on a USB device. For example, /dev/usbpiper_bulk_in_01 and /dev/usbpiper_bulk_out_03.

It’s an unfinished project, that was stopped before a lot of obvious tasks in the TODO list were done. This is why several parameters are hardcoded and some memory allocations aren’t freed. Plus several other implications listed below.

It’s available at Github: https://github.com/billauer/usbpiper

I eventually went for a good old kernel driver instead. This post explains why, and you probably want to read it if you have plans on this utility or want to use FUSE or CUSE otherwise. That post also explains why I went right on to /dev/cuse rather than using libfuse.

Nevertheless, the project may very well be useful for development of USB projects, as a boilerplate or a getting-started utility. It also shows how to implement epoll-based asynchronous USB transfers, as well as implementing a CUSE-based device file driver in userspace, implementing the protocol of /dev/cuse directly (i.e. without relying on libfuse). And all this as a single thread program.

But what was the utility meant to do in the first place?

The underlying idea is simple: With a single-threaded userspace program, create a plain character device for each BULK (or INTERRUPT) endpoint that is found on a selected USB device, and allow data to be sent to each OUT endpoint by opening a device file, and just write data to it. With “cat” for example. And the other way around, read data from each IN endpoint by reading data from another device file. This description is simplistic, however it may work quite well when working on a USB device project. Just be sure to read the details below on setting up usbpiper. Doing that pretty much covers the necessary gory details.

What usbpiper definitely isn’t: It’s NOT a user-space driver for XillyUSB (a generic FPGA IP core for SuperSpeed USB 3.0, based upon the FPGA’s Gigabit transceivers). XillyUSB requires a dedicated driver, which implements a specific protocol with the IP core.

Confusing usbpiper with XillyUSB’s driver is easy, because both share the idea of plain device files for I/O with a USB device. In fact, usbpiper started off as a user-space driver for XillyUSB, but never got to the point of covering XillyUSB’s protocol.

Another possible source of confusion is usbfs. It’s a USB filesystem, so what is there to add? So yes, usbfs is used by libusb to allow a low-level driver for a USB device to be written in user space (usbpiper uses this interface, of course). It doesn’t allow a simple access to the data.

It’s recommended to look on this post on the protocol with /dev/cuse before diving into this one.

What works and in what ways it’s unfinished

usbpiper is executed with no arguments. It takes control of the selected USB device’s interface (which one — see below) and creates a /dev/usbpiper_* device file for each bulk or endpoint endpoint that it finds. The file’s name reflects the endpoint’s number, direction and bulk vs. interrupt.

It has however only been tested on bulk endpoints. Interrupt endpoints may work, but has not been tested, and isochronous endpoints are ignored. Also, usbpiper doesn’t free memory properly, in particular not buffers and other memory consuming stuff that are related to libusb.

Several parameters would normally be set through command-line parameters, but they are hardcoded.

The verbosity level can be set by editing some defines in usbpiper.h. In particular, a lot of messages are reduced by replacing

#define DEBUG(...) { fprintf(stderr, __VA_ARGS__); }

with

#define DEBUG(...)

In usbpiper.c, max_size defines the largest number of bytes that can be handled in a CUSE READ or WRITE request.

In usb.c, the following parameters are hardcoded:

FIFOSIZE: The effective number of bytes in the FIFO between the CUSE and USB units. The actual FIFO size for OUT endpoints is larger by max_size, for reasons explained in the “Basic data flow principle” section below.
vendorID and prodID define the device to be targeted. Note that the find_device() function in usb.c explicitly finds the device from the list of devices on the bus, so it can be altered to select the device based upon other criteria.
int_idx and alt_idx are the Interface and Alternate Setting indexes for selection on the device. More on this issue below.
td_bufsize is the size of the buffer that goes which each transfer. Set to 64 kiB, which is probably an overkill for most devices, but reasonable for proper bandwidth utilization with SuperSpeed devices. Also see below why it should be large when working with just some device.
numtd: The maximal number of outstanding transfers for each endpoint. A large number is good for high-bandwidth applications (with SuperSpeed) since it gives the hardware controller several transfers in a row before software intervention is required. Make it too big, and libusb_submit_transfer() may fail (the controller got more than it could accept).

Features that were meant to be added, but I never got there:

Array size of epoll should be dynamic (number of held file descriptors). Currently it’s ARRAYSIZE in usbpiper.c.
A file was supposed to be bidirectional. Makes no sense in this usage scenario, and bidirectional was never tested.
Non-blocking OPEN not supported
Was intended to support USB hotplugging
Adaption to XillyUSB’s protocol

USB Transfers and why you should care about them

There is a good reason why there isn’t any pipe-like plain device file interface for any USB device by default: usbpiper overlooks several details in the communication of a USB device.

The most important issue is that USB communication is divided into transfers, and are generally not treated as a continuous stream of data. The underlying model in the USB spec is that the host software initiates a transfer of a given number of bytes (in or out), the USB framework carries it out against the device, and then informs the software that it has been finished. The USB spec’s authors seem to have thought that the mainline usage of the USB bus would be done with a functional call saying something like “send this packet of data to the device”. Or another function saying “receive X bytes from the device”, which returns with a buffer pointing to the data buffer.

The USB framework supports asynchronous transfers, of course, but that doesn’t change the notion that the host’s software explicitly requests each transfer with a given number of bytes. All communication is cut into packet-like chunks of data with clear, boundaries. The device is allowed to divert from the host’s transfer requests only in one way: On IN endpoints, it’s allowed to terminate a transfer with less bytes than the host expected, and this is not considered an error.

However generally speaking, any software that communicates with a device directly (i.e. a device driver) is expected to know when the device expects transfers and of what size. usbpiper ignores this completely. Therefore, it may very well not work properly with just any device. This is less of an issue if the device is developed along with using usbpiper.

The three points to note are hence:

usbpiper sets byte count of OUT transfers according to the momentary buffer fill, up to a certain limit (td_bufsize). If the device expects a certain number of bytes in the transfer (which is legit) or the transfers are longer than in can take — things will break, of course. A device may also be sensitive to transfer boundaries, which usbpiper pays no attention to. If the device expects a fixed length for all transfers, this issue can be worked around by modifying try_queue_bulkout() never send a partially filled transfer, and set the desired length instead of td_bufsize.
usbpiper sets td_bufsize as the length of IN transfers, however the host doesn’t inform the device on how long the transfer is expected to be. The device driver is supposed to know the maximal length of an IN transfer that the device will respond with, and prepare a buffer long enough. Otherwise, a babbling error results (libusb returns LIBUSB_ERROR_OVERFLOW). td_bufsize is set to 64 kiB which is unlikely to be exceeded by USB devices — but this isn’t guaranteed.
Another issue with IN endpoints is that the information on where the boundaries of the transfers is lost: usbpiper just copies the data into a FIFO, which is read continuously on the other side. If the protocol of an IN endpoint relies on the driver knowing where a transfer started, usbpiper won’t be useful. This can be the case if the transfers are packets with a header, but without a data length field. This makes sense against a driver that receives the transfers directly.

Interfaces and alternate settings

A USB device may present several interfaces, and each interface may have alternate settings. This isn’t a gory technical detail, but can be the difference between getting your device working with usbpiper or not, in particular if it’s not something you designed yourself.

Even though a device is assigned an address on the USB bus, any USB driver claims the control of an interface of that device. In other words, it’s perfectly normal that several, possibly independent drivers control a single physical device. A keyboard / mouse combo device or a sound card with MIDI and joystick interface (not so common today). Or like a scanner / printer, which also acts as a card reader:

$ usb-devices
T:  Bus=01 Lev=03 Prnt=44 Port=03 Cnt=01 Dev#= 45 Spd=480 MxCh= 0
D:  Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
P:  Vendor=03f0 ProdID=7a11 Rev=01.00
S:  Manufacturer=HP
S:  Product=Photosmart B109a-m
S:  SerialNumber=MY5687428T02D2
C:  #Ifs= 4 Cfg#= 1 Atr=c0 MxPwr=2mA
I:  If#= 0 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=cc Prot=00 Driver=(none)
I:  If#= 1 Alt= 0 #EPs= 2 Cls=07(print) Sub=01 Prot=02 Driver=usblp
I:  If#= 2 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=(none)
I:  If#= 3 Alt= 0 #EPs= 2 Cls=08(stor.) Sub=06 Prot=50 Driver=usb-storage

Note that the device effectively behaves like two independent devices: A scanner / printer and a USB disk.

It’s therefore important to not just set the Vendor / Product IDs correctly, but also the interface. usb-devices and lsusb -vv may help making the correct selection.

Alternate setting is less common, but a single interface may have different usage modes. If present, this must be set correctly as well.

Basic data flow principle

The purpose of the utility is to move data from a USB endpoint to a CUSE device file or vice versa. To accomplish this, there is a plain RAM FIFO allocated for each such data stream.

For an IN endpoint, the USB frontend queues asynchronous transfer requests using libusb. For each IN transfer that is finished, the data is copied into the relevant FIFO. On the FIFO’s other side, the read() calls on the device file (i.e. CUSE READ requests) are fulfilled, as necessary, by submitting data that is fetched from the FIFO. Overflow of the FIFO is prevented by queuing IN transfer requests only when there’s enough room in the FIFO to accept the data that all outstanding requests may carry, if they all return with a full buffer. Underflow is not an issue, but the read() call isn’t completed if there is no data to submit, in which case read() blocks.

For an OUT endpoint, a the handler of a write() call (i.e. CUSE WRITE requests) copies the data into the relevant FIFO. As a result of the FIFO containing data, the USB frontend may queue new OUT transfers with the data available — it may also not do so, in particular if the number of already outstanding transfer stands at the maximal available. The FIFO is protected from overflow by blocking the write() call until there is enough room in the FIFO. The exact condition relates to the fact the length of the data buffer of each CUSE WRITE request is limited by a number (max_size in the code) that is set during CUSE initialization. A WRITE request is hence not completed (hence preventing another one) until there is room for max_size additional bytes in the FIFO, after writing the current request’s data to the FIFO. This ensures that the usbpiper process always has where to put the data, and doesn’t need to block — which it’s now allowed to, being a single-threaded utility.

The requirement of always having max_size bytes of data vacant in the FIFO gets slightly trickier when a WRITE request is interrupted (i.e. receives an INTERRUPT request on its behalf). This forces usbpiper to immediately complete the request. In order to ensure the requirement on the FIFO, usbpiper possibly unwinds the FIFO, throwing away data so that the FIFO’s write fill is at most max_size bytes below full. This doesn’t break the data stream’s integrity or continuity, because the write() call returns with the number of bytes actually written (or an -EINTR, if none). If the FIFO was unwound, the number of bytes that were discarded is reduced from write()’s return value, giving the caller of write() the correct picture of how much data was consumed.

Execution flow

Recall from above that usbpiper doesn’t rely on libfuse, but rather communicates with the CUSE framework directly through /dev/cuse.

As the utility’s single thread needs to divide attention between the USB tasks and those related to CUSE, a single epoll() file descriptor is allocated for all open /dev/cuse files as well as those supplied by the libusb framework. A epoll_wait() event loop is implemented in usbpiper.c: Each entry in the epoll_event array contains a pointer a small structure, which contains a function to call and a pointer to a private data pass it to the function.

The communication protocol with /dev/cuse is discussed on another post. For the purpose of the current topic, the CUSE kernel framework creates a device file in /dev/ as a result of each time /dev/cuse being opened and a simple read-write handshake completed. After this, for each operation on the related device file (e.g. open(), read(), write() etc) a request packet is passed to the server (i.e. usbpiper in this case) by virtue of read() calls to the /dev/cuse file handle. The operation blocks until the server responds by writing a buffer to the same file handle, which contains a status header and possibly data. Responses to requests are not necessarily written in the same order as the requests. A unique ID number in the said status header ensures the pairing between requests and their responses.

read() calls from /dev/cuse block when there’s nothing to do, and are therefore subject to epoll in usbpiper. write() calls never block.

However this is not enough: For example, an epoll entry may indicate a new WRITE request on a CUSE file descriptor, which fills one of the FIFOs with data. As a result, there might be a new opportunity to queue new USB transfers. There are many software design approaches for how to make one action trigger others — the one taken in usbpiper is the simplest and messiest: Letting the performer of the action call the functions that may benefit from the opportunity directly. In the given example, this means that process_write() calls try_queue_bulkout() directly. The latter calls try_complete_write() in turn.

The function nomenclature in this utility is consistent in that several functions have a try_*() prefix to mark that they are opportunity oriented. It would have been equally functional, cleaner and more elegant (however less efficient) to call all try_*() functions on behalf of all endpoints and device files. Or alternatively, maintain some queue of try_*() function calls, however this wouldn’t take away the need for awareness of which actions may open what opportunity.

Delays and timeouts

There are a couple of situations where a timer is required. A timerfd is allocated for each device file, serving the following two scenarios:

Related to IN endpoints: When a READ request can’t be completed with the full number of bytes that are required, usbpiper waits up to 10 ms for data from the IN endpoint to fill the relevant FIFO. After this timeout, try_complete_read() completes the request as soon as there is any data in the FIFO. The rationale is to avoid a flood of READ request and responses if the data arrives frequently and in small chunks.
Related to OUT endpoints: When a RELEASE request arrives, and there is still data in the relevant FIFO, try_complete_release() waits up to 1000 ms for the FIFO to drain by the OUT endpoint. After this, try_complete_release() completes the request, hence closing the related device file (not /dev/cuse) after emptying the FIFO.

A single timer can be used for both tasks, because a RELEASE can’t occur before all outstanding requests have been completed on the related device file (Linux’ device file API ensures that). Besides, each device file can be related only to either an IN or OUT endpoint, so once again, the timer won’t be necessary for both uses at the same time.

A similar 10 ms timeout could have been implemented for OUT endpoints, i.e. generate an OUT transfer only if the FIFO contains enough data for a full transfer buffer. This wouldn’t require another timer, for the first reason given above. However this possibility was dropped in favor of another mechanism for preventing unnecessary I/O: try_queue_bulkout() submits a transfer with less than a full buffer only if there is no other outstanding transfer on the same endpoint. The reason for opting out the 10 ms timer for this purpose has to do with the original purpose of this usbpiper, as a driver for XillyUSB (which didn’t materialize).

Recovering from a BULK IN overflow on USB 3.0

eli — Sat, 07 Dec 2019 19:02:11 +0000

Introduction

At times, an attempt to get data from a BULK IN endpoint may result in an overflow error. In other words,

rc = libusb_bulk_transfer(dev_handle, (1 | LIBUSB_ENDPOINT_IN),
                          buf, bufsize, &count, (unsigned int) 1000);

may fail with rc having the value of LIBUSB_ERROR_OVERFLOW. Subsequent attempts to access the same endpoint, even after re-initializing the libusb interface result in rc = LIBUSB_ERROR_IO, which is just “the I/O failed”. Terminating and rerunning the program doesn’t help. Apparently, the only thing that gets the endpoint out of this condition is physically plugging it out and back into the computer (or hub).

Why this happens

The term “transfer” in the “libusb_bulk_transfer” function name refers to a USB transfer cycle. In essence, calling this function ends up with a request to the xHCI hardware controller to receive a given number of bytes (“bufsize” above) from the endpoint, and to write it into a buffer (“buf” above). The host controller start requesting DATA packets from the device, and fills the buffer with the data in those packets. Note that the communication isn’t a continuum of data packets. Rather, it’s a session which fulfills a transfer request. Someone who wrote the USB spec probably thought of the data transmission in terms of a function call of the libusb_bulk_transfer sort: A function is called to request some data, communication takes place, the function returns. This principle holds for all USB versions.

Here’s the crux, and it’s related to the hardware protocol, namely with data packets: The host controller can give the device go-ahead to transmit packets, but it can’t control the number of bytes in each packet. This is completely up to the device. The rule is that if the number of bytes in a packet is less than the maximum allowed for the specific USB version (1024 bytes for USB 3.0), it’s considered a “short packet”. When a short packet arrives at the host controller, it should conclude that the transfer is done (including the data in this last packet).

So the device may very well supply less data than requested for the transfer. This is perfectly normal, and this is why it’s mandatory to check how many bytes have actually arrived when the function returns.

But what if the transfer request was for 10 bytes, and the device sent 20?

An overflow on USB 3.0

On USB 3.0, the xHCI controller requests data from the device indirectly by issuing an ACK Transaction Packet to the device. It may or may not acknowledge packets it has already acknowledged, but this isn’t the point at the moment. Right now, it’s more interesting that all ACK packets also carry the number of DATA packets that the host is ready to receive at the moment that the ACK packet was formed, in a field called NumP. This is how the host controls the flow of data.

When there’s no data flowing, it can be because the device has no data to send, but also if the last ACK for the relevant endpoint was a packet with NumP=0. To resume the flow, the host then issues an ACK packet, with a non-zero NumP, giving the go-ahead to transmit more data.

So a packet capture can look like this for a transfer request of 1500 bytes (the first column is time in microseconds):

       0.000 ACK  seq=0 nump=2
       0.040 DATA seq=0 len=1024
       0.032 DATA seq=1 len=1024
       2.832 ACK  seq=1 nump=1
       2.104 ACK  seq=2 nump=0

Note that the device sent 1024 bytes twice, exceeding the 1500 bytes requested. This causes an overflow error. So far so good. But what about all those LIBUSB_ERROR_IO afterwards?

So this is the time to pay attention to the “seq” numbers. All DATA packets carry this sequence number, which runs cyclically from 0 to 31. The ACK’s sequence number is the one that the host expects next. Always. For example, if the host requests a retransmission, it repeats the sequence number of the DATA packet it failed to receive properly, basically saying, “I expect this packet again” (and then ignores DATA packet that follow until the retransmission).

Now, this is what appears on the bus as a result of a libusb_bulk_transfer() call after the one that returned with an overflow condition:

21286605.320 ACK  seq=0 nump=2

This is not a mistake: The sequence number is zero. Note that the ACK that closed the previous sequence with a nump=0 had seq=2. Hence the ACK that follows it to re-initiate the data flow should have seq=2 as well. But it’s zero. In section 8.11.1, the USB 3.0 spec says that an ACK is considered invalid if it hasn’t an expected sequence number, and that it should be ignored in this case. So the device ignores this ACK and sends no DATA packet in response. The host times out on tDPResponse (400 ns per USB 3.0 spec) and reports a LIBUSB_ERROR_IO. So the forensic explanation is established. Now to the motive.

Handling a babble error

The resetting of the sequence number has been observed with more than one xHCI controller (an Intel chipset as well as Renesas’ uPD720202), so it’s not a bug.

This brings us to the xHCI spec, revision X, section 4.10.2.4: “When a device transmits more data on the USB than the host controller is expecting for a transaction, it is defined to be babbling. In general, this is called a Babble Error. When a device sends more data than the TD transfer size, … the host controller shall set the Babble Detected Error in the Completion Code field of the TRB, generate an Error Event, and halt the endpoint (refer to section 4.10.2.1).”

So the first thing to note is that the endpoint wasn’t halted after the overflow. In fact, there was no significant traffic at all. Quite interestingly, Linux’ host controller didn’t fulfill its duty in this respect.

But still, why was the sequence number reset? Section 8.12.1.2 in the USB 3.0 sheds some light: “The host expects the first DP to have a sequence number set to zero when it starts the first transfer from an endpoint after the endpoint has been initialized (via a Set Configuration, Set Interface, or a ClearFeature (ENDPOINT_HALT) command”.

So had the endpoint been halted, as it’s required per xHCI spec, it would just have returned STALL packets until it was taken out of this condition. At which point the sequence number should be reset to zero per USB spec.

So apparently, whoever designed the xHCI hardware controller assumed that no meaningful communication would take place after the overflow (babble) error, and that the endpoint must be halted anyhow, so reset the sequence number and have it done with. It’s easier doing it this way than detecting the SETUP sequence that clears the ENDPOINT HALT feature.

Given that the xHCI driver doesn’t halt the endpoint, it’s up to the application software to do it.

Halt and unhalt

This is a sample libusb-based C code for halting BULK IN endpoint 1.

      if (rc == LIBUSB_ERROR_OVERFLOW) {
	rc = libusb_control_transfer(dev_handle,
				     0x02, // bmRequestType, endpoint
				     0x03, // bRequest = SET_FEATURE
				     0x00, // wValue = ENDPOINT_HALT
				     (1 | LIBUSB_ENDPOINT_IN), // wIndex = ep
				     NULL, // Data (no data)
				     0, // wLength = 0
				     100); // Timeout, ms

	if (rc) {
	  print_usberr(rc, "Failed to halt endpoint");
	  break;
	}

	rc = libusb_control_transfer(dev_handle,
				     0x02, // bmRequestType, endpoint
				     0x01, // bRequest = CLEAR_FEATURE
				     0x00, // wValue = ENDPOINT_HALT
				     (1 | LIBUSB_ENDPOINT_IN), // wIndex = ep
				     NULL, // Data (no data)
				     0, // wLength = 0
				     100); // Timeout, ms

	if (rc) {
	  print_usberr(rc, "Failed to unhalt endpoint");
	  break;
	}

	continue;
      }

The second control transfer can be exchanged with

	rc = libusb_clear_halt(dev_handle, (1 | LIBUSB_ENDPOINT_IN));

however there is no API function for halting the endpoint, so one might as well do them both with control transfers.

Resetting the entire device

For those who like the big hammer, it’s possible to reset the device completely. This is one of the conditions for resetting the sequence numbers on all endpoints, so there’s no room for confusion.

      if (rc == LIBUSB_ERROR_OVERFLOW) {
	rc = libusb_reset_device(dev_handle);

	if (rc) {
	  print_usberr(rc, "Failed to reset device");
	  break;
	}
	continue;
      }

This causes a Hot Reset. which is an invocation of Recovery with the Hot Reset bit set, and return to U0, which in itself typically takes ~150 μs. However as a result from this call, the is reconfigured — its descriptors are read and configuration commands are sent to it. It keeps its bus number, and the the entire process takes about 100 ms. All this rather extensive amount of actions is hidden in this simple function call.

Also, a line appears in the kernel log, e.g.:

usb 4-1: reset SuperSpeed USB device number 3 using xhci_hcd

So all in all, this is a noisy overkill, and is not recommended. It’s given here mainly because this is probably how some people eventually resolve this kind of problem.

systemd: Reacting to USB NIC hotplugging (post-up scripting)

eli — Mon, 11 Nov 2019 17:17:57 +0000

The problem

Using Linux Mint 19, I have a network device that needs DHCP address allocation connected to a USB network dongle. When I plug it in, the device appears, but the DHCP daemon ignored eth2 (the assigned network device name) and didn’t respond to its DHCP discovery packets. But restarting the DHCP server well after plugging in the USB network card solved the issue.

I should mention that I use a vintage DHCP server for this or other reason (not necessarily a good one). There’s a good chance that a systemd-aware DHCP daemon will resynchronize itself following a network hotplug event. It’s evident that avahi-daemon, hostapd, systemd-timesyncd and vmnet-natd trigger some activity as a result of the new network device.

Most notable is systemd-timesyncd, which goes

Nov 11 11:25:59 systemd-timesyncd[1101]: Network configuration changed, trying to establish connection.

twice, once when the new device appears, and a second time when it is configured. See sample kernel log below.

It’s not clear to me how these daemons get their notification on the new network device. I could have dug deeper into this, but ended up with a rather ugly solution. I’m sure this can be done better, but I’ve wasted enough time on this — please comment below if you know how.

Setting up a systemd service

The very systemd way to run a script when a networking device appears is to add a service. Namely, add this file as /etc/systemd/system/eth2-up.service:

[Unit]
Description=Restart dhcp when eth2 is up

[Service]
ExecStart=/bin/sleep 10 ; /bin/systemctl restart my-dhcpd
Type=oneshot

[Install]
WantedBy=sys-subsystem-net-devices-eth2.device

And then activate the service:

# systemctl daemon-reload
# systemctl enable eth2-up

The concept is simple: A on-shot service depends on the relevant device. When it’s up, what’s on ExecStart is run, the DHCP server is restarted, end of story.

I promised ugly, didn’t I: Note the 10 second sleep before kicking off the daemon restart. This is required because the service is launched when the networking device appears, and not when it’s fully configured. So starting the DHCP daemon right away misses the point (or simply put: It doesn’t work).

I guess the DHCP daemon will be restarted one time extra on boot due to this extra service. In that sense, the 10 seconds delay is possible better than restarting it soon after or while it being started by systemd in general.

So with the service activated, this is what the log looks like (the restarting of the DHCP server not included):

Nov 11 11:25:54 kernel: usb 1-12: new high-speed USB device number 125 using xhci_hcd
Nov 11 11:25:54 kernel: usb 1-12: New USB device found, idVendor=0bda, idProduct=8153
Nov 11 11:25:54 kernel: usb 1-12: New USB device strings: Mfr=1, Product=2, SerialNumber=6
Nov 11 11:25:54 kernel: usb 1-12: Product: USB 10/100/1000 LAN
Nov 11 11:25:54 kernel: usb 1-12: Manufacturer: Realtek
Nov 11 11:25:54 kernel: usb 1-12: SerialNumber: 001000001
Nov 11 11:25:55 kernel: usb 1-12: reset high-speed USB device number 125 using xhci_hcd
Nov 11 11:25:55 vmnet-natd[1845]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001002
Nov 11 11:25:55 vmnetBridge[1620]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001002
Nov 11 11:25:55 kernel: r8152 1-12:1.0 eth2: v1.09.9
Nov 11 11:25:55 mtp-probe[59372]: checking bus 1, device 125: "/sys/devices/pci0000:00/0000:00:14.0/usb1/1-12"
Nov 11 11:25:55 mtp-probe[59372]: bus: 1, device: 125 was not an MTP device
Nov 11 11:25:55 upowerd[2203]: unhandled action 'bind' on /sys/devices/pci0000:00/0000:00:14.0/usb1/1-12
Nov 11 11:25:55 systemd-networkd[65515]: ppp0: Link is not managed by us
Nov 11 11:25:55 systemd-networkd[65515]: vmnet8: Link is not managed by us
Nov 11 11:25:55 systemd-networkd[65515]: vmnet1: Link is not managed by us
Nov 11 11:25:55 networkd-dispatcher[1140]: WARNING:Unknown index 848 seen, reloading interface list
Nov 11 11:25:55 systemd-networkd[65515]: lo: Link is not managed by us
Nov 11 11:25:55 systemd-networkd[65515]: eth2: IPv6 successfully enabled
Nov 11 11:25:55 systemd[1]: Starting Restart dhcp when eth2 is up...
Nov 11 11:25:55 kernel: IPv6: ADDRCONF(NETDEV_UP): eth2: link is not ready
Nov 11 11:25:55 vmnet-natd[1845]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001043
Nov 11 11:25:55 vmnetBridge[1620]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001043
Nov 11 11:25:55 vmnetBridge[1620]: Adding interface eth2 index:848
Nov 11 11:25:55 vmnet-natd[1845]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001043
Nov 11 11:25:55 vmnetBridge[1620]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001043
Nov 11 11:25:55 systemd-timesyncd[1101]: Network configuration changed, trying to establish connection.
Nov 11 11:25:55 vmnetBridge[1620]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001003
Nov 11 11:25:55 vmnetBridge[1620]: Removing interface eth2 index:848
Nov 11 11:25:55 vmnet-natd[1845]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001003
Nov 11 11:25:55 upowerd[2203]: unhandled action 'bind' on /sys/devices/pci0000:00/0000:00:14.0/usb1/1-12/1-12:1.0
Nov 11 11:25:55 kernel: IPv6: ADDRCONF(NETDEV_UP): eth2: link is not ready
Nov 11 11:25:55 systemd-timesyncd[1101]: Synchronized to time server 91.189.89.198:123 (ntp.ubuntu.com).
Nov 11 11:25:55 kernel: userif-3: sent link down event.
Nov 11 11:25:55 kernel: userif-3: sent link up event.
Nov 11 11:25:57 vmnetBridge[1620]: RTM_NEWLINK: name:eth2 index:848 flags:0x00011043
Nov 11 11:25:57 vmnetBridge[1620]: Adding interface eth2 index:848
Nov 11 11:25:57 systemd-networkd[65515]: eth2: Gained carrier
Nov 11 11:25:57 systemd-timesyncd[1101]: Network configuration changed, trying to establish connection.
Nov 11 11:25:57 avahi-daemon[1115]: Joining mDNS multicast group on interface eth2.IPv4 with address 10.20.30.1.
Nov 11 11:25:57 avahi-daemon[1115]: New relevant interface eth2.IPv4 for mDNS.
Nov 11 11:25:57 avahi-daemon[1115]: Registering new address record for 10.20.30.1 on eth2.IPv4.
Nov 11 11:25:57 kernel: r8152 1-12:1.0 eth2: carrier on
Nov 11 11:25:57 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
Nov 11 11:25:57 vmnet-natd[1845]: RTM_NEWLINK: name:eth2 index:848 flags:0x00011043
Nov 11 11:25:57 vmnet-natd[1845]: RTM_NEWADDR: index:848, addr:10.20.30.1
Nov 11 11:25:57 systemd-timesyncd[1101]: Synchronized to time server 91.189.89.198:123 (ntp.ubuntu.com).
Nov 11 11:25:58 kernel: userif-3: sent link down event.
Nov 11 11:25:58 kernel: userif-3: sent link up event.
Nov 11 11:25:59 avahi-daemon[1115]: Joining mDNS multicast group on interface eth2.IPv6 with address fe80::2e0:4cff:fe68:71d.
Nov 11 11:25:59 avahi-daemon[1115]: New relevant interface eth2.IPv6 for mDNS.
Nov 11 11:25:59 systemd-networkd[65515]: eth2: Gained IPv6LL
Nov 11 11:25:59 avahi-daemon[1115]: Registering new address record for fe80::2e0:4cff:fe68:71d on eth2.*.
Nov 11 11:25:59 systemd-networkd[65515]: eth2: Configured
Nov 11 11:25:59 systemd-timesyncd[1101]: Network configuration changed, trying to establish connection.
Nov 11 11:25:59 systemd-timesyncd[1101]: Synchronized to time server 91.189.89.198:123 (ntp.ubuntu.com).

As emphasized in bold above, there are 4 seconds between the activation of the script and systemd-networkd’s declaration that it’s finished with it.

It would have been much nicer to kick off the script where systemd-timesyncd detects the change for the second time. It would have been much wiser had WantedBy=sys-subsystem-net-devices-eth2.device meant that the target is reached when it’s actually configured. Once again, if someone has an idea, please comment below.

A udev rule instead

The truth is that I started off with a udev rule first, ran into the problem with the DHCP server being restarted too early, and tried to solve it with systemd as shown above, hoping that it would work better. The bottom line is that it’s effectively the same. So here’s the udev rule, which I kept as /etc/udev/rules.d/99-network-dongle.rules:

SUBSYSTEM=="net", ACTION=="add", KERNEL=="eth2", RUN+="/bin/sleep 10 ; /bin/systemctl restart my-dhcpd"

Note that I nail down the device by its name (eth2). It would have been nicer to do it based upon the USB device’s Vendor / Product IDs, however that failed for me. Somehow, it didn’t match for me when using SUBSYSTEM==”usb”. How I ensure the repeatable eth2 name is explained on this post.

Also worth noting that these two commands for getting the udev rule together:

# udevadm test -a add /sys/class/net/eth2
# udevadm info -a /sys/class/net/eth2

xhci_hcd WARN Event TRB for slot x ep y with no TDs queued

eli — Sat, 09 Nov 2019 18:57:11 +0000

What’s this?

There’s a chance that you’re reading this because the message in the title appeared (or flooded) your kernel log. This post attempts to clarify what to do about it, depending on how much you want to get involved in the matter.

So the short answer: The said warning message is a bug related to the xHCI USB controller, but a rather harmless one: Except for the message in the kernel log, everything is fine. This was fixed in Linux kernel v4.15, so upgrading the kernel is one way out. Alternatively, patch your kernel with commit e4ec40ec4b from December 2017, which is the fix made on v4.15. Even editing away the said xhci_warn() call is fairly sensible, if the patch doesn’t apply.

Note that the xHCI controller is used for any port that is USB 3.x capable, even when a lower version USB device is connected to it (e.g. USB 2.0). So if your computer has a few USB 2.0-only ports, moving a device to such port might be your quick fix. Look for USB plugs with black plastic instead of blue.

The rest of this post dives deeply into the whereabouts of this accident. It matters if you’re developing a USB device. Note that this post is written in increasing detail level, which makes it a bit disorganized. Apologies if it’s confusing.

Why this?

The explanation is a bit tangled, and is related to the way the xHCI driver organizes the buffers for a USB transfer.

To make a long story short(er), a software transfer request (URB in Linux) is conveyed to the hardware controller as a TD (Transfer Descriptor) which is presented to the hardware in the form of one or more Transfer Request Blocks (TRBs). How many? There’s no advantage in chopping the TD into more than one TRB for a transfer request, however there are certain constraints to an TRB. Among others, each TRB points at a chunk of the buffer in continuous physical memory. The xHCI specification also requires in its section 6.4.1 that the data buffers shall not cross 64 kiB boundaries in physical memory.

So if the transfer request’s data buffer does cross 64 kiB boundaries, the TD is split into several TRBs, each pointing at a another part of the buffer supplied in the software’s transfer request. For example, see xhci_queue_bulk_tx() in the Linux kernel’s drivers/usb/host/xhci-ring.c. This is fine, and should cause no problems.

But as it turns out, if a BULK IN transfer request is terminated by the device with a short packet, and the TD consists of more than one TRB, the xHCI hardware produces more than one notification event on this matter. This is also fine and legal, however it causes the kernel (before the patch) to issue the said warning. As already said, the warning is a bug, but otherwise there’s no problem.

So for this bug to show up, there needs to be a combination of a pre-4.15 kernel, short packets and transfer requests that have data buffers that span 64 kiB boundaries.

Huh? Short packets?

A “short packet” is a USB packet that is shorter than the maximal size that is allowed for the relevant USB version (that is, 512 bytes for USB 2.0 and 1024 bytes on USB 3.x).

The idea is that the software can request a transfer of a data chunk of any size (with some maximum, but surely several kilobytes). On a BULK IN endpoint, the device is in principle expected to send that amount of data, divided into packets, which all have the maximal size, except for the last one.

However it’s perfectly legal and fine that the device sends less than requested by the software in the transfer request. In that case, the device sends a packet that is shorter than the maximal packet size, and by doing so, it marks the end of the transfer. It can even be a packet with zero bytes, if the previous packet was of maximal length. This is called a short packet, and once again, it’s not reason for any alarm. It just means that the software must check how much data it actually got when the transfer is completed.

Since the hardware and its driver often have some dedicated protocol to coordinate their data exchange, this short packet mechanism is often unnecessary and not used.

Avoiding these warnings

Linux kernel v4.15 was released in January 2018, so when writing this it’s quite expected that older kernels are still ubiquitous.

If short packets are expected from the device, and pre-patch kernels are expected to be in use, it’s wise to make sure that the buffers don’t cross the said boundaries. Keeping them small, surely below 4 kiB is a good start, as larger buffers surely span more than one page. Also, different pages can be anywhere in physical memory, causing the need to divide the buffer into several TRBs. And then there’s the 64 kiB boundaries issue. However this isn’t practical in a high-bandwidth application, as a 4 kB buffer is exhausted in 10 μs at a rate of 400 MB/s. The software has no chance to continuously generate TDs fast enough to keep up with their completion, which will result in data transmission stops, waiting for TDs to be queued.

When writing a device driver in the kernel, it’s relatively easy to control the position of the buffer in physical memory. In particular, the kernel’s __get_free_pages() function allows this. This is however not the common practice, in particular as the buffers are typically much smaller than a page, so using __get_free_pages() would have seemed to be a waste of memory. So existing drivers are subject to this problem, and there’s not much to do about it (except for applying the patch, as suggested above).

When libusb is used to access the device (through usbfs), there is a solution, assuming that the kernel is v4.6 (released May 2016) and later + libusb version 1.0.21 and later. The trick is to use the libusb_dev_mem_alloc() function to allocate memory (i.e. the zerocopy feature), which implements a shared memory mapping with the usbfs driver, so that the data buffer that is accessed from user space is directly accessed by the xHCI hardware controller. It also speeds up things slightly by avoiding memory allocation and copying of buffers on x86 platforms, which ensure cache coherency on DMA transfers. Not sure on what happens on ARM platforms, for which cache coherency needs to be maintained explicitly.

Note that physical memory continuity is only ensured within 4 kiB pages, as the mmap() call doesn’t ensure physical memory continuity. Since virtual to physical memory translation never touches the lower 12 bits of the address, staying within 4 kiB alignment in virtual alignment ensures no 4 kiB boundaries are crossed — but this doesn’t help regarding 64 kiB boundaries.

Without libusb_dev_mem_alloc(), the usbfs framework allocates a buffer with kmalloc() for each transfer, and eventually copies the data from the DMA buffer into it. No control whatsoever on how the buffer is aligned. The position of the buffer that is supplied by the user-space software makes no difference.

Zero-copy buffers: Gory details (too much information)

Introduced in Linux kernel v4.6 (commit f7d34b445, February 2016), mmap() on the usbfs file descriptor allows the memory buffer provided by the user-space program to be used directly by the hardware xHCI controller. The patch refers to this as “zerocopy”, as there is no copying of the data, however the more important part is that the buffers don’t need to be allocated and freed each time.

The idea is quite simple: The user-space software allocates a memory buffer by calling mmap() on the file descriptor of the relevant USB device. When a URB is submitted with the data buffer pointer directed to inside the mmap’ed region, the usbfs driver detects this fact, and skips memory allocation and copying of data. Instead, it uses the memory buffer directly.

This is implemented in proc_submiturb() (drivers/usb/core/devio.c) by first checking if the buffer is in an mmap’ed segment:

as->usbm = find_memory_area(ps, uurb);

“uurb” in the context of this code is the user-space copy of the URB. find_memory_area scans the list of mmap’ed regions for one that contains the address uurb->buffer (and checks that uurb->buffer_length doesn’t exceed the region). It returns NULL if no such buffer was found (which is the common case, when mmap() isn’t used at all). Moving on, we have

if (as->usbm) {
  unsigned long uurb_start = (unsigned long)uurb->buffer;

  as->urb->transfer_buffer = as->usbm->mem + (uurb_start - as->usbm->vm_start);
 } else {
  as->urb->transfer_buffer = kmalloc(uurb->buffer_length, GFP_KERNEL);
  if (!as->urb->transfer_buffer) {
    ret = -ENOMEM;
    goto error;
  }
  if (!is_in) {
    if (copy_from_user(as->urb->transfer_buffer,
		       uurb->buffer,
		       uurb->buffer_length)) {
      ret = -EFAULT;
      goto error;
    }

So if a memory mapped buffer exists, the buffer that is used against hardware is calculated from the relative position in the mmap’ed buffer.

Otherwise, it’s kmalloc’ed, and in the case of an OUT transaction, there’s a copy_from_user() call following to populate the buffer with data.

So to use this feature, the user-space software should mmap() a memory segment that is large enough to contain all data buffers, and then manage this segment, so that each transfer URB has its own chunk of memory in this segment.

In order to do this from libusb, there’s the libusb_dev_mem_alloc() API function call, defined in core.c. It calls usbi_backend.dev_mem_alloc(), which for Linux is op_dev_mem_alloc() in os/linux_usbfs.c. It’s short and concise, so here it is:

static unsigned char *op_dev_mem_alloc(struct libusb_device_handle *handle,
	size_t len)
{
	struct linux_device_handle_priv *hpriv = _device_handle_priv(handle);
	unsigned char *buffer = (unsigned char *)mmap(NULL, len,
		PROT_READ | PROT_WRITE, MAP_SHARED, hpriv->fd, 0);
	if (buffer == MAP_FAILED) {
		usbi_err(HANDLE_CTX(handle), "alloc dev mem failed errno %d",
			errno);
		return NULL;
	}
	return buffer;
}

This capability was added to libusb in its version 1.0.21 (as the docs say) with git commit a283c3b5a, also in February 2016, and by coincidence, by Steinar H. Gunderson, who also submitted the Linux kernel patch. Conspiracy at its best.

libusb: From API call to ioctl()

This is some libusb sources dissection notes. Not clear why this should interest anyone.

From the libusb sources: A simple submitted bulk transfer (libusb_bulk_transfer(), defined in sync.c) calls do_sync_bulk_transfer(). The latter function wraps an async transfer, and then calls the API’s function for that purpose, libusb_submit_transfer() (defined in io.c), which in turn calls usbi_backend.submit_transfer(). For Linux, this data structure is populated in os/linux_usbfs.c, with a pointer to the function op_submit_transfer(), which calls submit_bulk_transfer().

submit_bulk_transfer() ends up making a ioctl() call to submit one or more URBs to Linux’ usbfs interface. The code used for this ioctl is IOCTL_USBFS_SUBMITURB, which is a libusb-specific defined as

#define IOCTL_USBFS_SUBMITURB	_IOR('U', 10, struct usbfs_urb)

in linux_usbfs.h, which matches

#define USBDEVFS_SUBMITURB         _IOR('U', 10, struct usbdevfs_urb)

in the Linux kernel source’s include/uapi/linux/usbdevice_fs.h.

The ioctl() call ends up with proc_submiturb() defined in drivers/usb/core/devio.c, which moves the call on to proc_do_submiturb() in the same file. The latter function splits the transfer into scatter-gather buffers if it’s larger than 16 kiB (where did that number come from?).

And then we have the issue with memory mapping, as mentioned above.

Extra kernel gory details: URB handling

This is the traversal of a URB that is submitted via usbfs in the kernel, details and error handling dropped for the sake of the larger picture.

proc_do_submiturb() sets up a usbfs-specific struct async container for housekeeping data and the URB (as a struct urb), puts it on a locally accessible list and then submits the URB into the USB framework with a

ret = usb_submit_urb(as->urb, GFP_KERNEL);

This finishes the usbfs-specific handling of the URB submission request. usb_submit_urb() (defined in drivers/usb/core/urb.c) makes a lot of sanity checks, and eventually calls usb_hcd_submit_urb() (defined in drivers/usb/core/hcd.c). Unless the URB is directed to the root hub itself, the function registered as hcd->driver->urb_enqueue is called with the URB. The hcd is obtained with

hcd = bus_to_hcd(udev->bus);

which merely fetches the usb_hcd structure from the usb_bus structure.

Anyhow for xHCI the list of methods is mapped in drivers/usb/host/xhci.c, where urb_enqueue is assigned with xhci_urb_enqueue() (no surprises here). This function allocates a private data structure with kzalloc() and assigns urb->hcpriv with a pointer to it, and then calls xhci_queue_bulk_tx() (defined in drivers/usb/host/xhci-ring.c) for a bulk transfer URB.

xhci_queue_bulk_tx() calls queue_trb() which actually puts 16 bytes of TRB data into the Transfer Ring (per section 6.4.1 in the xHCI spec), and calls inc_enq() to move the enqueue pointer.

Once the hardware xHCI controller finishes handling the TRB (for better or worse), it queues an entry in the event ring, and issues an interrupt, which is handled by xhci_irq() (also in xhci-ring.c). After several sanity checks, this ISR calls xhci_handle_event() until all events have been handled, and then does the housekeeping tasks for confirming the events with the hardware.

The interesting part in xhci_handle_event() is that it calls handle_tx_event() if the event was a finished transmission URB. This happens to be the function that emits the warning in the title under some conditions. After a lot of complicated stuff, it calls process_bulk_intr_td() for a BULK endpoint TD. Which in turn calls finish_td(), which returns the TD back to the USB subsystem: It calls xhci_td_cleanup(), which checks if all TDs of the relevant URB have been finished. If so, xhci_giveback_urb_in_irq() is called, which in turn calls usb_hcd_giveback_urb() (defined in drivers/usb/core/hcd.c). That function launches a tasklet that completes the URB.