my tech blog » NXP (Freescale)

NXP / Freescale SDMA and the art of accessing peripheral registers

eli — Thu, 10 Aug 2017 17:58:37 +0000

Preface

While writing a custom SDMA script for copying data arriving from an eCSPI peripheral into memory, it occurred to me that there is more than one way to fetch the data from the peripheral. This post summarizes my rather decisive finding in this matter. Spoiler: Linux’ driver could have done better (Freescale’s v4.1.15)

I’ve written a tutorial on SDMA scripts in general, by the way, which is recommended before diving into this one.

Using the Peripheral DMA Unit

This is the method used by the official eCSPI driver for Linux. That is, the one obtained from Freescale’s / NXP’s Linux git repository. Specifically, spi_imx_sdma_init() in drivers/spi/spi-imx.c sets up the DMA transaction with

	spi_imx->rx_config.direction = DMA_DEV_TO_MEM;
	spi_imx->rx_config.src_addr = res->start + MXC_CSPIRXDATA;
	spi_imx->rx_config.src_addr_width = DMA_SLAVE_BUSWIDTH_1_BYTE;
	spi_imx->rx_config.src_maxburst = spi_imx_get_fifosize(spi_imx) / 2;
	ret = dmaengine_slave_config(master->dma_rx, &spi_imx->rx_config);
	if (ret) {
		dev_err(dev, "error in RX dma configuration.\n");
		goto err;
	}

Since res->start points at the address resource obtained from the device tree (0x2008000 for eCSPI1), this is the very same address used for accessing the peripheral registers (only the software uses the virtual address mapped to the relevant region).

In essence, it means issuing an stf command to set the PSA (Peripheral Source Address), and then reading the data with an ldf command on the PD register. For example, if the physical address (e.g. 0x2008000) is in register r1:

69c3 (0110100111000011) | 	stf	r1, 0xc3	# PSA = r1 for 32-bit frozen periheral read
62c8 (0110001011001000) | 	ldf	r2, 0xc8	# Read peripheral register into r2

One would expect this to be correct way, or why does this unit exist? Or why does Linux’ driver use it? On the other hand, if this is the right way, why is there a “DMA mapping”?

Using the Burst DMA Unit

This might sound like a bizarre idea: Use the DMA unit intended for accessing RAM for peripheral registers. I wasn’t sure this would work at all, but it does: If the same address that was fed into PSA for accessing a peripheral goes into MSA instead, the data can be read correctly from MD. After all, the same address space is used by the processor, Peripheral DMA unit and Burst DMA unit, and it turns out that the buses are interconnected (which isn’t obvious).

So the example above changes into

6910 (0110100100010000) | 	stf	r1, 0x10    # To MSA, NO prefetch, address is frozed
620b (0110001000001011) | 	ldf	r2, 0x0b    # Read peripheral register into r2

The motivation for this type of access is using copy mode — a burst of up to 8 read/write operations in a single SDMA command. This is possible only from PSA to PDA, or from MSA to MDA. But there is no burst mode from PSA to MDA. So treating the peripheral register as a memory element works around this.

Spoiler: It’s not such a good idea. The speed results below tell why.

Using the SDMA internal bus mapping

The concept is surprisingly simple: It’s possible to access some peripherals’ registers directly in the SDMA assembly code’s memory space. In other words, to access eCSPI1, one can go just

5201 (0101001000000001) | 	ld	r2, (r1, 0) # Read peripheral register from plain SDMA address space

and achieve the equivalent result of the examples above. But r1 needs to be set to a different address. And this is where it gets a bit confusing.

The base address is fairly easy to obtain. For example, i.MX6′s reference manual lists the address for eCSPI1 as 0x2000 in section 2.4 (“DMA memory map”), where it also says that the relevant section spans 4 kB. Table 55-14 (“SDMA Data Memory Space”) in the same document assigns the region 0x2000-0x2fff to “per2″, declares its size as 16 kB, and in the description it says “peripheral 2 memory space (4 Kbyte peripheral’s address space)”. So what is it? 4 kB or 16 kB?

The answer is both: The address 0x2000 is given in SDMA data address format, meaning that each address points at a 32-bit word. Therefore, the SDMA map region of 0x2000-0x2fff indeed spans 16 kB. But the mapping to the peripheral registers was done in a somewhat creative way: The address offsets of the registers apply directly on the SDMA mapping’s addresses.

For example, let’s consider the ECSPI1_STATREG, which is placed at “Base address + 18h offset”. In the Application Processor’s address space, it’s quite clear that it’s 0x2008000 + 0x18 = 0x2008018. The 0x18 offset means 0x18 (24 in decimal) bytes away from the base.

In the SDMA mapping, the same register is accessed at 0x2000 + 0x18 = 0x2018. At first glance, this might seem obvious, but an 0x18 offset means 24 x 4 = 96 bytes away from the base address. A bit odd, but that’s the way it’s implemented.

So even though each address increment in SDMA data address space moves 4 bytes, they mapped the multiply-by-4 offsets directly, placing the registers 16 bytes apart. Attempting to access addresses like 0x2001 yield nothing noteworthy (in my experiments, they all read zero). I believe that the SDMA submodule was designed in France, by the way.

Almost needless to say, these addresses (e.g. 0x2000) can’t be used to access peripherals with Peripheral / Burst DMA units — these units work with the Application Processor’s bus infrastructure and memory map.

Speed tests

As all three methods work, the question is how fast each is. So I ran a speed test. I only tested the peripheral read operation (my application didn’t involve writes), but I would expect more or less the same results for writes. The speed tests were carried out by starting the SDMA script from a Linux kernel module, and issuing a printk when the SDMA script was kicked off. When the interrupt arrived at the completion of the script (resulting from a “done 3″ opcode, not shown in the table below), another printk was issued. The timestamps in dmeg’s output was used to measure the time difference.

In order to keep the influence of the Linux overhead delays low, the tested command was executed within a hardware loop, so that the overall execution would take a few seconds. A few milliseconds of printk delay hence became fairly negligible.

The results are given in the following table:

	Peripheral DMA Unit	Burst DMA Unit	Internal bus mapping	Non-IO command
Assembly code	`stf r1, 0xc3 loop endloop, 0 ldf r2, 0xc8 endloop:`	`stf r1, 0x10 loop endloop, 0 ldf r2, 0x0b endloop:`	`loop endloop, 0 ld r2, (r1, 0) endloop:`	`loop endloop, 0 addi r5, 2 endloop:`
Execution rate	7.74 Mops/s	3.88 Mops/s	32.95 Mops/s	65.97 Mops/s

Before concluding the results, a word on the rightmost one, which tested the speed of a basic command. The execution rate, almost 66 Mops/s, shows the SDMA machine’s upper limit. Where this came from isn’t all that clear, as I couldn’t find a matching clock rate in any of the three clocks enabled by Linux’ SDMA driver: clk_ahb, clk_ipg and clk_per.

The reference manual’s section 55.4.6 claims that the SDMA core’s frequency is limited to 104 MHz, but calling clk_get_rate() for clk_ahb returned 132 MHz (which is 2 x 66 MHz…). For the two other which the imx-sdma.c driver declares that it uses, clk_ipg and clk_per (the same clock, I believe), clk_get_rate() returned 60 MHz, so it’s not that one. In short, it’s not 100% what’s going on, except that the figure is max 66 Mops/s.

By the way, I verified that the hardware loop doesn’t add extra cycles by duplicating the addi command, so it ran10 times for each loop. The execution rate dropped to exactly 1/10, so there’s definitely no loop overhead.

OK, so now to the conclusions:

The clear winner is using the internal bus. Note that the result isn’t all that impressing, after all. With 33 Mops, 4 bytes each, there’s a theoretical limit of 132 MB/s for just reading. That doesn’t include doing something with the data. More about that below.
Note that reading from the internal bus takes just 2 execution cycles.
There is a reason for using the Peripheral DMA Unit, after all: It’s twice as fast compared with the Burst DMA Unit.
It probably doesn’t pay off to use the Burst DMA Unit for burst copying from a peripheral to memory, even though I didn’t give it a go: The read is twice as slow, and writing to memory with autoflush is rather quick (see below).
The use of the Peripheral DMA Unit in the Linux kernel driver is quite questionable, given the results above. On the other hand, the standard set of scripts aren’t really designed for efficiency anyhow.

Copying data from peripheral to RAM

In this last pair of speed tests, the loop reads one value from the peripheral with Internal bus mapping (the fastest way found) and writes it to the general RAM with an stf command, using autoincrement. This is hence a realistic scenario for bulk copying of data from a peripheral data register into memory that is available to the Application Processor.

The test code had to be modified slightly, so the destination address is brought back to the beginning of the buffer every 1,000,000 write operations, since the buffer size is limited, quite naturally. So when the script begins, r7 contains the number of times to loop until resetting the destination address (that is, r7 = 1000000) and r3 contains the number of such sessions to run (was set to 200). The overhead of this larger loop is literally one in a million.

The assembly code used was:

                             | bigloop:
0000 008f (0000000010001111) | 	mov	r0, r7
0001 6e04 (0110111000000100) | 	stf	r6, 0x04	# MDA = r6, incremental write
                             |
0002 7802 (0111100000000010) | 	loop endloop, 0
0003 5201 (0101001000000001) | 	ld	r2, (r1, 0)
0004 6a0b (0110101000001011) | 	stf	r2, 0x0b	# Write 32-bit word, no flush
                             | endloop:
0005 2301 (0010001100000001) | 	subi	r3, 1		# Decrement big loop counter
0006 7cf9 (0111110011111001) | 	bf	bigloop		# Loop until r3 == 0
                             | quit:
0007 0300 (0000001100000000) | 	done 3			# Quit MCU execution

The result was 20.70 Mops/s, that is 20.7 Million pairs of read-writes per second. This sets the realistic hard upper limit for reading from a peripheral to 82.8 MB/s. Note that deducing the known time it takes to execute the peripheral read, one can estimate that the stf command runs at ~55.5 Mops/s. In other words, it’s a single cycle instruction until an autoflush is forced every 8 writes. However dropping the peripheral read command (leaving only the stf command) yields only 35.11 Mops/s. So it seems like the DMA burst unit takes advantage of the small pauses between accesses to it.

I should mention that the Linux system was overall idle while performing these tests, so there was little or no congestion on the physical RAM. The results were repeatable within 0.1% of the execution time.

Note that automatic flush was enabled during this test, so the DMA burst unit received 8 writes (32 bytes) before flushing the data into RAM. When reattempting this test, with explicit flush on each write to RAM (exactly the same assembly code as listed above, with a peripheral read and then stf r7, 0x2b instead of 0x0b), the result dropped to 6.83 Mops/s. Which is tantalizingly similar to the 7.74 Mops result obtained for reading from the Peripheral DMA Unit.

Comparing with non-DMA

Even though not directly related, it’s worth comparing how fast the host accesses the same registers. For example, how much time will this take (in Linux kernel code, of course)?

  for (i=0; i<10000000; i++)
    rc += readl(ecspi_regs + MX51_ECSPI_STAT);

So the results are as follows:

Reading from an eCSPI register (as shown above): 4.10 Mops/s
The same, but from RAM (non-cacheable, allocated with dma_alloc_coherent): 6.93 Mops/s
The same, reading with readl() from a region handled by RAM cache (so it’s considered volatile): 58.14 Mops/s
Writing to an eCSPI register (with writel(), loop similar to above): 3.8696 Mops/s

This was carried out on an i.MX6 processor with a clock frequency of 996 MHz.

The figures echo well with those found in the SDMA tests, so it seems like the dominant delays come from i.MX6′s bus bridges. It’s also worth nothing the surprisingly slow performance of readl() from cacheable, maybe because of the memory barriers.

NXP / Freescale i.MX6 as an SPI slave

eli — Thu, 10 Aug 2017 17:56:58 +0000

Motivation

Even though SPI is commonly used for controlling rather low-speed peripherals on an embedded system, it can also come handy for communicating data with an FPGA.

When using the official Linux driver, the host can only be the SPI master. It means, among others, that transactions are initiated by the host: When the bursts take place is completely decided by software, and so is how long they are. It’s not just about who drives which lines, but also the fact that the FPGA is on the responding side. This may not be a good solution when the data rates are anything but really slow: If the FPGA is slave, it must wait for the host to poll it for data (a bit like a USB peripheral). That can become a bit tricky at the higher end of data rates.

For example, if the FPGA’s FIFO is 16 kbit deep, and is filled at 16 Mbit/s, it takes 1 ms for it to overflow, unless drained by the host. This can be a difficult real-time task for a user-space Linux program (based upon spidev, for example). Not to mention how twisted such a solution will end up, having the processor constantly spinning in a loop collecting data, whether there is data to collect or not.

Another point is that the SPI clock is always driven by the SPI master, and it’s usually not a free-running one. Rather, bursts of clock edges are presented on the clock wire to advance the data transaction.

Handling a gated clock correctly on an FPGA isn’t easy when it’s controlled by an external device (unless its frequency is quite low). From an FPGA design point of view, it’s by far simpler to drive the SPI clock and handle the timing of the MOSI/MISO signals with respect to it.

And finally: If a good utilization of the upstream (FPGA to host) SPI channel is desired, putting the FPGA as master has another advantage. For example, on i.MX6 Dual/Quad, the SPI clock cycle is limited to a cycle of 15 ns for write transactions, but to 40 ns or 55 ns on read transactions, depending on the pins used. The same figures are true, regardless of whether the host is master or slave (compare sections 4.11.2.1and 4.11.2.2 in the relevant datasheet, IMX6DQCEC.pdf). So if the FPGA needs to send data faster than 25 Mbps, it can only use write cycles, hence it has to be the SPI master.

CS is useless…

This is the “Chip Select” signal, or “Slave Select” (SS) in Freescale / NXP terminology.

The reference manual, along with NXP’s official errata ERR009535, clearly state that deasserting the SPI’s CS wire is not a valid way to end a burst. Citing the description for the SS_CTL field of ECSPIx_CONFIGREG, section 21.7.4 in the i.MX6 Reference Manual:

In slave mode – an SPI burst is completed when the number of bits received in the shift register is equal to (BURST_LENGTH + 1). Only the n least-significant bits (n = BURST_LENGTH[4:0] + 1) of the first received word are valid. All bits subsequent to the first received word in RXFIFO are valid.

So the burst length is fixed. The question is, what value to pick. Short answer: 32 bits (set BURST LENGTH to 31).

Why 32? First, let’s recall that RXFIFO is 32 bits wide. So what is more natural than packing the incoming data into full 32 bits entries in the RXFIFO, fully utilizing its storage capacity? Well, maybe the natural data alignment isn’t 32 bits, so another packing scheme could have been better. In theory.

That’s where the second sentence in the citation above comes in. What it effectively says is that if BURST_LENGTH + 1 is chosen anything else than a multiple of 32, the first word, which is ever pushed into RXFIFO since the SPI module’s reset, will contain less than 32 received bits. All the rest, no matter what BURST_LENGTH is set to, will contain 32 bits of received data. This is really what happens. So in the long run, data is packet into 32 bit words no matter what. Choosing BURST_LENGTH + 1 other than a multiple of 32 will just mess up things on the first word the RXFIFO receives after waking up from reset. Nothing else.

So why not set BURST_LENGTH to anything else than 31? Simply because there’s no reason to do so. We’re going to end up with an SPI slave that shifts bits into RXFIFO as 32 bit words anyhow. The term “burst” has no significance, since deassertions of CS are ignored anyhow. In fact, I’m not sure if it makes any difference between different values satisfying multiple of 32 rule.

Note that since CS doesn’t function as a frame for bursts, it’s important that the eCSPI module is brought out of reset while there’s no traffic (i.e. clock edges), or it will pack the data in an unaligned and unpredictable manner. Also, if the FPGA accidentally toggles the clock (due to a bug), alignment it lost until the eCSPI is reset and reinitialized.

Bottom line: The SPI slave receiver just counts 32 clock edges, and packs the received data into RXFIFO. Forever. There is no other useful alternative when the host is slave.

… but must be taken care of properly

Since the burst length doesn’t depend on the CS signal, it might as well be kept asserted all the time. With the register setting given below, that means holding the pin constantly low. It’s however important to select the correct pin in the CHANNEL_SELECT field of ECSPIx_CONREG: The host will ignore the activity on the SPI bus unless CS is selected. In other words, you can’t terminate a burst with CS, but if it isn’t asserted, bits aren’t sampled.

Another important thing to note, is that the CS pin must be IOMUXed as a CS signal. In the typical device tree for the mainstream Linux SPI master driver, it’s assigned as a GPIO pin. That’s no good for an SPI slave.

So, for example, if the ECSPI entry in the device tree says:

&ecspi1 {
[ ... ]
	pinctrl-names = "default";
	pinctrl-0 = <&pinctrl_ecspi1_1>;
	status = "okay";
 };

meaning that the IOMUX settings given in pinctrl_ecspi1_1 should be applied, when the Linux driver related to ecspi1 is probed. It should say something like

&iomuxc {
	imx6qdl-var-som-mx6 {
[ ... ]

		pinctrl_ecspi1_1: ecspi1grp {
			fsl,pins = <
				MX6QDL_PAD_DISP0_DAT22__ECSPI1_MISO	0x1f0b1
				MX6QDL_PAD_DISP0_DAT21__ECSPI1_MOSI	0x1f0b1
				MX6QDL_PAD_DISP0_DAT20__ECSPI1_SCLK	0x130b1
				MX6QDL_PAD_DISP0_DAT23__ECSPI1_SS0	0x1f0b1
			>;
		};
[ ... ]

The actual labels differ depending on the processor’s variant, which pins were chosen etc. The point is that the _SS0 usage was selected for the pin, and not the GPIO alternative (in which case it would say MX6QDL_PAD_DISP0_DAT23__GPIO5_IO17). The list of IOMUX defines for the i.MX6 DL variant can be found in arch/arm/boot/dts/imx6dl-pinfunc.h.

Endianness

The timing diagrams for SPI communication in the Reference Manual show only 8 bit examples, with MSB received first. But this applies to 32 bit words as well. But what happens if 4 bytes are sent with the intention of being treated as a string of bytes?

Because the first byte is treated as the MSB of a 32-bit word, it’s going to end up as the last byte when the 32-bit word is copied (by virtue of a single 32-bit read and write) into RAM, whether done by the processor or by SDMA. This ensures that a 32-bit integer is interpreted correctly by the Little Endian processor when transmitted over the SPI bus, but messes up single bytes transmitted.

Where exactly this flipping takes place, I”m not sure, but it doesn’t really matter. Just be aware that if a sequence of bytes are sent over the SPI link, they need to be byte swapped in groups of 4 bytes to appear in the correct order in the processor’s memory.

Register setting

In terms of a Linux kernel driver, the probe of an SPI slave is pretty much the same as the SPI master, with a few obvious differences. For example, the SPI clock’s frequency isn’t controlled by the host, so it probably doesn’t matter so much how the dividers are set (but it’s probably wise to set these dividers to 1, in case the internal clock is used for something).

  ctrl = MX51_ECSPI_CTRL_ENABLE | /* Enable module */
    /* MX51_ECSPI_CTRL_MODE_MASK not set, so it's slave mode */
    /* Both clock dividers set to 1 => 60 MHz, not clear if this matters */
    MX51_ECSPI_CTRL_CS(which_cs) | /* Select CSn */
    (31 << MX51_ECSPI_CTRL_BL_OFFSET); /* Burst len = 32 bits */

  cfg = 0; /* All defaults, in particular, no clock phase / polarity change */

  /* CTRL register always go first to bring out controller from reset */
  writel(ctrl, regs + MX51_ECSPI_CTRL);

  writel(cfg, regs + MX51_ECSPI_CONFIG);

  /*
   * Wait until the changes in the configuration register CONFIGREG
   * propagate into the hardware. It takes exactly one tick of the
   * SCLK clock, but we will wait 10 us to be sure (SCLK is 60 MHz)
   */

  udelay(10);

  /*
    Turn off DMA requests (revert the register to its defaults)
    But set the RXFIFO watermark as required by device tree.
  */
  writel(MX51_ECSPI_DMA_RX_WML(rx_watermark),
	 regs + MX51_ECSPI_DMA);

  /* Enable interrupt when RXFIFO reaches watermark */
  writel(MX51_ECSPI_INT_RDREN, regs + MX51_ECSPI_INT);

The example above shows the settings that apply when the the host reads from the RXFIFO directly. Given the measurements I present in another post of mine, showing ~4 Mops/s with a plain readl() call, it means that at the maximal bus rate of 66 Mbit/s, which is ~2.06 Mops/s (32 bits per read), we have the a processor core 50% busy just on readl() calls.

So for higher data rates, SDMA is pretty much a must.

The speed test

Eventually, I ran a test. With a dedicated SDMA script, SPI clock running at 112 MHz, 108.6 Mbit/s actual throughput:

# time dd if=/dev/myspi of=/dev/null bs=64k count=500
500+0 records in
500+0 records out
32768000 bytes (33 MB, 31 MiB) copied, 2.41444 s, 13.6 MB/s

real	0m2.434s
user	0m0.000s
sys	0m1.610s

This data rate is, of course, way above the allowed SPI clock frequency of 66 MHz, but it’s not uncommon that real-life results are so much better. I didn’t bother pushing the clock higher.

I ran a long and rigorous test looking for errors on the data transmission line (~ 1 TB of data) and it was completely clean with the 112 MHz, so the SPI slave is reliable. For a production system, I don’t think about exceeding 66 MHz, despite this result. Just to have that said.

But the bottom line is that the SPI slave mode can be used as a simple transmission link of 32-bit words. Often that’s good enough.

i.MX: SDMA not working? Strange things happen? Maybe it’s all about power management.

eli — Sun, 06 Jul 2014 10:06:47 +0000

I ran into a weird problem while attempting to enable SDMA for UARTs on an i.MX53 processor running Freescale’s 2.6.35.3 Linux kernel: To begin with, the UART would only transmit 48 bytes, which is probably a result of only one watermark event arriving (the initial kickoff filled the UART’s FIFO with 32 bytes, and then one SDMA event occurred when the FIFO reached 16 bytes’ fill, so another 16 bytes were sent).

So it seemed like the SDMA core misses the UART’s watermark events. More scrutinized experiments with my own test scripts revealed a variety of weird behaviors, including what appeared to be preemption of the SDMA script’s process, even though the reference manual is quite clear about it: Context switching of SDMA scripts is voluntary. And still, the flow of data on the UART’s tx lines was stopped for 5-6 ms periods randomly, even when I ran a busy-wait loop in the SDMA script, polling the “not full” flag of the UART’s transmission FIFO.

So it looked like something stopped the SDMA script from running in the middle of the loop (which included no “yield” nor “done” command). Or maybe a completely different issue? Maybe the peripheral bus wasn’t completely coherent? Anything seemed possible at some point.

As the title implies, the problem was power management, and poor settings of the SDMA’s behavior during low power modes.

It goes like this: Every time the Linux kernel’s scheduler has no process to run, it executes an WFI ARM processor command, halting the processor until an interrupt arrives (from a peripheral or just the scheduler’s tick clock). But before doing that, the kernel calls an architecture-dependent function, arch_idle(), which possibly shuts down or slows down clocks in order to increase power savings.

The kernel I used didn’t configure the SDMA’s behavior in the lower-power WAIT mode correctly, causing it halt and miss events while the processor was in this mode. The word is that to overcome this, the CCM_CCGR bits for SDMA clocks should be set to 11 (bits 31-30 in CCM_CCGR4). There is probably also a need to enable aips_tz1_clk to keep the SDMA and aips_tz1 clocks running. But since the application I worked on didn’t have any power restrictions, I decided to avoid these power mode switches altogether.

This was done by editing arch/arm/mach-mx5/system.c in the kernel tree, where it said:

void arch_idle(void)
{
 if (likely(!mxc_jtag_enabled)) {
   if (ddr_clk == NULL)
     ddr_clk = clk_get(NULL, "ddr_clk");
   if (gpc_dvfs_clk == NULL)
     gpc_dvfs_clk = clk_get(NULL, "gpc_dvfs_clk");
   /* gpc clock is needed for SRPG */
   clk_enable(gpc_dvfs_clk);
   mxc_cpu_lp_set(arch_idle_mode);

and delete the last line in the listing above — the call to mxc_cpu_lp_set(), which changes the processor’s power mode.

This solved the SDMA problem for me.

As a matter of fact, I would suggest commenting out this line during the development phase of any i.MX-based system, and return it once everything works. True, this shouldn’t be an issue if the clocks are properly configured. But if they’re not, something will fail, and the natural tendency is to focus the drivers of the failing functionality, and not looking for power management issues.

When the power reduction function is re-enabled at some later point, it’s quite evident what the problem is, if something fails then. So even if the target product is battery-driven, do yourself a favor, and drop that line in system.c until you’re finished struggling with other things.

Linux kernel platform device food chain example

eli — Fri, 14 Feb 2014 12:31:03 +0000

Since the device tree is the new way to set up hardware devices on embedded platforms, I hoped that I could avoid the “platform” API for picking which driver is going to take control over what. But it looks like the /arch/arm disaster is here to stay for a while, so I need to at least understand how it works.

So for reference, here’s an example walkthrough of the SPI driver for i.MX51, declared and matched with a hardware device.

The idea is simple: The driver, which is enabled by .config (and hence the Makefile in its directory includes it for compilation) binds itself to a string during its initialization. On the other side, initialization code requests a device matching that string, and also supplies some information along with that. The example tells the story better.

The platform API is documented in the kernel tree’s Documentation/driver-model/platform.txt. There’s also a nice LWN article by Jonathan Corbet.

So let’s assume we have Freescale’s 3-stack board at hand. in arch/arm/mach-mx5/mx51_3stack.c, at the bottom, it says

MACHINE_START(MX51_3DS, "Freescale MX51 3-Stack Board")
 .fixup = fixup_mxc_board,
 .map_io = mx5_map_io,
 .init_irq = mx5_init_irq,
 .init_machine = mxc_board_init,
 .timer = &mxc_timer,
MACHINE_EN

mxc_board_init() is defined in the same file, which among many other calls goes

mxc_register_device(&mxcspi1_device, &mxcspi1_data);

with the extra info structure mxcspi1_data defined as

static struct mxc_spi_master mxcspi1_data = {
 .maxchipselect = 4,
 .spi_version = 23,
 .chipselect_active = mx51_3ds_gpio_spi_chipselect_active,
 .chipselect_inactive = mx51_3ds_gpio_spi_chipselect_inactive,
};

Now to the declaration of mxcspi1_device: In arch/arm/mach-mx5/devices.c we have

struct platform_device mxcspi1_device = {
	.name = "mxc_spi",
	.id = 0,
	.num_resources = ARRAY_SIZE(mxcspi1_resources),
	.resource = mxcspi1_resources,
	.dev = {
		.dma_mask = &spi_dma_mask,
		.coherent_dma_mask = DMA_BIT_MASK(32),
	},
};

and before that, in the same file there was:

static struct resource mxcspi1_resources[] = {
	{
		.start = CSPI1_BASE_ADDR,
		.end = CSPI1_BASE_ADDR + SZ_4K - 1,
		.flags = IORESOURCE_MEM,
	},
	{
		.start = MXC_INT_CSPI1,
		.end = MXC_INT_CSPI1,
		.flags = IORESOURCE_IRQ,
	},
	{
		.start = MXC_DMA_CSPI1_TX,
		.end = MXC_DMA_CSPI1_TX,
		.flags = IORESOURCE_DMA,
	},
};

So that defines the magic driver string and the resources that are allocated to this device.

It’s worth noting that devices.c ends with

postcore_initcall(mxc_init_devices);

which causes a call to mxc_init_devices(), a function that messes up the addresses of the resources for some architectures. Just to add some confusion. Always watch out for those little traps!

Meanwhile, in drivers/spi/mxc_spi.c

static struct platform_driver mxc_spi_driver = {
	.driver = {
		   .name = "mxc_spi",
		   .owner = THIS_MODULE,
		   },
	.probe = mxc_spi_probe,
	.remove = mxc_spi_remove,
	.suspend = mxc_spi_suspend,
	.resume = mxc_spi_resume,
};

followed by:

static int __init mxc_spi_init(void)
{
	pr_debug("Registering the SPI Controller Driver\n");
	return platform_driver_register(&mxc_spi_driver);
}

static void __exit mxc_spi_exit(void)
{
	pr_debug("Unregistering the SPI Controller Driver\n");
	platform_driver_unregister(&mxc_spi_driver);
}

subsys_initcall(mxc_spi_init);
module_exit(mxc_spi_exit);

So this is how the driver tells Linux that it’s responsible for devices marked with the “mxc_spi” string.

As for some interaction with the device data (also in mxc_spi.c), there’s stuff like

mxc_platform_info = (struct mxc_spi_master *)pdev->dev.platform_data;

and

master_drv_data->res = platform_get_resource(pdev, IORESOURCE_MEM, 0);

going on with

if (!request_mem_region(master_drv_data->res->start,
			master_drv_data->res->end -
			master_drv_data->res->start + 1, pdev->name)) { /* Ayee! */ }

and

if (pdev->dev.dma_mask == NULL) { /* No DMA for you! */ }

and it goes on…

Cache coherency on i.MX25 running Linux

eli — Tue, 04 Feb 2014 14:37:31 +0000

What this blob is all about

Running some home-cooked SDMA scripts on Freescale’s Linux 2.6.28 kernel on an i.MX25 processor, I’m puzzled by the fact, that cache flushing with dma_map_single(…, DMA_TO_DEVICE) doesn’t hurt, but nothing happens if the calls are removed. On the other hand, attempting to remove cache invalidation calls, as in dma_map_single(…, DMA_FROM_DEVICE) does cause data corruption, as one would expect.

The de-facto lack of need for cache flushing could be explained by the small size of the cache: The sequence of events is typically preparing the data in the buffer, then some stuff in the middle, and only then is the SDMA script kicked off. If the cache lines are evicted naturally as a result of that “some stuff” activity, one gets away with not flushing the cache explicitly.

I’m by no means saying that cache flushing shouldn’t be done. On the contrary, I’m surprised that things don’t break when it’s removed.

So why doesn’t one get away with not invalidating the cache? In my tests, I saw 32-byte segments going wrong when I dropped the invalidation. That is, some segments, typically after a handful of successful data transactions of less than 1 kB of data.

Why does dropping the invalidation break things, and dropping the flushing doesn’t? As I said above, I’m still puzzled by this.

So I went down to the details of what these calls to dma_map_single() do. Spoiler: I didn’t find an explanation. At the end of the foodchain, there are several MCR assembly instructions, as one should expect. Both flushing and invalidation apparently does something useful.

The rest of this post is the dissection of Linux’ kernel code in this respect.

The gory details

DMA mappings and sync functions practically wrap the dma_cache_maint() function, e.g. in arch/arm/include/asm/dma-mapping.h:

static inline dma_addr_t dma_map_single(struct device *dev, void *cpu_addr,
		size_t size, enum dma_data_direction dir)
{
	BUG_ON(!valid_dma_direction(dir));

	if (!arch_is_coherent())
		dma_cache_maint(cpu_addr, size, dir);

	return virt_to_dma(dev, cpu_addr);
}

It was verified with disassembly that dma_map_single() was implemented with a call to dma_cache_maint().

This function can be found in arch/arm/mm/dma-mapping.c as follows

/*
 * Make an area consistent for devices.
 * Note: Drivers should NOT use this function directly, as it will break
 * platforms with CONFIG_DMABOUNCE.
 * Use the driver DMA support - see dma-mapping.h (dma_sync_*)
 */
void dma_cache_maint(const void *start, size_t size, int direction)
{
	const void *end = start + size;

	BUG_ON(!virt_addr_valid(start) || !virt_addr_valid(end - 1));

	switch (direction) {
	case DMA_FROM_DEVICE:		/* invalidate only */
		dmac_inv_range(start, end);
		outer_inv_range(__pa(start), __pa(end));
		break;
	case DMA_TO_DEVICE:		/* writeback only */
		dmac_clean_range(start, end);
		outer_clean_range(__pa(start), __pa(end));
		break;
	case DMA_BIDIRECTIONAL:		/* writeback and invalidate */
		dmac_flush_range(start, end);
		outer_flush_range(__pa(start), __pa(end));
		break;
	default:
		BUG();
	}
}
EXPORT_SYMBOL(dma_cache_maint);

The outer_* calls are defined as null functions in arch/arm/include/asm/cacheflush.h, since the CONFIG_OUTER_CACHE kernel configuration flag isn’t set.

The dmac_* macros are defined in arch/arm/include/asm/cacheflush.h as follows:

#define dmac_inv_range			__glue(_CACHE,_dma_inv_range)
#define dmac_clean_range		__glue(_CACHE,_dma_clean_range)
#define dmac_flush_range		__glue(_CACHE,_dma_flush_range)

where __glue() simply glues the two strings together (see arch/arm/include/asm/glue.h) and _CACHE equals “arm926″ for the i.MX25, so e.g. dmac_clean_range becomes arm926_dma_clean_range.

These actual functions are implemented in assembler in arch/arm/mm/proc-arm926.S:

/*
 *	dma_inv_range(start, end)
 *
 *	Invalidate (discard) the specified virtual address range.
 *	May not write back any entries.  If 'start' or 'end'
 *	are not cache line aligned, those lines must be written
 *	back.
 *
 *	- start	- virtual start address
 *	- end	- virtual end address
 *
 * (same as v4wb)
 */
ENTRY(arm926_dma_inv_range)
#ifndef CONFIG_CPU_DCACHE_WRITETHROUGH
	tst	r0, #CACHE_DLINESIZE - 1
	mcrne	p15, 0, r0, c7, c10, 1		@ clean D entry
	tst	r1, #CACHE_DLINESIZE - 1
	mcrne	p15, 0, r1, c7, c10, 1		@ clean D entry
#endif
	bic	r0, r0, #CACHE_DLINESIZE - 1
1:	mcr	p15, 0, r0, c7, c6, 1		@ invalidate D entry
	add	r0, r0, #CACHE_DLINESIZE
	cmp	r0, r1
	blo	1b
	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
	mov	pc, lr

/*
 *	dma_clean_range(start, end)
 *
 *	Clean the specified virtual address range.
 *
 *	- start	- virtual start address
 *	- end	- virtual end address
 *
 * (same as v4wb)
 */
ENTRY(arm926_dma_clean_range)
#ifndef CONFIG_CPU_DCACHE_WRITETHROUGH
	bic	r0, r0, #CACHE_DLINESIZE - 1
1:	mcr	p15, 0, r0, c7, c10, 1		@ clean D entry
	add	r0, r0, #CACHE_DLINESIZE
	cmp	r0, r1
	blo	1b
#endif
	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
	mov	pc, lr

/*
 *	dma_flush_range(start, end)
 *
 *	Clean and invalidate the specified virtual address range.
 *
 *	- start	- virtual start address
 *	- end	- virtual end address
 */
ENTRY(arm926_dma_flush_range)
	bic	r0, r0, #CACHE_DLINESIZE - 1
1:
#ifndef CONFIG_CPU_DCACHE_WRITETHROUGH
	mcr	p15, 0, r0, c7, c14, 1		@ clean+invalidate D entry
#else
	mcr	p15, 0, r0, c7, c6, 1		@ invalidate D entry
#endif
	add	r0, r0, #CACHE_DLINESIZE
	cmp	r0, r1
	blo	1b
	mcr	p15, 0, r0, c7, c10, 4		@ drain WB
	mov	pc, lr

The CONFIG_CPU_DCACHE_WRITETHROUGH kernel configuration flag is not set, so there are no shortcuts.

Exactly the same snippet, only disassembled from the object file (using objdump -d):

000004d4 <arm926_dma_inv_range>:
 4d4:	e310001f 	tst	r0, #31
 4d8:	1e070f3a 	mcrne	15, 0, r0, cr7, cr10, {1}
 4dc:	e311001f 	tst	r1, #31
 4e0:	1e071f3a 	mcrne	15, 0, r1, cr7, cr10, {1}
 4e4:	e3c0001f 	bic	r0, r0, #31
 4e8:	ee070f36 	mcr	15, 0, r0, cr7, cr6, {1}
 4ec:	e2800020 	add	r0, r0, #32
 4f0:	e1500001 	cmp	r0, r1
 4f4:	3afffffb 	bcc	4e8 
 4f8:	ee070f9a 	mcr	15, 0, r0, cr7, cr10, {4}
 4fc:	e1a0f00e 	mov	pc, lr

00000500 <arm926_dma_clean_range>:
 500:	e3c0001f 	bic	r0, r0, #31
 504:	ee070f3a 	mcr	15, 0, r0, cr7, cr10, {1}
 508:	e2800020 	add	r0, r0, #32
 50c:	e1500001 	cmp	r0, r1
 510:	3afffffb 	bcc	504 
 514:	ee070f9a 	mcr	15, 0, r0, cr7, cr10, {4}
 518:	e1a0f00e 	mov	pc, lr

0000051c <arm926_dma_flush_range>:
 51c:	e3c0001f 	bic	r0, r0, #31
 520:	ee070f3e 	mcr	15, 0, r0, cr7, cr14, {1}
 524:	e2800020 	add	r0, r0, #32
 528:	e1500001 	cmp	r0, r1
 52c:	3afffffb 	bcc	520 
 530:	ee070f9a 	mcr	15, 0, r0, cr7, cr10, {4}
 534:	e1a0f00e 	mov	pc, lr

So there’s actually little to learn from the disassembly. Or at all…

Examples of SDMA-assembler for Freescale i.MX51

eli — Sat, 05 Nov 2011 15:13:17 +0000

These are a couple of examples of SDMA assembly code, which performs data copy using the DMA functional unit. The first one shows how to copy data from application memory space to SDMA memory. The second example copies data from one application memory chunk to another, and hence works as an offload memcpy().

To actually use this code and generally understand what’s going on here, I’d warmly suggest reading a previous post of mine about SDMA assembly code, which also explains how to compile the code and gives the context for the C functions given below.

Gotchas

Never let either the source address nor the destination address cross a 32-byte boundary during a burst from or to the internal FIFO. Even though I haven’t seen this restriction in the official documentation, several unexplained misbehaviors have surfaces when allowing this happen, in particular when accessing EIM. So just don’t.
When accessing EIM, the EIM’s maximal burst length must be set to allow 32 bytes in one burst with the BL parameter, or data gets corrupted.

Application space memory to SDMA space

The assembly code goes

$ ./sdma_asm.pl app2sdma.asm
 | # Always in context (not altered by script):
 | #
 | # r4 : Physical address to source in AP memory space
 | # r6 : Address in SDMA space to copy to
 | # r7 : Number of DWs to copy   
 | #
 | # Both r4 and r5 must be DW aligned.
 | # Note that prefetching is allowed, so up to 8 useless DWs may be read.
 |
 | # First, load the status registers into SDMA space
                             | start:
0000 6c20 (0110110000100000) |     stf r4, 0x20 # To MSA, prefetch on, address is nonfrozen
0001 008f (0000000010001111) |     mov r0, r7
0002 018e (0000000110001110) |     mov r1, r6
0003 7803 (0111100000000011) |     loop postloop, 0
0004 622b (0110001000101011) |     ldf r2, 0x2b # Read from 32 bits from MD with prefetch
0005 5a01 (0101101000000001) |     st r2, (r1, 0) # Address in r1
0006 1901 (0001100100000001) |     addi r1, 1
                             | postloop:
0007 0300 (0000001100000000) |     done 3
0008 0b00 (0000101100000000) |     ldi r3, 0
0009 4b00 (0100101100000000) |     cmpeqi r3, 0 # Always true
000a 7df5 (0111110111110101) |     bt start # Always branches

------------ CUT HERE -----------

static const int sdma_code_length = 6;
static const u32 sdma_code[6] = {
 0x6c20008f, 0x018e7803, 0x622b5a01, 0x19010300, 0x0b004b00, 0x7df50000,
};

Note that the arguments for sdf and ldf are given as numbers, and not following the not-so-helpful notation used in the Reference Manual.

The basic idea behind the assembly code is that each DW (Double Word, 32 bits) is read automatically by the functional unit from application space memory, and then fetched from the FIFO into r2. Then the register is written to SDMA memory with a plain “st” opcode.

The relevant tryrun() function to test this is:

static int tryrun(struct sdma_engine *sdma)
{
 dma_addr_t src_phys;
 void *src_virt;

 const int channel = 1;
 struct sdma_channel *sdmac = &sdma->channel[channel];
 static const u32 sdma_code[6] = {
   0x6c20008f, 0x018e7803, 0x622b5a01, 0x19010300, 0x0b004b00, 0x7df50000,
 };

 static const u32 sample_data[8] = {
   0x12345678, 0x11223344, 0xdeadbeef, 0xbabecafe,
   0xebeb0000, 0, 0xffffffff, 0xabcdef00 };

 const int origin = 0xe00; // In data space terms (32 bits/address)

 struct sdma_context_data *context = sdma->context;

 int ret;

 src_virt = dma_alloc_coherent(NULL,
                               4096, // 4096 bytes, just any buffer size
                               &src_phys, GFP_KERNEL);
 if (!src_virt) {
   printk(KERN_ERR "Failed to allocate source buffer memory\n");
   return -ENOMEM;
 }

 memset(src_virt, 0, 4096);

 memcpy(src_virt, sample_data, sizeof(sample_data));

 sdma_write_datamem(sdma, (void *) sdma_code, sizeof(sdma_code), origin);

 ret = sdma_request_channel(sdmac);

 if (ret) {
   printk(KERN_ERR "Failed to request channel\n");
   return ret;
 }

 sdma_disable_channel(sdmac);
 sdma_config_ownership(sdmac, false, true, false);

 memset(context, 0, sizeof(*context));

 context->channel_state.pc = origin * 2; // In program space addressing...
 context->gReg[4] = src_phys;
 context->gReg[6] = 0xe80;
 context->gReg[7] = 3; // Number of DWs to copy

 ret = sdma_write_datamem(sdma, (void *) context, sizeof(*context),
 0x800 + (sizeof(*context) / 4) * channel);

 if (ret) {
   printk(KERN_ERR "Failed to load context\n");
   return ret;
 }

 ret = sdma_run_channel(&sdma->channel[1]);

 sdma_print_mem(sdma, 0xe80, 128);

 if (ret) {
   printk(KERN_ERR "Failed to run script!\n");
   return ret;
 }

 return 0; /* Success! */
}

Note that the C code snippet, which is part of the output of the assembler compilation, actually appears in the tryrun() function.

Fast memcpy()

Assembly goes

$ ./sdma_asm.pl copydma.asm
 | # Should be set up at invocation
 | #
 | # r0 : Number of DWs to copy (is altered as script runs)
 | # r1 : Source address (DW aligned)
 | # r2 : Destination address (DW aligned)
 |
0000 6920 (0110100100100000) |     stf r1, 0x20 # To MSA, prefetch on, address is nonfrozen
0001 6a04 (0110101000000100) |     stf r2, 0x04 # To MDA, address is nonfrozen
0002 0c08 (0000110000001000) |     ldi r4, 8 # Number of DWs to copy each round
                             | copyloop:
0003 04d8 (0000010011011000) |     cmphs r4, r0 # Is 8 larger or equal to the number of DWs left to copy?
0004 7d03 (0111110100000011) |     bt lastcopy  # If so, jump to last transfer label
0005 6c18 (0110110000011000) |     stf r4, 0x18 # Copy 8 words from MSA to MDA address.
0006 2008 (0010000000001000) |     subi r0, 8   # Decrement counter
0007 7cfb (0111110011111011) |     bf copyloop  # Always branches, because r0 > 0
                             | lastcopy:
0008 6818 (0110100000011000) |     stf r0, 0x18 # Copy 8 or less DWs (r0 is always > 0)
                             | exit:
0009 0300 (0000001100000000) |     done 3
000a 0b00 (0000101100000000) |     ldi r3, 0
000b 4b00 (0100101100000000) |     cmpeqi r3, 0 # Always true
000c 7dfc (0111110111111100) |     bt exit # Endless loop, just to be safe

------------ CUT HERE -----------

static const int sdma_code_length = 7;
static const u32 sdma_code[7] = {
 0x69206a04, 0x0c0804d8, 0x7d036c18, 0x20087cfb, 0x68180300, 0x0b004b00, 0x7dfc0000,
}

For a frozen (constant) source address (e.g. when reading from a FIFO) the first stf should be done with argument 0x30 rather than 0x20. For a frozen destination address, the seconds stf has the argument 0x14 instead of 0x04.

This script should be started with r0 > 0. It may be OK to have r0=0, but I’m not sure about that (and if there’s no issue with not reading any data after a prefetch, as possibly related to section 52.22.1 in the Reference Manual).

The endless loop to “exit” should never be needed. It’s there just in case the script is rerun by mistake, so it responds with a “done” right away. And the example above is not really optimal: To make a for-sure branch, I could have gone “bt exit” and “bf exit” immediately after it, making this in two opcodes instead of three. Wasteful me.

The tryrun() function for this case then goes

static int tryrun(struct sdma_engine *sdma)
{
 dma_addr_t buf_phys;
 u8 *buf_virt;

 const int channel = 1;
 struct sdma_channel *sdmac = &sdma->channel[channel];

 static const u32 sdma_code[7] = {
   0x69206a04, 0x0c0804d8, 0x7d036c18, 0x20087cfb, 0x68180300, 0x0b004b00, 0x7dfc0000,
 };

 static const u32 sample_data[8] = {
                                    0x12345678, 0x11223344, 0xdeadbeef, 0xbabecafe,
                                    0xebeb0000, 0, 0xffffffff, 0xabcdef00 };

 const int origin = 0xe00; // In data space terms (32 bits/address)

 struct sdma_context_data *context = sdma->context;

 int ret;

 buf_virt = dma_alloc_coherent(NULL, 4096,
                               &buf_phys, GFP_KERNEL);
 if (!buf_virt) {
   printk(KERN_ERR "Failed to allocate source buffer memory\n");
   return -ENOMEM;
 }

 memset(buf_virt, 0, 4096);

 memcpy(buf_virt, sample_data, sizeof(sample_data));

 sdma_write_datamem(sdma, (void *) sdma_code, sizeof(sdma_code), origin);

 ret = sdma_request_channel(sdmac);

 if (ret) {
   printk(KERN_ERR "Failed to request channel\n");
   return ret;
 }

 sdma_disable_channel(sdmac);
 sdma_config_ownership(sdmac, false, true, false);

 memset(context, 0, sizeof(*context));

 context->channel_state.pc = origin * 2; // In program space addressing...
 context->gReg[0] = 18; // Number of DWs to copy
 context->gReg[1] = buf_phys;
 context->gReg[2] = buf_phys + 0x40;

 ret = sdma_write_datamem(sdma, (void *) context, sizeof(*context),
                          0x800 + (sizeof(*context) / 4) * channel);

 if (ret) {
   printk(KERN_ERR "Failed to load context\n");
   return ret;
 }

 ret = sdma_run_channel(&sdma->channel[1]);

do {
 int i;
 const int len = 0xa0;

 unsigned char line[128];
 int pos = 0;

 for (i=0; i
The memory’s content  is printed out here from tryrun() directly, since the dumped memory is in application space.

Freescale i.MX SDMA tutorial (part IV)

eli — Wed, 26 Oct 2011 17:57:36 +0000

This is part IV of a brief tutorial about the i.MX51′s SDMA core. The SDMA for other i.MX devices, e.g. i.MX25, i.MX53 and i.MX6 is exactly the same, with changes in the registers’ addresses and different chapters in the Reference Manual.

This is by no means a replacement for reading the Reference Manual, but rather an introduction to make the landing softer. The division into part goes as follows:

Part I: Introduction, addressing and the memory map
Part II: Contexts, Channels, Scripts and their execution
Part III: Events and Interrupts
Part IV: Running custom SDMA scripts in Linux (this page)

Running custom scripts

I’ll try to show the basics of getting a simple custom script to run on the SDMA core. Since there’s a lot of supporting infrastructure involved, I’ll show my example as a hack on the drivers/dma/imx-sdma.c Linux kernel module per version 2.6.38. I’m not going to explain the details of kernel hacking, so without experience in that field, it will be pretty difficult to try this out yourself.

The process of running an application-driven custom script consists of the following steps:

Initialize the SDMA module
Initialize the SDMA channel and clearing its HE flag
Copy the SDMA assembly code from application space memory to SDMA memory space RAM.
Set up the channel’s context
Enable the channel’s HE flag (so the script runs pretty soon)
Wait for interrupt (assuming that the script ends with a “DONE 3″)
Possibly copy back the context to application processor space, to inspect the registers upon termination, and verify that their values are as expected.
Possibly copy SDMA memory to application processor space in order to inspect if the script worked as expected (if the script writes to SDMA RAM)

The first two steps are handled by the imx-smda.c kernel module, so I won’t cover them. I’ll start with the assembly code, which has to be generated first.

The assembler

Freescale offers their assembler, but I decided to write my own in Perl. It’s simple and useful for writing short routines, and its output is snippets of C code, which can be inserted directly into the source, as I’ll show later. It’s released under GPLv2, and you can download it from this link.

The sample code below does nothing useful. For a couple of memory related examples, please see another post of mine.

To try it out quickly, just untar it on some UNIX system (Linux included, of course), change directory to sdma_asm, and go

$ ./sdma_asm.pl looptry.asm
                             | start:
0000 0804 (0000100000000100) |     ldi r0, 4
0001 7803 (0111100000000011) |     loop exit, 0
0002 5c05 (0101110000000101) |     st r4, (r5, 0) # Address r5
0003 1d01 (0001110100000001) |     addi r5, 1
0004 1c10 (0001110000010000) |     addi r4, 0x10
                             | exit:
0005 0300 (0000001100000000) |     done 3
0006 1c40 (0001110001000000) |     addi r4, 0x40
0007 0b00 (0000101100000000) |     ldi r3, 0
0008 4b00 (0100101100000000) |     cmpeqi r3, 0 # Always true
0009 7df6 (0111110111110110) |     bt start # Always branches

------------ CUT HERE -----------

static const int sdma_code_length = 5;
static const u32 sdma_code[5] = {
 0x08047803, 0x5c051d01, 0x1c100300, 0x1c400b00, 0x4b007df6,
};

The output should be pretty obvious. In particular, note that there’s a C declaration of a const array called sdma_code, which I’ll show how to use below. The first part of the output is a plain assembly listing, with the address, hex code and binary representation of the opcodes. There are a few simple syntax rules to observe:

Anything after a ‘;’ or ‘#’ sign is ignored (comments)
Empty lines are ignored, of course
A label starts the line, and is followed by a colon sign, ‘:’
Everything is case-insensitive, including labels (all code is lowercased internally)
The first alphanumeric string is considered the opcode, unless it’s a label
Everything following an opcode (comments excluded) is considered the arguments
All registers are noted as r0, r1, … r7 in the argument fields, and not as plain numbers, unlike the way shown in the reference manual. This makes a clear distinction between registers and values. It’s “st r7, (r0,9)” and not “~~st 7, (0,9)~~“.
Immediate arguments can be represented as decimal numbers (digits only), possibly negative (with a plain ‘-’ prefix). Positive hexadecimal numbers are allowed with the classic C “0x” prefix.
Labels are allowed for loops, as the first argument. The label is understood to be the first statement after the loop, so the label is the point reached when the loop is finished. See the example above. The second argument may not be omitted.
Other than loops, labels are accepted only for branch instructions, where the jump is relative. Absolute jump addresses can’t be generated automatically for jmp and jsr because the absolute address is not known during assembly.

A few words about why labels are not allowed for absolute jumps: It would be pretty simple to tell the Perl script the origin address, and allow absolute addressed jumps. I believe absolute jumps within a custom script should be avoided at any cost, so that the object code can be stored and run anywhere vacant. This is why I wasn’t keen on implementing this.

A simple test function

This is a simple function, which loads a custom script and runs it a few times. I added it, and a few additional functions (detailed later) to the Linux kernel’s SDMA driver, imx-sdma.c, and called it at the end of sdma_probe(). This is the simplest, yet not most efficient way to try things out: The operation takes place once when the module is inserted into the kernel, and then a reboot is necessary, since the module can’t be removed from the kernel. But with the reboot being fairly quick on an embedded system, it’s pretty OK.

So here’s the tryrun() function. Mind you, it’s called after the SDMA subsystem has been initialized, with one argument, the pointer to the sdma_engine structure (there’s only one for the entire system).

static int tryrun(struct sdma_engine *sdma)
{
 const int channel = 1;
 struct sdma_channel *sdmac = &sdma->channel[channel];
 static const u32 sdma_code[5] = {
  0x08047803, 0x5c051d01, 0x1c100300, 0x1c400b00, 0x4b007df6,
 };

 const int origin = 0xe00; /* In data space terms (32 bits/address) */

 struct sdma_context_data *context = sdma->context;

 int ret;
 int i;

 sdma_write_datamem(sdma, (void *) sdma_code, sizeof(sdma_code), origin);

 ret = sdma_request_channel(sdmac);

 if (ret) {
   printk(KERN_ERR "Failed to request channel\n");
   return ret;
 }

 sdma_disable_channel(sdmac);
 sdma_config_ownership(sdmac, false, true, false);

 memset(context, 0, sizeof(*context));

 context->channel_state.pc = origin * 2; /* In program space addressing... */
 context->gReg[4] = 0x12345678;
 context->gReg[5] = 0xe80;

 ret = sdma_write_datamem(sdma, (void *) context, sizeof(*context),
                          0x800 + (sizeof(*context) / 4) * channel);
 if (ret) {
   printk(KERN_ERR "Failed to load context\n");
   return ret;
 }

 for (i=0; i<4; i++) {
   ret = sdma_run_channel(&sdma->channel[1]);
   printk(KERN_WARNING "*****************************\n");
   sdma_print_mem(sdma, 0xe80, 128);

   if (ret) {
     printk(KERN_ERR "Failed to run script!\n");
     return ret;
   }
 }
 return 0; /* Success! */
}

Copying the code into SDMA memory

First, note that sdma_code is indeed copied from the output of the assembler, when it’s executed on looptry.asm as shown above. The assembler adds the “static” modifier as well as an sdma_code_length variable which were omitted, but otherwise it’s an exact copy.

The first thing the function actually does, is calling sdma_write_datamem() to copy the code into SDMA space (and I don’t check the return value, sloppy me). This is a function I’ve added, but its clearly derived from sdma_load_context(), which is part of imx-sdma.c:

static int sdma_write_datamem(struct sdma_engine *sdma, void *buf,
                              int size, u32 address)
{
 struct sdma_buffer_descriptor *bd0 = sdma->channel[0].bd;
 void *buf_virt;
 dma_addr_t buf_phys;
 int ret;

 buf_virt = dma_alloc_coherent(NULL, size, &buf_phys, GFP_KERNEL);
 if (!buf_virt)
 return -ENOMEM;

 bd0->mode.command = C0_SETDM;
 bd0->mode.count = size / 4;
 bd0->mode.status = BD_DONE | BD_INTR | BD_WRAP | BD_EXTD;
 bd0->buffer_addr = buf_phys;
 bd0->ext_buffer_addr = address;

 memcpy(buf_virt, buf, size);

 ret = sdma_run_channel(&sdma->channel[0]);

 dma_free_coherent(NULL, size, buf_virt, buf_phys);

 return ret;
}

The sdma_write_datamem()’s principle of operation is pretty simple: First a buffer is allocated, with its address in virtual space given in buf_virt and its physical address is buf_phys. Both addresses are related to the application processor, of course.

Then the buffer descriptor is set up. This piece of memory is preallocated globally for the entire sdma engine (in application processor’s memory space), which isn’t the cleanest way to do it, but since these operations aren’t expected to happen in parallel processes, this is OK. The sdma_buffer_descriptor structure is defined in imx-smda.c itself, and is initialized according to section 52.23.1 in the Reference Manual. Note that this calling convention interfaces with the script running on channel 0, and not with any hardware interface. This chunk is merely telling the script what to do. In particular, the C0_SETDM command tells it to copy from application memory space to SDMA data memory space (see section 53.23.1.2).

Note that in the function’s arguments, “size” is given in bytes, but address in SDMA data address space (that is, in 32-bit quanta). This is why “size” is divided by four to become the element count (mode.count).

Just before kicking off, the input buffer’s data is copied into the dedicated buffer with a plain memcpy() command.

And then sdma_run_channel() (part of imx-sdma.c) is called to make channel 0 runnable. This function merely sets the HE bit of channel 0, and waits (sleeping) for the interrupt to arrive, or errors on timeout after a second.

At this point we have the script loaded into SDMA RAM (at data address 0xe00).

Some housekeeping calls on channel 1

Up to this point, nothing was done on the channel we’re going to use, which is channel #1. Three calls to functions defined in imx-sdma.c prepare the channel for use:

sdma_request_channel() sets up the channel’s buffer descriptor and data structure, and enables the clock global to the entire sdma engine, actions which I’m not sure are necessary. It also sets up the channel’s priority and the Linux’ wait queue (used when waiting for interrupt).
sdma_disable_channel() clears the channel’s HE flag
sdma_config_ownership() clears HO, sets EO and DO for the channel, so the channel is driven (“owned”) by the processor (as opposed to driven by external events).

Setting up the context

Even though imx-sdma.c has a sdma_load_context() function, it’s written for setting up the context as suitable for running the channel 0 script. To keep things simpler, we’ll set up the context directly.

After zeroing the entire structure, three registers are set in tryrun(): The program counter, r4 and r5. Note that the program counter is given the address to which the code was copied, multiplied by 2, since the program counter is given in program memory space. The two other registers are set merely as an initial state for the script. The structure is then copied into the per-channel designated slot with sdma_write_datamem().

Again, note that the “context” data structure, which is used as a source buffer from which the context is copied into SDMA memory, is allocated globally for the entire SDMA engine. It’s not even protected by a mutex, so in a real project you should allocate your own piece of memory to hold the sdma_context structure.

Running the script

In the end, we have a loop of four subsequent runs of the script, without updating the context, so from the second time and on, the script continues after the “done 3″ instruction. This is possible, because the script jumps to the beginning upon resumption (the three last lines in the assembly code, see above).

Each call to sdma_run_channel() sets channel 1′s HE flag, making it do its thing and then trigger off an interrupt with the DONE instruction, which in turn wakes up the process telling it the script has finished. sdma_print_mem() merely makes a series of printk’s, consisting of hex dumps of data from the SDMA memory. As used, it’s aimed on the region which the script is expected to alter, but the same function can be used to verify that the script is indeed in its place, or look at the memory. The function goes

static int sdma_print_mem(struct sdma_engine *sdma, int start, int len)
{
 int i;
 u8 *buf;
 unsigned char line[128];
 int pos = 0;

 len = (len + 15) & 0xfff0;

 buf = kzalloc(len, GFP_KERNEL);

 if (!buf)
   return -ENOMEM;

 sdma_fetch_datamem(sdma, buf, len, start);

 for (i=0; i
and it uses this function (note that the instruction is C0_GETDM):
static int sdma_fetch_datamem(struct sdma_engine *sdma, void *buf,
                              int size, u32 address)
{
 struct sdma_buffer_descriptor *bd0 = sdma->channel[0].bd;
 void *buf_virt;
 dma_addr_t buf_phys;
 int ret;

 buf_virt = dma_alloc_coherent(NULL, size,
                               &buf_phys, GFP_KERNEL);
 if (!buf_virt)
   return -ENOMEM;

 bd0->mode.command = C0_GETDM;
 bd0->mode.count = size / 4;
 bd0->mode.status = BD_DONE | BD_INTR | BD_WRAP | BD_EXTD;
 bd0->buffer_addr = buf_phys;
 bd0->ext_buffer_addr = address;

 ret = sdma_run_channel(&sdma->channel[0]);

 memcpy(buf, buf_virt, size);

 dma_free_coherent(NULL, size, buf_virt, buf_phys);

 return ret;
}
Dumping context
This is the poor man’s debugger, but it’s pretty useful. A “done 3″ function can be seen as a breakpoint, and the context dumped to the kernel log with this function:
static int sdma_print_context(struct sdma_engine *sdma, int channel)
{
 int i;
 struct sdma_context_data *context;
 u32 *reg;
 unsigned char line[128];
 int pos = 0;
 int start = 0x800 + (sizeof(*context) / 4) * channel;
 int len = sizeof(*context);
 const char *regnames[22] = { "r0", "r1", "r2", "r3", "r4", "r5", "r6", "r7",
                              "mda", "msa", "ms", "md",
                              "pda", "psa", "ps", "pd",
                              "ca", "cs", "dda", "dsa", "ds", "dd" };

 context = kzalloc(len, GFP_KERNEL);

 if (!context)
   return -ENOMEM;

 sdma_fetch_datamem(sdma, context, len, start);

 printk(KERN_WARNING "pc=%04x rpc=%04x spc=%04x epc=%04x\n",
   context->channel_state.pc,
   context->channel_state.rpc,
   context->channel_state.spc,
   context->channel_state.epc
 );

 printk(KERN_WARNING "Flags: t=%d sf=%d df=%d lm=%d\n",
   context->channel_state.t,
   context->channel_state.sf,
   context->channel_state.df,
   context->channel_state.lm
 );       

 reg = &context->gReg[0];

 for (i=0; i<22; i++) {
   if ((i % 4) == 0)
     pos = 0;

   pos += sprintf(&line[pos], "%s=%08x ", regnames[i], *reg++);

   if (((i % 4) == 3) || (i == 21))
     printk(KERN_WARNING "%s\n", line);
 }

 kfree(context);

 return 0;
}
Clashes with Linux’ SDMA driver
Playing around with the SDMA subsystem directly is inherently problematic, since the assigned driver may take contradicting actions, possibly leading to a system lockup. Running custom scripts using the existing driver isn’t possible, since it has no support for that as of kernel 2.6.38. On the other hand, there’s a good chance that the SDMA driver wasn’t enabled at all when the kernel was compiled, in which case there is no chance for collisions.
The simplest way to verify if the SDMA driver is currently present in the kernel, is to check in /proc/interrupts whether interrupt #6 is taken (it’s the SDMA interrupt).
The “imx-sdma” pseudodevice is always registered on the platfrom pseudobus (I suppose that will remain in the transition to Open Firmware), no matter the configuration. It’s the driver which may not be present. The “i.MX SDMA support” kernel option (CONFIG_IMX_SDMA) may not be enabled (it can be a module). Note that it depends on the general “DMA Engine Support” (CONFIG_DMADEVICES), which may not be enabled to begin with.
Anyhow, for playing with the SDMA module, it’s actually better when these are not enabled. In the long run, maybe there’s a need to expand imx-sdma.c, so it supports custom SDMA scripting. The question remaining is to what extent it should manage the SDMA RAM. Well, the real question is if there’s enough community interest in custom SDMA scripting at all.

Freescale i.MX51 SDMA tutorial (part III)

eli — Wed, 26 Oct 2011 15:35:46 +0000

This is part III of a brief tutorial about the i.MX51′s SDMA core. The SDMA for other i.MX devices, e.g. i.MX25, i.MX53 and i.MX6 is exactly the same, with changes in the registers’ addresses and different chapters in the Reference Manual.

This is by no means a replacement for reading the Reference Manual, but rather an introduction to make the landing softer. The division into part goes as follows:

Part I: Introduction, addressing and the memory map
Part II: Contexts, Channels, Scripts and their execution
Part III: Events and Interrupts (this page)
Part IV: Running custom SDMA scripts in Linux

Events

Even though an SDMA script can be kicked off (or made eligible for running, to be precise) by the application processor, regardless of any external events, there’s a lot of sense in letting the peripheral kick off the script(s) directly, so the application processor doesn’t have to be bothered with an interrupt every time.

So the system has 48 predefined SDMA events, listed in section 3.3 of the Reference Manual. Each of these events can turn one or several channels eligible for executing by automatically setting their EP flag. Which of the channels will have its EP flag set is determined by the SDMA event’s CHNENBL register. There are 48 such registers, one for each SMDA register, with each of its 32 bits corresponding to an SDMA channel: If bit i is set, the event linked with the register will set EP[i]. Note that these registers have unknown values on powerup, so if event driven SDMA is enabled, all registers must be initialized, or hell breaks loose.

In a normal flow, EP[i] is zero when an event is about to set this flag: If it was set by a previous event, the respective SDMA script should have finished, and hence cleared the flag before the next event occurred. Since attempting to set EP[i] when it’s already set may indicate that the event came too early (or the script is too late), there’s an CHNERR[i] flag, which latches such errors, so that the application processor can make itself informed about such a condition. This can also trigger an interrupt, if the respective bit in INTRMASK is set. The application processor can read these flags (and reset them at the same time) in the EVTERR register.

I’d like to draw special attention to events #14 and #15, which are driven by external pins, namely GPIO1_4 and GPIO1_5. These two make it possible for an external chip (e.g. an FPGA) request service without involving the application processor. A rising edge on these lines creates an event when the IOMUX is set to ALT1 (SDMA_EXT_EVENT) on the relevant pins. Note that setting the IOMUX to just GPIO won’t do it.

It’s important to note, that the combination of the EP[i] flag being cleared by the script itself with the edge-triggered nature of the event signal creates an inevitable risk for a race condition: There is no rigorous way for the script to make sure that a “DONE 4″ instruction, which was intended to clear a previous event won’t clear one that just arrived to create another. The CHNERR[i] flag will indicate that the event arrived before the previous one was cleared, but in some implementations, that can actually be a legal condition. This can be solved by emulating a level-triggered event with a constantly toggling event line, when the external hardware wants servicing. This will make CHNERR[i] go high for sure, but otherwise it’s fine.

This possible race condition is not a design bug of the SDMA subsystem. Rather, it was designed with SDMA script which finish faster than the next event in mind. The “I need service” kind of design was not considered.

Interrupts

By executing a “DONE 3″ command, the SDMA scripts can generate interrupts on the application processor by setting the HI[i] flag, where i is the channel number of the currently running script. This will assert interrupt #6 on the application processor, which handles it like any other interrupt.

The H[i] flags can be read by the application processor in the INTR register (see section 52.12.3.2 in the Reference Manual). An interrupt handler should scan this register to determine which channel requests an interrupt. There is no masking mechanism for individual H[i]‘s. The global interrupt #6 can be disabled, but an individual channel can’t be masked from generating interrupts.

If any of the INTRMASK bits is set, the EVTERR register should also be scanned, or at least cleared, since CHNERR[i] conditions generate interrupts which are indistinguishable from H[i] interrupts.

“DONE 3″, which is the only instruction available for setting HI[i] also clears HE[i], so it was clearly designed to work with scripts kicked off directly by the application processor. In order to issue an interrupt from a script, which is kicked off by an event, a little trick can be used: According to section 52.21.2 in the Reference Manual (the detail for the DONE instruction), “DONE 3″ means “clear HE, set HI for the current channel and reschedule”. In other words, make the current channel ineligible of execution unless HO[i] is set, and set HI[i] so an interrupt is issued. But event-driven channels do have HO[i] set, so clearing HE[i] has no significance whatsoever. According to table 52-4, the context will be saved, and then restored immediately. So there will be a slight waste of time with context writes and reads, but since the most likely instruction following this “DONE 3″ is a “DONE 4″ (that is, clear EP[i], the event-driven script has finished), the impact is rather minimal. Anyhow, I still haven’t tried this for real, but I will soon.

So much for part III. You may want to go on with Part IV: Running custom SDMA scripts in Linux

Freescale i.MX51 SDMA tutorial (part II)

eli — Tue, 25 Oct 2011 17:53:18 +0000

This is part II of a brief tutorial about the i.MX51′s SDMA core. The SDMA for other i.MX devices, e.g. i.MX25, i.MX53 and i.MX6 is exactly the same, with changes in the registers’ addresses and different chapters in the Reference Manual.

This is by no means a replacement for reading the Reference Manual, but rather an introduction to make the landing softer. The division into part goes as follows:

Part I: Introduction, addressing and the memory map
Part II: Contexts, Channels, Scripts and their execution (this page)
Part III: Events and Interrupts
Part IV: Running custom SDMA scripts in Linux

Contexts and channels

The SDMA’s purpose is to service requests from hardware or from the application processor. In a way, it’s like a processor with no idle task, just interrupts. But the way the service is performed is different from interrupt handling.

Let’s assume that all scripts (those SDMA programs) are already present in the SDMA’s memory space. They may reside in the on-chip ROM or they’ve been loaded into RAM. How are they executed?

The answer lies in the contexts: Some of the SDMA’s RAM space is allocated for containing an array of structures. There are 32 such structures, each occupying 128 bytes (or 32 32-bit words), so all in all this block takes up 4 kB of memory (there’s a 96-byte variant as well, but we’ll leave it for now).

These structures do what their name implies: They contain the context of a certain execution thread. In other words, they contain everything that needs to be stored to resume execution at some point, as if it was never stopped. Since the SDMA core doesn’t have a stack, this information has to go to a fixed place. This includes the program counter, the registers and flags. Section 52.13.4 in the Reference Manual describes this structure in detail.

As mentioned, there’s an array of 32 of these structures. It means that the SDMA subsystem can maintain 32 contexts, or if you like, resemble a multitasking system with 32 independent threads. Or in SDMA terms: The SDMA core supports 32 DMA channels. This kinda connects with the common concept of DMA channels: Each channel has a certain purpose and particular flow.

The method to kick off a channel, so it will execute a certain script, is to write directly to the channel’s context structure, and then set up some flags to make it runnable. This is demonstrated in part IV. Since the context includes the program counter register, this controls where the execution starts. Other registers can be used to pass information to the script (that is, the SDMA “program”). What each register means upon such an invocation is up to the script’s API.

A script’s life cycle (scheduling)

So there are 32 context, each corresponding to 32 channels. What makes a context load into the registers, making its channel’s script execute? It’s time to talk about the scheduler. It’s described in painstaking detail in the Reference Manual, so let’s stick to the main points.

The scheduler’s main function is to decide which channel is the most eligible to spend time on the processor core. This decision is relevant only when the SDMA core isn’t running anything at all (a.k.a. “sleeping”) or when the currently running script voluntarily yields the processor. The SDMA core’s execution is non-preemptive, so the scheduler can’t force any script to stop running. In other words, if any script is (mistakenly) caught in an infinite loop, all DMA activity is as good as dead, most possibly leading to a complete system hangup. Nothing can force a script to stop running (expect for a reset or the debugger). Just a small thing to bear in mind when writing those scripts.

The SDMA core has a special instruction for yielding the processor, with the mnemonic “done”, which takes a parameter for choosing its variant. Two variants of this instructions have earned their own mnemonics, “yield” and “yieldge”. While “done” variant #3 (usually called just “done”) always yields the processor, the two others yield it if there are other channels ready for executing with higher priority (or higher-or-equal priority for “yieldge”). But never mind the details. The overall picture is that the script runs until it issues a command saying “you must stop me now” (as in “done”) or “you may stop me now” (as in the two other variants).

Yielding only means that the registers are stored back into the context structure (with optimizations to speed this process up) and that another context may be loaded instead of it. Depending on which variant of “done” was used, plus some other factors, the scheduler may or may not reschedule the same channel automatically at a later time. That is, the context may be reloaded into the registers. So unless designed otherwise, the opcode directly after the “done” instructions will be executed at some later time. Hence a carefully written script never “ends”, it just gives up the processor until the next time the relevant channel is scheduled.

Channel eligibility

Now let’s look at what makes a channel eligible for execution. Leaving priority issues aside, let’s ask what makes a certain channel a candidate for having its context pushed into the SDMA core.

In some cases, the setup is that the channel becomes eligible for execution without any other condition. This is the case for offload memory copy, for example. In other cases, the channel’s eligibility depends on some hardware event, typically some peripheral requesting service. The latter scenario resembles old-school interrupt handlers, only the interrupt isn’t serviced by the application processor, but wakes up a service thread (channel) in the SDMA core. And exactly as waking up a thread in a modern operating system doesn’t cause immediate execution, but rather sets some flag to make the thread eligible for getting a processor time slice, so does the SDMA channel wakeup work: It’s just a flag telling the scheduler to push the channel’s context into the SDMA’s core when it sees fit.

The Reference Manual sums this up in section 52.4.3.5, saying the channel i is eligible to run if and only if the following expression is logical ’1′:

(HE[i] or HO[i]) and (EP[i] or EO[i])

where HE[i], HO[i], EP[i], and EO[i] are flags belonging to the i’th channel. Let’s take them one by one:

HE[i] stands for “Host Enable”, and is set and reset by the application processor by writing to registers. It’s also cleared by the “done” instruction, so it’s suitable for a scenario where the host kicks off a channel, and the script quits it.
EP[i] stands for “External Peripheral”, and is set when an external peripheral wants service (more about that mechanism later on). It’s cleared by one of the “done” variants, so this is the flag used when a peripheral kicks off a channel, and the script quits.
HO[i] stands for “Host override”, and is controlled solely by a register written to by the application processor. Its purpose is to make the left hand of the expression always true, when we want the channel’s eligibility be controlled by the peripheral only.
EO[i] stands for “External override”, and is like HO[i] in the way it’s handled. This flag is set when we want the channel’s eligibility controlled by the host only.

There are four registers in the application processor’s memory space, which are used to alter these flags: STOP_STAT, HSTART, EVTOVR and HOSTOVR. They are outlined in sections 52.12.3.3-52.12.3.7 in the Reference Manual.

The full truth is that there’s also a DO[i] flag mentioned (controlled by the DSPOVR register), but it must be held ’1′ on i.MX51 devices, so let’s ignore it.

So if our case is the application processor controlling the i’th SDMA channel for offload operation, it sets EO[i], clears HO[i], and then sets HE[i] whenever it wants to have the script running. The script may clear HE[i] with a “done” instruction, or the application processor may clear it when appropriate. For example, the script can trigger an interrupt on the application processor, which clears the flag (even though I can’t see when this would be right way to do it).

In the case of channels being started by a peripheral, the application processor sets HO[i] and clears EO[i]. Certain events (as discussed next) set the EP[i] flag directly, and the script’s “done” instruction clears it.

Keep in mind that the script may not run continuously: It should execute “yield” instructions every now and then to give other channels a chance to use the SDMA core, but since neither HE[i] nor EP[i] are affected by yields, the script will keep running until it’s, well, done.

There is a possibility to reset the SDMA core or force a reschedule with the SDMA’s RESET register, but that’s really something for emergencies (e.g. a runaway script).

So much for part II. You may want to go on with Part III: Events and Interrupts

Freescale i.MX SDMA tutorial (part I)

eli — Tue, 25 Oct 2011 15:52:55 +0000

This is part I of a brief tutorial about the i.MX51′s SDMA core. The SDMA for other i.MX devices, e.g. i.MX25, i.MX53 and i.MX6 is exactly the same, with changes in the registers’ addresses and different chapters in the Reference Manual.

Freescale’s Linux drivers for DMA also vary significantly across different kernel releases. It looks like they had two competing sets of code, and couldn’t make up their minds which one to publish.

This is by no means a replacement for reading the Reference Manual, but rather an introduction to make the landing softer. The division into part goes as follows:

Part I: Introduction, addressing and the memory map (this page)
Part II: Contexts, Channels, Scripts and their execution
Part III: Events and Interrupts
Part IV: Running custom SDMA scripts in Linux

NOTE: For more information, in particular on SDMA for i.MX6 and i.MX7, there’s a follow-up post written by Jonah Petri.

Introduction

Behind all the nice words, the SDMA subsystem is just a small and simple RISC processor core, with its private memory space and some specialized functional units. It works side-by-side with the main ARM processor (the application processor henceforth), and pretty much detached from it. Special registers allow the application processor to control the SDMA’s core, and special commands on the SDMA’s core allow it to access the application processor’s memory space and send it interrupts. But in their natural flow, each of these two don’t interact.

The underlying idea behind the SDMA core is that instead of hardwiring the DMA subsystem’s capabilities and possible behaviors, why not write small programs (scripts henceforth), which perform the necessary memory operations? By doing so, the possible DMA operations and variants are not predefined by the chip’s vendor; the classic DMA operations are still possible and available with vendor-supplied scripts, but the DMA subsystem can be literally programmed to do a lot of other things. Offload RAID xoring is an example of something than can be taken off the main processor, as the data is being copied from disk buffers to the peripherals with DMA.

Scripts are kicked off either by some internal event (say, some peripheral has data to offer) or directly by the main processor’s software (e.g. an offload memcpy). The SDMA processor’s instruction set is simple, all opcodes occupying exactly 16 bits in program memory. Its assembler can be acquired from Freescale, or you can download my mini-assembler, which is suitable for small projects (in part IV).

Chapter 52 in the Reference Manual is dedicated to the SDMA, but unfortunately it’s not easy reading. In the hope to clarify a few things, I’ve written down the basics. Please keep in mind that the purpose of my own project was to perform memory-to-memory transfers triggered autonomously by an external device, so I’ve given very little attention to the built-in scripts and handling DMA from built-in peripherals.

Quirky memory issues

I wouldn’t usually start the presentation of a processor with its memory map and addressing, but in this case it’s necessary, as it’s a major source of confusion.

The SDMA core processor has its own memory space, which is completely detached from the application processor’s. There are two modes of access to the memory space: Instruction mode and data mode.

Instruction mode is used in the context of jumps, branches and when calling built-in subroutines which were written with program memory in mind. In this mode, the address points at a 16-bit word (which matches the size of an opcode), so the program counter is incremented (by one) between each instruction (except for jumps, of course).

Data mode is used when reading from the SDMA’s memory (e.g. loading registers) or writing to it. This should not be confused with the application processor’s memory (the one Linux sees, for example), which is not directly accessible by the SDMA core. In data mode, addressing works on 32-bit words, so incrementing the data mode address (by one) means moving forward four bytes.

Instruction mode and data mode addressing points at exactly the same physical memory space. It’s possible to write data to RAM in data mode, and then execute it as a script, the latter essentially reading from RAM in instruction mode. It’s important to note, that different addresses will be used for each. This is best explained with a simple example:

Suppose that we want to run a routine (script) written by ourselves. To do so, it has to be copied into the internal RAM first. How to do that is explained in part IV, but let’s assume that we want to execute our script with a JMP instruction to 0x1800. This is 12 kB from the zero-address of the memory map, since the 0x1800 address is given in 16-bit quanta (2 bytes per address count). After the script is loaded in its correct place, we’ll be able to read the first instruction (as a piece as data) as follows: Set one of the SDMA’s processor’s registers to the value 0x0c00, and then load from the address pointed by that register. The address, 0x0c00, is given in 32-bit quanta (4 bytes per address count), so it hits exactly the same place: 12 kB from zero-address. And since we’re reading 32 bits, we’ll read the first instruction as well as the second at the same time.

Let’s say it loud and clear:

Instruction mode addresses are always double their data mode equivalents.

As for endianess, the SDMA core thinks Big Endian all the way through. That means, that when reading two assembly opcodes from memory in data mode, we get a 32-bit word, for which the first instruction is on bits [31:16] and the instruction following it on bits [15:0].

The memory map

Since we’re at it, and since the Reference Manual has this information spread all over, here’s a short outline of what’s mapped where, in data addresses.

0x0000-0x03ff: 4 kB of internal ROM with boot code and standard routines
0x0400-0x07ff: 4 kB of reserved space. No access at all should take place here
0x0800-0x0bff: 4 kB of internal RAM, containing the 32 channels’ contexts (each context is 32 words of 4 bytes each, when SMSZ is set in the CHN0ADDR register). More about this in part II. For the details, see Section 52.13.4 in the Reference Manual. When SMSZ is clear, this segment is 3 kB only (see 52.4.4).
0x0c00-0x0fff: 4 kB of internal RAM, free for end-user application scripts and data.
0x1000-0x6fff: Peripherals 1-6 memory space
0x7000-0x7fff: SDMA registers, as accessed directly by the SDMA core (as detailed in section 52.14 of the reference manual)
0x8000-0xffff: Peripherals 7-14 memory space (not accessible in program memory space)

The two regions of peripherals memory space is the preferred way to access peripherals (unlike the implementation in Linux drivers using SDMA script) as discussed in another post of mine.

And once again: The memory map above is given in data addresses. The memory map in program memory space is the same, only all addresses are double.

So much for part I. You may want to go on with Part II: Contexts, Channels, Scripts and their execution