Allwinner Development Linux

A noob's journey to mainline: Part 2

MasterR3C0RD

Last time, I finished with mentioning that something was missing to get U-Boot running on the Allwinner A100 series of SoCs. Luckily for me, it meant I could dust off my Binary Ninja install for the first time in a few months and perform one of my most beloved acts.

My more well-known expertise is found in software development. I've been doing as much since I was in Grade 5 (~11 year old). That was something that led naturally from my interest in computers. However, a few years later, I pulled myself into a completely different project that required me to learn how to reverse engineer code.

And that.
That was exhilarating.

Since even before I was programming, I was taking apart everything I could get my hands on, just to see how they work. I didn't have a screwdriver, so this lead to some... casualties. My parents tended to be quite displeased when they saw the carcasses of various toys and devices from around the house hidden under my bed. For me, the most sacred of arts is figuring out how something works, and figuring out the reason behind it. And reverse engineering combines this sacred art with the pattern matching itch my autism begs me to scratch and my passion for cybersecurity and software development.

So what did I need to reverse engineer?

Just the code for a proprietary DDR3/DDR3L/DDR4/LPDDR3/LPDDR4 SDRAM PHY and controller, with absolutely zero documentation or source code to work off of, and values hidden away in values that, in isolation, make absolutely zero sense.


Just about anyone who is slightly technically inclined will know about RAM. It's the magical thing that Chrome/(insert triple-A game) munches on like crazy, and when it's all used up it grinds your PC to a halt. The more technically inclined will know that RAM acts as "short term memory" for their computer, where the software you're running stores its information as its running. Overclockers might know about RAM timings and the impacts it can have on latency. But most systems and embedded developers will likely think of RAM as a long 1D array. This RAM can be cut and sliced around by an IOMMU to provide "virtual" memory, but under the hood it's just a long string of tape.

This couldn't be further from the truth, at least for DDR SDRAM. Instead, it acts more like an entire city of 2D arrays of memory "cells". The illusion of a linear array of memory is provided by mapping this organization of memory into a linear space. This is done by the DRAM controller, normally built into the SoC, alongside some additional functionality like burst reads.

Even lower than this, there's the PHY itself. Just like many other things, the controller doesn't handle the raw electrical signaling with the DRAM chips; instead, another block of the SoC needs to handle the physical layer, being required to set each address line, compensating for microsecond delays caused by differing trace lengths, and figuring out what voltages work best for getting in/out data that is on sub-nanosecond intervals.

This is a lot. My first read through the decompiled code, I had no idea what I was even reading. It turned out the controller was a well-known IP core (the Synopsis DesignWare uMCTL2, also found in the Zynq UltraScale+), but I didn't understand what most of these memory mapped registers being written to were even doing. Although I was working off of an object file that provided function names, that didn't help much. What's the actual use of things like "read calibration" and "write training" and "DX eye delay compensation"? And why is there a function that purportedly does "address remapping" when we have another function that does "address mapping" earlier on? And what does "MR" stand for? What about "DBI"? "DFS"? What is "geardown mode" and "burst length" and...

It would have been easy for me to get overloaded at this point, but I decided to break things down into smaller bits, and try to understand things piece by piece. Eventually, I started to understand the bigger picture, and things slowly became more clear. Let's go through what the code does at a low level, and break it down into processes we can understand on a higher level.


Before that though, let's clarify my off-handed debunking regarding the organization of DRAM internally. SDRAM, as previously mentioned, is not a 1D array of memory "cells", but a 2D array of capacitors, each storing a logic 0 or logic 1. These are organized into banks, and (in non-LP DDR4 and newer) bank groups. And these can be located on multiple RAM chips, known as "ranks".

In other words. each bit of memory in DDR4 memory is identified by a combination of up to 5 different parameters, from smallest to largest:

  • Column
  • Row
  • Bank
  • Bank group
  • Rank

Each DRAM chip will have different numbers of these, but they almost always correspond to powers of 2; this is because you need log2(n) lines to address n separate blocks. There are 6 Gbit (768 MiB) and 12 Gbit (1.5GiB) modules, but this is usually due to there only being certain valid bank values or something similar.

Additionally, each DRAM chip will have a certain "bus width"; that is, how many bits will be pulled by a DRAM access at once. For example, an x32 DRAM chip will return 32 bits (4 bytes) at once, while an x16 chip will return 16 bits (2 bytes).

The first steps are quite simple; anyone who's written embedded code knows what we need to do first, and that's to start up some clocks and bring the peripheral blocks out of reset. We also need to set the DRAM voltage; this is stored in one of many "TPRs". We don't actually know what TPR stands for; it seems to be an Allwinner invention meant to obfuscate the values used for DRAM configuration. ChatGPT proposes that it may mean "Timing Parameter Register", but if that's the case AW has a definition of "timing" that is incredibly broadly reaching. However, it doesn't really matter what it means, and I'll ignore the concept of TPRs unless it's critical to understanding the behavior of the underlying code.

Okay, the blocks are running, are we good to go? Nope; we first need to configure the controller. In case you didn't catch it earlier, the controller on this PHY supports 4 different and incompatible variants of DDR memory:

  • DDR3
  • DDR4
  • LPDDR3
  • LPDDR4

These all run at different voltages, different clock speeds, and have different parameters that need to be set up. So we need to let the controller know about everything our RAM chips support. We'll shorten "DRAM controller" to "DRAMC"; this is how the single-page blurb about the controller's supported functionality in the user manual refers to it.

DRAMC needs to be told the following about the RAM on the board:

Basic parameters
What type of DRAM is this? What's its maximum supported burst length? Does it support DBI? Does it support geardown mode?

The most important thing of course is the type of DRAM; whether it's DDR3, DDR4, LPDDR3 or LPDDR4. DDR3L is actually just DDR3 but supports running at a slightly lower voltage (1.35V instead of the standard 1.5V for DDR3).

Burst length is hardcoded for each DDR type to the maximum supported for the specific type of DDR SDRAM in use. Essentially, this lumps a number of data transfers (reads/writes) together into a "burst" to reduce operational latency.

DBI, or Data Bus Inversion, is a little more complicated. Here's how it's described by people smarter than me:

If DBI is enabled, then when the driver (controller during a write or DRAM during a read) is sending out data on a lane, it counts the number of “0” (logic low) bits.  If the number of bits driving “0” in the lane is five or more, then the entire byte is inverted, and a ninth bit indicating DBI is asserted low.  This ensures that out of the 8 DQ bits and the 9th DBI bit, at least five bits are “1” during any given transaction.  This also ensures that out of the entire data lane, the maximum total number of signals transitioning is either five 1’s to 9 1’s or vice-versa.  There can never be a situation where all bits go from 0 to 1 or from 1 to 0.

Since this reduces the number of voltage transitions, it can apparently reduce signal noise and power consumption. But I don't really understand it that well.

Finally, geardown mode essentially allows the RAM to sometimes run at half the clock speed, which can improve stability when running at higher clock rates.

ODT and address mapping
How do we set up ODT? How do we map columns/rows/banks/bank groups/ranks into linear memory addresses?

ODT, or On-Die Termination, is fairly critical; I'll be honest and say I don't understand the theory too well, but what it tries to solve is signal reflections. Essentially, at the high frequencies used by RAM (within the MHz/GHz range), slight differences in impedance along the path electricity flows can make the electricity literally reflect back; this increases the amount of noise in the signal. This is a bit of a problem, since we need a reasonable distance between a logic 0 and logic 1 in order to figure out what the RAM says is stored at a certain location and for the RAM to know what data to store.

Address mapping describes how to map the bits of a linear memory address into the column, row, bank, bank group, and rank used to access RAM. This has some complexity; you could do a straightforward mapping, but this wouldn't take into account complexities of RAM addressing, latency in activating columns/rows, et cetera. I'm not well-versed on best practices here.

Timings and Mode Registers
When should I send certain data? What should the mode registers be set to?

I'm not going to get too into the weeds on timings; the name is fairly self-explanatory.

DRAM, as stated before, is essentially a bunch of capacitors that hold a charge. But there's more: the RAM needs to also know when to expect refreshes, the timings selected for commands, and a whole host of other things that make up a large segment of JEDEC specifications. They're stored on the RAM in special places called "mode registers" (MRs). Each DDR specification defines how to set these registers, and what their values mean. On a random DIMM you'd throw into your computer, there's a little flash chip that stores what to set these MRs to, and the timings to default to (along with XMP profiles and such). There is no such chip on most embedded boards, so this is hardcoded into firmware in some cases, and stored in compile-time configuration in others.


Phew. That's a lot to absorb. We're done now, right? We've set up DRAMC with all this timing junk, so we can start using the RAM normally, right?

Nope. You forgot the PHY.

We now have to tell the PHY to start up, and get it talking with DRAMC. But on top of that, we also need to turn on the PHY, configure it, and start it on its way. Unlike the DRAMC, which is a known IP core, nobody knows exactly what IP core is used on the A100 series of SoCs. This means a lot of code ends up being magic values we don't understand. However, we luckily have some debug symbols that shed a little light on what we're doing here.

One critical thing we do here is set up "swizzling". Basically, we might have RAM connected to different physical pins on the SoC for impedance/length matching, and we have to map these physical lanes to their actual values. In the case of the A100, there are 28 different lanes that have to be configured, and the exact swizzling used depends on some variant checks and the type of RAM in use.

Then we set up "V_ref", the reference voltage for signalling, and set up some drive strength (how hard to push a signal) and ODT settings for the PHY. Finally, we tell the DRAM controller to start the PHY, and actually issue the mode register writes to the DRAM (we wrote a few of them to DRAMC earlier, but that was before we could submit any writes to RAM).

But we're not done. We have one last thing to do, and that's some training, baby!


I promise, we're almost done. We just need to get through trainings. Implementing these was tricky; there's a lot of writing random values everywhere that we still don't actually understand, and not all boards require all trainings, but here's what we know we could be doing.

  • Command/Address per-bit deskew
    As mentioned before, board designers try to length match traces as much as possible. However, there may still be slight differences, which could cause one bit to arrive at its destination earlier/later than the rest. Deskewing trains the PHY and RAM to send data with slight per-lane delays so everything arrives at its destination at very near the same time. In this case, we're doing these on the CA (Command/Address) lines; these are the physical column/row/... bits we were talking about earlier. We actually do this right before writing MRs to the DRAM; MR writes don't use any data bits, but are controlled through special patterns on the CA lines.
  • Read DQ/Write DQ deskew
    Like above, this deskews data lines, but these are specifically the data (DQ) lanes. We train separately for read/writes.
  • Read gate training
    This is more complicated. The "read gate" in question is the window of time where the memory controller actually samples data coming from the DRAM. This window can be different depending on temperature, manufacturing differences, the alignment of the planets on the first Sunday of the month, etc. This calibration is one of the most critical, and essentially every board's configuration requests this calibration.
  • DX bit delay compensation
    More deskew, sort of. Mainly this is trying to align DQ (data) and DQS (data strobe) signals on the level of entire data transfers (DXs). This is also usually required by every board.

Finally, we're done. But are we?

On some boards, there may be settings configured for something called "DFS", or Dynamic Frequency Scaling. I bet you could never guess what this does; it dynamically scales the frequency depending on the workload. To do this, we essentially do all of this work again at different frequency points to have sets of parameters for the DRAMC can switch to, which can reduce power usage and stability.

But once that completes, we're finally done. We have RAM.


I'll note again that I'm no expert, but I haven't found much explanation of any of these processes before, so I thought even an amateur could take part in this. If you're smarter than me, more experienced with DRAM, and can inform me better on this subject, please do so!

Additionally, you can see the code I wrote here. If you happen to have industry knowledge about DRAM PHY IP cores, please reach out and let me know if you recognize this core based on the read/write patterns! I've cross-compared these registers and their values with common PUBL blocks, but nothing that I've seen has matched the register layouts here. It'd be super interesting to learn what IP this is, especially since similar PHYs have been used on at least 2 of Allwinner's dies at this point (H616, A523).

In any case, I'll have to continue telling this story later; I'll talk about submitting my first patches to the kernel and what I learned about that process. For now, peace out!