New AMD CPU line, Ryzen (using the "zen" architecture)
Chris Katko

Has anyone been following the new Ryzen? I've been eagerly waiting for 6+ months.

As I set in another thread I'm a bit of an AMD fanboy, and I love when they release their CPUs because the comparable Intel CPUs "suddenly" drop 2-3X in price. (They're not stupid. They've got ONE competitor for desktops and servers.)

Anyhow, the new CPU architecture has:

- 20+ CPU sensors including millivolt and milliwatt power monitoring across... I don't know what granularity. Whether it's only core level, or sub-core level into the individual compute units.

- DDR4 (wooooo)

- A freakin' NEURAL NET cache system that learns your programs access patterns instead of following a pre-set CPU designers algorithm. (Wow!)

- Up to 8 cores for enthusiast market and up to 32 cores for servers.

- Hyper-threading, so 64 virtual cores! (Granted, hyper-threading in general SLOWS many benchmarks because the threads still have the same memory bus and content with the other cores.)

- Much higher energy efficency than the last generation. (Obviously.) The TDP is less than half of my current AMD CPU FX-8370 (~60 vs ~150 W in mine). So there's plenty of room to do what they did with the last generation and just dial the frequency up.

- NO FIXED OVERCLOCK LIMIT. It'll actually scale to your heatsink setup to whatever it can. But, I have no idea how it'll manage to detect and recover / avoid crashing when it's overclocked too far. Maybe it's as simple as an automatic burn-in test through the BIOS. It will use those 20+ sensors I mentioned earlier.

- AM4 socket. I've heard this was one of the biggest bottlenecks for last-gen AMD CPUs including DDR3-only.

- HBM "high bandwidth memory", which is on-chip, 3-D stacked RAM for things like internal videocards. Per here, it runs at 128 GB/s bandwidth. If it were me, I'd have the CPU directly use that RAM and call that stuff L4 cache. But I'm no designer, obviously. HBM 3 is slated for 512 GB/s. HBM 2 has up to 4 GB per "Stack." That's a huge amount of nearby memory. No details (AFAIK) on what HBM the Zen will use yet.

- Hardware encrypted memory pages using an internal ARM Cortex-A5.

- NVM Express. PCI Express bus designed specifically for blasting-fast disk/memory expansion. It targets SSDs, whereas the SATA 3 standard has plenty of overhead because it supports physical drives (as well as perhaps legacy standards.) It supports super long memory queues.

And these are all the chips that are launching. Who knows how far this architecture will scale for later releases. (It's slated to at least 2020.)


- Supports SATA Express. That's pretty neat, check out the wiki. It's completely different and exposes two PCI Express lanes. Apparently it's designed that way because current SATA standards can't keep up with the rising SSD speeds. However, almost nobody is using it yet. So let's hope the standard doesn't flop.


To immediately segue into a tangent:

- A freakin' NEURAL NET cache system that learns your programs access patterns instead of following a pre-set CPU designers algorithm. (Wow!)

As I've been saying for a few years now, I'm pretty sure CPU cache is an ubiquitous bad idea. It's purpose is solely to make poorly written programs performant while hamstringing those who know how to write performant programs by placing a guessing game in between the engineer and the metal.
Accessing RAM is slow. It's been slow effectively forever. I don't think it's possible to be a valuable engineer and not know accessing RAM is slow. Since it is a fact, and it is a universally known fact, why are CPU manufacturers still trying to pull off this con with this invisible man-in-the-middle?
Sure, there is a rather large contingent of very poorly written software which relies on cache existing and which people in turn rely on, but that doesn't seems like the sort of problem you could solve with a flag on your program at startup.
From an engineering standpoint, a neural net cache system in hardware is impressive, but I'd be much happier with the ability to just turn the whole thing off and using the memory as work ram.

For my next crusade against bullshit CPU abstractions: Registers--When the CPU has over 100, why do I have to pretend there are only 16?


I hope there will be an 8 cores / 8 threads variant available as the various sources on the internet claims.
Then i will be finally replacing my old Athlon X2 5600+. :)


Accessing RAM has always been relatively slow, but nevertheless cache misses are unavoidable. And working from the hardware side to minimize them makes as much sense as from software developing side. RAM is Random Access Memory, because, you know, your access patterns have random elements in them.

I agree with that in certain cases software developers may get free ride on this, getting away with poorer code. But then this may drive down software development costs.


Cache misses aren't unavoidable. If you're writing your code such that you don't know what memory you'll need soon, you're writing bad code. It's just resource management, like all other engineering.

Chris Katko

You can be the greatest programmer on the planet. Good for you. Now enjoy sharing that same processor with 100 other programs running at the same time with no guarantee of ordering, none of which are written by you. Likewise, any CPU with out-of-order scheduling (read: 99.9%) can and do reorder any set of instructions (including the ones you wrote "correctly") as it sees fit. Read that again. The instructions you write are merely suggestions and get re-ordered as the CPU sees fit.

I'm not going to spend any more time refuting the idea that "caches are unnecessary" because it's absurd.


Out of order execution is just another hack to compensate for poorly written code.

I know the idea of caches being unnecessary is absurd. My point is, that absurdity is an artifact of needless abstractions. Work RAM solves the same problem as cache without the needless abstraction. It's the operating systems job to allow your program to think it's the only one running.


Cache misses aren't unavoidable.

They can be minimized, but they can't be unavoidable due to nature of data access. You sometimes need to pull some hashed value from somewhere in your RAM and you're sometimes clueless from where unless you actually ran the hash calculation. This applies for example to cryptography.

In many cases you can very much nullify there, for example in certain machine learning algorithms you can precalculate all tables, and then pull it linearly. It almost won't miss.

But if you're compressing some huge chunk of data with a dictionary larger than your cache, you'll be running a regular and massive miss rate.


Then you kick-off your data load and perform whatever calculations are available. If there aren't any, the cost of an unforseen load to work ram is still (potentially/slightly) lower than the cache miss because you may have an area which you know doesn't need to go back to RAM and can be discarded.


I guess it's implemented in the optimized applications e.g. the latest Winrar, not like this doesn't produce it's own overhead, nor does it answer the problem completely.

Another thing, is that part of this job can(and IS) done by the hardware. We go higher, and higher encapsulating more and more stuff exporting work away from our lines to the standard libraries of programming languages, OS, hardware.. That's the way of progress.

Something tells me if you were coding in the 80-90s, you'd refuse to move to C, sticking with asm. :)
And, I'm sure your code would work faster.. The code you'd have the time to write.


There's a rather direct mapping from C to machine code, and the optimizer does a pretty good job. There are a few instances where hand rolled ASM would be faster, but those are very marginal gains for a very high cost. With work ram as opposed to cache, I feel there are rather significant gains for a very low cost, and that's just in a single threaded environment. Once you have to worry about cache coherency, cache really starts to cost you more than it ever bought.


Only thing I have ever bothered with that is connected to this "level", is trying to access things sequentially wherever it was possible. Not sure if I ever thought any deeper.

Chris Katko

I think it's interesting how some optimizations of the 90's are anti-optimizations today. Memory access is so slow (the "CPU/memory gap" is widening so drastically) that it's now better to do very many CPU instructions on a single cacheline read (64 bytes for most of CPUs), then it is to pre-compute AT COMPILE TIME a set of values like a sin/cos table.

With 100 to 200 CPU cycles per memory read, how much could YOU accomplish (especially when instructions can be chained/pipelined/vectorized/whatever to increase throughput) before the next RAM cacheline even shows up? The answer is a lot!

Back in the 80's and 90's, however, it was common to use pre-computed look-up tables.

Also, relpatseht, you mentioned "Work RAM." Do you have any actual experience with it? Are you talking about NUMA setups where each core has their own "working space" RAM?

That reminds me of an interesting development coming soon! (And I already wrote about in the first post.) HBM. High-bandwidth memory (and whatever competitors equivalents are called) is 3-D stacked DRAM soldered right onto the die. We're talking insanely fast memory channels. AMD is planning on using it for on-the-chip VRAM for their on-die "videocards".

HBM 1.0 was up to 4 GB of RAM (not bad even by today's standards for an internal videocard) at 128 GB/s speed.

In 2016, Samsung was producing HBM 2.0. 8 GB per package at 256 GB/s.

HBM 3.0 is slated for ~2020 and will support 512 GB/s at unknown size.

That's a CRAP TON of memory to have sitting so freaking close to the CPU, I would call it "L4 cache." Now, there's no indication that this memory will be code accessible. But imagine if it was! Imagine having 4/8/16+ GB of insanely fast DRAM right there, for your cores to play with. Even if you divided it evenly per core, with 8 cores and 8 GB of RAM, each core has an entire 1 GB of cache! You could write a GIG OF DATA before having to slow down and flush to normal system RAM (even longer if you're running a circular buffer and the system is constantly trying to move the L4 back to system RAM). I would absolutely love to play with that and optimize it for my software/games.


How fast are the sin/cos done now?
Regarding tables, it's still not useless. It probably is for so simple things though, but I did one couple months ago. Didn't compare to mealtime compute, but it's orders of magnitude of a difference.

Chris Katko

I used "Useless" as general term. Useless compared to what it used to be used for, and for what most people would need it. (How many times can I use... the letters "use" in this sentence? ;) )

Pre-computing will always have its place. But it's--of course as you already know--only applicable for data that doesn't change, and ideally, can fit inside a cacheline of 64-bytes. It's a trade-off directly connected to the CPU-Memory Speed Gap. The larger the gap, the larger the pre-computation that needs to be done before it becomes cost-effective. And computations that are that large are rarer and more complex. (Squareroot used to be one. Tangent functions used to be one. Color conversions between HSL/RGB/etc used to be one but nowadays a few multiplies are practically free compared to the memory access. )

Graph of CPU/Memory gap for those who haven't seen it

So clearly you still see it in niche cases. But those niche cases keep getting thinner as the cpu/memory gap increases. However, on systems where that gap is still small (many embedded systems), pre-computation still has a wider array of applicable situations. But I'm talking mostly about desktops here.

I would absolutely LOVE to have a gig of "work RAM." You could pre-compute HUGE swaths of things (no longer limited to cacheline size, or cache size) and have them sitting there right next to the CPU, ready for faster-than-RAM access. The CPU/memory gap has increases so large, it's really time we got a huge "L4" DRAM cache.

Of course, my use of the term "Work RAM" predicates on the assumption I'm using the term correctly! I've almost never heard the term before today, and Google shows nothing. So I'm calling "work RAM" to be RAM that is local to a core or CPU, faster than system RAM, and smaller.


I call it work RAM because that's what it is. RAM you do all your actual work in. Local to a particular core and very fast. Preferably with a DMA unit attached.

I may have made the term up because it makes more sense than "locked cache", which is what it was called on the only processors I've ever seen it on (PPC Wii/WiiU). There you had the option of setting up half your data L1 to be a CPU addressable chunk of memory you could DMA into and out of. Of course, they also had a hardware bug forcing you to stall on the DMA transfer or suffer I-cache corruption, but even so it was faster than using cache as intended.

You don't need gigabytes of it. It might be nice, but every level of cache, if recomposed as work ram, is another level of resource management. If your data is setup properly, you only need enough work RAM for the processing time to cover the streaming cost. Having more increases flexibility some though.

From my perspective, the more you get to thinking about it, the more you realize cache is just a stupid idea. I didn't even mention associativity in my earlier post, but that isn't free and also solved with work ram.

Chris Katko

Oh yeah, I didn't mean that I would "need" gigabytes of it. But that gigs of it basically come for free. That is, once you've got HBM, you've probably already got a few gig or so.

Even a couple dozen MB would probably be huge for most people. (Say, one order-of-a-magnitude larger than the L3 cache. So L3 20MB, 200MB L4.) And while yes, it's one more layer of memory management, that simply boils down to: "Do I want to pay the complexity in exchange for the performance?" And if not, you simply ignore the memory.

I mean, why would anyone use CUDA / GPUs to do calculations? There, the barrier-to-entry is much higher because the cores are so tiny and rigid compared to a general-purpose CPU that can access any position in memory. But people still find it worthwhile.

HBM with CPU seems like a much smaller barrier-to-entry for non-experts yet with huge room for performance improvements. You don't HAVE to convert your problem into a massively-parallel algorithm. You could simply have... faster RAM, with your normal, sequential X[i+1] = X[i]+1 algorithm, and all you have to do is keep your memory inside this "buffer" or "static pool". It could be as exposed as simply as a linear array in C/C++.

__hbm[203] = __hbm[204] + 1; //each value is added to the previous, plus one.

int output = __hbm[0];

That's orders-of-a-magnitude simpler to understand what's going on than managing CUDA cores, and as we know, CUDA cores are widely used in-spite of their complexity and constraints.

And it can be debugged inside a normal compiler, instead of a CUDA one. Breakpoints and all.


The larger the gap, the larger the pre-computation that needs to be done before it becomes cost-effective. And computations that are that large are rarer and more complex.

That depends on another factor, sequentiality. If you process large chunk of data which you either calculate in realtime, or read from a huge array(while reading all of the array), the delays stop to matter at all, and the bandwidth.. Well it's in the tens of GB/s nowdays on a simple system.

Let's say the data chunks are 64 bytes, and let's say you grab it at the rate of 16 GB/s(my DDR 3 is faster FYI).

Allows you to grab ~268m of entries in a second.
My CPU runs at 4Ghz, so that's about 15 cycles to process the stuff in a single thread.
60 in perfect utilization of four threads. (it's a quad core i5)

Now the question is also how much of the other data you need to load from RAM in order to do your calculation.. I guess as is, it's a very general question.


My thesis is though, it is one to two orders of magnitude of performance gain if the data you access is sequential, not more than three which the chart shows. But then again, need a more specific task and deeper understanding.

Kitty Cat

Out of order execution is just another hack to compensate for poorly written code.

Or for the CPUs themselves. The assembly you write is not what the actual core processes. Back in the early days, x86 used CISC cores. This was known to be slow, but it was convenient since you could tell the CPU to do a complex instruction and wait for the result, instead of manually supplying each cheaper/"smaller" instruction that comprises the complex one (e.g. just tell it you want the square root, instead of supplying individual instructions to calculate it; back when you didn't have reliably tested libraries and compilers to implement a sqrt() function correctly, it's nicer to let the CPU do it). But even if you only use the smaller/cheaper instructions, the CISC design caused unavoidable overhead. Motorola with PPC, and nowadays ARM, instead employ RISC cores. This puts extra work on the developer to provide more instruction calls to do the task at hand, but it ends up being more efficient since it can cut out some unnecessary work, and can be better parallelized.

Seeing this, along with a rise in performance-critical applications, Intel redesigned their chips to have RISC cores, with a CISC front-end that "breaks up" the complex instructions into smaller ones (to retain compatibility with existing applications). This alone wouldn't do much, but it can also be seen that multiple complex instructions may break down to include some common smaller instructions, or it may see that it's better to move some of the generated smaller instructions to occur earlier or later in a way that the application couldn't otherwise specify with the complex instruction set. So as a result, those instructions can be reordered to minimize the overall amount of work.

Similarly, CPUs themselves and the systems they're in are constantly changing. Where it may have once been most efficient to do A then B, it may end up that it's now more efficient to do B then A. It's not bad code, it's just code that couldn't predict (and wait for) the future, so why not help them along by automatically reordering the instructions as long as the end result is still the same?

Chris Katko


Slated for 2019? The Zen+ cores, and a freakin' 48 core (+ 2x hyperthreading) CPU using half the size process (7nm FinFET vs 14nm).

That's also very close to "the end" of Moore's Law, experts think to be at 5nm.

Since I just upgraded to an FX-8370 last year, I'll probably hold out at least till 2018 for my next CPU/motherboard/RAM upgrade. I want to see the next (as opposed to "initial release") CPUs that come out using the Zen cores.

And the Zen+ cores have me really intrigued. (As I might have said earlier). I'm curious to see how far AMD can take this Zen (and Zen+) architecture and improve it.

AMD is releasing a server Zen CPU with up to 32 cores, however, the leaked specs show 2.8 GHZ maximum clock speed as opposed to 4 GHZ on the 8-core desktop CPUs. I'd love more cores (I use all eight of mine! Virtual machines, anyone?), but I don't want to drop to almost half the clock speed.

[edit] - [New Post]

Newest Ryzen 1800X (top line) benchmarks show it beats an Intel CPU that costs twice as much. (i7 6900K)

$500 vs $1000 is a pretty dang big deal for most of us. That's the difference between "almost affordable upgrade" and "no way in hell I'm wasting that kind of money".

[edit] - [New Post]


Price's comin' down, baby!


In order to saturate such a massive core count, won't one need at least some 100 GB/s mem bandwidth?

It can badly affect performance of even four cores.. Also AFAIK(or is it just my poor memory/invalid sources?) AMD lagged behind Intel in tech predicting further memory requests, aka caused more misses.

Chris Katko

That's possible. But Intel is running quite a few 24 core server CPUs right now and nobody is complaining about that (AFAIK).

And we have no idea what socket the server-class Zen(+) CPUs may run on, and what memory controller(s) it'll use.

Now granted, IIRC, AMD did make a mistake with their memory controllers (that you're alluding to) which aren't up to the task compared to the Intel ones, on their previous generation. With billions (tens? hundreds?) on the line, one would assume they'd learned their lesson.

So you may be right. But my only point at this time is that the same physical constraints affect the Intel lines and everyone seems content. So if it's a design decision, I'm pretty sure AMD will correct the design accordingly for their new server line.

Thread #616747. Printed from