Core Question

[This post is about machines used by people. I realize things are different in the server room.]

We had one core per socket for a long time. When multi-cores came along, dual core seemed pretty awkward: real concurrency was possible, but with speedup bounded above by two, there wasn’t much point doing anything trickier than “make -j2”. Except in low-end machines two cores seems to have been a passing phase. Now, several years later, it is possible to buy desktop processors with six or eight cores, but they do not seem to be very common or popular. However, I will definitely spend some time working for a 4x speedup, so stalling there may not be such a shame. Even some inexpensive tablets are quad core now. But are we just pausing at four cores for another year or two, or is this going to be a stable sweet spot? If we are stuck at four, there should be a reason. A few random guesses:

Desktop workloads seldom benefit much from more than four cores.
Going past four cores puts too much of a squeeze on the number of transistors available for cache memory.
Above four cores, DRAM becomes a significant bottleneck.
Above four cores, operating systems run into scalability problems.

None of these limitations is fundamental, so perhaps in a few years four cores will be low-end and most workstations will be 16 or 32?

August 10, 2012

regehr

Computer Science, Futurist

12 responses to “Core Question”

msalib says:

August 10, 2012 at 3:19 pm

Guesses 2 and 3 do seem to be pretty fundamental. Although process improvements allow us to put more more transisters on a single die over time, the wires that bond to the die are not increasing in density at nearly the same rate. So even as we pack more cores onto a die, we’re not getting much more throughput, which means that each core is seeing lower memory bandwidth.
LarsBerg says:

August 10, 2012 at 4:04 pm

I can rule out DRAM. I did some pretty exhaustive testing of the DDR connections and processor interconnects (in a UofC tech report; rejected from MSPC, but available online), and it’s tough to saturate modern DDR3 even with cache-antagonistic algorithms with 16 threads per processor.

If we stay at 4, my guess is more mundane — lack of demand for more cores and huge demand for better graphics hardware and lower power consumption.

Back when we started Manticore (in 2007), our Intel contacts were telling us they had a roadmap and the tech all laid out for 8 full (non-SMT) cores in normal user computer by 2010, which is a lot of what drove our work on getting scalability at 32/64/128 cores. Instead, there has been quite a bit of very good work on the integrated graphics and dropping power consumption.

But, that’s pure speculation.
bcs says:

August 10, 2012 at 7:12 pm

If you haven’t already read it, go take a look at Herb Sutter’s “Welcome to the Jungle”.

tl;dr; we a very close to being able to put more CPU in tablet than anyone needs closely attached to their screen/UI/etc. Once we hit that, there will be very little reason to pack around more powerful CPUs or even put them under your desk as opposed to using a relatively weak CPU there and offloading any real computation into a cloud (i.e. warehouse scale computing) with several orders of magnitude more CPU than you can get the electrical power to run.

Another thought to consider: the number of doping atoms in a modern transistor has already dropped low enough to not need scientific notation. Processes might quit shrinking in the near future.
Yossi Kreinin says:

August 11, 2012 at 9:33 am

Not awfully “fundamental” in the sense that things obviously do work with >4 cores, but: 4 is a nice number in a 2D world – inter-core communication logic can be concentrated in the spot in the middle of the 4 squares making up the core (or non-square regions if you don’t use a single hard macro for the core). AFAIK ARM Cortex-A15 8-core clusters are actually 2 4-core clusters cooperating somehow and my guess is because 4 is nice geometrically.
pm215 says:

August 13, 2012 at 5:11 am

Yossi: yes, a Cortex-A15 processor cluster has 4 cores maximum; for more cores than that you connect multiple clusters together (typically via a cache coherent interconnect). One of the advantages of doing it this way is that you don’t have to have all your cores identical: a big.LITTLE config might have 4xCortex-A15 + 4xCortex-A7, for example.
regehr says:

August 13, 2012 at 10:36 am

Thanks for the thoughts, folks.

msalib, the reason I didn’t consider memory bandwidth to be a fundamental limit is that NUMA seems like a pretty reasonable option. I’ve never programmed a NUMA, however.

Lars, I guess I don’t have a good mental model for DRAM. If I can make time I’ll run some experiments. Can you send a link to your TR?

bcs, I think we’ve already succeeded in putting more CPU basically everywhere than any non-power-user needs. On the other hand, Siri does the heavy lifting off of the phone in some server room, right?

Yossi, I hand’t thought about the geometrical factor, but that sounds very reasonable.
Wes Felter says:

August 14, 2012 at 4:04 pm

I think they just decided to reduce system cost and use transistors to integrate GPUs instead of more cores. Right now the integrated GPUs are slow enough that it’s worth putting all the transistor growth into them. It will be interesting to see what happens after the chips become “balanced” (Skylake?); will core count increase again?

Yossi: Take a look at Sandy Bridge; it uses a ring bus so the cores are all in a row. A crossbar would probably be faster for 4C but I guess they want to be scalable. Although even for a crossbar I could imagine a row of cores on one side and the uncore on the other side.
bcs says:

August 15, 2012 at 8:42 am

@Yossi: So CPU macros might start looking like pie slices?
Anders says:

August 16, 2012 at 6:10 am

Putting on my “grumpy old man” hat:
I think it is telling that I experience about the same real/productive performance today from my quad-core 3GHz machine as I did 20 years ago on a 25MHz SUN workstation. True, I have much smoother graphics and can watch a full-HD movie if I want to, and can even do some computational stuff that was impossible back then, but for my daily work I do not see much of a difference. Most of the performance increase have been eaten up by “more of the same” already existing functionality, i.e. more visual eye candy…

It seems to me that multicore (x>4) in consumer and traditional office machines is a solution looking for a problem. Until we start doing something really revolutionary with all the GHz CPU cycles we already have available I can’t really see a need for even more transistors creating waste heat without contributing to world peace… 😉

The exception might be gaming, but as already noted, in the short term a transistor increase is likely to go into the GPU anyway.

I think it would be really cool if software vendors and open source projects from time to time made it a top priority to increase the performance of their software *on existing hardware*.
regehr says:

August 17, 2012 at 11:33 am

Hi Anders, I generally agree about the 25 MHz workstation feeling pretty much the same as a modern machine, but some things have changed for real:

– make -j4 makes a noticeable difference in my life as a developer

– scripting languages can be used guilt-free in a much wider variety of circumstances now

– heavyweight compression/decompresion (e.g. bzip2 vs. gzip vs. compress) is enabled by the extra compute cycles

– crypto is free for most client-side purposes now — for example I don’t see any slowdown due to encrypted filesystems

– digital image processing (at multi-megapixel resolutions) is made possible by fast machines and in particular benefits from multicore and SIMD; a modern DSLR would be useless combined with a SPARCstation or similar
paul says:

August 18, 2012 at 6:30 pm

Surely there is cpu-intensive, parallelizeable stuff that can’t be done in the cloud. Think of video recording which involves h264 or equivalent compression. You can’t upload the uncompressed video and compress it on a server. You have to compress first, upload after, and compression is designed to use about the most computation resources feasible in commodity devices. If the devices could compute more, they’d choose even more aggressive compression algorithms that used the extra capability.

There is also the issue of the “utilization wall” that says power consumption with everything running full speed increases quadratically with shrinking feature size. So more and more of the chip has to stay unused at any moment to keep temperature manageable. That’s part of why cpu’s are mostly cache now.

This page has more info and a bunch of links to papers, etc.: http://cseweb.ucsd.edu/~mbtaylor/
Anders says:

August 21, 2012 at 6:37 am

John, you’re absolutely right, especially about the image processing. The only way to do some serious image processing back then was on a Thinking Machines CM 2 or similar… 🙂

But I think there is an important distinction between typical developer machines and consumer/office machines and how they are used. Scripting languages and other very high-level languages are part of the problem. The fact that one can use scripting languages transparently to solve certain problems has had a huge impact on *developer productivity*, but as a user of the resultant software I do not care at all how it was crafted. I would very much prefer it to be blindingly fast instead.

OTOH – If the developer could not use the more productive tools, I might not get to use the software at all…