<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Discovering New Instructions</title>
	<atom:link href="http://blog.regehr.org/archives/669/feed" rel="self" type="application/rss+xml" />
	<link>http://blog.regehr.org/archives/669</link>
	<description></description>
	<lastBuildDate>Thu, 23 May 2013 17:10:41 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
	<item>
		<title>By: Alastair Reid</title>
		<link>http://blog.regehr.org/archives/669/comment-page-1#comment-3870</link>
		<dc:creator>Alastair Reid</dc:creator>
		<pubDate>Thu, 26 Jan 2012 07:18:51 +0000</pubDate>
		<guid isPermaLink="false">http://blog.regehr.org/?p=669#comment-3870</guid>
		<description><![CDATA[John: I think you&#039;re right that we need to look at instruction sequences, dataflow, etc. but there&#039;s a lot of tradeoffs here.

Your intended microarchitecture matters: If you plan to use register renaming, you want to avoid instructions with large numbers of destination registers and favor conditional move instructions over conditional instructions.
(I also don&#039;t know which processor added conditional move - but there&#039;s a good chance it was a processor with register renaming.)

If you&#039;re more focused on energy efficiency, you might choose more complex instructions requiring greater logic depth - but that&#039;s going to hurt your high frequency implementations.

How long you&#039;re going to have to live with your design choices. If you&#039;re commercially successful, you&#039;ll be living with them for 20-30 years so decisions which make a huge amount of sense for today&#039;s microarchitecture may be things you regret in 15 years time.  So you&#039;ve got to anticipate all the various scaling trends (Moore&#039;s law, memory wall, power wall, dark silicon, ...) but also produce a design that you can make well today.  You&#039;ve also got to leave room to add new instructions in the future.

Do you want to define subsets of your architecture for use in cost/area-constrained niches like microcontrollers?  (Microcontrollers are often implemented on older processes like 180nm or 130nm instead of the 45nm or 32nm used for high end cores so, while Moore&#039;s law applies, they&#039;re behind the curve.)
And if you do aim to service multiple niches, then you have to consider a wider range of workloads - embedded code is very different from running a web browser.

Do you want to support high performance computing? How are you going to support the massive parallelism and energy efficiency required for exascale machines? With 1,000,000 or more processors, reliability is an issue - what will you handle in software and what in hardware? Do you handle it in user-level code or only at the OS layer?

And then you have to produce compilers - some degree of completeness and regularity are a good idea. In a regular architecture you can exploit these by first generating a direct version of the code (using simple instructions) and then applying a series of peepholes to replace them with more complex compound instructions.  If the ISA lacked the simple instructions, you would have to leap straight for the complex instructions.  (I&#039;m especially thinking of vectorization here - you have a complex transformation with many side-conditions which prevent it from being applied so you want to start by targeting a simple ISA, and then be able to apply a series of transformations to improve on it.  Too many optimality islands will be hard to target.)

Do you even have the workloads you need to produce the traces? SPEC is hopelessly unrepresentative of real programs. JITs are changing rapidly, runtime code generation seems to be making a comeback. Languages are becoming more dynamic (more branches, more indirect branches) - but will that trend reverse as designers accept that Moore&#039;s law is ending (if it is) and programmers have to take an ax to today&#039;s bloatware? In the time it takes to launch a new processor, get compilers, OSes, etc. fully up to speed, get enough design wins to reach high volume, etc. all the key parts of a web browser will be completely rewritten. So just as you have to anticipate the hardware trends, you have to predict some of the software trends in the absence of actual workloads.]]></description>
		<content:encoded><![CDATA[<p>John: I think you&#8217;re right that we need to look at instruction sequences, dataflow, etc. but there&#8217;s a lot of tradeoffs here.</p>
<p>Your intended microarchitecture matters: If you plan to use register renaming, you want to avoid instructions with large numbers of destination registers and favor conditional move instructions over conditional instructions.<br />
(I also don&#8217;t know which processor added conditional move &#8211; but there&#8217;s a good chance it was a processor with register renaming.)</p>
<p>If you&#8217;re more focused on energy efficiency, you might choose more complex instructions requiring greater logic depth &#8211; but that&#8217;s going to hurt your high frequency implementations.</p>
<p>How long you&#8217;re going to have to live with your design choices. If you&#8217;re commercially successful, you&#8217;ll be living with them for 20-30 years so decisions which make a huge amount of sense for today&#8217;s microarchitecture may be things you regret in 15 years time.  So you&#8217;ve got to anticipate all the various scaling trends (Moore&#8217;s law, memory wall, power wall, dark silicon, &#8230;) but also produce a design that you can make well today.  You&#8217;ve also got to leave room to add new instructions in the future.</p>
<p>Do you want to define subsets of your architecture for use in cost/area-constrained niches like microcontrollers?  (Microcontrollers are often implemented on older processes like 180nm or 130nm instead of the 45nm or 32nm used for high end cores so, while Moore&#8217;s law applies, they&#8217;re behind the curve.)<br />
And if you do aim to service multiple niches, then you have to consider a wider range of workloads &#8211; embedded code is very different from running a web browser.</p>
<p>Do you want to support high performance computing? How are you going to support the massive parallelism and energy efficiency required for exascale machines? With 1,000,000 or more processors, reliability is an issue &#8211; what will you handle in software and what in hardware? Do you handle it in user-level code or only at the OS layer?</p>
<p>And then you have to produce compilers &#8211; some degree of completeness and regularity are a good idea. In a regular architecture you can exploit these by first generating a direct version of the code (using simple instructions) and then applying a series of peepholes to replace them with more complex compound instructions.  If the ISA lacked the simple instructions, you would have to leap straight for the complex instructions.  (I&#8217;m especially thinking of vectorization here &#8211; you have a complex transformation with many side-conditions which prevent it from being applied so you want to start by targeting a simple ISA, and then be able to apply a series of transformations to improve on it.  Too many optimality islands will be hard to target.)</p>
<p>Do you even have the workloads you need to produce the traces? SPEC is hopelessly unrepresentative of real programs. JITs are changing rapidly, runtime code generation seems to be making a comeback. Languages are becoming more dynamic (more branches, more indirect branches) &#8211; but will that trend reverse as designers accept that Moore&#8217;s law is ending (if it is) and programmers have to take an ax to today&#8217;s bloatware? In the time it takes to launch a new processor, get compilers, OSes, etc. fully up to speed, get enough design wins to reach high volume, etc. all the key parts of a web browser will be completely rewritten. So just as you have to anticipate the hardware trends, you have to predict some of the software trends in the absence of actual workloads.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ben Ylvisaker</title>
		<link>http://blog.regehr.org/archives/669/comment-page-1#comment-3854</link>
		<dc:creator>Ben Ylvisaker</dc:creator>
		<pubDate>Tue, 24 Jan 2012 13:26:18 +0000</pubDate>
		<guid isPermaLink="false">http://blog.regehr.org/?p=669#comment-3854</guid>
		<description><![CDATA[Another line of similar work is Bracy&#039;s Mini-Graphs (e.g., Dataﬂow Mini-Graphs: Amplifying Superscalar Capacity and Bandwidth, MICRO, 2004).  I was surprised that it doesn&#039;t seem to be mentioned in the Galuzzi &amp; Bertels survey.

Scott Mahlke has also done a raft of related work.]]></description>
		<content:encoded><![CDATA[<p>Another line of similar work is Bracy&#8217;s Mini-Graphs (e.g., Dataﬂow Mini-Graphs: Amplifying Superscalar Capacity and Bandwidth, MICRO, 2004).  I was surprised that it doesn&#8217;t seem to be mentioned in the Galuzzi &amp; Bertels survey.</p>
<p>Scott Mahlke has also done a raft of related work.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: David Harris</title>
		<link>http://blog.regehr.org/archives/669/comment-page-1#comment-3853</link>
		<dc:creator>David Harris</dc:creator>
		<pubDate>Tue, 24 Jan 2012 13:22:19 +0000</pubDate>
		<guid isPermaLink="false">http://blog.regehr.org/?p=669#comment-3853</guid>
		<description><![CDATA[@Daniel Lemire 

(2) Galois multiplication is not nearly as common as integer multiplication. If the divisibility property is important, then you should work in a large prime close to 2^n. However, the number-theoretic characterization of division is usually more important than its algebraic properties.

By the way, the AES-NI instruction set does allow faster multiplication over a Galois field.]]></description>
		<content:encoded><![CDATA[<p>@Daniel Lemire </p>
<p>(2) Galois multiplication is not nearly as common as integer multiplication. If the divisibility property is important, then you should work in a large prime close to 2^n. However, the number-theoretic characterization of division is usually more important than its algebraic properties.</p>
<p>By the way, the AES-NI instruction set does allow faster multiplication over a Galois field.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: gsg</title>
		<link>http://blog.regehr.org/archives/669/comment-page-1#comment-3852</link>
		<dc:creator>gsg</dc:creator>
		<pubDate>Tue, 24 Jan 2012 10:09:53 +0000</pubDate>
		<guid isPermaLink="false">http://blog.regehr.org/?p=669#comment-3852</guid>
		<description><![CDATA[For some purposes it might be more worthwhile to optimise for code size than try to cherry pick individual instructions. 

Modern OOO processors tend to make pretty fast and loose with the information exposed in an ISA, caching it, rewriting it into microcode, etc. Thus a highly specialised instruction stands at least some chance of becoming obsolete over time as the processor implementation evolves, whereas compactness doesn&#039;t seem likely to ever become undesirable so long as caches are still a thing.]]></description>
		<content:encoded><![CDATA[<p>For some purposes it might be more worthwhile to optimise for code size than try to cherry pick individual instructions. </p>
<p>Modern OOO processors tend to make pretty fast and loose with the information exposed in an ISA, caching it, rewriting it into microcode, etc. Thus a highly specialised instruction stands at least some chance of becoming obsolete over time as the processor implementation evolves, whereas compactness doesn&#8217;t seem likely to ever become undesirable so long as caches are still a thing.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: reno</title>
		<link>http://blog.regehr.org/archives/669/comment-page-1#comment-3850</link>
		<dc:creator>reno</dc:creator>
		<pubDate>Tue, 24 Jan 2012 09:11:47 +0000</pubDate>
		<guid isPermaLink="false">http://blog.regehr.org/?p=669#comment-3850</guid>
		<description><![CDATA[@Pascal Cuo, I disagree with some of the examples given:
-I&#039;m not especially impressed with the PPC ISA: the Alpha had no flag registers, as the designers thought that they could become a bottleneck.
-the register window while innovative, is a bit like the branch delay slot: it works well on one implementation but on others implementations with different technology, it can become a liability as it makes all the registers access indirect.

One thing you didn&#039;t list that I would add is MIPS&#039; two versions of integer arithmetic instructions, one which TRAP on integer overflow, one which set the flag  register: very useful for implementing sane integer semantic like Ada does.]]></description>
		<content:encoded><![CDATA[<p>@Pascal Cuo, I disagree with some of the examples given:<br />
-I&#8217;m not especially impressed with the PPC ISA: the Alpha had no flag registers, as the designers thought that they could become a bottleneck.<br />
-the register window while innovative, is a bit like the branch delay slot: it works well on one implementation but on others implementations with different technology, it can become a liability as it makes all the registers access indirect.</p>
<p>One thing you didn&#8217;t list that I would add is MIPS&#8217; two versions of integer arithmetic instructions, one which TRAP on integer overflow, one which set the flag  register: very useful for implementing sane integer semantic like Ada does.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
