Request from Intel's Museum

From: Ross Archer <archer_at_topnow.com>
Date: Wed Oct 9 20:14:01 2002

"Dwight K. Elvey" wrote:
>
> >From: "Ross Archer" <archer_at_topnow.com>
> >
> >"Dwight K. Elvey" wrote:
> >>
> >> >From: "Ross Archer" <dogbert_at_mindless.com>
> >> >
> >> >Jerome H. Fine wrote:
> >> >
> >> >>>Jim Kearney wrote:
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >>>I just had an email exchange with someone at Intel's Museum
> >> >>>(http://www.intel.com/intel/intelis/museum/index.htm)
> >> >>>
> >> >>>
> >> >>
> >> >>Jerome Fine replies:
> >> >>
> >> >>I am not sure why the information is so blatant in its
> >> >>stupid attempt to ignore anything but Intel hardware
> >> >>as far a anything that even look like a CPU chip, but
> >> >>I guess it is an "Intel" museum.
> >> >>
> >> >>Of course, even now, Intel, in my opinion, is so far
> >> >>behind from a technical point of view that is is a sad
> >> >>comment just to read about the products that were
> >> >>way behind, and still are, the excellence of other
> >> >>products. No question that if the Pentium 4 had been
> >> >>produced 10 years ago, it would have been a major
> >> >>accomplishment.
> >> >>
> >> >Harsh! :)
> >> >
> >> >Guess it depends on what you mean by "far behind from a
> >> >technical point of view."
> >> >
> >> >If you mean that x86 is an ugly legacy architecture, with
> >> >not nearly enough registers, an instruction set which
> >> >doesn't fit any reasonable pipeline, that's ugly to decode
> >> >and not particularly orthogonal, that from purely technical
> >> >reasons ought to have died a timely death in 1990,
> >> >I'd have to agree.
> >> >
> >> >However, look at the performance. P4 is up near the
> >> >top of the tree with the best RISC CPUs, which have
> >> >the advantage of clean design and careful evolution.
> >> >
> >> >It surely takes a great deal of inspiration, creativity,
> >> >and engineering talent to take something as ill-suited
> >> >as the x86 architecture and get this kind of performance
> >> >out of it. IMHO.
> >> >
> >> >In other words, making x86 fast must be a lot like
> >> >getting Dumbo off the air. That ought to count as
> >> >some kind of technical achievement. :)
> >>
> >> ---snip---
> >>
> >> It is all done with smoke and mirrors.
> >
> >Anything the results in a net faster CPU isn't, in my book,
> >akin to smoke and mirrors.
> >
> >If anyone's guilty of "smoke and mirrors", it's probably
> >Intel by making a ridiculous long (20-24 stage) pipeline
> >just to allow the wayupcrankinzee of clock rates so they can
> >be first CPU to X Ghz. Why not a 50 stage pipeline that hits
> >8 Ghz, nevermind the hideous branch-misprediction penalties
> >and exception overhead?
> >
> >
> >> We do the same
> >> here at AMD. The trick is to trade immediate execution
> >> for known execution. The x86 code is translated to run
> >> on a normal RISC engine.
> >
> >Yes, and this in and of itself must be rather tricky, no?
> >X86 instructions are variable-length, far from load/store,
> >have gobs of complexity in protected nonflat mode, etc.
> >I'd bet a significant portion of the Athlon or P4 is devoted
> >just to figuring out how to
> >translate/align/schedule/dispatch
> >such a mess with a RISC core under the hood. :)
>
> It doesn't take as much as one would think but it is a hit
> on speed and space. Still, the overall hit is really quite
> small.

Based on what you're saying, it follows that a multi-level
instruction-set
implementation (lower level microarchitecture plus higher
level user-visible architecture) is not only feasible, but
might
even be superior in some cases to a one-level implementation
tuned
either for CPU speed or compiler convenience.

What follows is that the user-level instruction set ought to
be
organized for compiler code generation efficiency (less
code,
fewer instructions, less semantic gap between compiler and
compiler-visible CPU to make optimizations more obvious,
etc.)
The microarchitecture is then designed to keep the execution
units
and pipelines as busy as possible without regard to semantic
gap
from the outside world.

The hybrid might eventually surpass the best purely RISC or
CISC
approaches simply because there are two optimization points:
at the compiler/assembler and at the internal hardware.

>
> >
> >> This means that the same tricks
> >> on a normal RISC engine would most likely only buy about
> >> a couple percent. It would only show up on the initial
> >> load of the local cache. Once that is done, there is
> >> really little difference.
> >> Choices of pipeline depth, out of order execution, multiple
> >> execution engines and such are just the fine tuning.
> >> Intel, like us is just closer to the fine edge of what
> >> the silicon process can do than anything tricky that
> >> people like MIPS don't know about.
> >
> >Well, why isn't something elegant like Alpha, HP-PA, or MIPS
> >at the top of the performance tree then? (Or are they and
> >I'm
> >just not aware of the latest new products.)
> >
> >My pet theory is that the higher code density of x86
> >vs. mainline RISC helps utilize the memory subsystem
> >more efficiently, or at least overtaxes it less often.
> >The decoding for RISC is a lot simpler, but
> >if the caching systems can't completely compensate for the
> >higher
> >memory bandwidth requirements, you're stalling more often or
> >limiting
> >the maximum internal CPU speed indirectly due to the
> >mismatch.
> >And decoding on-chip can go much faster than any sort of
> >external
> >memory these days.
>
> This is why the newer processor chips are really a memory
> chip with some processor attached, rather than a processor
> with some memory attached. We and Intel are turning into
> RAM makers. Memory bandwidth is on the increase but it
> isn't keeping up with chip speed.

And unless you go with 1024+ bit wide SDRAM buses or such,
it's hard to see how you could have the external memory keep
up. The
"happy" (well, carefree) days of 1000 nS instruction cycles
are long gone. :)

> Still, I don't understand why many are not going to more
> efficient memory optimization than apparent execution speed.
> The compiler writers have a ways to go.
> The day is gone
> when pinhole optimization buys much.

For RISC targets, the semantic gap between an HLL statement
in "C", for example, and the target code is wider.
Intuitively
anyways, this means more instructions are output and
fewer optimizations are found for a given level of effort
in the code generation logic. And since it's an "all things
being
equal" deal, you can bet the optimization will be better
with a
friendly target.

Peephole optimization would be particularly
difficult where there is basically only one way to do
something in the
target.

Perversely, all this argues for a RISC engine optimized for
internal
speed and a CISC engine optimized to be compiler-friendly,
or in
other words, "Q: Which technology is better: RISC or CISC?
A: Both
are better than either." :)

At last, a possible explanation for why x86, which is so
ugly from
a performance-theory point of view, really does work so well
in practice?

> Keeping the process
> in on chip cache is really the important thing. There isn't
> an application out there that if one removed the large data
> array and image bit tables, couldn't completely fit in the
> caches that are being used today.

The problem is cache is generally implemented as "n"
parallel direct-mapped caches (n-way) rather than truly
associative,
because associative memory is impossibly complex and
expensive for any
decent size.

So if you have a 2-way cache, that means you can only have
two data
items in the cache whose index happens to hash to the same
cache slot,
regardless of how big your cache is. For an "n" way cache,
all you need
is "n+1" frequently-used data items that map to the same
cache index,
and musical chairs tosses out vital data on every load. :(

A compiler/linker would not only have to know what data was
dynamically
most frequently used at any given time, but also the method
by which the
item's address maps it to a cache index, and how many
cache-ways there are,
to prevent ugly *stuff* like this from happening by locating
data so
frequently-used data is always paired in the other ways with
infrequently-used
data.

One thing a compiler *could* do is set a "hint" bit in the
load and store
instructions (provided the CPU provides a bit in load/store
for this purpose).
when the code generator thinks the data just loaded/stored
will be used again
especially often in the *near* future. The CPU could
let that bit stay set for say one million CPU cycles before
clearing it, and
try its damndest not to toss out a data item with this bit
set if there's
an alternative in the other n-1 ways that has no such bit
set.
That might help quite a bit. Actual implementation would
undoubtedly be
very different (timestamp?) but the idea is to "hint" the
CPU to make a
better choice of who to toss into the street vs. keep in the
shelter. :)


> The compilers just don't
> write code well enough to keep the size down. It is just
> that we've gotten into the poor choice of languages and poor
> connection of software writers to the actual machine code
> that is run.

I'd have to agree with the poor connection and code size
parts.

I think it's a bit unfair to blame the high-level languages
for this problem though. It seems to me that the code
generation
phase is where things are broken. And since most compilers
have
a lexical view of the world rather than a run-time view of
the
world, it is also kind of difficult to predict what needs to
be
optimized without some fancy simulation technology that
AFAIK
isn't used as a rule as part of code generation, but perhaps
should be. :)

> Just my opinion.
> Dwight
>

Some great things I've learned so far.
It's safe to say I already think of things quite differently
than I did just yesterday.

-- Ross

> >
> >This isn't really a discussion for classiccmp, but I
> >couldn't
> >resist since I'm sure at least some folks enjoy
> >speculationalism
> >on such topics. :)
> >
> >
> >>
> >> On a separate subject, I was very disappointed in the
> >> Intel Museum. I'd thought it might be a good place to
> >> research early software or early IC's. They have vary
> >> little to offer to someone looking into this level of
> >> stuff. Any local library has better references on this
> >> kind of stuff ( and that isn't saying much ).
> >> Dwight

Yup. Even corporate boosterism shouldn't blind one from a
graceful acknowledgement of the contributions of others. :|

-- Ross

> >
> >n
> >
Received on Wed Oct 09 2002 - 20:14:01 BST

This archive was generated by hypermail 2.3.0 : Fri Oct 10 2014 - 23:35:32 BST