Assembly on a Apple IIc+ from Jim Keohane on 2003-02-08 (2003-February)

From: Jim Keohane <jimkeo_at_multi-platforms.com>
Date: Sat Feb 8 15:15:01 2003

Pete,

    I'll type more slowly:

    Question #1:

    If an instruction that takes two cpu cycles (as Sellam Ismail cited as a
minimum) and there are 2 cpu cycles per clock cycle then how many clock
cycles did this one instruction take?

   Answer: one. OK. OK. I'm being cute. But the 6502, sans Woz's Apple ][
sneakiness for video, can do two memory fetches per clock cycle. I also
recall a discussion where RAM was faster than on-chip registers back when
6502's were born. The original ex-Motorola engineers optimized the 6502 for
memory access, not register access. If you search for "6502 = RISC" on the
Internet and you'll find some folks even more inventive than I with math.
{chuckle}

   Question #2:

    If someone writes "pipelining" and encloses it within quotes does that
indicate to you that the term is being used, well, advisably?

    Answer: Visit groups.google.com and search for "pipelining" and 6502 (or
related processor). You'll get 340 hits from Rockwell's use of "pipelining"
as a feature of their version of 6502 chip to some folks referring to "older
pipelining" and "new pipelining" to ....

   . . .

   I enclosed "pipelining" in quotes because there is friendly disagreement
over the 6502 and the use of that term. I guess you were not aware of that.
In any event, in a discussion of cycles per instruction (clock cycles, cpu
cycles, whatever) the "prefetch" of next instruction's opcode is termed
pipelining by some including the manufacturer. Of course, you will now
respond back that "prefetch" really means ....

Here are some google excerpts:

=====excerpt=1====================

I don't like the "one instruction per cycle" definition of RISC - for a
start,
what is a cycle? I prefer to think of RISC as an "every cycle is sacred"
philosophy - you don't waste cycles. I'd try to get _memory cycles_ as
often
as the hardware permits them - on the 6502, for example, one per cycle (and
it
almost manages it!).

=====excerpt=2======================

A 6502 task context
would therefore require moving about 1KB, which would take about 4,500
instructions (at one instruction per cycle.) On a circa-1980's machine,
with a 1MHz clock, that would take about 4.5 msec.

=====excerpt 3=======================

The 6502 _IS_ pipelined, but in ways that are not very dramatic or even
obvious
unless you look at the CPU's internal operation in detail. Rockwell touted
the
pipelining in their 6502 user's guide years ago, it is essentially this:

When you do a ADC of something, the last cycle of the instruction is when
the
actual data byte is read in, right? Immediately after that the next opcode
is
read so the next instruction has started, right? So when did the 6502 add?

It added while the next opcode was being read. The accumulator does not
actually hold the new value until sometime during the second half (forget
exactly where) of the opcode cycle of the next instruction.

That's pipelining. It saves you a cycle on every instruction that does an
ALU
operation. It may not be as spectacular as what's being done on the monster
RISCs these days but it is essentially pipelining.

========excerpt 4===========================

The two things that in _my_ mind distinguish the 6502 are the pipelining and
the price ($20 for a 6501, $25 for a 6502, quantity 1, when Intel
was asking over a hundred for the 8080 and Moto wanted over $50
for the 6800).

=========excerpt=5==========================

- The 6502 should get the honor of being the first microprocessor to
  use pipelining, which explains much of its speed.

=========excerpt=6===========================

In a more recent example, the 6502 microprocessor had a through-put similar
to the 8080 processor running at a clock rate four times faster. This was
due to the pipelined architecture of the 6502 versus the non-pipelined 8080.

=========excerpt=7==========================

At the same clock rate, the 6502's are demonstrably faster. Their speed
comes from simplicity. Hard-wired instructions, short execution times,
and an early form of pipelining. Anyways, without any mental effort
expended,
here is very simple 6502 code that will move 2x as many bytes as the z80
code
above but in the same # of cycles!

     ldx #49
ldir lda tab1+50,x
     sta tab2+50,x
     lda tab1,x
     sta tab2
     dex
     bpl ldir

2 bytes moved in 21 cycles or 10.5 cycles average. The overhead is 1 cycle.
See, already I have beat Z80 code. In fact the 128 has a special feature
to do even better: a relocatable zero page and stack.

I don't know exactly how to move the zero page, but it's something like
this:

     lda #>tab1
     sta stackpagehigh
     lda #>tab2
     sta zeropagehigh
     ldx #49
     txs
ldir pla ;read tab1,sp and decrement sp
     tsx
     sta <tab2,x
     lda tab1+50,x
     sta <tab2+50,x
     bpl ldir

18 cycles for two bytes, so 9 cycles per byte. Tab1 must start at a page
boundary, and tab2 must be contained within a page.

I can't remember which is 3 cycles; pla or pha, but the routine can be
arranged to use the faster one, so it can be made to run in 9 cycles
anyhow. To be fair, LDIR is actually not the fastest way to move memory.
Maybe setting the stack to tab1 and using POP BC then storing a word at a
time to tab2.. hmm.. I don't remember enough z80 to know if this will end
up faster. I was thinking ld (HL),bc and sub l, #1 and jp nz loop.
Even the fastest Z80 instructions can be beat. RRCA takes 4 cycles while
ROR takes 2 cycles.

    . . .

The difference for the cycle times is due to pipelining. This is a
technique to reduce the cycles in a cpu, and is used in new RISC
processors. The 6502 has pipelining making it very efficient. There are
3 stages which are executed in parrallel for each instruction.
Instruction decode, alu, and something else. I need to look this stuff up.
But the 6502 is always fetching the next instruction at the same time as
it's figuring out what to do with the current one, so when the current
one is finished, it's part way through working on the next one.
Something like an assembly line. :)

=======end=of=excerpts==========================

Pete,

    I'd say you got me on the "one cycle per instruction" but you jumped the
gun on the pipelining issue. OK?

   - Jim

Jim Keohane, Multi-Platforms, Inc.

   "It's not whether you win or lose. It's whether you win!"
----- Original Message -----
From: <pete_at_dunnington.u-net.com>
To: <cctalk_at_classiccmp.org>
Sent: Saturday, February 08, 2003 14:06
Subject: Re: Assembly on a Apple IIc+

> On Feb 8, 11:06, Jim Keohane wrote:
>
> > The reference to "one cycle" instruction may have been referring
> > to there being 2 cpu cycles per clock cycle. Also, there's the
> "pipelining"
> > some say the 6502 does when the last (or only) byte of an instruction
> is
> > acted upon simultaneous to next instruction's 1st byte (opcode) being
> > fetched
> >
> > So perhaps "one instruction per clock cycle" may be awfully close
> with
> > pipelining and with use of zero page.
>
> You must be thinking of some different 6502 to the rest of us :-) As
> Sellam said, no 6502 opcode takes less than two clock cycles to
> execute, and most take more (up to 7): the only 2-cycle instructions
> are the ones with implied addressing, like RTS, CLI, TAX, ... This is
> why a 6502 running typical well-written code, running on a 2MHz clock,
> manages at best around 0.7 MIPS.
>
> There's no pipelining at all in a 6502. No overlap of instructions
> whatsoever.
>
> Zero-page instructions like LDA $12 take three clock cycles.
>
> There aren't two CPU cycles per clock cycle. Perhaps you're thinking
> of the fact that the 6502 uses a two-phase clock, and does part of the
> CPU cycle during phi-1, and part during phi-2?
>
> > Of course, we're talking Apple ]['s which, if I can trust my
> memory,
> > steal every other clock cycle to refresh memory.
>
> I believe you're thinking of how it uses part of the clock cycle when
> the CPU isn't accessing memory, not alternate clock cycles.
>
> > > > p.s. I also did quite well with 6502 asm code in cpu speed tests
> vs
> > > > 80x86 and Z80 programmers. The zero page, for all intents and
> purposes,
> > > > is 256 registers.
>
> That was the designers' intention, but you have to remember that it
> takes an extra clock cycle to access a zero-page location rather than a
> register.
>
> --
> Pete Peter Turnbull
> Network Manager
> University of York
Received on Sat Feb 08 2003 - 15:15:01 GMT

This archive was generated by hypermail 2.3.0 : Fri Oct 10 2014 - 23:35:54 BST