Tag - xor

on recent disk reading results
re-examining XOR data checksum used on amiga floppies
my checksum algorithm
error correction using hamming distances
SlackFest 2007

on recent disk reading results

(this was posted to Classic Computer mailing list, please disregard if you’re on the list.  I think this is an important topic)

The last two nights I’ve been busy archiving some of my Amiga floppy collection.  Most disks were written over 20 years ago.

On a sample size of about 150 floppies, most of them were perfectly readable by my homegrown usb external amiga floppy drive controller.

I paid very close attention to the failures or ones where my controller struggled.

Without sounding too obvious here, the time between the pulses (which more or less define the data) were grossly out of spec.  The DD pulses should nominally be 4us, 6us, and 8us apart before pre-write compensation.  Most good disks are slightly faster, and normal times for these ranges are:

4us: 3.2-4.2us.  Many around 3.75us
6us: 5.5-6.2us.
8us: 7.5-8.2us

(notice margins around 1-1.3us)

My original microcontroller implementation was 3.2-4.2, 5.2-6.2, and 7.2-8.2.

When my current FPGA controller would have a problem, I’d notice that there were problems right on a boundary.  So maybe pulses were coming in at 3.1us apart instead of 3.2.  Or maybe 4.3 instead of 4.2.  So I kept bumping the intervals apart, making a larger range of pulse times acceptable — the XOR sector checksums were passing, so I was likely making the right choices.  The bits were ending up in the right buckets.

But as I went through some of these disks, I ended up with the difference between ranges(and basically my noise margin) being reduced smaller and smaller.  Some to the point where an incoming pulse time might fall darn smack in the middle of the noise margin.  Which bucket does THAT one go into?

My approach has been very successful(easily 95%+), but it makes me wonder about Phil’s DiscFerret dynamic adaptive approach where a sample of the incoming data defines the ranges.

Some disk drives and controllers might be faster or slower than others, and if you create custom ranges for each disk (each track?), perhaps you’ll have better luck.

re-examining XOR data checksum used on amiga floppies

So I was wondering exactly how effective the XOR checksum that Commodore used for the amiga sectors was.  If I read a sector, and perform the checksum, and the checksum matches the stored checksum, then the data is the same, right?  Not always.  I had expected it to be better, maybe not MD5 or SHA-1 level, but I expected it to be decent.

I had run into this paper a year ago when I was looking at my transfer checksums.  But this certainly also applies to the amiga sectors, too.

Some good excerpts:

  • the percentage of undetected two-bit
    errors becomes approximately 1/k (k being checksum size), which is rather poor performance for
    a checksum (i.e., 12.5% undetected errors for an 8-bit checksum, and 3.125% for a 32-bit checksum).
  • The XOR checksum has the highest probability
    of undetected errors for all checksum algorithms in this study…..
  • There is generally no reason to continue the common practice of
    using an XOR checksum in new designs, because it has the same software computational cost as an
    addition-based checksum but is only about half as effective at detecting errors.

So I wanted to actually prove myself that XOR is that bad, looking at it from the amiga sector perspective, so I wrote a quick program to:

  1. Create a random block of data 512 bytes long
  2. Calculate the 32-bit XOR checksum based on 32-bit chunks at a time and store it as the last byte (as does the amiga)
  3. Select a number of bit inversions(basically corrupt the data in a specific way) which can affect data and/or stored checksum
  4. Recalculate the checksum of the modified block
  5. If the checksums MATCH, meaning that two different sets of data yield the same checksum, then this is an undetected error.
  6. Count the number of undetected errors vs total errors.

That paper lists the properties of an XOR checksum, and I wanted to compare results:

  • Detects all single-bit errors.  (my testing confirms)
  • Fails to detect some two-bit errors. (my testing confirms, see below)
  • If total number of bit errors is odd, XOR checksum detects it. (confirmed with 1,3,5,7 bit errors)
  • Fails to detect any even number of bit errors that occur in the same bit position of the checksum computational block. (confirmed with 2,4,6,8)

The two-bit errors is really the case that I worry about.  If two bit errors occur in the same bit position, and the inverted oppositely(1—>0 and 0—>1), then it won’t be detected.  So how often does this happen with random data?  Paper’s author, Maxino, says 3.125% of the time.  I can definitely confirm this.  My testing shows 1.8 million hits over 56 million tries.   There are some differences with Amiga data, but I think the results can be the same.  I might load some example amiga sectors and try them.

Once the number of bit errors increase, the probability of them happening in the same bit position, in the same direction, and occurring evenly on each position goes down —- and therefore the overall chance of undetection decreases as well.

4 bit undetected errors happen 0.3% of the time.

6 bit undetected errors happen 0.04% of the time.

8 bit undetected errors happen 0.009% of the time.

I have other testing to do, including running burst error tests.

So for each sector, you really don’t want to see only two bit errors occur.  A single bit error, sure.  Multiple bit errors, ok, because we can detect them.  You don’t want to think you have a good copy when it fact the checksum passing was a false positive.

my checksum algorithm

So I’m using an XOR checksum to detect errors in the transfer between the SX and the PC.  I always thought it was a reasonably strong checksum, even though it’s still an 8 bit checksum.

I found a neat article today here.

“The XOR checksum has the highest probability of undetected errors for all checksum algorithms in this study, and is not as effective as addition-based checksums for general purpose error detection uses.”

There were around 6 or 7 different checksums presented.

More later.

error correction using hamming distances

I’ve been thinking for quite some time about error correction.  About using the structure of the MFM to our advantage.  I’ve come up with something although I haven’t implemented the idea yet.

As you know from my many references to it, http://www.techtravels.org/?p=62 , this post shows the 32 possible byte combinations from legal raw MFM.

I’ve run some numbers today.  First, there are illegal values, and then you have illegal PAIRS of MFM bytes.  For instance, while 0xAA is valid by itself, and 0x11 is valid by itself, 0xAA11 is illegal, because the combination of those would produce 4 consecutive zeros, and as we know, this is illegal.  Just like 0x55 is legal and 0x88 is legal, but 0x5588 is illegal because that would produce back-to-back 1’s, also illegal.

So as far as PAIRS go, roughly 1/3 of the total possible pairs (32*32=1024), are illegal.  That works out to be 341 pairs or so.  I was always interested in the percentage of bad pairs so we could use that info to detect bad bits/bytes.

So here’s my logic:

We need to know the data is bad, and that we fixed the data once we’ve attempted some error correction stuff.  The checksum works fine here.  The checksum uses XOR, and while its not perfect, it provides us with a decent check.

Next, we need to know which bytes are bad so we know which ones to fix.  This is also easy.  Because we know which raw values are ok, we can simply check the bytes against a table (when the checksum fails), and then we know roughly which ones are bad.  The byte before or after might be screwy, but we have the general location.

So now that we know which bytes need fixed, we need to figure out what the correct values for the bytes would be.  Enter hamming distance.  If you don’t know what that is, look here.  Long and short of it is that it provides a measurement of how many bits need flipped/changed to go from one byte to another.  In our case, it tells us that if we have a single-bit error (and this we have to guess on), and we give it the bad byte, it will tell us which LEGAL mfm bytes the byte could have been.  I wrote a small c program today that calculates the hamming distance between BAD BYTES, and legal mfm bytes.  Note this produced ~7200 lines of results.

So then what, we now have a small subset of guesses to make at the byte value.  Now, if there is just one bad byte, then its easy enough to swap in some choices, and then re-checksum the track to see if that fixes it.  If there are multiple values, this becomes much harder for me to wrap my head around programmatically wise.  I think the problem wreaks of recursion, which I’ve never been very good at.  I need to test all the possible (or probable) values in each of the appropriate locations.  This is not simple iteration, as far as I know.

So here’s some examples: (these are in decimal,not my normal hex)

Let’s say our bad byte is DEC 83.  Once again, this is illegal because of the “3” ended in binary 11.

So, the output from my program tells me that a single-bit error could have cause an 81 to look like an 83, that means bit 2 in that byte got mangled from a 0 to a 1.

But it could been an 82 as well.  So bit 1 might have been set, when it should have been cleared.

But that’s it.  Assuming that we just had a single-bit error, then those are the only two possibilities.  I think single-bit errors are the most common, but I’ll need to work on this.

OK, so now let’s look at 2-bit errors in the same byte.  If bad byte is still 83, then 17, 18, and 85 are possible choices.

What I’m really doing is intelligently narrowing the search space to valid possible choices.  And then, I could apply the PAIR logic from above, and potentially eliminate more of the choices.

I think it might make sense to check upwards of 3-bad bits of error…… After that it seems like maybe brute force might just be better/simpler…..

While this stuff is nice, if we have a byte get mangled in such a way that it produces another legal different raw mfm byte, then it might be impossible to check.  I can still apply my pair-logic to it which might help identify which byte is the problem byte.

Definitely different and excited stuff.  I’ve never seen this idea used before, at least not in the amiga floppy world.

SlackFest 2007

Ok, I’ll admit it.  I’ve been slacking like the best of them.

I’m not horribly unsatisified with my progress, especially this year.  The first four months of this year I achieved the following:

* Whole first ADF written January 29th
* Much cleaner hardware, single circuit board February 14th

* Data is now byte-sync’d, transferred sectors in order Februrary 17th

* Working Java client March 26th
As far as next steps go, I’ve got to

1> get error correction and retries working in the java client

2> clean up the gui and set up a “save config file” option, selecting serial ports, etc

3> clean up the java code

4> start testing it under Ubuntu

I’m very interested in error recovery, and I have been putting a decent amount of thought into it.  I’m really fascinated by the fact that Amiga floppy disks have tons and tons of structure.  First, the MFM is output in such a precise and predictable way, ie we know that no two 1’s are found back to back, no more than three consecutive zeros, etc.  We also know about the way clock bits are inserted.  And because of this fact, only certain bytes are possible when made up of these certain bit patterns.  In fact, only 32 are possible.  With the help of a checksum, I think it would be possible to brute-force a certain number of bad bytes.  Now I do think that the checksum is particularly unhelpful for this (XOR, I think), something like MD5 would be da bomb.  No false chance of duplicates yielding the same checksum.  I don’t understand or appreciate the (math-based) ramifications of using XOR though.  It’s certainly used everyplace within encryption, so this is might be better than I think.

My read routines are no longer raw, although I’m probably going to go back and add a raw read function.

I’ve tossed around the idea of simplifying this project even further by eliminating the FRAM memory and adding real-time functionality which I know the emulator crowd wants.  This is further down the line, and honestly, I don’t even know if its possible (or at least, not sure how I would do it)

I’ve still been managing between 12,000 and 20,000 hits per month (about 10,000 unique visitors), and so I really appreciate the continued interest people have shown.  This is really a fun project, and I’m looking forward to expanding it and making it better and more useful.