Tag - checksum

1
on recent disk reading results
2
re-examining XOR data checksum used on amiga floppies
3
my checksum algorithm
4
upgraded Tivo HD serial 652
5
redone transfer routine
6
sleep inversely proportional to the AFP and no more tokens
7
error correction using hamming distances
8
feasability of writing disks
9
problem with high density disks
10
SlackFest 2007

on recent disk reading results

(this was posted to Classic Computer mailing list, please disregard if you’re on the list.  I think this is an important topic)

The last two nights I’ve been busy archiving some of my Amiga floppy collection.  Most disks were written over 20 years ago.

On a sample size of about 150 floppies, most of them were perfectly readable by my homegrown usb external amiga floppy drive controller.

I paid very close attention to the failures or ones where my controller struggled.

Without sounding too obvious here, the time between the pulses (which more or less define the data) were grossly out of spec.  The DD pulses should nominally be 4us, 6us, and 8us apart before pre-write compensation.  Most good disks are slightly faster, and normal times for these ranges are:

4us: 3.2-4.2us.  Many around 3.75us
6us: 5.5-6.2us.
8us: 7.5-8.2us

(notice margins around 1-1.3us)

My original microcontroller implementation was 3.2-4.2, 5.2-6.2, and 7.2-8.2.

When my current FPGA controller would have a problem, I’d notice that there were problems right on a boundary.  So maybe pulses were coming in at 3.1us apart instead of 3.2.  Or maybe 4.3 instead of 4.2.  So I kept bumping the intervals apart, making a larger range of pulse times acceptable — the XOR sector checksums were passing, so I was likely making the right choices.  The bits were ending up in the right buckets.

But as I went through some of these disks, I ended up with the difference between ranges(and basically my noise margin) being reduced smaller and smaller.  Some to the point where an incoming pulse time might fall darn smack in the middle of the noise margin.  Which bucket does THAT one go into?

My approach has been very successful(easily 95%+), but it makes me wonder about Phil’s DiscFerret dynamic adaptive approach where a sample of the incoming data defines the ranges.

Some disk drives and controllers might be faster or slower than others, and if you create custom ranges for each disk (each track?), perhaps you’ll have better luck.

re-examining XOR data checksum used on amiga floppies

So I was wondering exactly how effective the XOR checksum that Commodore used for the amiga sectors was.  If I read a sector, and perform the checksum, and the checksum matches the stored checksum, then the data is the same, right?  Not always.  I had expected it to be better, maybe not MD5 or SHA-1 level, but I expected it to be decent.

I had run into this paper a year ago when I was looking at my transfer checksums.  But this certainly also applies to the amiga sectors, too.

Some good excerpts:

  • the percentage of undetected two-bit
    errors becomes approximately 1/k (k being checksum size), which is rather poor performance for
    a checksum (i.e., 12.5% undetected errors for an 8-bit checksum, and 3.125% for a 32-bit checksum).
  • The XOR checksum has the highest probability
    of undetected errors for all checksum algorithms in this study…..
  • There is generally no reason to continue the common practice of
    using an XOR checksum in new designs, because it has the same software computational cost as an
    addition-based checksum but is only about half as effective at detecting errors.

So I wanted to actually prove myself that XOR is that bad, looking at it from the amiga sector perspective, so I wrote a quick program to:

  1. Create a random block of data 512 bytes long
  2. Calculate the 32-bit XOR checksum based on 32-bit chunks at a time and store it as the last byte (as does the amiga)
  3. Select a number of bit inversions(basically corrupt the data in a specific way) which can affect data and/or stored checksum
  4. Recalculate the checksum of the modified block
  5. If the checksums MATCH, meaning that two different sets of data yield the same checksum, then this is an undetected error.
  6. Count the number of undetected errors vs total errors.

That paper lists the properties of an XOR checksum, and I wanted to compare results:

  • Detects all single-bit errors.  (my testing confirms)
  • Fails to detect some two-bit errors. (my testing confirms, see below)
  • If total number of bit errors is odd, XOR checksum detects it. (confirmed with 1,3,5,7 bit errors)
  • Fails to detect any even number of bit errors that occur in the same bit position of the checksum computational block. (confirmed with 2,4,6,8)

The two-bit errors is really the case that I worry about.  If two bit errors occur in the same bit position, and the inverted oppositely(1—>0 and 0—>1), then it won’t be detected.  So how often does this happen with random data?  Paper’s author, Maxino, says 3.125% of the time.  I can definitely confirm this.  My testing shows 1.8 million hits over 56 million tries.   There are some differences with Amiga data, but I think the results can be the same.  I might load some example amiga sectors and try them.

Once the number of bit errors increase, the probability of them happening in the same bit position, in the same direction, and occurring evenly on each position goes down —- and therefore the overall chance of undetection decreases as well.

4 bit undetected errors happen 0.3% of the time.

6 bit undetected errors happen 0.04% of the time.

8 bit undetected errors happen 0.009% of the time.

I have other testing to do, including running burst error tests.

So for each sector, you really don’t want to see only two bit errors occur.  A single bit error, sure.  Multiple bit errors, ok, because we can detect them.  You don’t want to think you have a good copy when it fact the checksum passing was a false positive.


my checksum algorithm

So I’m using an XOR checksum to detect errors in the transfer between the SX and the PC.  I always thought it was a reasonably strong checksum, even though it’s still an 8 bit checksum.

I found a neat article today here.

“The XOR checksum has the highest probability of undetected errors for all checksum algorithms in this study, and is not as effective as addition-based checksums for general purpose error detection uses.”

There were around 6 or 7 different checksums presented.

More later.

upgraded Tivo HD serial 652

I followed the instructions at http://www.bumwine.com/tivo.html but when I tried to expand the partition after the copy, it failed with the following error:

Primary volume header corrupt, trying backup.
Secondary volume header corrupt, giving up.
mfs_load_volume_header: Bad checksum.
Unable to open MFS drives.

So I downloaded the MFSLive Linux Boot CD 1.4 from http://www.mfslive.org/

and issued the command mfsadd -x /dev/sda -r 4.  Where /dev/sda is the name of the destination copied drive that needs expanding, from cat /proc/partitions.

Everything (edit: did not — see below) work(ed) perfectly.  I would probably just suggest following the FAQ/guide for MFSLive Boot CD, using the backup/restore commands.

The drive I bought was the Western Digital 1.5TB green drive from newegg.

redone transfer routine

So my transfer routine has been a little flaky lately. I’m not sure exactly why but I think it’s related to the number of processes I’m running. While it’s a P4 2.4, I think USB scheduling is dead last on the priority list, because it sure seems that way. My transfer protocol is pretty simple right now, or almost completely non-existent. After the PC sends a R for ReadDisk, the SX starts spewing data. There is no handshake, no acknowledgement of data — but there is a checksum. And that checksum has been failing. While I do initiate auto-retransmit, its slow and clumsy.

So tonight, I implemented an XMODEM-like protocol. Sender and receiver synchronize the start, each block is acknowledged, a NAK causes a retransmit of a block, instead of the whole track, and so on. and overall it works pretty good except for one thing. IT’S SLOW. How slow? About 2.1s per track. Yuck. At my high point with the other transfer, I was around 313-328ms per track. So like 6 times faster.

That’s way too slow. There’s a lot of back and forth with this protocol, with forced acknowledgment before the next block is transferred. The wikipedia page on Xmodem said that was one of the reasons for its replacement with other protocols like Ymodem and Zmodem.

Incidentally, I grew up on these serial based protocols, and used everything from Xmodem to Kermit to Ymodem, CIS B+, etc on BBS’s. ZModem with it’s resume feature was really tits on a stick back in the day.

Part of my problem is block size. I’m actually using 32-byte blocks because I don’t have enough memory on my SX. So that’s 374 blocks per track. 374 headers, 374 ACK’s.

34 blocks per sector, 172ms per sector. Just way way way too slow.

Since before I read from FRAM directly to USB, there was effectively no easy way to retransmit, because you can’t just backup using serial memory. And I don’t actually keep track, use, seek, of any fram byte locations. I don’t need to — I write one bit at a time, and I read back one byte at a time — but always in order, and always from start to end. The way I retransmit now is to re-read the track into memory again, and start the process all over again. In the past this worked for the few times I needed, but it seems that for whatever reason (maybe new FTDI drivers?) I’m dropping much more regularly now.

This xmodem method isn’t really 100% polished yet, but given these times I think it’s unlikely I’m going to. Gotta come up with a better method. Some in-between. Maybe some FRAM re-positioning routines?

Dunno.

I’ve love to hear what YOU think.

Thanks

sleep inversely proportional to the AFP and no more tokens

So I have figured one thing out.

The more I work on the AFP, the less I sleep.  At least that’s how it’s been over the last two nights.  Two nights ago it was 3am, last night it was after 2am.

Ideas come and go with me.

I’ve completely scrapped the token idea.  After a couple dry runs with some amiga floppies, it seems like the density of “10”s seem to be much more popular than 100, or 1000’s.  I guess I could have told you that looking at a scope.  But anyways, I had a lot of trouble getting the idea to work, and suffice to say, I quickly abandoned it.  The point is that the tokens were saving me about 400 bytes or so on average(about 3%), not the grand 50% or whatever I had imagined.

This also avoids unnecessary complexity.

So the current read routine times the difference between the pulses and then writes a ’10’, ‘100’, or ‘1000’ to the FRAM.  I manually write out the bits for each grouping…… no loops or anything.  And plus I take advantage of the fact that the output bit, once changed to 0 really doesn’t need to change.  So basically write a 1, clock, write a 0, clock, clock, clock for the 1000.

I successfully read a floppy last night using this method, so at least my concept is stable.  This still eliminates interrupts completely (edge and timed) and eliminates the ISR, etc.  Definitely getting a little more robust.  Now, will it actually help me read more floppies?  I don’t know.  I think the next step is to start documenting HOW these floppies are failing if I can tell.

I was having a few goofy problems last night, which I think were mostly related to flow-control.  I was using putty without f.c. and my SX appeared to lockup — it wouldn’t echo commands or the actual track.  Turned off flow control completely, and it started working.  I don’t know if putty handles the dtr/dsr flow control properly.  or should I say I’m not sure if the FTDI VCP drivers support DTR/DSR flow control.  I think they do because I think I tested it in Windows under C++ earlier.

The other goofy problem I was having was related to my code spontaneously restarting.  I think it was related to using JMP’s and not using the @ before the address.  The @ causes a PAGE to be inserted before the jump.  Now I don’t know if this is true for other CJxx commands etc.

I also saw a bunch of transfer checksum errors, but I think that’s related to the number of bytes being received.  I think I have to use the beginning delimiter FF FF effectively, and stop using an absolute offset for start of frame etc.

Ahh well.

error correction using hamming distances

I’ve been thinking for quite some time about error correction.  About using the structure of the MFM to our advantage.  I’ve come up with something although I haven’t implemented the idea yet.

As you know from my many references to it, http://www.techtravels.org/?p=62 , this post shows the 32 possible byte combinations from legal raw MFM.

I’ve run some numbers today.  First, there are illegal values, and then you have illegal PAIRS of MFM bytes.  For instance, while 0xAA is valid by itself, and 0x11 is valid by itself, 0xAA11 is illegal, because the combination of those would produce 4 consecutive zeros, and as we know, this is illegal.  Just like 0x55 is legal and 0x88 is legal, but 0x5588 is illegal because that would produce back-to-back 1’s, also illegal.

So as far as PAIRS go, roughly 1/3 of the total possible pairs (32*32=1024), are illegal.  That works out to be 341 pairs or so.  I was always interested in the percentage of bad pairs so we could use that info to detect bad bits/bytes.

So here’s my logic:

We need to know the data is bad, and that we fixed the data once we’ve attempted some error correction stuff.  The checksum works fine here.  The checksum uses XOR, and while its not perfect, it provides us with a decent check.

Next, we need to know which bytes are bad so we know which ones to fix.  This is also easy.  Because we know which raw values are ok, we can simply check the bytes against a table (when the checksum fails), and then we know roughly which ones are bad.  The byte before or after might be screwy, but we have the general location.

So now that we know which bytes need fixed, we need to figure out what the correct values for the bytes would be.  Enter hamming distance.  If you don’t know what that is, look here.  Long and short of it is that it provides a measurement of how many bits need flipped/changed to go from one byte to another.  In our case, it tells us that if we have a single-bit error (and this we have to guess on), and we give it the bad byte, it will tell us which LEGAL mfm bytes the byte could have been.  I wrote a small c program today that calculates the hamming distance between BAD BYTES, and legal mfm bytes.  Note this produced ~7200 lines of results.

So then what, we now have a small subset of guesses to make at the byte value.  Now, if there is just one bad byte, then its easy enough to swap in some choices, and then re-checksum the track to see if that fixes it.  If there are multiple values, this becomes much harder for me to wrap my head around programmatically wise.  I think the problem wreaks of recursion, which I’ve never been very good at.  I need to test all the possible (or probable) values in each of the appropriate locations.  This is not simple iteration, as far as I know.

So here’s some examples: (these are in decimal,not my normal hex)

Let’s say our bad byte is DEC 83.  Once again, this is illegal because of the “3” ended in binary 11.

So, the output from my program tells me that a single-bit error could have cause an 81 to look like an 83, that means bit 2 in that byte got mangled from a 0 to a 1.

But it could been an 82 as well.  So bit 1 might have been set, when it should have been cleared.

But that’s it.  Assuming that we just had a single-bit error, then those are the only two possibilities.  I think single-bit errors are the most common, but I’ll need to work on this.

OK, so now let’s look at 2-bit errors in the same byte.  If bad byte is still 83, then 17, 18, and 85 are possible choices.

What I’m really doing is intelligently narrowing the search space to valid possible choices.  And then, I could apply the PAIR logic from above, and potentially eliminate more of the choices.

I think it might make sense to check upwards of 3-bad bits of error…… After that it seems like maybe brute force might just be better/simpler…..

While this stuff is nice, if we have a byte get mangled in such a way that it produces another legal different raw mfm byte, then it might be impossible to check.  I can still apply my pair-logic to it which might help identify which byte is the problem byte.

Definitely different and excited stuff.  I’ve never seen this idea used before, at least not in the amiga floppy world.

feasability of writing disks

While there are some other bug fixes and documentation to be done on what’s already been implemented, I started thinking about writing ADF’s to disk over the last few days.

While the hardware part of it is already in place, there are some things that would need done:

  • The interrupt service routine would need modified to not just read data by reacting to edges, but now extract a bit from memory and put an appropriate pulse on the writedata pin. Floppy drive specs say the pulse should be 0.1us to 1.1us wide.
  • Write an SX routine that would receive data from the PC and store it in the fram. This would need to checksum the received data and either error out or request a retransmit.
  • Write PC routines that would create the full MFM track: I’d need to create a header (easy enough, use sector numbers, 11-sector number, track number, and then checksum everything), then MFM encode the data. I’m already doing much of this in the decode portion, so I can basically do the opposite for encoding.
  • Of course there’ll need to be a “controlling” pc routine, much like my current readtrack() routine.

So the whole process of writing a disk would go something like this:

  1. Have user browse for the filename, select the file, load it into a “floppydisk” structure/object that currently exists.
  2. Rewind and make sure I’m at track 0.
  3. Create the first track’s data.
  4. Send a command to the SX to tell it to receive a track’s worth of data.
  5. Send the data making sure it’s been received correctly.
  6. Tell the SX to write the track.
  7. SX enables the write gate by lowering it, and starts pulsing out data, pausing the appropriate times for 0-bits.

I don’t see any major hangups, although there are a few timing related things to get right. I’ve got make sure 18ms has passed after each step pulse. And I’ve got to make sure 100us has passed since I’ve selected the drive(this is mainly during setup for the first trak). For the last track, I need to make sure I pause for 650us after the last track is written. I also have to make sure that the time from the write gate dropping to the first bit MUST be 8us or less. Same with the final bit, I have to raise that write-gate within 8us after the last pulse.

I’ve got to look into creating a gap, sending SYNC words, learning wtf pre-write compensation is, etc.

problem with high density disks

I’m running into an issue with HD, high density, floppies.  The output on the scope looks just fine, but I get a consistent bad checksum on all the tracks.  I’ve tried with and without covering the HD hole.

My floppy spec clearly says that it uses a mechanical switch to detect whether it is high density and stays in “1.0 MB” mode if a low density disk is inserted.  But the floppy drive isn’t going to high density mode anyways.  The output is “normal-sized”, so my groups are 4,6, and 8us wide.  Just as normal.

I’ve never used a high density drive to write any of these floppies.

Amiga, with a low density drive of course, reads these floppies fine.

What the heck is going on here?  The other disks I’ve tested have been all DD disks…..and they read perfect.

I’d like to get my hands on a normal original double density floppy drive……………

SlackFest 2007

Ok, I’ll admit it.  I’ve been slacking like the best of them.

I’m not horribly unsatisified with my progress, especially this year.  The first four months of this year I achieved the following:

* Whole first ADF written January 29th
* Much cleaner hardware, single circuit board February 14th

* Data is now byte-sync’d, transferred sectors in order Februrary 17th

* Working Java client March 26th
As far as next steps go, I’ve got to

1> get error correction and retries working in the java client

2> clean up the gui and set up a “save config file” option, selecting serial ports, etc

3> clean up the java code

4> start testing it under Ubuntu

I’m very interested in error recovery, and I have been putting a decent amount of thought into it.  I’m really fascinated by the fact that Amiga floppy disks have tons and tons of structure.  First, the MFM is output in such a precise and predictable way, ie we know that no two 1’s are found back to back, no more than three consecutive zeros, etc.  We also know about the way clock bits are inserted.  And because of this fact, only certain bytes are possible when made up of these certain bit patterns.  In fact, only 32 are possible.  With the help of a checksum, I think it would be possible to brute-force a certain number of bad bytes.  Now I do think that the checksum is particularly unhelpful for this (XOR, I think), something like MD5 would be da bomb.  No false chance of duplicates yielding the same checksum.  I don’t understand or appreciate the (math-based) ramifications of using XOR though.  It’s certainly used everyplace within encryption, so this is might be better than I think.

My read routines are no longer raw, although I’m probably going to go back and add a raw read function.

I’ve tossed around the idea of simplifying this project even further by eliminating the FRAM memory and adding real-time functionality which I know the emulator crowd wants.  This is further down the line, and honestly, I don’t even know if its possible (or at least, not sure how I would do it)

I’ve still been managing between 12,000 and 20,000 hits per month (about 10,000 unique visitors), and so I really appreciate the continued interest people have shown.  This is really a fun project, and I’m looking forward to expanding it and making it better and more useful.

Thanks