RE: Lossy data and incorrect data

SSC Guru

Points: 104773

August 21, 2013 at 4:50 pm

Anyone who did what is suggested in the last sentence of the editorial ought to be given a good thrashing with a clue stick. (My comments here are relevant to nothing other than that last sentence.)

Numerics are far better compressed by OCR than by image compression, and the rule should be to use OCR to get the numeric (and alphabetic) components BEFORE any compression is done (since areas of image can usually be then thrown away without loss of anything useful).

Note the positions and sizes of the alphanumeric chunk in the image. replace them by neutral background in the image, compress the revised image, then store the positions/sizes and text along with it. This will give better compression than compressing the image with the text/numeric data in it, and will ensure the accuracy (to the limits of the OCR plus whatever checking on the OCR is done) than compressing the text/numeric with the image, thus giving better compression and better accuracy at the same time, in all cases where the background of the text/numeric data is not important - which means in just about every case where the text/numeric content has any legal implication. Of course "in just about every case" doesn't mean always, there are cases where this technique is not good enough; even then, it's going to be better than compressing the image and using OCR on the result in any case where the accuracy of the numeric/text data matters: compressing the image with the original text/numeric zones included and keeping also a record of the text/numeric data and its positions in the image will have only a small compression penalty in exchange for an enormous improvement in accuracy of the alphanumeric data compared to compressing before applying OCR.

Tom