6.3.25 Integrity and Checksums

6.3.25.1  A checksum is a calculated value which is used to check that all stored, transmitted or replicated data is without error. The value is calculated according to an appropriate algorithm and transmitted or stored with the data.When the data is subsequently accessed, a new checksum is calculated and compared with the original, and if they match, then no error is indicated. Checksum algorithms come in many types and versions and are recommended, and standard, practice for the detection of accidental or intentional errors in archival files.

6.3.25.2  The cryptographic versions are the only type that have a proven record of trust when protecting against intentional damage to data, and even the simplest of these are now compromised. It has been recently shown that there are ways of creating meaningless bits that will calculate as a given MD5 checksum. This means that an external or internal intruder may replace digital content with meaningless data and that this attack will go unnoticed by the error checking management system until the files are required for use and opened. MD5, although still useful for transmission purposes, is 124 bit and should not be used where security is the issue. SHA-1 is another cryptographic algorithm that is under threat of being compromised, and which it has already been shown can, in theory, be circumvented. The length of SHA-1 is 160 bit: SHA-2 comes in versions with 224, 256, 384, and 512 bit lengths, and are algorithmically similar to SHA-1. The steady growth of computational power means that these checksums may, in the long run, be compromised as well.

6.3.25.3  Even with these compromises, a checksum is a valid approach to detecting accidental errors, and if incorporated into a trusted digital repository, may well be sufficient to uncover intentional damage to data files in low risk scenarios. However, where risks exists, and perhaps even where they do not, monitoring checksums and their viability must be part of preservation planning.