12. Data management: archiving principles in a file-based environment

The core actions in file-based archiving pertain to bit preservation, i.e., a set of actions that maintain the integrity of the digital data (“bitstreams”) that are being managed by the responsible institution.Actions beyond bit preservation will ultimately be needed when the formatting of the content is obsolescent. The most common action will be format migration, although (as noted in section 10 comments) there may be contexts in which system emulation is required. While bit preservation decisions may be left to information technology specialists and appropriate software and hardware applications, the actions beyond bit preservation will benefit from the involvement of people with curatorial responsibilities. What is at stake requires consideration of the significant properties of the content, the makeup of the research community being served, and an assessment of format obsolescence and the options for the new target formats.Data management must observe the following core principles:

  • Files are generally placed in storage systems by copying. This process must produce duplicates that are verifiably identical to the originals. This process of data integrity checking can be achieved through the prior creation of a checksum, also known as a hash or digest. The process of verification should take place immediately after the creation of the copy, ideally as an automated procedure.
  • The ongoing data integrity of file-based content must be checked at regular intervals to ensure that it can be read exactly as it was written, with no errors or changes.
  • Depending on the original file format however, it may be desirable to transcode to a new target format rather than simply copy from the original file (see sections 10 & 11). This process is known as format migration.
  • Digital content, whether file- or carrier-based, must be copied to a new physical carrier before uncorrectable errors occur. When the original and target formats are the same, this process is known as refreshment or media migration.
  • It is essential to keep at least two digital preservation copies, ideally more, and to use further dedicated copies for access as appropriate. The preservation copies should be kept in different geographic locations whenever possible. Additional security may also be provided by the use of different storage technologies for each set of preservation copies. When choosing which technologies to use, it should be borne in mind that a strategy will only be as strong as its weakest link.
  • Access copies should be made whenever possible. Unlike archival master files however, such access or distribution copies may be subjectively modified, depending on the requirements of users. Data reduction may also be employed when compatible with user requirements. As with the creation of archival masters, careful documentation of all parameters and procedures employed is essential.
  • Where possible, checks to ensure data integrity should be automated, as is possible with equipment within trusted digital repositories. If this is not possible, then manual checks will need to be undertaken, on a statistically significant basis.

Comment:While these principles apply equally to any form of file-based preservation, the relatively large file sizes and time-based nature of audiovisual content demand that storage and bandwidth capacities be considered carefully. Essentially, these principles are the same as those recommended for the analogue world. One fundamental difference, however, is the qualitative dimension of the file-based digital world, which permits objective validation of the integrity of recordings. Regular data integrity monitoring is amongst the core obligations of digital preservation routines. Digital carriers and systems can and do fail, without warning, at any time. Strategies for minimising risks to digital archives are greatly supported by networking between the primary collection, the user and backup archives.