Understanding Digital Data

Digital data is able to represent most media types, whether text, sound, image, moving image, or new media types such as hypertext or relational databases, in a unified way. In the end, everything is just a bit stream.

In a broad sense, digital data can be defined as anything recorded using a symbol-based code on a medium. Such a code uses a finite set, S , of Symbols—for example the Latin alphabet, Egyptian hieroglyphics, etc.

S = {S1,S2,.....Sn}, n ≥ 2

If n = 2 and thus the code uses only two symbols, it would be called a binary code. Binary codes are the simplest codes; they can be implemented easily by computing machinery (e.g., S = {0,1}, S = {true, false}, S = {+5V,-5V} or S = {?,?}).

The meaning of a symbol often depends on its position within the sequence of symbols. Frequently symbols are combined in groups to form new symbols (commonly known as “words”) which themselves are combined into higher units (“sentences”).

Reading and understanding a digital code requires two distinct steps:

  1. The file format has to be identified.

  2. The appropriate syntactic and semantic rules of the file format have to be applied to interpret the digital code.

If a data file is identified as being a TIFF image file, but the specification of the TIFF format is not known, the image represented by the data cannot be extracted. Therefore, if the semantic system of a digital code cannot be identified or is not known, the information contained in the digital data cannot be extracted.

This leads to the following prerequisites for retrieving digital data:

  1. The physical property used to create the marks has to be known. For current media types the physical property used to create the marks is usually known. We know that a floppy disk has magnetic marks and that a compact disc has optically detectable marks. However, future digital archeologists might have problems determining which physical property has been used to create the marks, especially for new, emerging recording technologies.

  2. The physical marks on the medium must be detectable and convertible into symbols. If, as a result of damage and aging, this is no longer possible, the medium has to be considered “destroyed” and unreadable.

  3. The syntactic and semantic system (file format) has to be identified and known.

If any of these tasks cannot be accomplished, the digital data will no longer be readable and the recorded information is lost.

Digital data is independent of the medium it is recorded on as long as the symbols can be deciphered. For example, a binary computer file representing an image using the JPEG format could be engraved into a stone—it would be not very handy to work with, but nevertheless feasible. Thus, digital data can be copied from one medium to any other medium without loss.

Digital data can be copied without any loss by reproducing the same sequence of symbols from the “original” sequence. The two copies will be indistinguishable from each other and therefore it is not possible to determine which one is the “original.” However, since the physical representation of a digital code always has an analog nature that may result in errors, the digital copy process is only completed if the two copies have been verified to be identical either by a symbol-wise (or, in case of binary data, bit-wise) comparison or by using checksums. Therefore, digital data can be copied without limits and there will be no generational loss.

Digital data can be transported through space with the speed of light without the need for moving atoms or matter. This property allows digital data to be tele-copied without loss at the speed of light.

Table of common image file formats with “magic numbers”

File Type Typical Extension Hex Digits xx = variable ASCII Digits
GIF .gif 47 49 46 38 GIF8
FITS .fits 53 49 4d 50 4c 45 SIMPLE
Bitmap .bmp 42 4d BM
Graphics Kernel System .gks 47 4b 53 4d GKSM
IRIS rgb .rgb 01 da . .
ITC (CMU WM) .itc f1 00 40 bb . . . .
JPEG File Interchange .jpg ff d8 ff e0 . . . .
NIFF (Navy TIFF) .nif 49 49 4e 31 IIN1
PM .pm 56 49 45 57 VIEW
PNG .png 89 50 4e 47 .PNG
Postscript .[e]ps 25 21 %!
Sun Rasterfile .ras 59 a6 6a 95 Y.j.
Targa .tga xx xx xx . . .
TIFF (Motorola—big endian) .tif 4d 4d 00 2a MM.*
TIFF (Intel—little endian) .tif 49 49 2a 00 II*.
X11 Bitmap .xbm xx xx  
XCF Gimp file structure .xcf 67 69 6d 70 20 78 63 66 20 76 gimp xcf
Xfig .fig 23 46 49 47 #FIG
XPM .xpm 2f 2a 20 58 50 4d 20 2a 2f /* XPM */

There have been no comments | Subscribe to Comments | Jump to Form »

Post Comment on This Article

Your e-mail address won't be published. If you simply add some value to the original post and stay on the topic, your comment will be approved.

You can use Textile parameters on your comments. For example: _italic_ *bold* bq. quated text "link text":URL — Get your own picture next to your comment with a Gravatar account.