Wyyme Technology Junction: File Formats and Structure

A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free and may be either unpublished or open.

Some file formats are designed for very particular types of data: PNG files, for example, store bitmapped images using lossless data compression. Other file formats, however, are designed for storage of several different types of data: the OGG format can act as a container for different types of multimedia including any combination of audio and video, with or without text (such as subtitles), and metadata. A text file can contain any stream of characters, including possible control characters, and is encoded in one of various character encoding schemes. Some file formats, such as HTML, scalable vector graphics, and the source code of computer software are text files with defined syntaxes that allow them to be used for specific purposes.

Identifying the file type

Different operating systems have traditionally taken different approaches to determining a particular file's format, with each approach having its own advantages and disadvantages. Most modern operating systems and individual applications need to use all of the following approaches to read "foreign" file formats, if not work with them completely.

Filename extension

One popular method used by many operating systems, including Windows, Mac OS X, CP/M, DOS, VMS, and VM/CMS, is to determine the format of a file based on the end of its name—the letters following the final period. This portion of the filename is known as the filename extension. For example, HTML documents are identified by names that end with .html (or .htm), and GIF images by .gif. In the original FAT file system, file names were limited to an eight-character identifier and a three-character extension, known as an 8.3 filename. There are only so many three-letter extensions, so, often any given extension might be linked to more than one program. Many formats still use three-character extensions even though modern operating systems and application programs no longer have this limitation. Since there is no standard list of extensions, more than one format can use the same extension, which can confuse both the operating system and users.

Internal metadata

A second way to identify a file format is to use information regarding the format stored inside the file itself, either information meant for this purpose or binary strings that happen to always be in specific locations in files of some formats. Since the easiest place to locate them is at the beginning, such area is usually called a file header when it is greater than a few bytes, or a magic number if it is just a few bytes long.

File header
The metadata contained in a file header are usually stored at the start of the file, but might be present in other areas too, often including the end, depending on the file format or the type of data contained. Character-based (text) files usually have character-based headers, whereas binary formats usually have binary headers, although this is not a rule. Text-based file headers usually take up more space, but being human-readable, they can easily be examined by using simple software such as a text editor or a hexadecimal editor.

As well as identifying the file format, file headers may contain metadata about the file and its contents. For example, most image files store information about image format, size, resolution and color space, and optionally authoring information such as who made the image, when and where it was made, what camera model and photographic settings were used (Exif), and so on. Such metadata may be used by software reading or interpreting the file during the loading process and afterwards.

Magic number or Shebang identifier in file contents

One way to incorporate file type metadata, often associated with Unix and its derivatives, is just to store a "magic number" inside the file itself. Originally, this term was used for a specific set of 2-byte identifiers at the beginnings of files, but since any binary sequence can be regarded as a number, any feature of a file format which uniquely distinguishes it can be used for identification. GIF images, for instance, always begin with the ASCII representation of either GIF87a or GIF89a, depending upon the standard to which they adhere. Many file types, especially plain-text files, are harder to spot by this method. HTML files, for example, might begin with the string <html> (which is not case sensitive), or an appropriate document type definition that starts with <!DOCTYPE HTML>, or, for XHTML, the XML identifier, which begins with <?xml. The files can also begin with HTML comments, random text, or several empty lines, but still be usable HTML.

The magic number approach offers better guarantees that the format will be identified correctly, and can often determine more precise information about the file. Since reasonably reliable "magic number" tests can be fairly complex, and each file must effectively be tested against every possibility in the magic database, this approach is relatively inefficient, especially for displaying large lists of files (in contrast, file name and metadata-based methods need check only one piece of data, and match it against a sorted index). On the other hand, a valid magic number does not guarantee that the file is not corrupt or is of a correct type. So-called shebang lines in script files are a special case of magic numbers. Here, the magic number is human-readable text that identifies a specific command interpreter and options to be passed to the command interpreter.

External metadata

A final way of storing the format of a file is to explicitly store information about the format in the file system, rather than within the file itself. This approach keeps the metadata separate from both the main data and the name, but is also less portable than either file extensions or "magic numbers", since the format has to be converted from file system to file system. While this is also true to an extent with filename extensions—for instance, for compatibility with MS-DOS's three character limit—most forms of storage have a roughly equivalent definition of a file's data and name, but may have varying or no representation of further metadata.

Note that zip files or archive files solve the problem of handling metadata. A utility program collects multiple files together along with metadata about each file and the folders/directories they came from all within one new file (e.g. a zip file with extension .zip). The new file is also compressed and possibly encrypted, but now is transmissible as a single file across operating systems by FTP systems or attached to email. At the destination, it must be unzipped by a compatible utility to be useful, but the problems of transmission are solved this way.

MIME (Multipurpose Internet Mail Extensions) types

MIME types are widely used in many Internet-related applications, and increasingly elsewhere, although their usage for on-disc type information is rare. These consist of a standardized system of identifiers (managed by IANA) consisting of a type and a sub-type, separated by a slash—for instance, text/html or image/gif. These were originally intended as a way of identifying what type of file was attached to an e-mail, independent of the source and target operating systems. There are problems with the MIME types though; several organizations and people have created their own MIME types without registering them properly with IANA, which makes the use of this standard awkward in some cases.

File format identifiers (FFIDs)

File name format identifiers is another, not widely used way to identify file formats according to their origin and their file category. It was created for the Description Explorer suite of software. It is composed of several digits of the form NNNNNNNNN-XX-YYYYYYY. The first part indicates the organization origin/maintainer (this number represents a value in a company/standards organization database), the 2 following digits categorize the type of file in hexadecimal. The final part is composed of the usual file extension of the file or the international standard number of the file, padded left with zeros. For example, the PNG file specification has the FFID of 000000001-31-0015948 where 31 indicates an image file, 0015948 is the standard number and 000000001 indicates the ISO Organization.

File content based format identification

Another but less popular way to identify the file format is to examine the file contents for distinguishable patterns among file types. The contents of a file are a sequence of bytes and a byte has 256 unique permutations (0–255). Thus, counting the occurrence of byte patterns that is often referred as byte frequency distribution gives distinguishable patterns to identify file types. There are many content-based file type identification schemes that use byte frequency distribution to build the representative models for file type and use any statistical and data mining techniques to identify file types

File structure and Types

File structure represents how each and every file types are storing encoded content in disk. There are several types of ways to structure data in a file. The most usual ones are described below.

Unstructured formats (Raw memory dumps)

Earlier file formats used raw data formats that consisted of directly dumping the memory images of one or more structures into the file. This has several drawbacks. Unless the memory images also have reserved spaces for future extensions, extending and improving this type of structured file is very difficult. It also creates files that might be specific to one platform or programming language (for example a structure containing a Pascal string is not recognized as such in C). On the other hand, developing tools for reading and writing these types of files is very simple.

Chunk-based formats

In this kind of file structure, each piece of data is embedded in a container that somehow identifies the data. The container's scope can be identified by start- and end-markers of some kind, by an explicit length field somewhere, or by fixed requirements of the file format's definition. Throughout the 1970s, many programs used formats of this general kind. For example, word-processors such as troff, Script, and Scribe, and database export files such as CSV. Electronic Arts and Commodore-Amiga also used this type of file format in 1985, with their IFF (Interchange File Format) file format. A container is sometimes called a "chunk", although "chunk" may also imply that each piece is small, and/or that chunks do not contain other chunks; many formats do not impose those requirements.
The information that identifies a particular "chunk" may be called many different things, often terms including "field name", "identifier", "label", or "tag". The identifiers are often human-readable, and classify parts of the data: for example, as a "surname", "address", "rectangle", "font name", etc. These are not the same thing as identifiers in the sense of a database key or serial number (although an identifier may well identify its associated data as such a key).

MIME headers do this with a colon-separated label at the start of each logical line. MIME headers cannot contain other MIME headers, though the data content of some headers has sub-parts that can be extracted by other conventions. CSV and similar files often do this using a header records with field names, and with commas to mark the field boundaries. Like MIME, CSV has no provision for structures with more than one level. XML and its kin can be loosely considered a kind of chunk-based format, since data elements are identified by markup that is akin to chunk identifiers. JSON is similar to XML without schemas, cross-references, or a definition for the meaning of repeated field-names, and is often convenient for programmers.

Directory-based formats

This is another extensible format, that closely resembles a file system (OLE Documents are actual file systems), where the file is composed of 'directory entries' that contain the location of the data within the file itself as well as its signatures (and in certain cases its type). Good examples of these types of file structures are disk images, OLE documents and TIFF images.

Few Popular File Types

Text File Types

Data File Types

Audio File Types

Note:
MIDI (Musical Instrument Digital Interface) is a technical standard that describes a communications protocol, digital interface and electrical connectors and allows a wide variety of electronic musical instruments, computers and other related music and audio devices to connect and communicate with one another.[1] A single MIDI link can carry up to sixteen channels of information, each of which can be routed to a separate device.

Video File Types

Note:
MPEG (Moving Pictures Expert Group) MP4 is an abbreviated term for MPEG-4 Part 14, a standard developed by the Motion Pictures Expert Group who was responsible for setting industry standards regarding digital audio and video, and is commonly used for sharing video files on the Web.

Executable File Types

Web File Types

Compressed File Types

System File Types

Settings File Types

Encoded File Types

Font File Types

Plugin File Types

Disk Image File Types

Developer File Types

Backup File Types

Image (Photo / Graphics / Animation) File Types

Basics of Digital Images

Our digital images are dimensioned in pixels (not bytes, and definitely not inches). And a pixel is simply a color definition, the color that this tiny dot of image sampled area ought to be. Put all those colored dots together, and our brain sees the image. The losses of image data we are speaking about is about the altered color of the pixels.

Image data consists of pixels, and pixels are "colors", simply the storage of the three RGB data components. Any 24-bit RGB image will use three bytes per pixel. So - for example- any 10 megapixel camera image data will occupy 3x10 = 30 million bytes, by definition of RGB color. This number is the "data size" (when opened into computer memory for use). A TIF file will be near that size (and is lossless), but JPG is normally compressed very heavily (lossy, not lossless) to store in a JPG file of perhaps 1/10 this size (variable with JPG Quality setting), which is "file size" (not image size and not data size). This example image size is still 10 megapixels (dimensioned in pixels, width x height), and the data size is 30 million bytes, but the JPG file size might be 3 MB (lossy compression takes a few liberties). The image will still come out of the JPG file as the same 10 megapixels and the same 30 million bytes when the 3 MB JPG file is opened. We hope its quality also comes out about the same - the JPG losses are altered color values of some of the pixels).

Image size (pixels) determines how we can use the image - everything is about the pixels. All photo editor programs will support these file formats, which will generally support and store images in the following color modes:

Color data mode of File Types, bits per pixel

JPG

RGB - 24-bits (8-bit color), or Grayscale - 8-bits
Always uses lossy JPG compression, but its degree is selectable, for higher quality and larger files, or lower quality and smaller files. JPG is for photo images, and is the worst choice for most graphics or text data.

TIF
Versatile, many formats supported.
Mode: RGB or CMYK or LAB, and others, almost anything.
8 or 16-bits per color channel, called 8 or 16-bit "color" (24 or 48-bit RGB files).
Grayscale - 8 or 16-bits,
Indexed color - 1 to 8-bits,
Line Art (bilevel)- 1-bit
For TIF files, most programs allow either no compression or LZW compression (LZW is lossless, but is less effective for color images). Adobe Photoshop also provides JPG or ZIP compression in TIF files too (but which greatly reduces third party compatibility of TIF files). "Document programs" allow ITCC G3 or G4 compression for 1-bit text (Fax is G3 or G4 TIF files), which is lossless and tremendously effective (small). Many specialized image file types (like camera RAW files) are TIF file format, but using special proprietary data tags.
24-bits is called 8-bit color, three 8-bit bytes for RGB (256x256x256 = 16.7 million colors maximum.)
Or 48-bits is called 16-bit color, three 16-bit words (65536x65536x65536 = trillions of colors conceptually)

PNG
RGB - 24 or 48-bits (called 8-bit or 16-bit "color"),
Alpha channel for RGB transparency - 32 bits
Grayscale - 8 or 16-bits,
Indexed color - 1 to 8-bits,
Line Art (bilevel) - 1-bit
Supports transparency in regular indexed color, and also there can be a fourth channel (called Alpha) which can map RGB graduated transparency (by pixel location, instead of only one color, and graduated, instead of only on or off).
The APNG version also supports animation (like GIF), showing several sequential frames fast to simulate motion.
PNG uses ZIP compression which is lossless, and somewhat more effective color compression than GIF or TIF LZW. For photo data, PNG is somewhat smaller files than TIF LZW, but larger files than JPG (however PNG is lossless, and JPG is not.) PNG is a newer format than the others, designed to be both versatile and royalty free, back when the patent for LZW compression was disputed for GIF and TIF files.

GIF
Indexed color - 1 to 8-bits (8-bit indexes, limiting to only 256 colors maximum.) Color is 24-bit color, but only 256 colors.
One color in indexed color can be marked transparent, allowing underlaying background to be seen (very important for text, for example). GIF is an online video image, the file contains no dpi information for printing. Designed by CompuServe for online images in the days of dialup and 8-bit indexed computer video, whereas other file formats can be 24-bits now. However, GIF is still great for web use of graphics containing only a few colors, when it is a small lossless file, much smaller and better than JPG for this. GIF files do not save the dpi number for printing resolution.
GIF uses lossless LZW compression. (for Indexed Color, see second page at GIF link at page bottom).
GIF also supports animation, showing several sequential frames fast to simulate motion.

Note that if your image size is say 3000x2000 pixels, then this is 3000x2000 = 6 million pixels (6 megapixels). Assuming this 6 megapixel image data is RGB color and 24-bits (or 3 bytes per pixel of RGB color information), then the size of this image data is 6 million x 3 bytes RGB = 18 million bytes. That is simply how large your image data is (see more). Then file compression like JPG or LZW can make the file smaller, but when you open the image in computer memory for use, the JPG may not still have the same image quality, but it is always still 3000x2000 pixels and 18 million bytes. This is simply how large your 6 megapixel RGB image data is (megapixels x 3 bytes per pixel).

Type of Images

There are two type of image type available in computer world. 1) Raster Image 2) Vector Image.

Raster Image:
In computer graphics, a raster graphics or bitmap image is a dot matrix data structure, representing a generally rectangular grid of pixels, or points of color, viewable via a monitor, paper, or other display medium. Raster images are stored in image files with varying formats. Raster graphics are best used for non-line art images; specifically digitized photographs, scanned artwork or detailed graphics. Non-line art images are best represented in raster form because these typically include subtle chromatic gradations, undefined lines and shapes, and complex composition.
To maximize the quality of a raster image, you must keep in mind that the raster format is resolution-specific — meaning that raster images are defined and displayed at one specific resolution. Resolution in raster graphics is measured in dpi, or dots per inch. The higher the dpi, the better the resolution. Remember also that the resolution you actually observe on any output device is not a function of the file’s own internal specifications, but the output capacity of the device itself. Thus, high resolution images should only be used if your equipment has the capability to display them at high resolution.
Better resolution, however, comes at a price. Just as raster files are significantly larger than comparable vector files, high resolution raster files are significantly larger than low resolution raster files. Overall, as compared to vector graphics, raster graphics are less economical, slower to display and print, less versatile and more unwieldy to work with. Remember though that some images, like photographs, are still best displayed in raster format. Common raster formats include TIFF, JPEG, GIF, PCX and BMP files. Despite its shortcomings, raster format is still the Web standard — within a few years, however, vector graphics will likely surpass raster graphics in both prevalence and popularity.

Vector Image:

Unlike pixel-based raster images, vector graphics are based on mathematical formulas that define geometric primitives such as polygons, lines, curves, circles and rectangles. Because vector graphics are composed of true geometric primitives, they are best used to represent more structured images, like line art graphics with flat, uniform colors. Most created images (as opposed to natural images) meet these specifications, including logos, letterhead, and fonts.
Inherently, vector-based graphics are more malleable than raster images — thus, they are much more versatile, flexible and easy to use. The most obvious advantage of vector images over raster graphics is that vector images are quickly and perfectly scalable. There is no upper or lower limit for sizing vector images. Just as the rules of mathematics apply identically to computations involving two-digit numbers or two-hundred-digit numbers, the formulas that govern the rendering of vector images apply identically to graphics of any size.
Further, unlike raster graphics, vector images are not resolution-dependent. Vector images have no fixed intrinsic resolution, rather they display at the resolution capability of whatever output device (monitor, printer) is rendering them. Also, because vector graphics need not memorize the contents of millions of tiny pixels, these files tend to be considerably smaller than their raster counterparts. Overall, vector graphics are more efficient and versatile. Common vector formats include AI, EPS, CGM, WMF and PICT (Mac).

Difference in photo (Raster) and Graphics (Vector) images
Photo images have continuous tones, meaning that adjacent pixels often have very similar colors, for example, a blue sky might have many shades of blue in it. Normally this is 24-bit RGB color, or 8-bit grayscale, and a typical color photo may contain perhaps a hundred thousand RGB colors, out of the possible set of 16 million colors in 24-bit RGB color.
Graphic images are normally not continuous tone (gradients are possible in graphics, but are seen less often). Graphics are drawings, not photos, and they use relatively few colors, maybe only two or three, often less than 16 colors in the entire image. In a color graphic cartoon, the entire sky will be only one shade of blue where a photo might have dozens of shades. A map for example is graphics, maybe 4 or 5 map colors plus 2 or 3 colors of text, plus blue water and white paper, often less than 16 colors overall. Line art is a special case, only two colors (black or white, with no gray), for example clip art, fax, and of course text. Low resolution line art (like cartoons on the web) is often better as grayscale, to add anti-aliasing to hide the jaggies.

Below are few popular Picture/Image File Formats and Types

The most common image file formats, the most important for cameras, printing, scanning, and internet use, are JPG, TIF, PNG, and GIF.

3D Image Types

Raster Image Types

Vector Images Types

JPG is the most used image file format. JPG is the file extension for JPEG files (Joint Photographic Experts Group, committee of ISO and ITU). Digital cameras and web pages normally use JPG files - because JPG heroically compresses the data to be very much smaller in the file. However JPG uses lossy compression to accomplish this feat, which is a strong downside. A smaller file, yes, there is nothing like JPG for small, but this is at the cost of image quality. This degree is selectable (with an option setting named JPG Quality), to be lower quality smaller files, or to be higher quality larger files. In general today, JPG is rather unique in this regard, using lossy compression allowing very small files of lower quality, whereas almost any other file type uses lossless compression (and is larger). The meaning of Lossy is discussed below.
Frankly, JPG is used when small file size is more important than maximum image quality (web pages, email, memory cards, etc). But JPG is good enough in many cases, if we don't overdo the compression. Perhaps good enough for some uses even if we do overdo it (web pages, etc). But if you are concerned with maximum quality for archiving your important images, then you do need to know two things: 1) JPG should always choose higher Quality and a larger file, and 2) do NOT keep editing and saving your JPG images repeatedly, because more quality is lost every time you save it as JPG (in the form of added JPG artifacts... pixels become colors they ought not to be - lossy). More at the JPG link at page bottom.
TIF is lossless (including LZW compression option), which is considered the highest quality format for commercial work. The TIF format is not necessarily any "higher quality" per se (the same RGB image pixels, they are what they are), and most formats other than JPG are lossless too. TIF simply has no JPG artifacts, no additional losses or JPG artifacts to degrade and detract from the original. And TIF is the most versatile, except that web pages don't show TIF files. For other purposes however, TIF does most of anything you might want, from 1-bit to 48-bit color, RGB, CMYK, LAB, or Indexed color. Most any of the "special" file types (for example, camera RAW files, fax files, or multipage documents) are based on TIF format, but with unique proprietary data tags - making these incompatible unless expected by their special software.
GIF was designed by CompuServe in the early days of computer 8-bit video, before JPG, for video display at dial up modem speeds. GIF discards all Exif data, and while GIF is fine for video screen purposes, GIF does Not retain printing resolution values. GIF always uses lossless LZW compression, but it is always an indexed color file (1 to 8-bits per pixel). GIF can have a palette of 24-bit colors, but only 256 of them maximum (which colors depend on your image colors). GIF is rather limited colors for color photos, but is generally great for graphics. Repeating, don't use indexed color for color photos today, the color is too limited. GIF offers transparency and animation. PNG and TIF files can also optionally handle the same indexed color mode that GIF uses, but they are more versatile with other choices too (can be RGB or 16 bits, etc). But GIF is still very good for web graphics (I.e., with a limited number of colors). For graphics of only a few colors, GIF can be much smaller than JPG, with more clear pure colors than JPG). Indexed Color is described at Color Palettes.
PNG can replace GIF today (web browsers show both), and PNG also offers many options of TIF too (indexed or RGB, 1 to 48-bits, etc). PNG was invented more recently than the others, designed to bypass possible LZW compression patent issues with GIF, and since it was more modern, it offers other options too (RGB color modes, 16 bits, etc). One additional feature of PNG is transparency for 24 bit RGB images. Normally PNG files are a little smaller than LZW compression in TIF or GIF (all of these use lossless compression, of different types), but PNG is slower to read or write. That patent situation has gone away now, but PNG remains excellent lossless compression. Less used than TIF or JPG, but PNG is another good choice for lossless quality work.
Camera RAW files are very important of course, but RAW files must be processed to regular formats (JPG, TIF, etc) to be viewable and usable in any way. However, the point is that RAW offers substantial benefit in doing that, one of which is we can choose our settings AFTER we can see the image, and what it needs, and what helps it. The debate goes on, some cannot imagine NOT taking advantage of the greater opportunities of RAW. Others think any extra step is too much trouble, and are satisfied with JPG - my own biased opinion is they just don't know yet. :) More detail Below.
We could argue that there really is no concept of RAW files from the scanner. Vuescan does offer an output called RAW, which is 16 bit, but RGB, not raw. It includes the fourth Infrared noise correction channel data if any, and defers gamma correction. Vuescan itself is the only post-processor for these. But scanner color images are already RGB color, instead of Bayer pattern raw data like from cameras. Camera RAW images are not RGB, and must be converted to RGB for any use.

Major considerations to choose the necessary file type include:

Compression quality - Lossy for smallest files (JPG), or Lossless for best quality images (TIF, PNG).
Full RGB color for photos (TIF, PNG, JPG), or Indexed Color for graphics (PNG, GIF, TIF).
16-bit color (48-bit RGB data) is sometimes desired (TIF and PNG).
Transparency or Animation is used in graphics (GIF and PNG).
Documents - line art, multi-page, text, fax, etc - this will be TIF.
CMYK color is certainly important for commercial prepress (TIF).

See chart near bottom of page. We select the file type that supports the options we need.

The only reason for using lossy compression is for smaller file size, usually due to internet transmission speed or storage space. Web pages require JPG or GIF or PNG image types, because sone browsers do not show TIF files. On the web, JPG is the clear choice for photo images (smallest file, with image quality being less important than file size), and GIF is common for graphic images, but indexed color is not normally used for color photos (PNG can do either on the web).

Other than the web, TIF file format is the undisputed leader when best quality is desired, largely because TIF is so important in commercial printing environments. High Quality JPG can be pretty good too, but don't ruin them by making the files too small. If the goal is high quality, you don't want small. Only consider making JPG large instead, and plan your work so you can only save them as JPG only one or two times. Adobe RGB color space may be OK for your home printer and profiles, but if you send your pictures out to be printed, the mass market printing labs normally only accept JPG files, and only process sRGB color space.

Saturday, 14 October 2017

File Formats and Structure

Identifying the file type

Audio File Types

Color data mode of File Types, bits per pixel

Difference in photo (Raster) and Graphics (Vector) images

No comments:

Post a Comment