Portable Filenames

From Gramps
Revision as of 19:37, 20 July 2008 by Duncan (talk | contribs) (Test files)
Jump to: navigation, search


In order to be able to move our media files from one computer to another it is critical that the names of our files can be understood by the different file systems and encodings they meet.

To find a set of characters which can meet all these criteria this article is originally based on content from Wikipedia Online Encyclopedia, especially the articles Filenames, Comparison of file systems and ASCII character encoding. Please add other references to improve this article.

File system issues

For genealogy purposes you will need to decide how many different situations you want your files to be usable in. To see what file systems you have on your computer !!NEEDS TO BE WRITTEN!! If you sometimes use a USB key you should remember that they typically use the FAT32 file system, which does not accept the same file names as (for example) Ubuntu Linux.

This page assumes you want to support the situations listed below and don't use more than 255 characters in the file names.

  • Windows accessing hard drives with the
    • NTFS file system
    • FAT32 file system
  • Linux accessing hard drives with the
    • NTFS file system
    • FAT32 file system (VFAT)
    • EXT2 and EXT3 file systems

It should be noted that FAT12 and FAT16 use 8.3 filenames, limiting the useful file name length to 8 characters (and three for the extension) before the introduction of Long Filenames (LFN) in 1994.

Reserved characters and words

(Copied directly from Filename 20th July 2008, 16:14 CET)

Many operating systems prohibit control characters from appearing in file names. For example, DOS and early Windows systems require files to follow the 8.3 filename convention. Unix-like systems are an exception, as the only control character forbidden in file names is the null character, as that's the end-of-string indicator in C. Trivially, Unix also excludes the path separator / from appearing in filenames.

Some operating systems prohibit some particular characters from appearing in file names:

Character Name Reason
/ slash used as a path name component separator in Unix-like, Windows, and Amiga systems. (The MS-DOS command.com shell would consume it as a switch character, but Windows itself //always// accepts it as a separator [1])
\ backslash Also used as a path name component separator in MS-DOS, OS/2 and Windows (there is no difference between slash and backslash); allowed in Unix filenames, see Note 1
? question mark used as a wildcard in Unix, Windows and AmigaOS; marks a single character. Allowed in Unix filenames, see Note 1
% percent sign used as a wildcard in RT-11; marks a single character.
* asterisk used as a wildcard in Unix, MS-DOS, RT-11, VMS and Windows. Marks any sequence of characters (Unix, Windows, later versions of MS-DOS) or any sequence of characters in either the basename or extension (thus "*.*" in early versions of MS-DOS means "all files". Allowed in Unix filenames, see note 1
: colon used to determine the mount point / drive on Windows; used to determine the virtual device or physical device such as a drive on AmigaOS, RT-11 and VMS; used as a pathname separator in classic Mac OS. Doubled after a name on VMS, indicates the DECnet nodename (equivalent to a NetBIOS (Windows networking) hostname preceded by "\\".)
| vertical bar designates software pipelining in Unix and Windows; allowed in Unix filenames, see Note 1
" quotation mark used to mark beginning and end of filenames containing spaces in Windows, see Note 1
< less than used to redirect input, allowed in Unix filenames, see Note 1
> greater than used to redirect output, allowed in Unix filenames, see Note 1
. full stop/period allowed but the last occurrence will be interpreted to be the extension separator in VMS, MS-DOS and Windows. In other OSes, usually considered as part of the filename, and more than one full stop may be allowed.

Note 1: Most Unix shells require certain characters such as spaces, <, >, |, \, and sometimes :, (, ), &, ;, as well as wildcards such as ? and *, to be quoted or escaped:

five\ and\ six\<seven (example of escaping)
'five and six<seven' or "five and six<seven" (examples of quoting)

In Windows the space and the period are not allowed as the final character of a filename. The period is allowed as the first character, but certain Windows applications, such as Windows Explorer, forbid creating or renaming such files (despite this convention being used in Unix-like systems to describe hidden files and directories). Among workarounds are using different explorer applications or saving a file from an application with the desired name.

Some file systems on a given operating system (especially file systems originally implemented on other operating systems), and particular applications on that operating system, may apply further restrictions and interpretations. See comparison of file systems for more details on restrictions imposed by particular file systems.

In Unix-like systems, MS-DOS, and Windows, the file names "." and ".." have special meanings (current and parent directory respectively).

In addition, in Windows and DOS, some words might also be reserved and can not be used as filenames.<ref name="win"/> For example, DOS Device file:

CON, PRN, AUX, CLOCK$, NUL
COM0, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9
LPT0, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9.

Operating systems that have these restrictions cause incompatibilities with some other filesystems. For example, Windows will fail to handle, or raise error reports for, these legal UNIX filenames: aux.c, q"uote"s.txt, or NUL.txt.

If you put your files onto any other file system you risk losing part of the files name, capitalisation of the name, characters not accepted in a file name. All these problems can make it hard to find and recover your files.

Encoding Issues

Please read the Wikipedia article Many of us use computers with Unicode as the default character encoding, which supports about 100,000 characters<ref>ref</ref>. It is now standard for Linux systemscitation requested. But many systems are still in use which by default use ASCII encoding which supports only 128 characters<ref>ref</ref> not all of which can be used in texts.

The first clear choice then is to use characters which are in the ASCII character set. If we use Unicode characters we can easily use something which won't be understood by an operating system using ASCII. For example Åse (a girl's name in Danish) simply cannot be represented as an ASCII character.

Safe characters

Keeping our sights firmly on the ASCII character set, specifically the ASCII printable characters (and skipping the control characters), we get the list below of available characters.

Glyph Name Safe character? Remarks
a-z English letters a through to z Yes Not reservedref
A-Z English letters A through to Z Yes Not reservedref
0-9 Digits 0 through to 9 Yes Not reservedref
space Yes Not reservedref
! exclamation mark Yes Not reservedref
" quotation mark No used to mark beginning and end of filenames containing spaces in Windows
# number sign Yes Not reservedref
$ dollar sign Yes Not reservedref. Used to start variables in many programming languages
% percent sign No used as a wildcard in RT-11; marks a single characterref
& Ampersand Yes Not reserved(ref)
' Apostrophe Yes Not reserved(ref)
( and ) Parentheses Yes Not reserved(ref)
* Asterisk No Used as a wildcard in Unix, MS-DOS, RT-11, VMS and Windows. Marks any sequence of characters (Unix, Windows, later versions of MS-DOS) or any sequence of characters in either the basename or extension (thus "*.*" in early versions of MS-DOS means "all files". Allowed in Unix filenames.(ref)
+ Plus Yes Not reserved(ref)
, Comma Yes Not reserved(ref)
- Hyphen Yes Not reserved(ref)
. Period / Full stop Yes Allowed but the last occurrence will be interpreted to be the extension separator in VMS, MS-DOS and Windows. In other OSes, usually considered as part of the filename, and more than one full stop may be allowed.(ref)
/ and \ Slash and Backslash No Slash is used as a path name component separator in Unix-like, Windows, and Amiga systems. Backslash is also used as a path name component separator in MS-DOS, OS/2 and Windows (there is no difference between slash and backslash); allowed in Unix filenames.(ref)
: Colon Yes Not reserved(ref)
; Semi colon Yes Not reserved(ref)
< Less than sign No Used to redirect input, allowed in Unix filenames(ref)
= Equals sign Yes Not reserved(ref)
> Greater than sign No Used to redirect input, allowed in Unix filenames(ref)
? Question mark Yes Not reserved(ref)
@ At sign Yes Not reserved(ref)
[ and ] square brackets or box brackets Yes Not reserved(ref)
^ Caret Yes Not reserved(ref)
_ Underscore Yes Not reserved(ref)
` Grave accent No Not reserved(ref), but Outside the US often replaced by the local currency symbol. Many older UK computers, such as the ZX Spectrum and BBC Micro, have the £ symbol in it's place.(ref)
{ and } Curly brackets Yes Not reserved(ref)
Vertical bar No Designates software pipelining in Unix and Windows; allowed in Unix filenames(ref)
~ Tilde Yes Not reserved(ref)

Test files

Here are some filenames which GRAMPS users have tested to make sure they are okay.

Filename Operating system File system Description User Name
~test.txt Windows Vista NTFS Simply created on a Vista machine's desktop, not moved around. Duncan