Portable Filenames

From Gramps
Revision as of 16:24, 20 July 2008 by Duncan (talk | contribs) (Safe characters)
Jump to: navigation, search

In order to be able to move our media files from one computer to another it is critical that the names of our files can be understood by the different file systems and encodings they meet.

To find a set of characters which can meet all these criteria this article is originally based on content from Wikipedia Online Encyclopedia, especially the articles Filenames, Comparison of file systems and ASCII character encoding. Please add other references to improve this article.

File system issues

For genealogy purposes you will need to decide how many different situations you want your files to be usable in. To see what file systems you have on your computer !!NEEDS TO BE WRITTEN!! If you sometimes use a USB key you should remember that they typically use the FAT32 file system, which does not accept the same file names as (for example) Ubuntu Linux.

This page assumes you want to support the situations listed below and don't use more than 255 characters in the file names.

  • Windows accessing hard drives with the
    • NTFS file system
    • FAT32 file system
  • Linux accessing hard drives with the
    • NTFS file system
    • FAT32 file system (VFAT)
    • EXT2 and EXT3 file systems

It should be noted that FAT12 and FAT16 use 8.3 filenames, limiting the useful file name length to 8 characters (and three for the extension) before the introduction of Long Filenames (LFN) in 1994.

Reserved characters and words

(Copied directly from Filename 20th July 2008, 16:14 CET)

Many operating systems prohibit control characters from appearing in file names. For example, DOS and early Windows systems require files to follow the 8.3 filename convention. Unix-like systems are an exception, as the only control character forbidden in file names is the null character, as that's the end-of-string indicator in C. Trivially, Unix also excludes the path separator / from appearing in filenames.

Some operating systems prohibit some particular characters from appearing in file names:

Character Name Reason
/ slash used as a path name component separator in Unix-like, Windows, and Amiga systems. (The MS-DOS command.com shell would consume it as a switch character, but Windows itself //always// accepts it as a separator [1])
\ backslash Also used as a path name component separator in MS-DOS, OS/2 and Windows (there is no difference between slash and backslash); allowed in Unix filenames, see Note 1
? question mark used as a wildcard in Unix, Windows and AmigaOS; marks a single character. Allowed in Unix filenames, see Note 1
% percent sign used as a wildcard in RT-11; marks a single character.
* asterisk used as a wildcard in Unix, MS-DOS, RT-11, VMS and Windows. Marks any sequence of characters (Unix, Windows, later versions of MS-DOS) or any sequence of characters in either the basename or extension (thus "*.*" in early versions of MS-DOS means "all files". Allowed in Unix filenames, see note 1
: colon used to determine the mount point / drive on Windows; used to determine the virtual device or physical device such as a drive on AmigaOS, RT-11 and VMS; used as a pathname separator in classic Mac OS. Doubled after a name on VMS, indicates the DECnet nodename (equivalent to a NetBIOS (Windows networking) hostname preceded by "\\".)
| vertical bar designates software pipelining in Unix and Windows; allowed in Unix filenames, see Note 1
" quotation mark used to mark beginning and end of filenames containing spaces in Windows, see Note 1
< less than used to redirect input, allowed in Unix filenames, see Note 1
> greater than used to redirect output, allowed in Unix filenames, see Note 1
. full stop/period allowed but the last occurrence will be interpreted to be the extension separator in VMS, MS-DOS and Windows. In other OSes, usually considered as part of the filename, and more than one full stop may be allowed.

Note 1: Most Unix shells require certain characters such as spaces, <, >, |, \, and sometimes :, (, ), &, ;, as well as wildcards such as ? and *, to be quoted or escaped:

five\ and\ six\<seven (example of escaping)
'five and six<seven' or "five and six<seven" (examples of quoting)

In Windows the space and the period are not allowed as the final character of a filename. The period is allowed as the first character, but certain Windows applications, such as Windows Explorer, forbid creating or renaming such files (despite this convention being used in Unix-like systems to describe hidden files and directories). Among workarounds are using different explorer applications or saving a file from an application with the desired name.

Some file systems on a given operating system (especially file systems originally implemented on other operating systems), and particular applications on that operating system, may apply further restrictions and interpretations. See comparison of file systems for more details on restrictions imposed by particular file systems.

In Unix-like systems, MS-DOS, and Windows, the file names "." and ".." have special meanings (current and parent directory respectively).

In addition, in Windows and DOS, some words might also be reserved and can not be used as filenames.<ref name="win"/> For example, DOS Device file:

CON, PRN, AUX, CLOCK$, NUL
COM0, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9
LPT0, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9.

Operating systems that have these restrictions cause incompatibilities with some other filesystems. For example, Windows will fail to handle, or raise error reports for, these legal UNIX filenames: aux.c, q"uote"s.txt, or NUL.txt.

If you put your files onto any other file system you risk losing part of the files name, capitalisation of the name, characters not accepted in a file name. All these problems can make it hard to find and recover your files.

Encoding Issues

Please read the Wikipedia article Many of us use computers with Unicode as the default character encoding, which supports about 100,000 characters<ref>ref</ref>. It is now standard for Linux systemscitation requested. But many systems are still in use which by default use ASCII encoding which supports only 128 characters<ref>ref</ref> not all of which can be used in texts.

The first clear choice then is to use characters which are in the ASCII character set. If we use Unicode characters we can easily use something which won't be understood by an operating system using ASCII. For example Åse (a girl's name in Danish) simply cannot be represented as an ASCII character.

Safe characters

Keeping our sights firmly on the ASCII character set, specifically the ASCII printable characters (and skipping the control characters), we get the list below

Glyph Name Safe character? Remarks
space Yes Not reservedref
! exclamation mark Yes Not reservedref
" quotation mark No used to mark beginning and end of filenames containing spaces in Windows
# number sign Yes Not reservedref
$ dollar sign Yes Not reservedref. Used to start variables in many programming languages
% percent sign No used as a wildcard in RT-11; marks a single characterref
& Ampersand No Not reservedref
' Apostrophe Yes Not reservedref
(

010 0111 047 39 27 ' 010 1000 050 40 28 ( 010 1001 051 41 29 ) 010 1010 052 42 2A * 010 1011 053 43 2B + 010 1100 054 44 2C , 010 1101 055 45 2D - 010 1110 056 46 2E . 010 1111 057 47 2F / 011 0000 060 48 30 0 011 0001 061 49 31 1 011 0010 062 50 32 2 011 0011 063 51 33 3 011 0100 064 52 34 4 011 0101 065 53 35 5 011 0110 066 54 36 6 011 0111 067 55 37 7 011 1000 070 56 38 8 011 1001 071 57 39 9 011 1010 072 58 3A : 011 1011 073 59 3B ; 011 1100 074 60 3C < 011 1101 075 61 3D = 011 1110 076 62 3E > 011 1111 077 63 3F ? 100 0000 100 64 40 @ 100 0001 101 65 41 A 100 0010 102 66 42 B 100 0011 103 67 43 C 100 0100 104 68 44 D 100 0101 105 69 45 E 100 0110 106 70 46 F 100 0111 107 71 47 G 100 1000 110 72 48 H 100 1001 111 73 49 I 100 1010 112 74 4A J 100 1011 113 75 4B K 100 1100 114 76 4C L 100 1101 115 77 4D M 100 1110 116 78 4E N 100 1111 117 79 4F O 101 0000 120 80 50 P 101 0001 121 81 51 Q 101 0010 122 82 52 R 101 0011 123 83 53 S 101 0100 124 84 54 T 101 0101 125 85 55 U 101 0110 126 86 56 V 101 0111 127 87 57 W 101 1000 130 88 58 X 101 1001 131 89 59 Y 101 1010 132 90 5A Z 101 1011 133 91 5B [ 101 1100 134 92 5C \ 101 1101 135 93 5D ] 101 1110 136 94 5E ^ 101 1111 137 95 5F _ 110 0000 140 96 60 ` 110 0001 141 97 61 a 110 0010 142 98 62 b 110 0011 143 99 63 c 110 0100 144 100 64 d 110 0101 145 101 65 e 110 0110 146 102 66 f 110 0111 147 103 67 g 110 1000 150 104 68 h 110 1001 151 105 69 i 110 1010 152 106 6A j 110 1011 153 107 6B k 110 1100 154 108 6C l 110 1101 155 109 6D m 110 1110 156 110 6E n 110 1111 157 111 6F o 111 0000 160 112 70 p 111 0001 161 113 71 q 111 0010 162 114 72 r 111 0011 163 115 73 s 111 0100 164 116 74 t 111 0101 165 117 75 u 111 0110 166 118 76 v 111 0111 167 119 77 w 111 1000 170 120 78 x 111 1001 171 121 79 y 111 1010 172 122 7A z 111 1011 173 123 7B { 111 1100 174 124 7C | 111 1101 175 125 7D } 111 1110 176 126 7E ~

[edit]