Table of contents ================= I. The problem of charsets and filenames on UNIX II. How LC_FSCTYPE tries to solve the problem III. Status of current implementation IV. Pitfalls & Possible improvements V. Configuration suggestions VI. Contact info I. The problem of charsets and filenames on UNIX ================================================ UNIX does not impose any interpretation over filenames except that ASCII '/' is used as directory separator. As a result, any charset can be used to encode filenames as long as it does not contain ASCII '/' byte (0x2F) in encodings of any other characters. As long as one sticks to one such charset to encode all filenames, everything works. However, depending on used language, it's not uncommon to use more than one charsets on a single system, and the spreading adoption of Unicode escalates the situation. Increasing number of applications are standardizing on Unicode and yet traditional charsets are far from disappearing. When files are shared among applications using different charsets, problems arise. As virtually all charsets used to encode filenames are ASCII compatible, filenames consisted solely out of ASCII characters can be listed, specified and created regardless of used charset, but filenames with charset-dependent characters wouldn't make any sense in other charsets than the original one. Applications using different charsets wouldn't be able to list, specify or create such files. Let's say a user uses ko_KR.UTF-8 as her primary locale. As most of Korean Internet services are built upon traditional EUC-KR charset, she would also have to use EUC-KR quite often. For example, virtually all FTP servers uses EUC-KR to encode filenames containing Korean characters, so whenever she downloads a file from such FTP server, she creates a file with filename which is meaningless in her primary locale. Or, as most console based services and text files are also encoded in EUC-KR, it's highly probable that she uses an EUC-KR console emulator as her main console; then, she wouldn't be able to handle files containing Korean characters from her console. It's evident that filenames should be translated back and forth between charsets used by filesystems and applications. II. How LC_FSCTYPE tries to solve the problem ============================================= This problem can be handled in several places. Applications can perform explicit charset translation on filenames but such approach is inefficient and inevitably brings differing semantics and confusion. Another place would be inside the operating system kernel. Each application informs the kernel which charset it's willing to use for filenames and the kernel can translate accordingly from and to charsets used by filesystems. In-kernel implementation would be the most efficient and transparent, but incorporating full charset translation and filename translation related subtleties into the kernel isn't really a good idea (Linus won't like it!), especially when C library already has all needed features and handles all locale related tasks. LC_FSCTYPE tries to solve the problem in the C library using the usual locale framework. LC_FSCTYPE is designed with the following goals in mind. - Integration with the rest of locale handling - Non-intrusive and intuitive usage - Consistent behavior - Ability to handle most files even when filenames contain characters outside the current charset - Modest performance penalty A pseudo locale variable LC_FSCTYPE is defined which designates the charset used by filenames. The syntax of LC_FSCTYPE is identical to other locales but only the charset part of the locale is used. Language and territory parts are ignored. If the charsets specified by LC_CTYPE and LC_FSCTYPE are different, the C library translates all filenames passed in C library calls back and forth appropriately. Characters which cannot be represented in a charset are byte encoded so that even though the user might not be able to recognize the characters he or she can handle such files with byte encoded filenames. C library functions which deal with filenames can be classified into several types depending on how filenames are used. Each type requires different translation rules. Terminologies and common rules are described first and descriptions of each type and respective rules will follow. Terminologies and Common rules ------------------------------ 1. Terminologies filename : filenames as we know it (such as /usr/bin/bash or tmp/test) basename : filename sans any directory specification (such as bash or test) dirname : directory part of a filename (such as /usr/bin/ or tmp/) FSCTYPE : charset used by filesystem CTYPE : charset used by application and user XLATED : filename translated using iconv(3) and byte encoding DECODED : byte decoded filename 2. Byte encoding There are several choices when an unconvertible character, which can be either an invalid character in the source charset or a missing character from the destination charset, is encountered during translation. Easy choices would be skipping the offending character or emitting the character or the whole filename verbatim. However, skipping the offending character is likely to produce duplicate names and emitting verbatim may mess up the terminal; moreover, both give no way to specify the offending filename using the converted result. So, when an unconvertible character is met, the hex encoded bytes of the character is emitted prefixed with '^'. For example, let's assume there's a UTF-8 encoded filename containing U+C00D BBWELG and it's to be displayed in a EUC-KR terminal. The UTF-8 byte representation of the character is "0xEC 0x80 0x8D". Because EUC-KR cannot encode BBWELG, the character is byte encoded as "^EC^80^8D". So if the original filename was "asdf/{U+C00D}/fdsa" where {U+C00D} is a single character BBWELG, the XLATED would be "asdf/^EC^80^8D/fdsa". The rationales for choosing '^' prefixed hex encoding are - Can be viewed and typed on any ASCII-compatible console - '^' is hardly used in filenames, so it won't cause confusion or ambiguity. If there's any better prefix character or you know of an environment where '^' is used in filenames commonly, please don't hesitate to tell. - More peculiar forms such as "^{0xEC}^{0x80}^{0x8d}" were considered. Those forms will cause less confusion and less ambiguity but they take up more space. As lengths of filenames and each component of filenames are limited, taking up too much space didn't seem like a good idea. There is an issue regarding how many bytes should be URL encoded when an unconvertible character is met. Ideal solution seems to be - If the character is a valid character in FSCTYPE, whatever bytes which consist a whole character starting at the unconvertible byte. - If the character is not a valid character in FSCTYPE, just one byte. Current implementation does not implement this semantics, please refer to 'Status of current implementation' section for details. 3. E2BIG and ENOMEM Any C library function which is affected by LC_FSCTYPE translation can fail from two additional reasons. Length of a filename changes as translated from one charset to another, and it's possible that the original fits in the length limit but the translated filename doesn't. In this case, the function fails with error code E2BIG. Also, as translation requires certain amount of resources, it may fail when resource pressure is high. In this case, the function will fail with error code ENOMEM. 4. DECODED and XLATED When XLATED doesn't contain any byte encoded characters, DECODED and XLATED are identical. In such cases, the operation is tried only on DECODED. In the following descriptions of rules, this isn't mentioned explicitly, but you can assume that whenever DECODED equals XLATED, only one of them is tried. 5. Same translation rule for all components of a filename All directory and edge components of a filename are subject to the same translation rule. In no case, some component are translated & byte encoded while others are only translated. i.e. the whole filename is dealt as a single entity when applying translation. The purposes of this restriction are to avoid confusion and to keep the number of variants of translated filenames to a handful. Types & Rules ------------- TYPE-1. Functions which read filenames from filesystem and pass them to the caller The filenames should be translated (including byte encoding if necessary) from FSCTYPE to CTYPE and associated data structure should be adjusted accordingly. For example, d_reclen field of struct dirent must be adjusted to reflect the length of translated filename. This rule applies to the following functions. int getdents(int fd, struct dirent *dirp, unsigned int count) int getdents64(int fd, struct dirent64 *dirp, unsigned int count) ssize_t getdirentries(int fd, char *buf, size_t nbytes, off_t *basep) ssize_t getdirentries64(int fd, char *buf, size_t nbytes, off64_t *basep) struct dirent *readdir(DIR *dir) struct dirent *readdir64(DIR *dir) int readlink(const char *file, char *buf, size_t bufsiz) char *getcwd(char *buf, size_t size) readlink() and getcwd() return full filename which may contain multiple directory components. One moot point is whether verifying length of each component is equal to or less than NAME_MAX after translation is required. Current implementation checks neither PATH_MAX nor NAME_MAX limit. TYPE-2. Functions which take one filename and fail without side effects when the file does not exist As these functions fail without side effects when the specified file does not exist, we can safely try the operation multiple times until we succeed or there's nothing left to try. Three versions of a filename are tried in the following order. step-1. try the operation on DECODED. if fails with ENOENT, continue to the next step; otherwise, pass the result to the caller. step-2. try the operation on XLATED. if fails with ENOENT, continue to the next step; otherwise, pass the result to the caller. step-3. try the operation on the original filename and pass the result to the caller. For example, consider a UTF-8 encoded filename "asdf/{U+C00D}{U+BA4D}". As before, brace-enclosed values are not verbatim - "{U+....}" stands for UTF-8 encoded byte sequence representing the Unicode character and "{0x.. 0x..}" means two bytes sequence represented in hexa. In the filename, {U+C00D} is BBWELG which cannot be represented in EUC-KR and {U+BA4D} is MEONG which is {0xB8 0xDB} in EUC-KR. When a user in EUC-KR CTYPE calls unlink(2) for this file and the directory "asdf" exists but the file does not exist, the following happens. a. The user calls unlink on "asdf/^EC^80^8D{0xBB 0xDB}" filename in hex: 61(a) 73(s) 64(d) 66(f) 2f(/) 5e(^) 45(E) 43(C) 5e(^) 38(8) 30(0) 5e(^) 38(8) 44(D) bb db b. unlink is tried on "asdf/{U+C00D}{U+BA4D}" and fails with ENOENT filename in hex: 61(a) 73(s) 64(d) 66(f) 2f(/) ec 80 8d(U+C00D) eb a9 8d(U+BA4D) c. unlink is tried on "asdf/^EC^80^8D{U+BA4D}" and fails with ENOENT filename in hex: 61(a) 73(s) 64(d) 66(f) 2f(/) 5e(^) 45(E) 43(C) 5e(^) 38(8) 30(0) 5e(^) 38(8) 44(D) eb a9 8d(U+BA4D) d. unlink is tried on "asdf/^EC^80^8D{0xBB 0xDB}" and fails with ENOENT filename in hex: 61(a) 73(s) 64(d) 66(f) 2f(/) 5e(^) 45(E) 43(C) 5e(^) 38(8) 30(0) 5e(^) 38(8) 44(D) bb db e. Because all failed, unlink(2) returns -1 with the last errno ENOENT. Functions which follow this rule are int open(const char *file, int oflags, ...) /* without O_CREAT */ int open64(const char *file, int oflags, ...) /* without O_CREAT */ int unlink(const char *file) int stat(const char *file, struct stat *buf) int stat64(const char *file, struct stat64 *buf) int lstat(const char *file, struct stat *buf) int lstat64(const char *file, struct stat64 *buf) int setxattr(const char *file, const char *name, const void *value, size_t size, int flags) int lsetxattr(const char *file, const char *name, const void *value, size_t size, int flags) int getxattr(const char *file, const char *name, void *value, size_t size) int lgetxattr(const char *file, const char *name, void *value, size_t size) ssize_t listxattr(const char *file, char *list, size_t size) ssize_t llistxattr(const char *file, char *list, size_t size) int removexattr(const char *file, const char *name) int lremovexattr(const char *file, const char *name) int mkdir(const char *file, mode_t mode) int chdir(const char *file) int rmdir(const char *file) int chown(const char *file, uid_t owner, gid_t group) int lchown(const char *file, uid_t owner, gid_t group) int chmod(const char *file, mode_t mode) int access(const char *file, int mode) int utime(const char *file, const struct utimbuf *buf) int utimes(char *filename, struct timeval *tvp) int lutimes(char *filename, struct timeval *tvp) int truncate(const char *file, off_t length) int truncate64(const char *file, off64_t length) TYPE-3. Functions which take one filename and create the file if it does not exist The following functions are in this category. int open(const char *file, int oflags, ...) /* with O_CREAT */ int open64(const char *file, int oflags, ...) /* with O_CREAT */ int creat(const char *file, mode_t mode) int creat64(const char *file, mode_t mode) As these functions create the file if it does not exist, we need to determine which filename to use before actually trying the operation. access(2) is used to determine which filename to use. step-1. if filename contains no '/' go to step-a1; otherwise, goto step-b1. step-a1. perform access(2) on DECODED. if succeeds, use it. if fails with ENOENT, continue to the next step; otherwise, fail with the error code. step-a2. perform access(2) on XLATED. if succeeds, use it. if fails with ENOENT, continue to the next step; otherwise, fail with the error code. step-a3. perform access(2) on the original filename. if succeeds, use it. if fails with ENOENT, continue to the next step; otherwise, use DECODED. step-b1. perform access(2) on dirname of DECODED. if succeeds, use DECODED. if fails with ENOENT, continue to the next step; otherwise, fail with the error code. step-b2. perform access(2) on dirname of XLATED. if succeeds, use XLATED. if fails with ENOENT, continue to the next step; otherwise, fail with the error code. step-b3. perform access(2) on dirname of the original filename. if succeeds, use the original. Otherwise; fail with the error code. When the filename does not contain any directory component, above rules guarantee that if any version of the filename already exists, it will be used; otherwise, DECODED is used. This prevents creating ambiguous files accidentally. Assume there exists a file named "^EC^80^8D{U+BA4D}" in UTF-8, and the user executed the following command in a EUC-KR terminal. cat > ^EC^80^8D{0xBB 0xDB} It's highly unlikely that the user wanted to create a new file named "{U+C00D}{U+BA4D}" instead of writing to the existing file "^EC^80^8D{U+BA4D}", so LC_FSCTYPE looks for an existing file which matches any translation of the supplied filename before creating a new file. When the filename contains directory components, above rules match dirnames in ordinary order (DECODED, XLATED, then original) and when a matching directory is found, use the corresponding filename. This observes "Same translation for all components of a filename" rule described previously and the semantics should be clear in most cases. TYPE-4. Functions which take two filenames Two functions belong to this category. int link(const char *oldfile, const char *newfile) int rename(const char *oldfile, const char *newfile) As above functions fail without side effects when oldfile does not exist and create a new file (link) named newfile on success, we can combine rules from TYPE-2 and TYPE-3 to handle oldfile and newfile respectively. step-1. apply TYPE-3 rules on newfile. if succeeds, proceed to the next step; otherwise, fail with the error code. step-2. apply TYPE-2 rules on oldfile and pass the result to the caller. TYPE-5. symlink int symlink(const char *oldfile, const char *newfile) symlink(2) differs from TYPE-4 functions in that it succeeds regardless of the actual existence or accessibility of oldfile. TYPE-3 rules are applied to newfile and modified TYPE-3 rules are used for oldfile. step-1. apply TYPE-3 rules on newfile. if succeeds, proceed to the next step; otherwise, fail with the error code. step-2. perform access(2) on DECODED. if succeeds, use it; otherwise, continue to the next step. step-3. perform access(2) on XLATED. if succeeds, use it; otherwise, continue to the next step. step-4. perform access(2) on the original filename. if succeeds, use it; otherwise, use DECODED. Above rules guranatee that if there's a file with filename matching a version of oldfile, it's used, and when none matches, the default DECODED is used. III. Status of current implementation ===================================== 1. Implemented OS/Architecture Currently, LC_FSCTYPE is implemented only for Linux x86 w/ linuxthreads. As LC_FSCTYPE translation is compiled in only when --enable-fsctype-xlate is specified when configuring, libc still can be compiled for other environments without LC_FSCTYPE translation. Please note that even when --enable-fsctype-xlate is not specified, support for pseudo locale LC_FSCTYPE is compiled in. Only actual translation part is left out. 2. Character boundary when byte encoding When an unconvertible character is met, current implementation repeats to byte encode single byte and restart iconv. Correct implementation will require internal information from iconv to determine how many bytes consist the unconvertible character. However, as long as UTF-8 is used for FSCTYPE and CTYPE is a subset of UTF-8, current implementation will show the same result because all characters are expressible in UTF-8 and character boundaries in UTF-8 are self-signalling. 3. Locale/conv caching To reduce the overhead, locale data and conversion resources including iconv and buffers are cached, so as long as the same CTYPE and LC_FSCTYPE are used resource allocation and free overheads should be negligible. Currently, maximum cache size is 15 for each way (CTYPE -> FSCTYPE and FSCTYPE -> CTYPE). 4. Cancellation handling Because fsctype conversion requires resource allocation, for functions which are not cancellation-points, fsctype sets deferred mode until after the operation finishes and allocated resources are freed. For cancellation points, fsctype registers clean up routines before invoking actual operations. 5. fsctype in rtld Currently, LC_FSCTYPE support is not compiled in when building rtld. Whole iconv stuff and a lot of other things need to be statically linked into rtld to support LC_FSCTYPE. What should be done? 6. libattr libattr directly calls syscall(2) instead of invoking individual xattr functions (getxattr...). This causes libattr operations to bypass fsctype translation resulting in ENOENT error for translated files. As ls command uses libattr, 'ls -l' will print ENOENT error message for every translated file with LC_FSCTYPE enabled libc. I think this should be fixed on libattr side. IV. Pitfalls & possible improvements ==================================== 1. Characters outside FSCTYPE in a user-supplied filename If a character which cannot be represented in FSCTYPE exists in a user-supplied filename, the character will be byte encoded then decoded again as it's translated into DECODED form. The situation gets worse if the original filename contains verbatim byte encoded characters in it. 2. Are the rules OK? I don't know. I've spent quite some time on the rules but still not really sure. Fortunately, changing the rules is relatively easy in the current implementation, so if you have any better idea, please don't hesitate. The function you wanna look at is __nl_fsctype_next_file() in locale/fsctype.c. 3. Is ENAMETOOLONG better than E2BIG? Currently, all translated functions fail with E2BIG when the filename is too long. Is ENAMETOOLONG more proper? 4. Preiniting Currently, LC_FSCTYPE translation is enabled when locale is initialized (w/ setlocale(3) or newlocale(3)/uselocale(3)). I think it would be helpful to enable LC_FSCTYPE translation on libc initialization regardless of other locale stuff. Maybe supply someway to specify preinit option? (LC_FSCTYPE=ko_KR.UTF-8,preinit ?) 5. Renaming filenames from one charset to another It would be nice if rename(1) can translate filenames from one charset to another. 6. The following functions are intentionally excluded from translation. int mount(const char *source, const char *target, const char *filesystemtype, unsigned long mountflags, const void *data) int umount(const char *target) int chroot(const char *path) int statfs(const char *path, struct statfs *buf); Should these functions do fsctype translation too? 7. ASCII CTYPE Currently if any of FSCTYPE or CTYPE is ASCII, fsctype translation is disabled. I think it would helpful to perform just byte encoding/decoding when FSCTYPE != ASCII and CTYPE == ASCII. V. Configuration suggestions ============================ Use UTF-8 exclusively for all filesystems and make environment variable LC_FSCTYPE=whatever.UTF-8 global to every session. I believe that's how every system should be configured in the future. VI. Contact info ================ If you have any questions or suggestions, please contact me at where _AT_ is '@' and _DOT_ is '.' :-) LocalWords: charsets LC FSCTYPE charset ko UTF EUC filesystems CTYPE tmp EC LocalWords: basename dirname filesystem XLATED iconv BBWELG asdf fdsa URL int LocalWords: ENOMEM reclen struct dirent getdents fd dirp ssize getdirentries LocalWords: buf nbytes basep readdir DIR dir readlink const bufsiz getcwd bb LocalWords: ENOENT hexa MEONG db ec eb errno oflags CREAT stat lstat setxattr LocalWords: lsetxattr getxattr lgetxattr listxattr llistxattr removexattr uid LocalWords: lremovexattr mkdir chdir rmdir chown gid lchown chmod utime tvp LocalWords: utimbuf utimes timeval lutimes creat dirnames oldfile newfile nl LocalWords: symlink linuxthreads fsctype xlate libc conv rtld libattr syscall LocalWords: xattr ENAMETOOLONG Preiniting setlocale newlocale uselocale tj LocalWords: preinit filesystemtype mountflags umount chroot statfs org