More Notes on Filesystem and Charset Portability

Tue May 18 15:39:40 EDT 2021

Tags: java

Nov 07 2018 - Java Hiccups
Apr 06 2019 - Bitwise Operators
May 03 2019 - Java Grab Bag 2
Feb 14 2021 - Java Travelogue: The Care and Feeding of Locales
May 18 2021 - More Notes on Filesystem and Charset Portability

A few months back, I talked about some localization troubles in the NSF ODP Tooling and how it's important to be explicit in your handling of this sort of thing to make sure your code will work in an environment that isn't specifically "Linux or macOS in an en-US environment".

Well, after making a bunch of little tweaks over the last few days, I have two additional tips in this arena! Specifically, my foes this round came from three sources: Windows, my use of a ZIP file filesystem, and the old reliable charset.

Path Separators

The first bit of trouble had to do with how those two things interact. For a long time, I've been in the (commonly-held) habit of using File.separator and File.separatorChar to get the default path separator for the system - that is, \ on Windows and / on most other platforms. Those work well enough - no real trouble there.

However, my problem came from using the Java NIO ZIP filesystem on Windows. Take this bit of code:

public static String toJavaClassName(Path path) {
	String name = path.toString();
	if(name.endsWith(".java")) {
		return name.substring(0, name.length()-".java".length()).replace(File.separatorChar, '.');
	}
	/* Other conditions here */
}

When Path is a path on the local filesystem, that works just fine, taking a path like "com/example/Foo.java" and turning it into "com.example.Foo". It also works splendidly on macOS and Linux in all cases, the two systems I actually use. However, when path represents a path within a ZIP file and you're working on Windows, it fails, returning a "class name" like "com/example/Foo".

This is exactly what happens when compiling an ODP using a remote Domino server running on Windows. For the portability reasons mentioned in my previous post, the client sends a ZIP of the ODP to the server and then the compilation pulls directly out of that ZIP instead of writing it out to the filesystem. The way the ZIP filesystem driver in Java is written, it uses / for its path separator on all platforms, which is consistent with dealing with ZIP files generally. But, when mixed with the native filesystem separator, that line resolved to:

1	return "com/example/Foo".replace('\\', '.');

...and there's the problem. The fix is to change the code to instead get the directory separator from the contextual filesystem in question:

public static String toJavaClassName(Path path) {
	String name = path.toString();
	if(name.endsWith(".java")) {
		return name.substring(0, name.length()-".java".length()).replace(path.getFileSystem().getSeparator(), ".");
	}
	/* Other conditions here */
}

A little more verbose, sure, but it has the advantage of functioning consistently in all environments.

This also has significant implications if you use static properties to store filesystem-dependent elements. This came into play in my OnDiskProject class, which contains a bunch of path matchers to find design elements to import from the ODP. Originally, I kept these in a static property that was generated by writing them Unix-style, then running them through a generator to use the platform-native separator character. This had to change, since the actual ODP store may or may not be the platform-native filesystem. This sort of thing is pervasive, and it'll take me a bit to get over my long-standing habit.

Over-Interpreting Character Sets

This one is similar to the charset troubles in my previous post, but ran into subtle trouble in the ODP compiler. Here was the sequence of events:

The ODP Compilers reads the XSP source of a page or custom control using ODPUtil, which read in the string as UTF-8
It then passes that string to the Bazaar's DynamicXPageBean
That method uses StringReader and an IBM Commons ReaderInputStream to read the content
That content is then read in by FacesReader, which uses the default DOM parser to read the XML

In general, that flow worked just fine. However, that's because, in general, I write US-ASCII markup. However, when the page contains, say, Czech diacritics, this goes off the rails. Somewhere in the interpretation and re-interpretation of the file, the UTF-8-iness of it breaks.

Fortunately, this one was a clean one: XML has its own mechanism for declaring its encoding (and it's almost always UTF-8 anyway), so my code doesn't actually need to be responsible for interpreting the bytes of the file before it gets to the DOM parser. So I added a version of the Bazaar method that takes an InputStream directly and modified NSF ODP to use it, with no extra interpretation in between.

Ben Langhinrichs - Tue May 18 16:05:07 EDT 2021

I am curious why you bother with getting the "appropriate" separator in this instance. While I understand your logic, why not simply replace both slashes, as neither can be used for a different purpose in a directory name? There are a lot of funky exceptions, but those are the only two characters used. I love the theory of getting it right, but sometimes the practical thing is to do it wrong but comprehensively and simply.

Jesse Gallagher - Tue May 18 16:26:37 EDT 2021

That's reasonable - it's unlikely that any system this code will run on would use a different character, and the case of e.g. using a \ in a file name in an ODP on Unix isn't likely to come up since it's encoded when exported anyway. It mattered more in the "this had to change" example above, where I was using the path separator to create new path globs, where I had to use specifically the right one.

Ben Langhinrichs - Tue May 18 17:57:49 EDT 2021

That matches up with most of my use cases where I need to generate the correct slash for the OS or situation. I have found it better to switch from either to the correct one. Since content is sometimes generated in Windows and sometimes elsewhere, it is not uncommon to find paths with the "wrong" character, so I simply make sure both right and wrong go to the right one for the OS..

New Comment

Name: Body: