Java’s HttpURLConnection is broken!


Continuing from the last post, I was experimenting on some prototypes for attaching documents (PDF, CSV, XML etc.) using the JavaMail 1.4 API. The basic modus operandi of the whole process is quite simple:

1. Obtain the URL of the document as served out by Tomcat server.

2. Use this URL as the DataSource for the JavaMail API.

3. Munge the URL for the file name and insert the URL as a link in the message body and, finally

4. Attach the document (if below the user-configured attachment limit) to the mail and e-mail it to the recipients.

All in all pretty simple, right? Wrong. The code was working smoothly until the document was either UTF-8 encoded and/or the document name contained white-spaces as well as a smorgasbord of other characters (as allowed by the Windows file system) such as the tilde character(~) or the backquote character (`) or even something such as the addition symbol (+). This caused the whole mechanism to come crashing down…and bad! This interesting development led me to a whole afternoon of researching and hacking which was quite tiring but enjoyable. I can honestly say that I learnt more about encoding and the way Java’s API’s conform (or not) to the standards, through this simple error, than I could have by devouring the RFC’s and API documentation!

So, first off, the problem description –

1. The URI specification (which is basically a superset of the URL specification) is drafted in a variety of RFC’s – RFC 2396, 3986, 5785 et al. However, most modern browsers conform to the new IRI (Internationalized Resource Identifiers) as specified by RFC 3987. This means that a whole variety of previously disallowed characters are now kosher for URI’s/URL’s/IRI’s. However, Java has failed to keep up with the trend and their IRI support is broken. See. This opens a whole can of worms.

2. In my code, the main “context path” of the URL that my module receives for a specified document is basically localized/internationalized. For instance, consider as URL of the form:

http://localhost:9999/examples/servlets/`~!@^()_-+={[ }] ‘,. test 1 サンプルファイル 999.CSV.PDF

In the URL above, the context portion is “http://localhost:9999/examples/servlets/”. So I just had to resolve the encoding problems with the file name portion (i.e., “`~!@^()_-+={[ }] ‘,. test 1 サンプルファイル 999.CSV.PDF”). In this respect, a variety of tools are purportedly provided by the JDK suite itself – URLEncoder, URI and the URL class. For instance, the following were the options that I tried, in order:

a). Use the java.net.URLEncoder class’ encode(String s, String enc) method to try and encode the complete URL. Bad idea. For the sample URL above, it returns a wholly encoded form as follows:

http%3A%2F%2Flocalhost%3A9999%2Fexamples%2Fservlets%2F%60%7E%21%40%5E%28%29_-%2B%3D%7B%5B+++%7D%5D+%27%2C.+test+1+%E3%82%B5%E3%83%B3%E3%83%97%E3%83%AB%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB++++++999.CSV.PDF

Whoa! This cannot be parsed by either the browser or by JavaMail’s URLDataSource handler (or even by JDK’s HttpURLConnection/HttpsURLConnection classes). Trashed.

b). Okay, so now I just try and encode the filename portion (since the context path is guaranteed by contract to be properly encoded). The output?

http://localhost:9999/examples/servlets/%60%7E%21%40%5E%28%29_-%2B%3D%7B%5B+++%7D%5D+%27%2C.+test+1+%E3%82%B5%E3%83%B3%E3%83%97%E3%83%AB%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB++++++999.CSV.PDF

Hmmm. Slightly better but still completely non-functional as before. At this point, I decided to dig into the JDK source code (using the OpenJDK Java 7 codebase). There was the problem! It was basically encoding whitespaces as ‘+’ instead of encoding it into ‘%20’ as expected. Plus a whole bunch of other characters were being ignored when they should have been encoded. At this point I was almost decided in my mind to implement my own UTF-8 encoder.

c). Upon suggestion by an acquaintance, I decided to try the java.net.URI class constructor to generate a proper URI and then invoke the toURL() method to obtain a properly encoded URL. Sounds exquisitely standards-conformant but this totally bombed when it encountered UTF-8 characters (though it handled whitespace just fine). Firefox Opera and Opera seemed to like this just fine.The URL generated was not usable by JavaMail’s URLDataSource handler (or HttpURLConnection/HttpsURLConnection classes). Plus on some e-mail clients (Gmail), the links were pretty much broken and the user would have to manually copy-and-paste the link into the browser to access the document.

The solution?

I implemented a simple UTF-8 encoder which basically encodes every non-alphanumeric character into UTF-8, shown belowpackage com.z0ltan.mail.encoder;

package com.z0ltan.mail.encoder;

import java.nio.charset.Charset;

public class MyUTFEncoder {
	private static final String HEX_CHARS = "0123456789ABCDEF";

	public static String encode(String data) {

		if (data != null) {
			StringBuffer buffer = new StringBuffer();

			byte[] dataBytes = MyUTFEncoder.getUTFBytes(data);

			for (int i = 0; i < dataBytes.length; i++) {
				char c = (char) dataBytes[i];

				if ((c >= 'a' && c <= 'z')
					|| (c >= 'A' && c <= 'Z')
					|| (c >= '0' && c <= '9')) {
					buffer.append(c);
				} else {
					buffer
					 .append('%');
					buffer
					 .append(HEX_CHARS.charAt((c & 0xF0) >> 4));
					buffer
					 .append(HEX_CHARS.charAt((c & 0x0F)));
				}
			}

			return buffer.toString();
		}
		return null;
	}

	private static byte[] getUTFBytes(String data) {
		if (data != null) {
			return data.getBytes(Charset.forName("UTF-8"));
		}
		return null;
	}
}

The code is pretty self-explanatory. The URL’s generated may not be pretty to look at but they are equally liked by the browser, Java’s URLConnection API’s, JavaMail’s URLDataSource handler as well as E-mail clients such as MS Outlook and Gmail (no more broken links, yay!).

For the representative URL shown at the beginning of this blog, the output is as follows:

http://localhost:9999/examples/servlets/%60%7E%21%40%5E%28%29%5F%2D%2B%3D%7B%5B%20%20%20%7D%5D%20%27%2C%2E%20test%201%20%E3%82%B5%E3%83%B3%E3%83%97%E3%83%AB%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB%20%20%20%20%20%20999%2ECSV%2EPDF

Yes, pretty ugly, but in most browsers that I tested it on, the browser automatically displays it in a more user-friendly format (with no encoding). The best part? This URL works well with JDK, JavaMail, e-mail clients as well as the browser! Plus, to rid it of some ugliness, I actually insert the attached document’s name without the encoding – such as `~!@^()_-+={[ }] ‘,. test 1 サンプルファイル 999.CSV.PDF. This saves the user some confusion about the veracity of the attached document (Note: The file name may still be encoded into weird forms by the e-mail client, if not properly configured but that is beyond the purview of my application).

All in all, an afternoon well spent. Slainte!

PostScript

JDK 1.6, JavaMail 1.4, Firefox 8.0.1, IE 7 and Opera 11.52, MS Outlook and Gmail client used for testing.

Now I am poring over the RFC’s to gain a wholesome understanding of the whole wonderfully twisted domain of encoding/decoding!

Advertisements
Java’s HttpURLConnection is broken!

One thought on “Java’s HttpURLConnection is broken!

Speak your mind!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s