Java Practices->Prefer UTF 8 in all layers

Prefer UTF 8 in all layers

If an application displays text with strange, unexpected characters, the likely cause is an incorrect character encoding.

Character encodings control how tools translate raw bytes into text. The best default character encoding is likely UTF-8. It can represent characters in almost all languages, and in an efficient manner. So, it seems to make sense to adopt UTF-8 as an excellent default character encoding. The W3C itself recommends such a practice.

Simple ASCII encoding should usually be avoided, since it's so limited. As well, the default encoding used by the Servlet specification is ISO-8859-1, which is restricted to West European languages.

In a web application, character encodings are used in three separate areas - the browser, the server, and the database. To work together correctly, the same character encoding must be used in each of these areas. See this excellent article by John O'Connor for further discussion.

Browser
The browser uses an encoding to present text to the user, and to send requests (often with parameters) to the server. The request parameter encoding will be the same as the page encoding, unless instructed otherwise. A JSP can instruct the browser on the desired encoding by using a page directive, such as:

<%@ page contentType="text/html; charset=UTF-8" pageEncoding="UTF-8" %>

META tags may be used instead:

<meta charset="UTF-8">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

As usual, such a policy should be defined in one place, if possible (for example, in a template page).

Server
In principle, the browser should respond to the server by including the character encoding it has already received from the server. In practice, however, browsers do not do a very good job at this. So, even though a JSP has indicated the character encoding, it's likely a good practice to "reset" the character encoding of the request, using, for example :

request.setCharacterEncoding("UTF-8");

A Controller could perform this for every incoming request, perhaps using a value configured in web.xml. This method must be called early in processing, before any parameter values are retrieved.

For reference, the servlet API has these methods for managing character encoding :

ServletRequest.getCharacterEncoding
ServletRequest.setCharacterEncoding
ServletResponse.setLocale
ServletResponse.setContentType

Database
The database has a character encoding as well. Please consult your database documentation for further information.

Stated Encoding Must Match Actual Encoding
It's important to note that such encoding settings don't define the encoding as such, in the sense that they don't define or change how the text is actually represented as bytes. Rather, they act simply as advice to tools, by stating what the encoding is supposed to be. Of course, it's an error to advise a tool that the encoding is X when it's actually Y.

A good example of this is the encoding of Java Server Pages. If a JSP is saved as ISO-8859-1, and its page directive states that its encoding is UTF-8, then that's a mistake. Such mistakes are very easy to make, since the error is not detected until later, when you see weird characters in your web page. Thus, you need to pay attention to which encoding is in effect when you save such files on your system. Unfortunately, many systems are not set up to use UTF-8 as the default.