There are two fundamentally different ways of comparing strings:
String
Collator
String
methods such as:
String.equalsIgnoreCase(String)
String.compareTo(String)
The fundamental difference is that localized comparison depends on Locale
, while String
is largely ignorant of Locale
.
Here is a quote from The Java Programming Language by Arnold, Gosling, and Holmes:
"You should be aware that internationalization and localization issues of full Unicode strings are not addressed with [String] methods. For example, when you're comparing two strings to determine which is 'greater', characters in strings are compared numerically by their Unicode values, not by their localized notion of order."
The only robust way of doing localized comparison or sorting of String
s, in the manner expected by an end user, is to use a Collator
, not the methods of the String
class.
Example 1 - Unicode Ordering
Here's an example of simple Unicode ordering of Strings.
Note the use of String.CASE_INSENSITIVE_ORDER
,
an implementation of Comparator
.
Reminder - the following items are important with any form of comparison or sorting:
Comparator
and
Comparable
interfaces
sort
methods of
Collections
and
Arrays
import java.util.*; /** Sorting Strings in Unicode order. */ public final class SortStringsNoLocale { public static void main(String... args){ List<String> insects = Arrays.asList("Wasp", "ant", "", "Bee"); log("Original:"); log(insects); log("Sorted:"); sortList(insects); log(insects); log(""); Map<String,String> capitals = new LinkedHashMap<>(); capitals.put("finland", "Helsinki"); capitals.put("United States", "Washington"); capitals.put("Mongolia", "Ulan Bator"); capitals.put("Canada", "Ottawa"); log("Original:"); log(capitals); log("Sorted:"); log(sortMapByKey(capitals)); } private static void sortList(List<String> items){ Collections.sort(items, String.CASE_INSENSITIVE_ORDER); } private static void log(Object thing){ System.out.println(Objects.toString(thing)); } private static Map<String, String> sortMapByKey(Map<String, String> items){ TreeMap<String, String> result = new TreeMap<>(String.CASE_INSENSITIVE_ORDER) ; result.putAll(items); return result; } }The class outputs the following:
Original: [Wasp, ant, , Bee] Sorted: [, ant, Bee, Wasp] Original: {finland=Helsinki, United States=Washington, Mongolia=Ulan Bator, Canada=Ottawa} Sorted: {Canada=Ottawa, finland=Helsinki, Mongolia=Ulan Bator, United States=Washington}
Example 2 - Localized Ordering
Here's an example of using a Collator
to perform localized sorting and comparison of Strings.
Note the importance of Collator
'strength' for fine-tuning the comparison.
To ignore case, for example, either PRIMARY
or SECONDARY
strength can be used.
package hirondelle.jp.util; import java.text.Collator; import java.util.*; /** Use Collator to sort and compare text. */ public final class SimpleCollator { /** Simple harness to exercise the code. */ public static void main (String... aArguments) { //This data is based on an example in Java Class Libraries, //by Chan, Lee, and Kramer List<String> words = Arrays.asList( "Äbc", "äbc", "Àbc", "àbc", "Abc", "abc", "ABC" ); log("Different 'Collation Strength' values give different sort results: "); log(words + " - Original Data"); sort(words, Strength.Primary); sort(words, Strength.Secondary); sort(words, Strength.Tertiary); log(EMPTY_LINE); log("Case kicks in only with Tertiary Collation Strength : "); List<String> wordsForCase = Arrays.asList("cache", "CACHE", "Cache"); log(wordsForCase + " - Original Data"); sort(wordsForCase, Strength.Primary); sort(wordsForCase, Strength.Secondary); sort(wordsForCase, Strength.Tertiary); log(EMPTY_LINE); log("Accents kick in with Secondary Collation Strength."); log("Compare with no accents present: "); compare("abc", "ABC", Strength.Primary); compare("abc", "ABC", Strength.Secondary); compare("abc", "ABC", Strength.Tertiary); log(EMPTY_LINE); log("Compare with accents present: "); compare("abc", "ÀBC", Strength.Primary); compare("abc", "ÀBC", Strength.Secondary); compare("abc", "ÀBC", Strength.Tertiary); } // PRIVATE // private static final String EMPTY_LINE = ""; private static final Locale TEST_LOCALE = Locale.FRANCE; /** Transform some Collator 'int' consts into an equivalent enum. */ private enum Strength { Primary(Collator.PRIMARY), //base char Secondary(Collator.SECONDARY), //base char + accent Tertiary(Collator.TERTIARY), // base char + accent + case Identical(Collator.IDENTICAL); //base char + accent + case + bits int getStrength() { return fStrength; } private int fStrength; private Strength(int aStrength){ fStrength = aStrength; } } private static void sort(List<String> aWords, Strength aStrength){ Collator collator = Collator.getInstance(TEST_LOCALE); collator.setStrength(aStrength.getStrength()); Collections.sort(aWords, collator); log(aWords.toString() + " " + aStrength); } private static void compare(String aThis, String aThat, Strength aStrength){ Collator collator = Collator.getInstance(TEST_LOCALE); collator.setStrength(aStrength.getStrength()); int comparison = collator.compare(aThis, aThat); if ( comparison == 0 ) { log("Collator sees them as the same : " + aThis + ", " + aThat + " - " + aStrength); } else { log("Collator sees them as DIFFERENT : " + aThis + ", " + aThat + " - " + aStrength); } } private static void log(String aMessage){ System.out.println(aMessage); } }This class outputs the following:
Different 'Collation Strength' values give different sort results: [Äbc, äbc, Àbc, àbc, Abc, abc, ABC] - Original Data [Äbc, äbc, Àbc, àbc, Abc, abc, ABC] Primary [Abc, abc, ABC, Àbc, àbc, Äbc, äbc] Secondary [abc, Abc, ABC, àbc, Àbc, äbc, Äbc] Tertiary Case kicks in only with Tertiary Collation Strength : [cache, CACHE, Cache] - Original Data [cache, CACHE, Cache] Primary [cache, CACHE, Cache] Secondary [cache, Cache, CACHE] Tertiary Accents kick in with Secondary Collation Strength. Compare with no accents present: Collator sees them as the same : abc, ABC - Primary Collator sees them as the same : abc, ABC - Secondary Collator sees them as DIFFERENT: abc, ABC - Tertiary Compare with accents present: Collator sees them as the same : abc, ÀBC - Primary Collator sees them as DIFFERENT: abc, ÀBC - Secondary Collator sees them as DIFFERENT: abc, ÀBC - Tertiary