Sorting strings with the Java platform can be thought of as an easy task, but there is much more thought that should be put into it when developing programs for an international market. If you're stuck in the English-only mindset, and you think your program works fine because it shows that the string tomorrow comes after today, you might think all is great. But, once you have a Spanish user who wants mañana to be sorted properly, if all you use is the default compare() method of String for sorting, the ñ character will come after the z character and will not be the natural Spanish ordering, between the n character and o character. That's where the Collator class of the java.text package comes into play.
Imagine a list of words
* first
* mañana
* man
* many
* maxi
* next
Using the default sorting mechanism of String, its compare() method, this will result in a sorted list of:
* first
* man
* many
* maxi
* mañana
* next
Here, mañana comes between maxi and next. In the Spanish world, what should happen is mañana should come between many and maxi as the ñ character (pronounced eñe) comes after the n in that alphabet. While you could write your own custom sort routine to handle the ñ, what happens to your program when a German user comes around and wants to use their own diacritical marks, or what about just a list of design patterns with façade? Do you want façade before or after factory? (Essentially treating the ç with the little cedilla hook the same as c or different.)
That's where the Collator class comes in handy. The Collator class takes into account language-sensitive sorting issues and doesn't just try to sort words based upon their ASCII/Unicode character values. Using Collator requires understanding one additional property before you can fully utilize its features, and that is something called strength. The strength setting of the Collator determines how strong (or weak) a match is used for ordering. There are four possible values for the property: PRIMARY, SECONDARY, TERTIARY, and IDENTICAL. What actually happens with each is dependent on the locale. Typically what happens is as follows. In reverse order, IDENTICAL strength means just that, the characters must be identical for them to be treated the same. TERTIARY typically is for ignoring case differences. SECONDARY is for ignoring diacritical marks, like n vs. ñ. PRIMARY is like IDENTICAL for base letter differences, but has some differences when handling control characters and accents. See the Collator javadoc for more information on these differences and decomposition mode rules.
To work with Collator, you need to start by getting one. You can either call getInstance() to get one for the default locale, or pass the specific Locale to the getInstance() method to get a locale for the one provided. For instance, to get one for the Spanish language, you would create a Spanish Locale with new Locale("es") and then pass that into getInstance():
Collator esCollator = Collator.getInstance(new Locale("es"));
Assuming the default Collator strength for the locale is sufficient, which happens to be SECONDARY for Spanish, you would then pass the Collator like any Comparator into the sort() routine of Collections to get your sorted List:
Collections.sort(list, esCollator);
Working with the earlier list, that now gives you a proper sorting with the Spanish alphabet:
* first
* man
* many
* mañana
* maxi
* next
Had you instead used the US Locale for the Collator, mañana would appear between man and many since the ñ is not its own letter.
Here's a quick example that shows off the differences.
import java.awt.; import java.text.; import java.util.; import java.util.List; // Explicit import required import javax.swing.;
public class Sort { public static void main(String args[]) { Runnable runner = new Runnable() { public void run() { String words[] = {"first", "mañana", "man", "many", "maxi", "next"}; List list = Arrays.asList(words); JFrame frame = new JFrame("Sorting"); frame.setDefaultCloseOperation (JFrame.EXIT_ON_CLOSE); Box box = Box.createVerticalBox(); frame.setContentPane(box); JLabel label = new JLabel("Word List:"); box.add(label); JTextArea textArea = new JTextArea( list.toString()); box.add(textArea); Collections.sort(list); label = new JLabel("Sorted Word List:"); box.add(label); textArea = new JTextArea(list.toString ()); box.add(textArea); Collator esCollator = Collator.getInstance(new Locale("es")); Collections.sort(list, esCollator); label = new JLabel("Collated Word List:"); box.add(label); textArea = new JTextArea(list.toString()); box.add(textArea); frame.setSize(400, 200); frame.setVisible(true); } }; EventQueue.invokeLater (runner); } }
[http://blogs.sun.com/CoreJavaTechTips/entry/sorting_strings]