-
Notifications
You must be signed in to change notification settings - Fork 10.9k
StringsExplained
Joining together a sequence of strings with a separator can be unnecessarily tricky -- but it shouldn't be. If your sequence contains nulls, it can be even harder. The fluent style of Joiner
makes it simple.
Joiner joiner = Joiner.on("; ").skipNulls();
return joiner.join("Harry", null, "Ron", "Hermione");
returns the string "Harry; Ron; Hermione". Alternately, instead of using skipNulls
, you may specify a string to use instead of null with useForNull(String)
.
You may also use Joiner
on objects, which will be converted using their toString()
and then joined.
Joiner.on(",").join(Arrays.asList(1, 5, 7)); // returns "1,5,7"
Warning: joiner instances are always immutable. The joiner configuration methods will always return a new Joiner
, which you must use to get the desired semantics. This makes any Joiner
thread safe, and usable as a static final
constant.
The built in Java utilities for splitting strings can have some quirky behaviors. For example, String.split
silently discards trailing separators, and StringTokenizer
respects exactly five whitespace characters and nothing else.
Quiz: ",a,,b,".split(",")
returns...
"", "a", "", "b", ""
null, "a", null, "b", null
"a", null, "b"
"a", "b"
- None of the above
The correct answer is none of the above: "", "a", "", "b"
. Only trailing empty strings are skipped. What is this I don't even.
Splitter
allows complete control over all this confusing behavior using a reassuringly straightforward fluent pattern.
Splitter.on(',')
.trimResults()
.omitEmptyStrings()
.split("foo,bar,, qux");
returns an Iterable<String>
containing "foo", "bar", "qux". A Splitter
may be set to split on any Pattern
, char
, String
, or CharMatcher
.
Method | Description | Example |
---|---|---|
Splitter.on(char) |
Split on occurrences of a specific, individual character. | Splitter.on(';') |
Splitter.on(CharMatcher) |
Split on occurrences of any character in some category. |
Splitter.on(CharMatcher.BREAKING_WHITESPACE) Splitter.on(CharMatcher.anyOf(";,."))
|
Splitter.on(String)
Split on a literal String
.
Splitter.on(", ")
Splitter.on(Pattern)
Splitter.onPattern(String)
Split on a regular expression.
Splitter.onPattern("\r?\n")
Splitter.fixedLength(int)
Splits strings into substrings of the specified fixed length. The last piece can be smaller than length
, but will never be empty.
Splitter.fixedLength(3)
Method | Description | Example |
---|---|---|
omitEmptyStrings() |
Automatically omits empty strings from the result. |
Splitter.on(',').omitEmptyStrings().split("a,,c,d") returns "a", "c", "d"
|
trimResults() |
Trims whitespace from the results; equivalent to trimResults(CharMatcher.WHITESPACE) . |
Splitter.on(',').trimResults().split("a, b, c, d") returns "a", "b", "c", "d"
|
trimResults(CharMatcher) |
Trims characters matching the specified CharMatcher from results. |
Splitter.on(',').trimResults(CharMatcher.is('_')).split("_a ,_b_ ,c__") returns "a ", "b_ ", "c" . |
limit(int) |
Stops splitting after the specified number of strings have been returned. |
Splitter.on(',').limit(3).split("a,b,c,d") returns "a", "b", "c,d"
|
TODO: Map splitters
If you wish to get a List
, just use Lists.newArrayList(splitter.split(string))
or the like.
Warning: splitter instances are always immutable. The splitter configuration methods will always return a new Splitter
, which you must use to get the desired semantics. This makes any Splitter
thread safe, and usable as a static final
constant.
In olden times, our StringUtil
class grew unchecked, and had
many methods like these:
allAscii |
collapse |
collapseControlChars |
collapseWhitespace |
indexOfChars |
lastIndexNotOf |
numSharedChars |
removeChars |
removeCrLf |
replaceChars |
retainAllChars |
strip |
stripAndCollapse |
stripNonDigits |
They represent a partial cross product of two notions:
- what constitutes a "matching" character?
- what to do with those "matching" characters?
To simplify this morass, we developed CharMatcher
.
Intuitively, you can think of a CharMatcher
as representing a particular class of characters, like digits or whitespace. Practically speaking, a CharMatcher
is just a boolean predicate on characters -- indeed, CharMatcher
implements Predicate<Character> -- but because it is so common to refer to "all whitespace characters" or "all lowercase letters," Guava provides this specialized syntax and API for characters.
But the utility of a CharMatcher
is in the operations it lets you perform on occurrences of the specified class of characters: trimming, collapsing, removing, retaining, and much more. An object of type CharMatcher
represents notion 1: what constitutes a matching character? It then provides many operations answering notion 2: what to do with those matching characters? The result is that API complexity increases linearly for quadratically increasing flexibility and power. Yay!
String noControl = CharMatcher.JAVA_ISO_CONTROL.removeFrom(string); // remove control characters
String theDigits = CharMatcher.DIGIT.retainFrom(string); // only the digits
String spaced = CharMatcher.WHITESPACE.trimAndCollapseFrom(string, ' ');
// trim whitespace at ends, and replace/collapse whitespace into single spaces
String noDigits = CharMatcher.JAVA_DIGIT.replaceFrom(string, "*"); // star out all digits
String lowerAndDigit = CharMatcher.JAVA_DIGIT.or(CharMatcher.JAVA_LOWER_CASE).retainFrom(string);
// eliminate all characters that aren't digits or lowercase
Note: CharMatcher
deals only with char
values; it does not understand supplementary Unicode code points in the range 0x10000 to 0x10FFFF. Such logical characters are encoded into a String
using surrogate pairs, and a CharMatcher
treats these just as two separate characters.
Many needs can be satisfied by the provided CharMatcher
constants:
ANY |
NONE |
WHITESPACE |
BREAKING_WHITESPACE |
INVISIBLE |
DIGIT |
JAVA_LETTER |
JAVA_DIGIT |
JAVA_LETTER_OR_DIGIT |
JAVA_ISO_CONTROL |
JAVA_LOWER_CASE |
JAVA_UPPER_CASE |
ASCII |
SINGLE_WIDTH |
Other common ways to obtain a CharMatcher
include:
Method | Description |
---|---|
anyOf(CharSequence) |
Specify all the characters you wish matched. For example, CharMatcher.anyOf("aeiou") matches lowercase English vowels. |
is(char) |
Specify exactly one character to match. |
inRange(char, char) |
Specify a range of characters to match, e.g. CharMatcher.inRange('a', 'z') . |
Additionally, CharMatcher
has negate()
, and(CharMatcher)
, and or(CharMatcher)
. These provide simple boolean operations on CharMatcher
.
CharMatcher
provides a wide variety of methods to operate on occurrences of the specified characters in any CharSequence
. There are more methods provided than we can list here, but some of the most commonly used are:
Method | Description |
---|---|
collapseFrom(CharSequence, char) |
Replace each group of consecutive matched characters with the specified character. For example, WHITESPACE.collapseFrom(string, ' ') collapses whitespaces down to a single space. |
matchesAllOf(CharSequence) |
Test if this matcher matches all characters in the sequence. For example, ASCII.matchesAllOf(string) tests if all characters in the string are ASCII. |
removeFrom(CharSequence) |
Removes matching characters from the sequence. |
retainFrom(CharSequence) |
Removes all non-matching characters from the sequence. |
trimFrom(CharSequence) |
Removes leading and trailing matching characters. |
replaceFrom(CharSequence, CharSequence) |
Replace matching characters with a given sequence. |
(Note: all of these methods return a String
, except for matchesAllOf
, which returns a boolean
.)
Don't do this:
try {
bytes = string.getBytes("UTF-8");
} catch (UnsupportedEncodingException e) {
// how can this possibly happen?
throw new AssertionError(e);
}
Do this instead:
bytes = string.getBytes(Charsets.UTF_8);
Charsets
provides constant references to the six standard Charset
implementations guaranteed to be supported by all Java platform implementations. Use them instead of referring to charsets by their names.
TODO: an explanation of charsets and when to use them
(Note: If you're using JDK7, you should use the constants in StandardCharsets
instead!)
CaseFormat
is a handy little class for converting between ASCII case conventions -- like, for example, naming conventions for programming languages. Supported formats include:
Format | Example |
---|---|
LOWER_CAMEL |
lowerCamel |
LOWER_HYPHEN |
lower-hyphen |
LOWER_UNDERSCORE |
lower_underscore |
UPPER_CAMEL |
UpperCamel |
UPPER_UNDERSCORE |
UPPER_UNDERSCORE |
Using it is relatively straightforward:
CaseFormat.UPPER_UNDERSCORE.to(CaseFormat.LOWER_CAMEL, "CONSTANT_NAME")); // returns "constantName"
We find this especially useful, for example, when writing programs that generate other programs.
- Introduction
- Basic Utilities
- Collections
- Graphs
- Caches
- Functional Idioms
- Concurrency
- Strings
- Networking
- Primitives
- Ranges
- I/O
- Hashing
- EventBus
- Math
- Reflection
- Releases
- Tips
- Glossary
- Mailing List
- Stack Overflow
- Android Overview
- Footprint of JDK/Guava data structures