Friday, February 18, 2011

Chapter 37: Searching, Parsing, Tokenizing and Formatting

Parsing, Tokenizing and Formatting are concepts you would use invariably in any application that is handling string data and displaying stuff on screen. The purposes of these are to help the programmer read/understand and represent data in a format that is understandable to all concerned parties.

So, lets get started!!!

Searching

No matter what language you’re using, sooner or later you’ll probably be faced with the need to search through large amounts of textual data, looking for some specific stuff.

Regular expressions (regex for short) are a kind of language within a language, designed to help programmers with these searching tasks. Every language that provides regex capabilities uses one or more regex engines. Regex engines search through textual data using instructions that are coded into expressions. A regex expression is like a very short program or script. When you invoke a regex engine, you’ll pass it the chunk of textual data you want it to process, and you pass it the expression you want it to use to search through the data.

It’s fair to think of regex as a language, and we will refer to it that way throughout this chapter. The regex language is used to create expressions, and as we work through this section, whenever we talk about expressions or expression syntax, we’re talking about syntax for the regex “language.”

Simple Searches

For our first example, we’d like to search through the following source String
xyxxxyx

for all occurrences (or matches) of the expression
xy

In all of these discussions we’ll assume that our data sources use zero-based indexes, so if we apply an index to our source string we get
source: xyxxxyx
index: 0123456

We can see that we have two occurrences of the expression xy: one starting at position 0 and the second starting at position 4. If we sent the previous source data and expression to a regex engine, it would reply by telling us that it found matches at positions 0 and 4:

import java.util.regex.*;
class RegexTest {
public static void main(String [] args) {
Pattern p = Pattern.compile("xy"); // the expression
Matcher m = p.matcher("xyxxxyx"); // the source
while(m.find()) {
System.out.print(m.start() + " ");
}
}
}
This produces
0 4

We’re not going to explain this code right now. In a few pages we’re going to show you a lot more regex code, but first we want to go over some more regex syntax. Once you understand a little more regex, the code samples will make a lot more sense. Here’s a more complicated example of a source and an expression:

source: xyxyxyx
index: 0123456
expression: xyx

How many occurrences do we get in this case? Well, there is clearly an occurrence starting at position 0, and another starting at position 4. But how about starting at position 2? In general in the world of regex, the xyx string that starts at position 2 will not be considered a valid occurrence. The first general regex search rule is

In general, a regex search runs from left to right, and once a source’s character has been used in a match, it cannot be reused.

So in our previous example, the first match used positions 0, 1, and 2 to match the expression. Because the character in position 2 was consumed in the first match, it couldn’t be used again. So the engine moved on, and didn’t find another occurrence of xyx until it reached position 4. This is the typical way that a regex matching engine works. However, a little later, we’ll look at an exception to the first rule we stated above.

So we’ve matched a couple of exact strings, but what would we do if we wanted to find something a little more dynamic? For instance, what if we wanted to find all of the occurrences of hex numbers or phone numbers or ZIP codes?

Searches Using Metacharacters

As luck would have it, regex has a powerful mechanism for dealing with the cases we described above. At the heart of this mechanism is the idea of a metacharacter. As an easy example, let’s say that we want to search through some source data looking for all occurrences of numeric digits. In regex, the following expression is used to look for numeric digits:
\d

If we change the previous program to apply the expression \d to the following source string
source: a12c3e456f
index: 0123456789

regex will tell us that it found digits at positions 1, 2, 4, 6, 7, and 8.

Regex provides a rich set of metacharacters that you can find described in the API documentation for java.util.regex.Pattern. We won’t discuss them all here, but we will describe the ones you’ll need for the exam:

\d A digit
\s A whitespace character
\w A word character (letters, digits, or “_” (underscore))

So for example, given
source: "a 1 56 _Z"
index: 012345678
pattern: \w

regex will return positions 0, 2, 4, 5, 7, and 8. The only characters in this source that don’t match the definition of a word character are the whitespaces.

You can also specify sets of characters to search for using square brackets and ranges of characters to search for using square brackets and a dash:

[abc] Searches only for a’s, b’s or c’s
[a-f] Searches only for a, b, c, d, e, or f characters

In addition, you can search across several ranges at once. The following expression is looking for occurrences of the letters a - f or A - F, it’s NOT looking for an fA combination:

[a-fA-F] Searches for the first six letters of the alphabet, both cases.

So for instance,
source: "cafeBABE"
index: 01234567
pattern: [a-cA-C]

returns positions 0, 1, 4, 5, 6.

Searches Using Quantifiers

Let’s say that we want to create a regex pattern to search for hexadecimal literals. As a first step, let’s solve the problem for one-digit hexadecimal numbers:

0[xX][0-9a-fA-F]

The preceding expression could be stated: “Find a set of characters in which the first character is a “0”, the second character is either an “x” or an “X”, and the third character is either a digit from “0” to “9”, a letter from “a” to “f” or an uppercase letter from “A” to “F” “. Using the preceding expression, and the following data

source: "12 0x 0x12 0Xf 0xg"
index: 012345678901234567

regex would return 6 and 11. (Note: 0x and 0xg are not valid hex numbers.)

As a second step, let’s think about an easier problem. What if we just wanted regex to find occurrences of integers? Integers can be one or more digits long, so it would be great if we could say “one or more” in an expression. There is a set of regex constructs called quantifiers that let us specify concepts such as “one or more.” In fact, the quantifier that represents “one or more” is the “+” character. We’ll see the others shortly.

The other issue this raises is that when we’re searching for something whose length is variable, getting only a starting position as a return value has limited value. So, in addition to returning starting positions, another bit of information that a regex engine can return is the entire match or group that it finds. We’re going to change the way we talk about what regex returns by specifying each return on its own line, remembering that now for each return we’re going to get back the starting position AND then the group:

source: "1 a44 555b"
pattern: \d+

You can read this expression as saying: “Find one or more digits in a row.” This expression produces the regex output
0 1
3 44
6 555

You can read this as “At position 0 there’s an integer with a value of 1, then at position 3 there’s an integer with a value of 44, then at position 6 there’s an integer with a value of 555.” Returning now to our hexadecimal problem, the last thing we need to know is how to specify the use of a quantifier for only part of an expression. In this case we must have exactly one occurrence of 0x or 0X but we can have from one to many occurrences of the hex “digits” that follow. The following expression adds parentheses to limit the “+” quantifier to only the hex digits:

0[xX]([0-9a-fA-F])+

The parentheses and “+” augment the previous find-the-hex expression by saying in effect: “Once we’ve found our 0x or 0X, you can find from one to many occurrences of hex digits.” Notice that we put the “+” quantifier at the end of the expression. It’s useful to think of quantifiers as always quantifying the part of the expression that precedes them.

The other two quantifiers we’re going to look at are
* Zero or more occurrences
? Zero or one occurrence

Let’s say you have a text file containing a comma-delimited list of all the file names in a directory that contains several very important projects. You want to create a list of all the files whose names start with car. You might discover .txt files, .java files, .pdf files. What kind of regex expression could we create to find these various car files? First let’s take a look at what a part of this text might look like:

..."bike.txt,carsched.pdf,car,bus,car.java"...

To solve this problem we’re going to use the regex ^ (carat) operator, which we mentioned earlier. The regex ^ operator isn’t on the exam, but it will help us create a fairly clean solution to our problem. The ^ is the negation symbol in regex. For instance, if you want to find anything but a’s, b’s, or c’s in a file you could say

[^abc]

So, armed with the ^ operator and the * (zero or more) quantifier we can create the following:
car([^,])*

If we apply this expression to just the portion of the text file we listed above, regex returns

10 carsched.pdf
25 car
37 car.java

The key part of this expression is the “give me zero or more characters that aren’t a comma.”
The last quantifier example we’ll look at is the ? (zero or one) quantifier. Let’s say that our job this time is to search a text file and find anything that might be a local, 7-digit phone number. We’re going to say, arbitrarily, that if we find either seven digits in a row, or three digits followed by a dash or a space followed by 4 digits, that we have a candidate. Here are examples of “valid” phone numbers:

9876543
987 6543
987-6543

The key to creating this expression is to see that we need “zero or one instance of either a space or a dash” in the middle of our digits:

\d\d\d([-\s])?\d\d\d\d

The Predefined Dot

In addition to the \s, \d, and \w metacharacters that we discussed, you also have to understand the “.” (dot) metacharacter. When you see this character in a regex expression, it means “any character can serve here.” For instance, the following source and pattern

source: "xy xyz x z"
pattern: x.z

will produce the output
3 xyz
7 x z

The “.” was able to match both the “y” and the “ ” in the source data.

Special Quantifiers

When you use the *, +, and ? quantifiers, you can fine tune them a bit to produce behavior that’s known as “greedy”, “reluctant”, or “possessive”. Although you need to understand only the greedy quantifier for the exam, we’re also going to discuss the reluctant quantifier to serve as a basis for comparison. First the syntax:

? is greedy, ?? is reluctant, for zero or once
* is greedy, *? is reluctant, for zero or more
+ is greedy, +? is reluctant, for one or more
What happens when we have the following source and pattern?
source: yyxxxyxx
pattern: .*xx

First off, we’re doing something a bit different here by looking for characters that prefix the static (xx) portion of the expression. We think we’re saying something like: “Find sets of characters that ends with xx”. Before we tell what happens, we at least want you to consider that there are two plausible results...can you find them? Remember we said earlier that in general, regex engines worked from left to right, and consumed characters as they went. So, working from left to right, we might predict that the engine would search the first 4 characters (0–3), find xx starting in position 2, and have its first match. Then it would proceed and find the second xx starting in position 6. This would lead us to a result like this:

0 yyxx
4 xyxx

A plausible second argument is that since we asked for a set of characters that ends with xx we might get a result like this:

0 yyxxxyxx

The way to think about this is to consider the name greedy. In order for the second answer to be correct, the regex engine would have to look (greedily) at the entire source data before it could determine that there was an xx at the end. So in fact, the second result is the correct result because in the original example we used the greedy quantifier *. The result that finds two different sets can be generated by using the reluctant quantifier *?. Let’s review:

source: yyxxxyxx
pattern: .*xx

is using the greedy quantifier * and produces
0 yyxxxyxx

If we change the pattern to
source: yyxxxyxx
pattern: .*?xx

we’re now using the reluctant qualifier *?, and we get the following:
0 yyxx
4 xyxx

The greedy quantifier does in fact read the entire source data, and then it works backward (from the right) until it finds the rightmost match. At that point, it includes everything from earlier in the source data up to and including the data that is part of the rightmost match.

Note:
There are a lot more aspects to regex quantifiers than we’ve discussed here, but we’ve covered more than enough for the exam. Sun has several tutorials that will help you learn more about quantifiers, and turn you into the go-to person at your job.


When Metacharacters and Strings Collide

So far we’ve been talking about regex from a theoretical perspective. Before we can put regex to work we have to discuss one more topic. When it’s time to implement regex in our code, it will be quite common that our source data and/or our expressions will be stored in Strings. The problem is that metacharacters and Strings don’t mix too well. For instance. let’s say we just want to do a simple regex pattern that looks for digits. We might try something like

String pattern = "\d"; // compiler error!

This line of code won’t compile! The compiler sees the \ and thinks, “Ok, here comes an escape sequence, maybe it’ll be a new line!” But no, next comes the d and the compiler says “I’ve never heard of the \d escape sequence.” The way to satisfy the compiler is to add another backslash in front of the \d

String pattern = "\\d"; // a metacharacter that will compile

The first backslash tells the compiler that whatever comes next should be taken literally, not as an escape sequence. How about the dot (.) metacharacter? If we want a dot in our expression to be used as a metacharacter, then no problem, but what if we’re reading some source data that happens to use dots as delimiters? Here’s another way to look at our options:

String p = "."; // regex sees this as the "." metacharacter
String p = "\.";
// the compiler sees this as an illegal Java escape sequence
String p = "\\.";
// the compiler is happy, and regex sees a dot, not a metacharacter

A similar problem can occur when you hand metacharacters to a Java program via command-line arguments. If we want to pass the \d metacharacter into our Java program, our JVM does the right thing if we say

% java DoRegex "\d"

But your JVM might not. If you have problems running the following examples, you might try adding a backslash (i.e., \\d) to your command-line metacharacters. Don’t worry, you won’t see any command-line metacharacters on the exam!

Exam Tip:
The Java language defines several escape sequences, including
\n = linefeed (which you might see on the exam)
\b = backspace
\t = tab
And others, which you can find in the Java Language Specification. Other than perhaps seeing a \n inside a String, you won’t have to worry about Java’s escape sequences on the exam.

Locating Data via Pattern Matching

Once you know a little regex, using the java.util.regex.Pattern (Pattern) and java.util.regex.Matcher (Matcher) classes is pretty straightforward. The Pattern class is used to hold a representation of a regex expression, so that it can be used and reused by instances of the Matcher class. The Matcher class is used to invoke the regex engine with the intention of performing match operations. The following program shows Pattern and Matcher in action, and it’s not a bad way for you to do your own regex experiments.

Note: You might want to modify the following class by adding some functionality from the Console class. That way you’ll get some practice with the Console class, and it’ll be easier to run multiple regex experiments.

import java.util.regex.*;
class RegexTest {
public static void main(String [] args) {
Pattern p = Pattern.compile(args[0]);
Matcher m = p.matcher(args[1]);
System.out.println("Pattern is " + m.pattern());
while(m.find()) {
System.out.println(m.start() + " " + m.group());
}
}
}

This program uses the first command-line argument (args[0]) to represent the regex expression you want to use, and it uses the second argument (args[1]) to represent the source data you want to search. Here’s a test run:

% java RegexTest "\d\w" "ab4 56_7ab"

Produces the output
Pattern is \d\w
4 56
7 7a

(Remember, if you want this expression to be represented in a String, you’d use \\d\\w). Because you’ll often have special characters or whitespace as part of your arguments, you’ll probably want to get in the habit of always enclosing your argument in quotes. Let’s take a look at this code in more detail. First off, notice that we aren’t using new to create a Pattern; if you check the API, you’ll find no constructors are listed. You’ll use the overloaded, static compile() method (that takes String expression) to create an instance of Pattern. For the exam, all you’ll need to know to create a Matcher, is to use the Pattern.matcher() method (that takes String sourceData).

The important method in this program is the find() method. This is the method that actually cranks up the regex engine and does some searching. The find() method returns true if it gets a match, and remembers the start position of the match. If find() returns true, you can call the start() method to get the starting position of the match, and you can call the group() method to get the string that represents the actual bit of source data that was matched.

Exam Tip:
To provide the most flexibility, Matcher.find(), when coupled with the greedy quantifiers ? or *, allow for (somewhat unintuitively) the idea of a zero-length match. As an experiment, modify the previous Regex.java class and add an invocation of m.end() to the S.O.P. in the while loop. With that modification in place, the invocation
java Regex "a?" "xyx"

should produce something very similar to this:
Pattern is a?
0 1 a
1 1
2 3 a
3 3

The lines of output 1 1 and 3 3 are examples of zero-length matches. Zero-length matches can occur in several places:
• After the last character of source data (the 3 3 example)
• In between characters after a match has been found (the 1 1 example)
• At the beginning of source data (try java Regex "a?" "bxyx")
• At the beginning of zero-length source data

Searching Using the Scanner Class

Although the java.util.Scanner class is primarily intended for tokenizing data (which we’ll cover next), it can also be used to find stuff, just like the Pattern and Matcher classes. While Scanner doesn’t provide location information or search and replace functionality, you can use it to apply regex expressions to source data to tell you how many instances of an expression exist in a given piece of source data. The following program uses the first command-line argument as a regex expression, then asks for input using System.in. It outputs a message every time a match is found:

import java.util.*;
class ScannerTest {
public static void main(String[] args) {
System.out.print("input: ");
System.out.flush();
try {
Scanner s = new Scanner(System.in);
String token;
do {
token = s.findInLine(args[0]);
System.out.println("found " + token);
} while (token != null);
} catch (Exception e) { System.out.println("scan exc"); }
}
}

The invocation and input

java ScannerTest "\d\d"
input: 1b2c335f456

produce the following:
found 33
found 45
found null

Tokenizing

Tokenizing is the process of taking big pieces of source data, breaking them into little pieces, and storing the little pieces in variables. Probably the most common tokenizing situation is reading a delimited file in order to get the contents of the file moved into useful places like objects, arrays or collections. We’ll look at two classes in the API that provide tokenizing capabilities: String (using the split() method) and Scanner, which has many methods that are useful for tokenizing.

Tokens and Delimiters

When we talk about tokenizing, we’re talking about data that starts out composed of two things: tokens and delimiters. Tokens are the actual pieces of data, and delimiters are the expressions that separate the tokens from each other. When most people think of delimiters, they think of single characters, like commas or backslashes or maybe a single whitespace. These are indeed very common delimiters, but strictly speaking, delimiters can be much more dynamic. In fact, delimiters can be anything that qualifies as a regex expression. Let’s take a single piece of source data and tokenize it using a couple of different delimiters:

source: "one,two,three,four"

If we say that our delimiter is a comma, then our four tokens would be

one
two
three
four

In general, when we tokenize source data, the delimiters themselves are discarded, and all that we are left with are the tokens.

Tokenizing with String.split()

The String class’s split() method takes a regex expression as its argument, and returns a String array populated with the tokens produced by the split process. This is a handy way to tokenize relatively small pieces of data. The following program uses args[0] to hold a source string, and args[1] to hold the regex pattern to use as a delimiter:

import java.util.*;

class StringSplitTest {

public static void main(String[] args) {

String[] tokens = args[0].split(args[1]);

System.out.println("count " + tokens.length);

for(String s : tokens)

System.out.println(">" + s + "<");

}

}


Everything happens all at once when the split() method is invoked. The source string is split into pieces, and the pieces are all loaded into the tokens String array. All the code after that is just there to verify what the split operation generated. The following invocation


% java StringSplitTest "ab5 ccc 45 @" "\d"


produces

count 4

>ab<

> ccc <

><

> @<


Note: Remember that to represent “\” in a string you may need to use the escape sequence “\\”. Because of this, and depending on your OS, your second argument might have to be “\\d” or even “\\\\d”.


We put the tokens inside “> <” characters to show whitespace. Notice that every digit was used as a delimiter, and that contiguous digits created an empty token.

One drawback to using the String.split() method is that often you’ll want to look at tokens as they are produced, and possibly quit a tokenization operation early when you’ve created the tokens you need. For instance, you might be searching a large file for a phone number. If the phone number occurs early in the file, you’d like to quit the tokenization process as soon as you’ve got your number. The Scanner class provides a rich API for doing just such on-the-fly tokenization operations.


Tokenizing with Scanner

The java.util.Scanner class is the master of tokenizing. When you need to do some serious tokenizing, look no further than Scanner. In addition to the basic tokenizing capabilities provided by String.split(), the Scanner class offers the following features:

• Scanners can be constructed using files, streams, or Strings as a source.
• Tokenizing is performed within a loop so that you can exit the process at any point.
• Tokens can be converted to their appropriate primitive types automatically.

Let’s look at a program that demonstrates several of Scanner’s methods and capabilities.

Scanner’s default delimiter is whitespace, which this program uses. The program makes two Scanner objects: s1 is iterated over with the more generic next() method, which returns every token as a String, while s2 is analyzed with several of the specialized nextXxx() methods (where Xxx is a primitive type):

import java.util.Scanner;
class TestScanner {
public static void main(String [] args) {
boolean b2, b;
int i;
String s, hits = " ";
Scanner s1 = new Scanner(args[0]);
Scanner s2 = new Scanner(args[0]);
while(b = s1.hasNext()) {
s = s1.next(); hits += "x";
}
while(b = s2.hasNext()) {
if (s2.hasNextInt()) {
i = s2.nextInt(); hits += "y";
} else if (s2.hasNextBoolean()) {
b2 = s2.nextBoolean(); hits += "z";
} else {
s2.next(); hits += "s2";
}
}
System.out.println("test " + hits);
}
}

If this program is invoked with
% java TestScanner "9 true 99 rocky"

it produces
test xxxxyzys2

Of course we’re not doing anything with the tokens once we’ve got them, but you can see that s2’s tokens are converted to their respective primitives. A key point here is that the methods named hasNextXxx() test the value of the next token but do not actually get the token, nor do they move to the next token in the source data. The nextXxx() methods all perform two functions: they get the next token, and then they move to the next token.

The Scanner class has nextXxx() (for instance nextLong()) and hasNextXxx() (for instance hasNextDouble()) methods for every primitive type except char. In addition, the Scanner class has a useDelimiter() method that allows you to set the delimiter to be any valid regex expression.

Formatting with printf() and format()

Any application that generates reports would have to take care of the arduous task of formatting the data. How pissed off would you be if your bank statement comes and all the details reg. the transactions in your account are in random fashion? I would tear up the statement immediately and close the darn account without a second thought. That is exactly why the statement is formatted in such a way that the customer (you and me) can read it and understand it properly and without going crazy (of course)

The format() and printf() methods were added to java.io.PrintStream in Java 5. These two methods behave exactly the same way, so anything we say about one of these methods applies to both of them.

Behind the scenes, the format() method uses the java.util.Formatter class to do the heavy formatting work. You can use the Formatter class directly if you choose, but for the exam all you have to know is the basic syntax of the arguments you pass to the format() method. The documentation for these formatting arguments can be found in the Formatter API.

Let’s start by paraphrasing the API documentation for format strings
printf("format string", argument(s));

The format string can contain both normal string literal information that isn’t associated with any arguments, and argument-specific formatting data. The clue to determining whether you’re looking at formatting data, is that formatting data will always start with a percent sign (%). Let’s look at an example:

System.out.printf("%2$d + %1$d", 123, 456);

This produces
456 + 123

Don't scold me just yet. I would let you figure this all by yourself and end up cursing me. So, let’s look at what just happened. Inside the double quotes there is a format string, then a +, and then a second format string. Notice that we mixed literals in with the format strings. Now let’s dive in a little deeper and look at the construction of format strings:

%[arg_index$][flags][width][.precision]conversion char

The values within [ ] are optional. In other words, the only required elements of a format string are the % and a conversion character. In the example above the only optional values we used were for argument indexing. The 2$ represents the second argument, and the 1$ represents the first argument. The d after the arguments is a conversion character. Here’s a rundown of the format string elements you’ll need to know for the exam:

arg_index An integer followed directly by a $, this indicates which argument should be printed in this position.

flags While many flags are available, for the exam you’ll need to know:
• “-” Left justify this argument
• “+” Include a sign (+ or -) with this argument
• “0” Pad this argument with zeroes
• “,” Use locale-specific grouping separators (i.e., the comma in 123,456)
• “(” Enclose negative numbers in parentheses

Width - This value indicates the minimum number of characters to print. (If you want nice even columns, you’ll use this value extensively.)
Precision - For the exam you’ll only need this when formatting a floating-point number, and in the case of floating point numbers, precision indicates the number of digits to print after the decimal point.
conversion - The type of argument you’ll be formatting. You’ll need to know:
• b boolean
• c char
• d integer
• f floating point
• s string

Let’s see some of these formatting strings in action:

int i1 = -555;
int i2 = 99999;
System.out.printf(">%1$(7d< \n", i1); System.out.printf(">%0,7d< \n", i2); System.out.format(">%+-7d< \n", i2); System.out.printf(">%2$b + %1$5d< \n", i1, false); This produces: “ (555)” “099,999” “+99999 “ >false + -555”

(We added the " and " symbols to help show how minimum widths, and zero padding and alignments work.) Finally, it’s important to remember that if you have a mismatch between the type specified in your conversion character and your argument, you’ll get a runtime exception:
System.out.format("%d", 12.3);

This produces something like

Exception in thread "main" java.util.IllegalFormatConversionException:
d != java.lang.Double

Previous Chapter: Chapter 36 - Using Dates, Numbers and Currencies

Next Chapter: Quick Review - Chapters 32 to 37

No comments:

Post a Comment

© 2013 by www.inheritingjava.blogspot.com. All rights reserved. No part of this blog or its contents may be reproduced or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without prior written permission of the Author.

ShareThis

Google+ Followers

Followers