Saturday, April 25, 2009

Ways to extract data from a space delimited string

Ways to tokenize a string if you cannot use the space character as a field delimiter in an input where the fields can have spaces:

1. create the input with the delimiter as some other character as the delimiter. The delimiter character should be non-printable. This would reduce its chances of occuring in the input and thus reduce handling of special cases.

2. If the input was generated with space as the delimiter then we have a problem at hand. For such cases there are two approaches and both require the knowledge of the format of the input.

If the format of the input is known then one can use regular expressions to search for the tokens in the string. (Typical scripting languages, like Perl, TCL support regular expressions. C++ user can use the Boost library for regular expression support).

However using regular expressions can be expensive if the number of searches during the program execution are large. So these can be used only when the number of searches are small.

For programs that do such search more often, let us understand the other approach using an example:

// input format:

char* inputStr = "12 abc def 14";
char* firstSpaceChar = strchr(inputStr, ' ');
int firstInt = 0;
int lastInt = 0;
string midStr = "";
if (firstSpaceChar != NULL) {
*firstSpaceChar = '\0';
firstInt = atoi(inputStr);
char* lastSpaceChar = strrchr(firstSpaceChar+1, ' ');
if (lastSpaceChar != NULL) {
lastInt = atoi(lastSpaceChar+1);
*lastSpaceChar = '\0';
}
midStr = firstSpaceChar+1;
} else {
midStr = inputStr;
}