split_line [Clemens Wacha]

Overview

split_line is a clean STL string tokenizer written in C++ in less than 100 lines of code. In its simplest form it creates a vector of strings with the tokens from a line of text separated at space, tab, carriage return and newline. In its most complex form it supports user provided delimiters, a user provided quote character, a user provided escape character, a special character for comments and limited abilities to resume tokenization with another part of the string.

Features

splits a line of text into words delimited by one or more delimiters
user can provide delimiters (defaults to \t\r\n and space)
user can provide one special character for quoted text (defaults to „)
user can provide one special escape character (defaults to \)
user can provide one special character for comments (disabled by default)
limited support to resume at another part of the string

Download

split_line-1.0.zip

Code Example

int main(int argc, char *argv[]) {
    vector<string> tokens;
    string line = "Writing    programs     \"in C++\"  	is   \
     Fun!!";
 
    split_line(tokens, line);
 
    cout << "Tokens:" << endl;
    for(unsigned int i = 0; i < tokens.size(); i++)
        cout << "'" << tokens[i] << "'" << endl;
 
    return 0;
}

Output:

Tokens:
'Writing'
'programs'
'in C++'
'is'
'Fun!!'

Documentation

A more complex example can be found in cfg_parser in function readFile(). The function resembles a state machine with 5 states (see enum SPLIT_LINE_STATE). It is possible to provide the starting state of the machine which gives you the ability to resume tokenization of a string in some cases. In resuming mode (start_state != SL_NORMAL) the read in characters are appended to the last string in the string vector ret until the state switches back to SL_NORMAL. In cfg_parser this behaviour was used to read in multiline values. However this features does not give you the ability to split a string anywhere yourself and then pass it over to split_line (using the return state as new start_state). The outcome will be different from what you might expect in most cases!

enum {
	SL_NORMAL,
	SL_ESCAPE,
	SL_SAFEMODE,
	SL_SAFEESCAPE,
	SL_COMMENT,
} SPLIT_LINE_STATE;
 
// splits line into tokens and stores them in ret. Supports delimiters, escape characters,
// ignores special characters between safemode_char and between comment_char and line end '\n'.
// returns SPLIT_LINE_STATE the parser was in when returning
int split_line(std::vector<std::string>& ret, std::string& line, const std::string& delimiters = " \t\r\n", char escape_char = '\\', char safemode_char = '"', char comment_char = '\0', int start_state = SL_NORMAL);

State Diagram

Legend

character read in / action
eat: append the character to the current token
finish: append token to token list and start with a new token

License

<html>

<!– Creative Commons License –> <a href=„http://creativecommons.org/licenses/GPL/2.0/“> <img alt=„CC-GNU GPL“ border=„0“ src=„http://creativecommons.org/images /public/cc-GPL-a.png“ /></a><br /> This software is licensed under the <a href=„http://creativecommons.org/licenses/GPL/2.0/“>CC-GNU GPL</a>. <!– /Creative Commons License –>

<!–

<rdf:RDF xmlns=„http://web.resource.org/cc/“

  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

 <license rdf:resource="http://creativecommons.org/licenses/GPL/2.0/" />
 <dc:type rdf:resource="http://purl.org/dc/dcmitype/Software" />

</Work>

 <permits rdf:resource="http://web.resource.org/cc/Distribution" />
 <requires rdf:resource="http://web.resource.org/cc/Notice" />
 <permits rdf:resource="http://web.resource.org/cc/DerivativeWorks" />
 <requires rdf:resource="http://web.resource.org/cc/ShareAlike" />
 <requires rdf:resource="http://web.resource.org/cc/SourceCode" />

</License>

</rdf:RDF>

–>

</html>

Split Line - A Clean and Small String Tokenizer