split_line

Overview

split_line is a clean STL string tokenizer written in C++ in less than 100 lines of code. In its simplest form it creates a vector of strings with the tokens from a line of text separated at space, tab, carriage return and newline. In its most complex form it supports user provided delimiters, a user provided quote character, a user provided escape character, a special character for comments and limited abilities to resume tokenization with another part of the string.

Features

  • splits a line of text into words delimited by one or more delimiters
  • user can provide delimiters (defaults to \t\r\n and space)
  • user can provide one special character for quoted text (defaults to „)
  • user can provide one special escape character (defaults to \)
  • user can provide one special character for comments (disabled by default)
  • limited support to resume at another part of the string

Download

Code Example

int main(int argc, char *argv[]) {
    vector<string> tokens;
    string line = "Writing    programs     \"in C++\"  	is   \
     Fun!!";
 
    split_line(tokens, line);
 
    cout << "Tokens:" << endl;
    for(unsigned int i = 0; i < tokens.size(); i++)
        cout << "'" << tokens[i] << "'" << endl;
 
    return 0;
}

Output:

Tokens:
'Writing'
'programs'
'in C++'
'is'
'Fun!!'

Documentation

A more complex example can be found in cfg_parser in function readFile(). The function resembles a state machine with 5 states (see enum SPLIT_LINE_STATE). It is possible to provide the starting state of the machine which gives you the ability to resume tokenization of a string in some cases. In resuming mode (start_state != SL_NORMAL) the read in characters are appended to the last string in the string vector ret until the state switches back to SL_NORMAL. In cfg_parser this behaviour was used to read in multiline values. However this features does not give you the ability to split a string anywhere yourself and then pass it over to split_line (using the return state as new start_state). The outcome will be different from what you might expect in most cases!

enum {
	SL_NORMAL,
	SL_ESCAPE,
	SL_SAFEMODE,
	SL_SAFEESCAPE,
	SL_COMMENT,
} SPLIT_LINE_STATE;
 
// splits line into tokens and stores them in ret. Supports delimiters, escape characters,
// ignores special characters between safemode_char and between comment_char and line end '\n'.
// returns SPLIT_LINE_STATE the parser was in when returning
int split_line(std::vector<std::string>& ret, std::string& line, const std::string& delimiters = " \t\r\n", char escape_char = '\\', char safemode_char = '"', char comment_char = '\0', int start_state = SL_NORMAL);
State Diagram

Legend

  • character read in / action
  • eat: append the character to the current token
  • finish: append token to token list and start with a new token

License

<html>

<!– Creative Commons License –> <a href=„http://creativecommons.org/licenses/GPL/2.0/“> <img alt=„CC-GNU GPL“ border=„0“ src=„http://creativecommons.org/images /public/cc-GPL-a.png“ /></a><br /> This software is licensed under the <a href=„http://creativecommons.org/licenses/GPL/2.0/“>CC-GNU GPL</a>. <!– /Creative Commons License –>

<!–

<rdf:RDF xmlns=„http://web.resource.org/cc/

  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

<Work rdf:about=“„>

 <license rdf:resource="http://creativecommons.org/licenses/GPL/2.0/" />
 <dc:type rdf:resource="http://purl.org/dc/dcmitype/Software" />

</Work>

<License rdf:about=„http://creativecommons.org/licenses/GPL/2.0/“> <permits rdf:resource=„http://web.resource.org/cc/Reproduction“ />

 <permits rdf:resource="http://web.resource.org/cc/Distribution" />
 <requires rdf:resource="http://web.resource.org/cc/Notice" />
 <permits rdf:resource="http://web.resource.org/cc/DerivativeWorks" />
 <requires rdf:resource="http://web.resource.org/cc/ShareAlike" />
 <requires rdf:resource="http://web.resource.org/cc/SourceCode" />

</License>

</rdf:RDF>

–>

</html>

  • split_line.txt
  • Zuletzt geändert: 16.11.2016 23:18 (vor 8 Jahren)
  • von 127.0.0.1