next up previous contents index
Next: table2sus Up: The Current Tools Previous: labmex

text2sus

Purpose:

Part of the Sus Filter Tools: Scans a text for keywords and outputs a sus file.

Usage:

text2sus [<option>] [<label>|<label>=<regexp>|<regexp>] ...

Options:
-i, --input=<FILENAME>
read from <FILENAME> (default: '-')
-o, --output=<FILENAME>
write to <FILENAME> (default: '-')
-n, --next=<REGEXP>
defines when to start a new record. (default: first label). For example, if you want each line to be a separate record, you can use --next=\n
With @FILE or @ FILE (some) command-line options are read from FILE (see section [*]).
Defining Labels:

There are three possibilities how to define a label:

<LABEL>
Take the first word or number after any occurrence of the <LABEL>.
<LABEL>=<REGEXP>
Take the first group defined in <REGEXP>. If there are no parentheses in <REGEXP>, again take the next word or number.
<REGEXP> (containing groups with labels, e.g. (?P<label>...))
Take every labeled group defined in <REGEXP>.
Thus, "label","label=label", "label:label[:= \t]*(\w+)", and "label[:= \t]*(?P<label>w+)" will do the same thing. In fact the regular expression is even a bit more complicated to capture a floating point number if that seems to be the next word: SPMquotlabel[:= ]*(?P<label>[+-]?+?([eE][+-]?+)?|+)"

Regular Expressions:
2

A regular expression (or REGEXP) specifies a set of strings that matches it.

Regular expressions can be concatenated to form new regular expressions; if A and B are both regular expressions, then AB is also an regular expression. If a string p matches A and another string q matches B, the string *pq* will match AB.

A brief explanation of a part the format of regular expressions follows. For further information and a gentler presentation, consult the Regular Expression HOWTO, accessible from http://www.python.org/doc/howto/.

Regular expressions can contain both special and ordinary characters. Most ordinary characters, like `A', `a', or `0', are the simplest regular expressions; they simply match themselves. You can concatenate ordinary characters, so "last" matches the string `last'. (In the rest of this section, we'll write REGEXP's in "this special style", usually without quotes, and strings to be matched `in single quotes'.)

Some characters, like `|' or `(', are special. Special characters either stand for classes of ordinary characters, or affect how the regular expressions around them are interpreted.

The special characters are:

`.'
(Dot.) In the default mode, this matches any character except a newline.

$\! \! \!\lq $^' (Caret.) Matches the start of the string and immediately after each newline.

`$'
Matches the end of the string and before a newline. "foo" matches both 'foo' and 'foobar', while the regular expression "foo$" matches only 'foo'.

`*'
Causes the resulting REGEXP to match 0 or more repetitions of the preceding REGEXP, as many repetitions as are possible. "ab*" will match 'a', 'ab', or 'a' followed by any number of 'b's.

`+'
Causes the resulting REGEXP to match 1 or more repetitions of the preceding REGEXP. "ab+" will match 'a' followed by any non-zero number of 'b's; it will not match just 'a'.

`?'
Causes the resulting REGEXP to match 0 or 1 repetitions of the preceding REGEXP. "ab?" will match either 'a' or 'ab'.

`*?', `+?', `??'
The `*', `+', and `?' qualifiers are all "greedy"; they match as much text as possible. Sometimes this behaviour isn't desired; if the REGEXP "<.*>" is matched against `<H1>title</H1>', it will match the entire string, and not just `<H1>'. Adding `?' after the qualifier makes it perform the match in "non-greedy" or "minimal" fashion; as few characters as possible will be matched. Using ".*?" in the previous expression will match only `<H1>'.

$\! \! \!\lq $\' Either escapes special characters (permitting you to match characters like `*', `?', and so forth), or signals a special sequence; special sequences are discussed below.

If you're not using a command-line file (see section [*]), remember that most shells also use the backslash as an escape sequence in the command line; therefore you have to put the regular expression into '-quotes to prevent an interpretation by the shell.

`[ ]'
Used to indicate a set of characters. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a `-'. Special characters are not active inside sets. For example, "[!akm]" will match any of the characters `a', `k', `m', or `!'; "[a-z]" will match any lowercase letter, and `[a-zA-Z0-9]' matches any letter or digit. Character classes such as `\w' or `\S' (defined below) are also acceptable inside a range. If you want to include a `]' or a `-' inside a set, precede it with a backslash, or place it as the first character. The pattern "[]]" will match `]', for example.

You can match the characters not within a range by "complementing" the set. This is indicated by including a `^' as the first character of the set; `^' elsewhere will simply match the `^' character. For example, "[^5]" will match any character except `5'.

`|'
`A|B', where A and B can be arbitrary REGEXPs, creates a regular expression that will match either A or B. This can be used inside groups (see below) as well. To match a literal `|', use "\|", or enclose it inside a character class, as in "[|]".

`(...)'
Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group. To match the literals `(' or `)', use "\(" or "\)", or enclose them inside a character class: "[(] [)]".

`(?...)'
This is an extension notation (a `?' following a `(' is not meaningful otherwise). The first character after the `?' determines what the meaning and further syntax of the construct is. Extensions usually do not create a new group; "(?P<NAME>...)" is the only exception to this rule. Following are the some of the currently supported extensions.

`(?P<NAME>...)'
Similar to regular parentheses, but the substring matched by the group is accessible via the symbolic group name NAME.

`(?P=NAME)'
Matches whatever text was matched by the earlier group named NAME.

`(?=...)'
Matches if "..." matches next, but doesn't consume any of the string. This is called a lookahead assertion. For example, "Isaac (?=Asimov)" will match `Isaac ' only if it's followed by `Asimov'.

`(?!...)'
Matches if "..." doesn't match next. This is a negative lookahead assertion. For example, "Isaac (?!Asimov)" will match `Isaac ' only if it's not followed by `Asimov'.

The special sequences consist of `\' and a character from the list below. If the ordinary character is not on the list, then the resulting REGEXP will match the second character. For example, "\$" matches the character `$'.

`\A' Matches only at the start of the string.

`\b' Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric characters, so the end of a word is indicated by whitespace or a non-alphanumeric character.

`\B' Matches the empty string, but only when it is not at the beginning or end of a word.

`\d' Matches any decimal digit; this is equivalent to the set "[0-9]".

`\D' Matches any non-digit character; this is equivalent to the set "[^0-9]".

`\s' Matches any whitespace character; this is equivalent to the set "[ \t\n\r\f\v]".

`\S' Matches any non-whitespace character; this is equivalent to the set "[^ \t\n\r\f\v]".

`\w' This is equivalent to the set "[a-zA-Z0-9_]", the alphanumeric characters.

`\W' This is equivalent to the set "[^a-zA-Z0-9_]", the non-alphanumeric characters.

`\Z' Matches only at the end of the string.

`\\' Matches a literal backslash.

Result:

A sus file containing data from the text.

Examples:

See Section [*].


next up previous contents index
Next: table2sus Up: The Current Tools Previous: labmex
Susan Hert
2002-08-29