Sed resources

2004-02-12

Index

Introduction
From ed to sed
sed, the only esoteric language actually used?
Cheap-sed
Features
License
Details
Limitations
Known bugs
Download
sed scripts
An unlambda interpreter
Sedcheck, a sed verifier
Links

Introduction

From `ed` to `sed`

sed is a stream editor, dating back from the time when computers were operated using typewriters. At that time, editing a file was a complex task, because ther was no screen to display the file on, and text editors did not really look like the modern editors with windows and a mouse (because none existed). For instance, to fix a typo on line 23 of a file, one had to give the editors commands such as: go to line 23 and print that line; replace "ediotr" by "editor"; save the file; quit.

Back around 1970, you would have to type the following commands (in bold) to fix the file using ed, the (only?) text editor available on UNIX at that time (or "the" editor as it was described):

                           (comments included in brackets)
67                         (ed says there are 67 lines)
23                         (go to line 23)
and text ediotrs did not   (the line printed)
s/ediotr/editor/           (fix the line)
p                          (print the line)
and text editors did not   (the line printed)
w                          (save the file)
67                         (still 67 lines)
q                          (quit ed)

As you can see, editing files was quite like programming, and in fact one could write the ed commands in a file once and for all, and then execute these commands on other files. Someone realised that for the great majority of such applications it was possible to avoid ed limitations (such as the need to load the entire file in memory) and this gave sed, the stream editor.

`sed`, the only esoteric language actually used?

I like sed because it is small, powerful, and present everywhere (almost). But I also like it because the language is rather funny. It has a simple syntax which is not designed to be esoteric, but which actually looks like random characters to the layman. For instance, here is a moderately complicated way to increment a number in sed:

s/$/i/
:inc
s/.i/&0123456789i0_/
s/^i/1/
s/\(.\)i[^_]*\1\(i*.\)[^_]*_/\2/
t inc

and here is a really nice way to increment a number:

s/$/+012345678910999000990090/
s/\(.\)\(9*\)+[^9]*\1\(.0*\).*\2\(0*\).*/\3\4/

Clear, huh? It's so simple when compared to any 'i++' in other programming languages :-)

sed operates by executing commands (essentially, search/replace) on a file which is read line by line; an other buffer, called the hold buffer, can keep information between lines. With these two buffers (the current line and the hold buffer) and a handful of commands, sed has remained a very common tool for more than thirty years.

This page holds a sed implementation I'm maintaining, and various sed scripts I wrote. I try to stick to standard, POSIX sed. (see the links).

Cheap-sed

Cheap-sed is based on HHsed (1991, by Howard L. Helman and David Kirschbaum), itself based on the small sed written around 1988 by Eric S. Raymond (whose last version was called sed-v1.3).

Here is how the original author described the ancestor of this sed:

This is a smaller, cheaper, faster SED utility. Minix uses it. GNU used to use it, until they built their own sed around an extended (some would say over-extended) regexp package.

I chose to maintain Cheap-sed because I was seduced by its small size, and the clarity of the source code, and impressed by its fast speed when compared especially to GNU sed. However good old sed v1.3 was rather obsolete, and a number of non-standard extensions had been added to HHsed. I decided to start from HHsed, remove all misfeatures and add all what was missing for the best POSIX compliance achievable. I named the result "cheap-sed" from Eric S Raymond's description, because "small-sed" abbreviated to ssed which is also the abbreviation of super-sed, another sed implementation.

Features

Cheap-sed inherited from its ancestors its small size and fast speed, and I strived to add the maximum POSIX compliance on top of that. There are also no more size limitations (well, almost at the current time, ultimately virtually none).

Thanks to these features Cheap-sed is a good sed implementation to run complex and demanding scripts, especially those sed scripts written more for the sake of writing them in sed than for any practical purpose :-). As an example, esoteric sed scripts such as dc.sed or unlambda.sed run between 10 and 60 times faster on cheap-sed than on recent GNU sed.

Licence

cheap-sed is distributed under the GPL version 2 or later at your option.

It seems to me that doing this is coherent with:

the fact that sed-1.3 itself was released under the GPL in 1998;
the wishes of authors of HHsed:

You, Dear Reader, may do *anything* you wish with it except steal it.
Copyright (c) 1991 Eric S. Raymond, David P Kirschbaum & Howard L. Helman All Rights Reserved

Details

The following extensions to POSIX are currently supported:

'\a', '\b', '\e', '\f', '\n', '\r', '\t', '\v' are synonyms of the corresponding characters in the C language ('\e' meaning ASCII 27, escape) and are recognised anywhere in regular expressions (including in bracket expressions), right hand side of substitutions, both sides of the translitteration command ('y///') and in the text argument of the 'a\', 'c\' and 'i\' commands.
(but note that other backslashes in bracket expression are treated as litteral backslashes; as an example, '[\z]' matches a backslash or a 'z', and '[n\]' and '[[.\.]n]' both match a backslash or an 'n');
'\xdd', where dd are two hexadecimal digits, are synonyms of a verbatim ASCII character number 0xdd;
'\+' and '\?' are synonym of '\{1,\}' and '\{0,1\}' respectively;
'\<' and '\>' match the beginning and end of words respectively.

POSIX-related implementation notes:

collating symbols '[.x.]' and equivalence classes '[=x=]'are only suported for single ASCII characters x, and in the POSIX locale only;
character class expressions '[:name:]' are implemented in the POSIX locale only.
sub-expressions are not anchored. For instance, '/\(^a\)/' is equivalent to '\(\^a\)', not '^\(a\)'.
in case of multiple '*' and intervals (undefined behaviour according to POSIX), the second multiplier is interpreted as litteral characters. For instance, 'a**' is interpreted as 'a*\*' and 'a\{2\}\{3\}' as 'aa{3}'.

Limitations

I'm currently removing as many limitations as I can. Still the current version handles:

no more than 20 levels of { ... } nesting;
less than 10 writeout files;
numbers n and m in multipliers '\{n\}', '\{n,\}' and '\{n,m\}' must be lower than 32767;
matched \( ... \), when repeated, must be less than 32767 bytes away in compiled form.

Known bugs

Only up to nine subexpressions \(...\) are currently supported. An indefinite number of them should be supported (but note that only the first nine of them may be recalled using "\1" to "\9")
\n will have a wrong value if referring to a subexpression for which backtracking occured. For example:
echo abcada | sed 's/\(\(a\([cd]\)b\)*\)\{2\}/'
reports "d" instead of the correct answer "c".
csed does not implement strictly the leftmost, longest matching rule mandated by POSIX. Instead, each part of the regular expression is tried from left to right for the longest match. This is a bit hard to explain in words, but here is an example:
echo "aaabaaa" | sed 's/a*\(a*\)b\1/<&>'
outputs "<aaab>aaa" in csed (as most other sed implementations do ?), whereas according to POSIX it should output "<aaabaaa>" (as GNU sed 4.0 does).
(Actually I've not yet decided if it is a bug or a feature: I don't currently know if it can be implemented at a reasonable cost.)

Bug reports are welcome by mail at my email address (at the bottom of this page)

Download

cheap-sed is still in active development. Archives named with a version number "csed-x.y.z.tgz" are stable, official versions. Archives simply dated "csed-yymmdd.tgz" contain interim or beta version.

The first stable version will probably be version 1.6.0 (to represent the progress made since HHsed, also known as sed 1.5).

csed-dos16-040212.zip (29 kb), an experimental DOS 16bit build of csed (using TC2.01). The test suite seems to run fine, but I don't have the ability of testing it on a real 16bit computer. Please report any success or failure with this version.
csed-030913.tgz (69 kb), most recent source-code archive, and csed-030913.zip (51 kb), a Win32 binary distribution;
csed-030816.zip (44 kb), a windows binary distribution (with source code, but without the test suite, as it needs so many unix tools to run that whoever could run the test suite could also compile the source code);
csed-030815.tgz (60 kb), the source-code archive of the first published beta cheap-sed;

`sed` scripts

An unlambda interpreter

This is an interpreter for unlambda version 1. Unlambda is an esoteric language invented by David Madore in 1999. You should really refer to the unlambda page for the complete description, of the language, including a tutorial and other things; but nevertheless there is also a sketchy description of unlambda in the sed script itself, in comments.

The sed script itself is here: unlambda.sed, and a colorized version can be found at http://sed.sf.net.

Sedcheck, a sed verifier

Have you ever written a sed script and given it to someone who came back later complaining about the dreadful "sed: command garbled" error? If so, sedcheck is for you :-)

Sedcheck is a utility checking various portability issues (mostly POSIX conformance) in sed scripts. Sedcheck is itself written in sed.

The sed script is here: sedcheck.sed (9 kb), and the full beta 1.0.0 distribution (including a kind of test suite) is here: sedcheck-030912.tgz (4 kb) or sedcheck-030912.zip (4 kb).

As an example, here is what the current version of sedcheck reports on add_decs.sed (from http://sed.sf.net/):

line 13: <blank>s not recommended here
: loop 
 ^
line 13: avoid <blank>s in labels
: loop 
      ^
line 14: no <blank>s allowed after "!"
/^--[^a]/ ! {
           ^
line 25: "\?" undefined (use "\{0,1\}" instead)
  s/-\(aaaaaaaaa\(a\)\)\?\(a*\)\([0-9b]*;.*\([0-9]\)\3\5\)/-\2\5\4/
                       ^
4 issues reported.

As can be seen in this example, the bad news is that Sedcheck is rather pedantic (the first three warnings reported may not seem very useful or relevant), but this is because I applied (rather strictly) the following sources:

the recent POSIX spec (HTML version).
Sed and Awk, by Dale Dougherty and Arnold Robbins (O'reilly).

Now the good news is, that the overall framework and sed parser work correctly, and it is much easier to remove warnings that it was to add them.

Sedcheck was tested successfully on the whole GNU sed testsuite and on many scripts from the grabbag. Nevertheless some bugs may still be there. Please report bugs and suggestions to me (see my email address below). I'd appreciate it particularly if sed programmers running other sed implementations than GNU sed could tell me whether some of the warnings currently in sedcheck are useless, or whether some other ones should be added.