|
Index
Introduction
From ed to
sed
sed , the only
esoteric language actually used?
Cheap-sed
Features
License
Details
Limitations
Known bugs
Download
sed scripts
An unlambda interpreter
Sedcheck, a sed verifier
Links
Introduction
From ed to sed
sed is a stream editor, dating back from the time when
computers were operated using typewriters. At that time, editing
a file was a complex task, because ther was no screen to display
the file on, and text editors did not really look like the modern
editors with windows and a mouse (because none existed). For instance,
to fix a typo on line 23 of a file, one had to give the editors
commands such as: go to line 23 and print that line;
replace "ediotr" by "editor"; save the file; quit.
Back around 1970, you would have to type the following commands
(in bold ) to fix the file using
ed , the (only?) text editor available on UNIX at that
time (or "the" editor as it was described):
(comments included in brackets)
67 (ed says there are 67 lines)
23 (go to line 23)
and text ediotrs did not (the line printed)
s/ediotr/editor/ (fix the line)
p (print the line)
and text editors did not (the line printed)
w (save the file)
67 (still 67 lines)
q (quit ed )
As you can see, editing files was quite like programming, and in fact
one could write the ed commands in a file once and for all, and
then execute these commands on other files. Someone realised that
for the great majority of such applications it was possible to avoid
ed limitations (such as the need to load the entire file in memory)
and this gave sed , the stream editor.
sed , the only esoteric language actually
used?
I like sed because it is small, powerful,
and present everywhere (almost).
But I also like it because the language is rather funny. It
has a simple syntax which is not designed to be esoteric, but which
actually looks like random characters to the layman. For instance,
here is a moderately complicated way to increment a number in sed:
s/$/i/
:inc
s/.i/&0123456789i0_/
s/^i/1/
s/\(.\)i[^_]*\1\(i*.\)[^_]*_/\2/
t inc
and here is a really nice way to increment a number:
s/$/+012345678910999000990090/
s/\(.\)\(9*\)+[^9]*\1\(.0*\).*\2\(0*\).*/\3\4/
Clear, huh? It's so simple when compared to any 'i++ '
in other programming languages :-)
sed operates by executing commands (essentially, search/replace)
on a file which is read line by line; an other buffer, called the hold buffer,
can keep information between lines. With these two buffers (the current line
and the hold buffer) and a handful of commands, sed has remained a
very common tool for more than thirty years.
This page holds a sed implementation I'm maintaining, and various
sed scripts I wrote. I try to stick to standard, POSIX sed .
(see the links).
Cheap-sed
Cheap-sed is based on
HHsed (1991, by Howard L. Helman and David Kirschbaum),
itself based on the small
sed written around 1988 by Eric S. Raymond (whose last
version was called sed -v1.3).
Here is how the original author described the ancestor of this
sed :
This is a smaller, cheaper, faster SED utility. Minix uses it. GNU used
to use it, until they built their own sed around an extended (some would
say over-extended) regexp package.
I chose to maintain Cheap-sed because I was seduced by its small size,
and the clarity of the source code, and impressed by its fast speed when
compared especially to GNU sed. However good old sed v1.3 was
rather obsolete, and a number of non-standard extensions had been added to
HHsed . I decided to start from HHsed , remove all
misfeatures and add all what was missing for the best POSIX compliance
achievable.
I named the result "cheap-sed" from Eric S Raymond's description, because
"small-sed" abbreviated to ssed which is also the abbreviation of
super-sed, another sed implementation.
Features
Cheap-sed inherited from its ancestors its small size and
fast speed, and I strived to add the maximum POSIX compliance
on top of that. There are also no more size limitations (well,
almost at the current time, ultimately virtually none).
Thanks to these features Cheap-sed is a good sed implementation to
run complex and demanding scripts, especially those sed scripts
written more for the sake of writing them in sed than for any practical
purpose :-). As an example, esoteric sed scripts such as dc.sed or
unlambda.sed run between 10 and 60 times faster on
cheap-sed than on recent GNU sed.
Licence
cheap-sed is distributed under the
GPL
version 2 or later at your option.
It seems to me that doing this is coherent with:
-
the fact that
sed -1.3 itself was released under the GPL in 1998;
-
the wishes of authors of HHsed:
You, Dear Reader, may do *anything* you wish with it except steal it.
Copyright (c) 1991 Eric S. Raymond, David P Kirschbaum & Howard L. Helman
All Rights Reserved
Details
The following extensions to POSIX are currently supported:
-
'
\a ', '\b ', '\e ', '\f ',
'\n ', '\r ', '\t ', '\v '
are synonyms of the corresponding
characters in the C language ('\e ' meaning ASCII 27, escape) and
are recognised anywhere in regular expressions (including in bracket
expressions), right hand side of substitutions, both sides of the
translitteration command ('y/// ') and in the text argument of the
'a\ ', 'c\ ' and 'i\ ' commands.
(but note that other backslashes in bracket expression are treated as
litteral backslashes; as an example, '[\z] ' matches a backslash
or a 'z', and '[n\] ' and '[[.\.]n] ' both match a
backslash or an 'n');
-
'
\x dd', where dd are two hexadecimal digits, are
synonyms of a verbatim ASCII character number 0x dd;
-
'
\+ ' and '\? ' are synonym of '\{1,\} ' and
'\{0,1\} ' respectively;
-
'
\< ' and '\> ' match the beginning and end of words
respectively.
POSIX-related implementation notes:
-
collating symbols '
[. x.] ' and equivalence classes
'[= x=] 'are only suported for single ASCII
characters x, and in the POSIX locale only;
-
character class expressions '
[: name:] ' are
implemented in the POSIX locale only.
-
sub-expressions are not anchored. For instance, '
/\(^a\)/ ' is
equivalent to '\(\^a\) ', not '^\(a\) '.
-
in case of multiple '
* ' and intervals (undefined behaviour
according to POSIX), the second multiplier is interpreted as litteral characters.
For instance, 'a** ' is interpreted as 'a*\* ' and
'a\{2\}\{3\} ' as 'aa{3} '.
Limitations
I'm currently removing as many limitations as I can. Still the current
version handles:
-
no more than 20 levels of
{ ... } nesting;
-
less than 10 writeout files;
-
numbers n and m in multipliers
'
\{ n\} ',
'\{ n,\} ' and
'\{ n, m\} '
must be lower than 32767;
-
matched
\( ... \) , when repeated, must be less than
32767 bytes away in compiled form.
Known bugs
-
Only up to nine subexpressions
\( ...\) are
currently supported. An indefinite number of them should be supported
(but note that only the first nine of them may be recalled using
"\1 " to "\9 ")
-
\ n will have a wrong value if referring to a
subexpression for which backtracking occured. For example:
echo abcada | sed 's/\(\(a\([cd]\)b\)*\)\{2\}/'
reports "d " instead of the correct answer "c ".
-
csed does not implement strictly the leftmost, longest
matching rule mandated by POSIX. Instead, each part of the regular
expression is tried from left to right for the longest match. This
is a bit hard to explain in words, but here is an example:
echo "aaabaaa" | sed 's/a*\(a*\)b\1/<&>'
outputs "<aaab>aaa " in csed (as most other sed
implementations do ?), whereas according to POSIX
it should output "<aaabaaa> " (as GNU sed 4.0 does).
(Actually I've not yet decided if it is a bug or a feature:
I don't currently know if it can be implemented at a reasonable cost.)
Bug reports are welcome by mail at my email address (at the bottom of this
page)
Download
cheap-sed is still in active development.
Archives named with a version number
"csed- x. y. z.tgz " are stable, official versions. Archives simply dated
"csed- yymmdd.tgz " contain interim or beta
version.
The first stable version will probably be version 1.6.0 (to represent the
progress made since HHsed , also known as sed 1.5).
-
csed-dos16-040212.zip (29 kb),
an experimental DOS 16bit build of csed (using TC2.01). The test suite
seems to run fine, but I don't have the ability of testing it on a real
16bit computer. Please report any success or failure with this version.
-
csed-030913.tgz (69 kb), most recent
source-code archive, and csed-030913.zip (51 kb), a Win32 binary distribution;
-
csed-030816.zip (44 kb), a windows binary
distribution (with source code, but without the test suite, as it needs
so many unix tools to run that whoever could run the test suite could
also compile the source code);
-
csed-030815.tgz (60 kb), the source-code
archive of the first published beta cheap-sed;
sed scripts
An unlambda interpreter
This is an interpreter for unlambda version 1.
Unlambda is an esoteric language invented by David Madore in 1999.
You should really refer to the
unlambda
page for the complete description,
of the language, including a tutorial and other things;
but nevertheless there is also a sketchy description of unlambda in the
sed script itself, in comments.
The sed script itself is here: unlambda.sed,
and a colorized version can be found at http://sed.sf.net.
Sedcheck, a sed verifier
Have you ever written a sed script and given it to someone who came
back later complaining about the dreadful "sed: command garbled" error?
If so, sedcheck is for you :-)
Sedcheck is a utility checking various portability issues (mostly POSIX
conformance) in sed scripts. Sedcheck is itself written in sed.
The sed script is here: sedcheck.sed (9 kb), and the full beta 1.0.0 distribution (including a kind of
test suite) is here: sedcheck-030912.tgz (4 kb) or sedcheck-030912.zip (4 kb).
As an example, here is what the current version of sedcheck reports
on add_decs.sed (from http://sed.sf.net/):
line 13: <blank>s not recommended here
: loop
^
line 13: avoid <blank>s in labels
: loop
^
line 14: no <blank>s allowed after "!"
/^--[^a]/ ! {
^
line 25: "\?" undefined (use "\{0,1\}" instead)
s/-\(aaaaaaaaa\(a\)\)\?\(a*\)\([0-9b]*;.*\([0-9]\)\3\5\)/-\2\5\4/
^
4 issues reported.
As can be seen in this example, the bad news is that Sedcheck
is rather pedantic (the first three warnings reported may not seem
very useful or relevant), but this is because I applied (rather
strictly) the following sources:
- the recent POSIX spec (HTML version).
- Sed and Awk, by Dale Dougherty and Arnold Robbins
(O'reilly).
Now the good news is, that the overall framework and sed parser
work correctly, and it is much easier to remove warnings that it
was to add them.
Sedcheck was tested successfully on the whole GNU sed testsuite
and on many scripts from the grabbag. Nevertheless some bugs may still be there. Please report
bugs and suggestions to me (see my email address below). I'd
appreciate it particularly if sed programmers running other
sed implementations than GNU sed could tell me whether some of the
warnings currently in sedcheck are useless, or whether some other
ones should be added.
links
-
sed.sf.net, a hoard of
sed
resources. Be sure to check the grabbag on the same site. These sites really contain a lot
(all?) of sed resources.
-
http://catb.org/~esr/software.html,
Eric S. Raymond's software page containing
sed -v1.3 under GPL.
- the
sed and
regular expressions pages from
the Single UNIX Specification (POSIX). Thanks go to the Open Group for kindly
making them available in html form free of charge.
-
the mailing list
<sed-users@yahoo-groups.com>
|