AwkChannelWiki: Sed FAQ

Content

I have a line like "abdcgfjeuPATTERNfjfhghj", I want to get the PATTERN part, why isn't sed 's/\(PATTERN\)/\1/' working? I get the input line unchanged!
I'm doing echo 'foobar' | sed 's/b*/ZZ/' but I'm not getting the expected result! (I want fooZZar, but I get ZZfoobar)
But I've read that regular expressions are greedy!
Why shouldn't I use sed to parse *ML?
Can sed edit a file "in place"?
How do I extract or "pull out" all the occurrences of PATTERN?
How do I pull out the first N occurrences of PATTERN in the line/in the file?
How do I extract/delete everything between PATTERN1 and PATTERN2?
How do I extract/delete everything between occurrences N and M of PATTERN?
How do I replace arbitrary occurrences of a pattern?
How do I do any of the above things on the file as a whole?
How do I reverse the characters in a line?
How do I have sed act on a file depending on the contents of another file?
I have lots of slashes in my pattern and/or replacement!
Ho do I do replacements in blocks of unknown length, that could potentially span multiple lines?
I have a list of strings, how do I find the longest common prefix?

I have a line like "abdcgfjeuPATTERNfjfhghj", I want to get the PATTERN part, why isn't sed 's/\(PATTERN\)/\1/' working? I get the input line unchanged!

If you had the string "foo123bar", and did sed 's/123/123/', would this give you only the 123? No, right? Now, to extract your PATTERN, you are using code that produces an equivalent result, and of course it does not work. It just replaces, within the line, PATTERN with itself. Then, sed prints what's in the pattern space, which is still the whole line. See the pulling out FAQ to learn how to do what you want.

I'm doing echo 'foobar' | sed 's/b*/ZZ/' but I'm not getting the expected result! (I want fooZZar, but I get ZZfoobar)

Regular expressions that match 0 or more occurrences of something can match the empty string "". Since the empty string has length 0, conventionally it matches "before" the beginning of the string, "after" the end of the string, and between any two characters in the string (if it cannot match a longer string, of course). /b*/ just happens to be one such expression, and as such it matches at the very beginning of the string. Since the regex engine does find a match, it's happy with that and sed goes ahead with the replacement. Try running the same command with the /g switch and you'll see all the places where sed thinks your expression is matching.

But I've read that regular expressions are greedy!

They surely are greedy, but generally they also stop at the first match they are able to find, moving from left to right. Once a matching position is found, greediness comes into play in case many matches of different lengths are possible starting at that same position.

Had the input been "bbboobar", then greediness would have mattered, because at the same position (beginning of the string), four matches would be possible for /b*/: the empty string "" (length 0), "b" (length 1), "bb" (length 2), "bbb" (length 3). Here greediness causes the longest match (bbb) to be used. But since "foobar" does not start with b, then /b*/ just matches the empty string at the beginning, and there is no alternative between different match lengths: only a match of length 0 is possible, and that match is used, without looking further in the string.

In all these cases, what you probably wanted is

sed 's/b/ZZ/'

sed 's/b\{1,\}/ZZ/'    # some seds also support s/b\+/ZZ/

Why shouldn't I use sed to parse *ML?

Short answer: you need a real parser to parse *ML effectively.

Long answer: sed (and awk, and grep, any other non-parsing tool that processes text) can only work based on the lexical structure of the input. Interesting bits are recognized using patterns (regular expressions) only. *ML documents, on the other hand, can be formatted rather freely, and recognizing the interesting bits using pattern matching only can rapidly become very hard or next to impossible. The simplest example is a tag like <tag>....</tag>, which can appear in the input like this:

<tag>.....</tag>

or this:

<tag>.....
</tag>

or this:

<  tag >
.....</tag  >

or this:

<TAG>
.....
</tag>

or this:

<outertag><  tAg  attribute1="foobar"   attribute2="blahblah"    >
<subtag>
  ...
..
</subtag>....</TaG></outertag>

or yet some other variation. You see that it's very difficult (to say the least) to write a regular expression that can take care of all the above formats to match the tag. A true parser, on the other hand, will be able to recognize the tag regardless of it textual format. Well, one could argue that in the end a parser is just a tool that uses patterns in a very sophisticated way. Ok, so if you want to go ahead with pattern matching, then do that, but be aware that (to be really thorough) you will end up writing something very similar to a real parser anyway. So, since parsers for any language already exist and work just fine, it may be better to use one of the existing ones (unless you want to undergo a learning experience and write things from scratch, which is fine too).

However, a scenario where using normal regex-based text-processing tools like sed, awk or grep to process *ML could work fine is when you know that the format of the input is fixed, and you can be sure that the interesting bits always are in the same place, and have the same format. This may be true, for example, for database-like records of XML data, or some other computer-generated markup.

Can sed edit a file "in place"?

Certain versions of sed can edit "in place" (well, that's not really what they do, although that's the impression they should give to the user). For those seds that support it, this feature is enabled using the switch -i, optionally followed by a string that will be used as extension to create a backup copy of the file before modification. For example,

sed -i.bak 's/foo/bar/g' *.txt

will edit all the files that match *.txt, but after sed is done you'll find an equal number of .txt.bak files created by sed that are copies of the original unchanged files. Note that some versions of sed require that an extension be specified with -i.

As stated, sed does not really edit in place. See what GNU sed does (taken from the info page):

-i
     This option specifies that files are to be edited in-place.  GNU
     `sed' does this by creating a temporary file and sending output to
     this file rather than to the standard output.

     ...

     When the end of the file is reached, the temporary file is renamed
     to the output file's original name.  The extension, if supplied,
     is used to modify the name of the old file before renaming the
     temporary file, thereby making a backup copy).
     
     ...

     If no extension is supplied, the original file is overwritten
     without making a backup.

That implies that you should have some free space on disk, at least the size of the original file. Use -i with caution, and in any case always supply a backup extension. Hardware or OS are not the only things that can go wrong; you might write incorrect sed code by mistake (shit happens, you know), and end up with 1000 changed files and no backup.

TODO add something about ed

How do I extract or "pull out" all the occurrences of PATTERN?

This problem can have different degrees of complexity depending on how PATTERN looks like and how many times it can appear on a line.

FOREWORD: when pulling out things, you may want to add a check in your program to make sure that the line you're acting on does indeed have at least one occurrence of whatever you want to pull out. Otherwise, lines with no occurrences of PATTERN will be left unchanged and printed, which is probably not what you want. Alternatively, you can modify the code below such that sed -n is used, and interesting bits are explicitly printed after the replacement (adding some /p switches at the end of the right "s" commands).

PATTERN appears only once

Let's start from the simplest case, where PATTERN can appear at most once per line. So, pulling out the matching substring is easily accomplished with this code:

sed 's/.*\(PATTERN\).*/\1/'

# or, if the pattern isn't on all lines

sed -n 's/.*\(PATTERN\).*/\1/p'

Now, you could have a situation where PATTERN appears more than once, but you are interested at most in a specific occurrence of it. For example, this text:

blah blah blah and he said: "blah" blah blah blah

but you are only interested to get what's between double quotes. So, if you know the surrounding context of the occurrence you want, you can still use the above code, but supplying context so sed knows what you mean:

sed 's/.*"\(blah\)".*/\1/'

# or, if the pattern isn't on all lines
sed -n 's/.*"\(blah\)".*/\1/p'

sed 's/.*"\([^"]*\)".*/\1/'

# or, if the pattern isn't on all lines
sed -n 's/.*"\([^"]*\)".*/\1/p'

or even (if you know more context)

sed 's/.*he said: "\([^"]*\)".*/\1/'

# or, if the pattern isn't on all lines
sed -n 's/.*he said: "\([^"]*\)".*/\1/p'

each of these techniques helps sed to match exactly what you want and not something else in the same line which might match as well but it's not the string you want.

However, we are still in the case of at most ONE occurrence of PATTERN per line. "Occurrence" here means "identifiable occurrence" (either because it's truly unique, or because context can be supplied to uniquely identify it).

PATTERN occurs more than once (with some constraints)

If you have multiple occurrences with no context to identify them, and you want to extract them all, then again you have to check the various cases and see if the format of the input can be exploited to your advantage, since generally speaking the problem is not trivial in sed.

Let's start from the easy case: if PATTERN can occur multiple times, but PATTERN is a single character (assuming that makes sense for your problem), you can do something like

sed 's/[^c]//g'

where "c" is the character, and remove all non-c characters, resulting only in the "c" being pulled out (although that will most likely mean a string of "c"s is what's left).

If PATTERN is an arbitrary regexp, but occurs a known, fixed number of times per line, then you can do this (example for a pattern that occurs three times):

sed 's/.*\(PATTERN\).*\(PATTERN\).*\(PATTERN\).*/\1\2\3/'

but that becomes rapidly unmanageable, and furthermore you can't do that if PATTERN occurs more than nine times since many seds cannot handle backreferences greater than \9.

PATTERN occurs more than once (no constraints)

So now for the general solution. If PATTERN is an arbitrary regexp and occurs an unknown number of times, then, to solve the problem with sed only, you need to do the following:

* choose a character that does NOT appear in your input, let's say "_" (that will likely appear in real inputs, but for clearness we'll use it for this example; for real problems, you will probably have to choose some symbol or strange character; any character is fine, as long as it does not occur in the input. A good choice is \n, since you're guaranteed by sed that it's not in the input by definition)

* change each occurrence of PATTERN to _PATTERN_;

* delete everything before the very first "_" and after the last "_";

* delete everything between ecah pair of "_" and "_" (that is, between occurrences of PATTERN)

In sed code, that would be something like this:

s/PATTERN/_&_/g   # change each PATTERN to _PATTERN_
s/^[^_]*_//       # delete what is before the first PATTERN
s/_[^_]*$//       # delete what is after the last PATTERN
s/_[^_]*_//g      # remove everything "between" PATTERNs

That will leave all the occurrences concatenated together on a single line; if you want to separate them (for example, with a newline character or an underscore), you can modify the last line as follows:

s/_[^_]*_/\n/g  # remove everything "between" PATTERNs and replace it with \n

Another way to solve the problem is as follows:

s/PATTERN/_&/g           # change each PATTERN to _PATTERN
s/^[^_]*_//              # delete what is before the first PATTERN
s/\(PATTERN\)[^_]*/\1/g  # leave only "_" between PATTERNs: PATTERN_PATTERN_PATTERN
s/_//g                   # remove the "_"s (or change them to whatever you want)

However, my opinion is that if you find that you need to use the above code too often, you might probably want to turn to other tools like awk or Perl.

How do I pull out the first N occurrences of PATTERN in the line/in the file?

First N occurrences of PATTERN in the line

This can be done by deleting everything from the (N+1)th occurrence to end, and then using the technique described in the previous section to pull out the remaining occurrences (thus effectively pulling out the first N occurrences). An example with N=3:

s/PATTERN/&/3           # make sure we have at least 3 occurrences
t ok
b                       # less than 3 occurrences
:ok
s/PATTERN/\n/4          # turn 4th occurrence into \n
s/\n.*//                # delete everything from \n to end of line

# from here proceed as in the above section

First N occurrences of PATTERN in the file

This is about a way to emulate the output of

grep -o PATTERN file | head -n N

with sed. A way is to use the technique described above but applying it to the whole file; this requires slurping the file in memory beforehand and choosing a suitable marker (you can't use \n in that case; a good choice is some ASCII control character like \x1; see this other FAQ for more information).

However, here is another method that does not require slurping the file in memory, that is, it operates line-by-line. As above, we assume N=3 for the example. The idea is to pull out one occurrence at a time from lines that have matches, and print it. Then, record in the hold space that we have pulled out another occurrence, by means of adding an "1" there. When we have a string of N "1"s in the hold space, that means we have pulled out the required number of occurrences, and we quit. Easier to look at the code:

:start
s/PATTERN/\n&\n/   # try to isolate an occurrence of PATTERN
t ok
d                  # if failure, delete the line
:ok
s/^[^\n]*\n//      # remove garbage before the occurrence
P                  # print occurrence
s/^[^\n]*\n//      # delete it

x                  # record the fact in the hold space
s/$/1/
/111/q             # exit if we have three "1"s
x

t start            # try to pull out next occurrence in this line, if any

Note that in the last line we use "t" to jump to the beginning instead of "b". This is because of the way "t" is defined:

     Branch to LABEL only if there has been a successful `s'ubstitution
     since the last input line was read or conditional branch was taken.

In our case, if we used "b", the "t ok" at the beginning would succeed even if the "s" itself did not replace anything, because the above conditions would be satisfied. Using "t" to branch to the beginning, on the other hand (which will always succeed because we just did a successful replacement, so it's effectively like having "b" there), ensures that the "t ok" at the beginning fails if the "s" does not replace anything, thus correctly executing the "d".

How do I extract/delete everything between PATTERN1 and PATTERN2?

99.9% of the time, this question comes up when trying to parse some kind of markup language with sed (which should be done sparingly; use a proper parser to parse *ml. See this FAQ).

Suppose you have a series of <tag>....</tag> elements, and you want to either:

Pull out all the elements (including the enclosing tags)
Pull out all the elements (contents only, not including the enclosing tags)
Remove all the elements (including the enclosing tags)
Remove all the elements (contents only, not including the enclosing tags)

All these problems can be solved using the same technique, which is similar to that used in the general case of the pulling out all occurrences problem. You can also choose to not act on all PATTERN1/PATTERN2 pairs, but only on specific occurrences.

Basically, we first delimit the interesting parts of the line (where "interesting" depends on the specific variation of the problem), and then remove what's not needed. The overall goal is to produce a line like this:

xxxxxxx_xxxxxxxxxxxx_xxxxxxxxxxxxxxx_xxxxxxxxxxxxx_xxxxxxx

Once we have this, this code removes all the substrings in even positions (2nd, 4th, etc.):

s/_[^_]*_//g

If we first remove the very first substring (and the last), the same code will then remove all the odd substrings:

s/^[^_]*_//
s/_[^_]*$//
s/_[^_]*_//g

Let's solve all four problems one by one. For this example, we assume an example input line like this:

blahblah<tag>inside the tag</tag>blahblahblah<tag>inside the tag</tag>blah

Each solution will describe how to delimit the line according to the rule above, and whether the odd-numbered or even-numbered substrings need to be removed to solve the problem.

Pull out all the elements (including the enclosing tags)

The line should be delimited like this:

blahblah_<tag>inside the tag</tag>_blahblahblah_<tag>inside the tag</tag>_blah

and then the odd-numbered substrings removed.

Pull out all the elements (contents only, not including the enclosing tags)

The line should be delimited like this:

blahblah<tag>_inside the tag_</tag>blahblahblah<tag>_inside the tag_</tag>blah

and then the odd-numbered substrings removed.

Remove all the elements (including the enclosing tags)

The line should be delimited like this:

blahblah_<tag>inside the tag</tag>_blahblahblah_<tag>inside the tag</tag>_blah

and then the even-numbered substrings removed.

Remove all the elements (contents only, not including the enclosing tags)

The line should be delimited like this:

blahblah<tag>_inside the tag_</tag>blahblahblah<tag>_inside the tag_</tag>blah

and then the even-numbered substrings removed.

In the end, the sed code to use will be something like this (remember that any character can be used instead of "_"; the ideal is to use "\n"):

# mark lines; use this to pull out/remove including PATTERNs
s/PATTERN1/_&/g
s/PATTERN2/&_/g

# use this to pull out/remove, not including PATTERNs
s/PATTERN1/&_/g
s/PATTERN2/_&/g

# use the following to pull out content
s/^[^_]*_//
s/_[^_]*$//
s/_[^_]*_//g

# use the following to delete content
s/_[^_]*_//g

The last "s" command can of course replace with some character instead of just removing the matched patterns. The lines where PATTERN1 and PATTERN2 are marked can be modified to mark only the Nth pattern pair by using /N (where N is the occurrence number) instead of /g.

Finally, we can note that, if we are not interested in the PATTERNs, the problem can be simplified.

Delete all elements, including PATTERNs

s/PATTERN1/_/g
s/PATTERN2/_/g
s/_[^_]*_//g

We can take this a step further, by observing that if we replace PATTERN2 first, then we can delete everything matching "PATTERN1[^_]*_", so here's the new code:

s/PATTERN2/_/g
s/PATTERN1[^_]*_//g

Pull out all elements, contents only (not including PATTERNs)

s/PATTERN1/_/g
s/^[^_]*_//
s/_[^_]*$//
s/PATTERN2[^_]*_//g

Ok, this isn't really simpler than the normal way, but it's mentioned for completeness.

How do I extract/delete everything between occurrences N and M of PATTERN?

This is just a special case of the above scenario. Assuming 2nd and 4th occurrence of PATTERN, you just have to mark lines in one of the following ways, depending on whether you care about the PATTERNs:

xxxxPATTERNxxxxxPATTERN_xxxxxPATTERNxxxxxx_PATTERNxxxxxxPATTERNxxxxxx
# or
xxxxPATTERNxxxxx_PATTERNxxxxxPATTERNxxxxxxPATTERN_xxxxxxPATTERNxxxxxx

and then either keep or remove what's between the "_"s, depending on your requirements.

How do I replace arbitrary occurrences of a pattern?

This is something that, depending on what exactly you need, can range from trivial to very complex. Let's try to summarize some of the most common cases.

How do I replace the Nth occurrence of PATTERN/from the Nth occurrence to the end?

This is easy. The "s" command accepts an optional number to specify which occurrence to replace:

sed 's/PATTERN/replacement/4'  # replaces 4th occurrence of PATTERN only

This is documented, but not so well-known, since many people only seem to know the /g option to replace all occurrences.

GNU sed has a special syntax to replace all occurrences starting from the Nth to the last:

echo 'foobarfoobazfooblah' | sed 's/foo/XXX/2g'  # replace from 2nd occurrence to end

For non-GNU seds, you can do the same by using the "splitting" technique (see below for a detailed explanation):

s/PATTERN/\n&/2   # put a \n before the 2nd occurrence
tok
b                 # do nothing if less than 2 occurrences
:ok
h
s/^[^\n]*\n//      # remove everything before
s/PATTERN/replacement/g
x
s/\n.*//
G
s/\n//

How do I replace all but the Nth occurrence of PATTERN?

This is a bit involved, but still it can be done. The idea is that the Nth occurrence is pulled apart and replaced with \n (which cannot appear otherwise in a line; this is so we know where to reinsert the string); on this resulting string a normal global replacement is performed, and finally the original Nth occurrence of PATTERN is reinserted. In sed code, something like this (in this example it's the 4th occurrence of PATTERN):

h                              # duplicate line to hold space
s/PATTERN/\n&\n/4              # "isolate" 4th occurrence
tok                            # check that line has at least 4 occurrences
b                              # if not, branch to end
:ok
s/^[^\n]*\n//                  # remove what's before...
s/\n[^\n]*$//                  # remove what's after...
x                              # switch to the copy of the original line
s/PATTERN/\n/4                 # remove the 4th occurrence (replace with \n)
s/PATTERN/replacement/g        # replace all remaining occurrences
G                              # append hold space (naked 4th occurrence) to pattern space
s/\n\([^\n]*\)\n\(.*\)/\2\1/   # reinsert 4th occurrence where it belongs

Note the conditional branch after the first susbtitution is attempted. That is to catch the case where the line has less than N occurrences of the PATTERN. Also note that the above code (as often in this FAQ) assumes a sufficiently modern sed that is able to handle \n in character classes and in the right hand side of the "s" command.

How do I replace the Xth, Yth and Zth occurrence of PATTERN?

This can be done by executing all replacements from the highest-numbered occurrence to the lowest. For example, assuming X=2, Y=4, Z=7:

s/PATTERN/replacement/7
s/PATTERN/replacement/4
s/PATTERN/replacement/2

It's important to go backwards because otherwise the replacement of a low-numbered occurrence could affect the result of the replacement of another higher-numbered occurrence (in the simplest case, once occurrence 2 is replaced, occurrence 4 is not occurrence 4 anymore; might now be 3, might be something else, depending on the exact PATTERN used; in the general case, it's not predictable).

How do I replace the last occurrence of PATTERN (total number of occurrences unknown)?

To do this, we can exploit the greediness of regular expressions:

sed 's/\(.*\)PATTERN/\1replacement/'

This works because the .* eats as much as possible of the string while still allowing the overall regular expression to match, so PATTERN must refer to the last occurrence of it.

How do I replace the first N occurrences of PATTERN, the last N occurences of PATTERN, the Nth to last occurrence of PATTERN?

For all these replacements, it is useful to introduce a technique that I will call "splitting". Basically, the idea is to split the line in two parts P1 and P2 such that the concatenation of P1 and P2 yields the original line back. P1 is kept in the pattern buffer, while P2 is kept in the hold buffer. When the necessary substitutions have been made, P2 can be appended to P1 using this code:

G                 # append P2 to P1
s/\n//            # remove the \n added by G

So, depending on how we calculate P1 and P2, different things can be done. For example, if we put the first N occurrences of PATTERN in P1, and the rest of the line in P2, we can easily replace the first N occurrences of PATTERN by just doing a s/PATTERN/replacement/g on P1. Similarly, if the total number of occurrences is unknown, we can put the last N occurrences of PATTERN in P2, and change them all by doing a s/PATTERN/replacement/g on P2. With the same split, we can also replace the last occurrence of PATTERN in P1, thus yielding replacement of the (N+1)th to last occurrence, or only the first occurrence in P2, thus replacing the Nth to last occurrence. So the only thing left to do is to write some code that leaves us in the situation described above (ie, P1 in pattern buffer and P2 in hold buffer). This can be done in two ways, depending on whether we are counting from the beginning (ie, the first N occurrences) or the end (the last N occurrences) of the line. Depending on our problem and the information we have, we may be forced to use one or the other method. All the examples will use N=4.

First of all, we need to check that there are enough occurrences of PATTERN in the line to do what we want:

s/PATTERN/&/4            # try to replace the 4th occurrence with itself
tok                      # if successful, branch to label ok
b                        # otherwise, not enough occurrences, branch to end and do nothing
:ok                      # rest of code follows here...

Counting from the start is straightforward:

:ok
s/PATTERN/&\n/4   # mark the end of 4th occurrence  ..PATTERN..PATTERN..PATTERN..PATTERN\n...
h                 # do a copy of the line
s/.*\n//          # this is P2
x                 # put P2 in hold space
s/\n.*//          # this is P1 in pattern space
                  
# now do whatever we want with P1 and P2... (x to exchange them)
# be sure to end with P1 in pattern space and P2 in hold space

G                 # re-join code: append P2 to P1
s/\n//            # remove the \n added by G

Counting from the end is a bit more involved, but still it can be done:

:ok
s/^\([^\n]*\)\(PATTERN\)/\1\n\2/   # add another separator \n
tok1                               # this is just to reset the "t" status
:ok1
s/\n/&/4                           # check that we have isolated 4 occurrences
tok2                               # if yes, go to the main part
bok                                # if not, go up and repeat
:ok2                               # we have .....\nPATTERN..\nPATTERN..\nPATTERN..\nPATTERN..
h                                  # make a copy of the line in hold space
s/^[^\n]*\n//                      # everything after the first \n is P2
s/\n//g                            # remove all separator \n characters; this is P2
x                                  # switch to the copy
s/\n.*//                           # everything up to the first \n is P1

# again, we have P1 in pattern space and P2 in hold space.
# do whatever we want with P1 and P2... (x to exchange them)
# be sure to end with P1 in pattern space and P2 in hold space

G                 # re-join code: append P2 to P1
s/\n//            # remove the \n added by G

Some examples of what can be done in the "do whatever we want" part:

# replace the first N occurrences (if counting from the start)
# or the first TOT-N occurrences (if counting from the end) of PATTERN

s/PATTERN/replacement/g     # we act on P1 here

# replace the last TOT-N occurrences (if counting from the start)
# or the last N occurrences (if counting from the end) of PATTERN

x
s/PATTERN/replacement/g     # we act on P2 here
x

# replace the Nth occurrence (if counting from the start)
# or the (N+1)th to last occurrence (if counting from the end) of PATTERN

s/\(.*\)PATTERN/\1replacement/g     # last occurrence of P1

# replace the (N+M)th occurrence (if counting from the start)
# or the (N-M+1)th to last occurrence (if counting from the end) of PATTERN

x
s/PATTERN/replacement/M     # Mth occurrence of PATTERN in P2
x

etc. etc.

By the way, besides the general method described here, there are specific optimizations that can be used under specific circumstances, for example the replacement of the first N occurrences can also be done using the Xth, Yth and Zth technique described above. Also, in the particular case of the penultimate occurrence, we can exploit the fact that we can recognize and isolate the last occurrence of PATTERN to our advantage. Much code can be cut off and this is what we get:

s/\(.*\)\(PATTERN\)/\1\n\2/     # insert \n before last occurrence
h                               # make a copy of the line in hold space
s/^[^\n]*\n//                   # remove everything before
x                               # switch to the copy
s/\n.*//                        # remove everything after the last occurrence
s/\(.*\)PATTERN/\1replacement/  # replace last (but really penultimate) occurrence
G                               # append back the missing part
s/\n//                          # remove \n added by "G"

How do I replace all occurrences of PATTERN before/after character N?

This can again be done using the "isolation" technique, ie leaving only the interesting part for sed to work on, and then re-add back what was removed. Here an example to have sed replace all PATTERNs between character 20 and 40 of the line (inclusive):

/.\{40\}/!b                             # check that we have at least 40 characters!
h
s/^\(.\{19\}\)\(.\{21\}\)/\1\n\2\n/
s/^[^\n]*\n//
s/\n[^\n]*$//
s/PATTERN/replacement/g
x
s/^\(.\{19\}\).\{21\}/\1\n/
G
s/\n\([^\n]*\)\n\(.*\)/\2\1/

If you just need replacement before a given character or after a given character, adapt the above code (basically,it's just a matter of creating P1 and P2 with the appropriate number of characters).

How do I replace all occurrences of PATTERN before/after PATTERN2?

This can again be solved using the P1/P2 technique, splitting at PATTERN2. If you want to replace all PATTERNs before PATTERN2, then you split like this:

P1 = ^.................
P2 = PATTERN2.........$

and do the replacement of PATTERN on P1.

If you want to replace all PATTERNs after PATTERN2, then you split like this:

P1 = ^........PATTERN2
P2 = ................$

and do the replacement of PATTERN on P2.

How do I do any of the above things on the file as a whole?

Generally speaking, the simplest way is probably to read the whole file in the pattern space, and then use the same approches explained above. For example if you want to replace the 4th occurrence of PATTERN in the whole file:

:a                        # this is the typical way
$!{N;ba;}                 # to read the whole file in pattern space
s/PATTERN/replacement/4   # the pattern space here contains the whole file

However, for solutions that require marking, it's not possible to use \n to mark (while in a single line it's guaranteed that \n is not present, the same cannot be said for the whole file, of course). In these cases, if your sed supports that, you can use some control character like ASCII 1 or 2 (\x1 and \x2 with GNU sed) as markers, since those will always be deleted after the replacements.

How do I reverse the characters in a line?

You may have seen the solution proposed by the SED oneliners:

# reverse each character on the line (emulates "rev")
sed '/\n/!G;s/\(.\)\(.*\n\)/&\2\1/;//D;s/.//'

Well, here is another (clearer?) way.It just uses a \n as a marker to separate the part of word that has been reversed so far from the part that hasn't. Finally, the \n is removed.

sed 's/^/\n/;:a; s/^\([^\n]*\n\)\(.\)/\2\1/;/\n$/!ba;s/\n//'
# Or, if your sed doesn't support [^\n], possibly less efficient:
sed 's/^/\n/;:a; s/^\(.*\n\)\(.\)/\2\1/;/\n$/!ba;s/\n//'

How do I have sed act on a file depending on the contents of another file?

Ok, it seems difficult but it isn't. Suppose you have a file (DATA) like this:

001 abc
002 def
003 abc
004 ghi
005 jkl

and another file (MAP) like this:

abc something 
def foobar
ghi blah
jkl xxxxxx

And you want to use sed to get this:

001 something
002 foobar
003 something
004 blah
005 xxxxxx

(Use awk. Seriously.) If you're still reading, here's a way to do that with sed. The idea is to run sed on the MAP file first, transforming it into a sed script that does what we want. This script is fed to another sed that works on the DATA file.

sed 's|^\([^ ]\{1,\}\) \([^ ]\{1,\}\)$|s/\1/\2/;t|' MAP | sed -f- DATA

The first sed invocation outputs something like this:

s/abc/something/;t 
s/def/foobar/;t
s/ghi/blah/;t
s/jkl/xxxxxx/;t

This sed program is read and used by the second sed, which uses it to change our data. (Next time, use awk.)

I have lots of slashes in my pattern and/or replacement!

You can escape them all (the so-called toothsaw effect):

sed 's/\/a\/b\/c\//\/d\/e\/f\//'      # change "a/b/c/" to "d/e/f/"

but that is ugly and unreadable. It's a not-so-known fact that sed can use any character as separator for the "s" command. Basically, sed takes whatever follows the "s" as the separator. So, our example can be rewritten for example as follows:

sed 's_/a/b/c/_/d/e/f/_'
sed 's;/a/b/c/;/d/e/f/;'
sed 's#/a/b/c/#/d/e/f/#'
sed 's|/a/b/c/|/d/e/f/|'
sed 's%/a/b/c/%/d/e/f/%'
# etc.

An even less-known fact is that you can use a different delimiter even for pattern used in addresses, using a special syntax:

# do this (ugly)...
sed '/\/a\/b\/c\//{do something;}'

# ...or these (better)
sed '\#/a/b/c/#{do something;}'
sed '\_/a/b/c/_{do something;}'
sed '\%/a/b/c/%{do something;}'
# etc.

Ho do I do replacements in blocks of unknown length, that could potentially span multiple lines?

It can be done, but blocks must be recognizable. In this example, we assume blocks start when a line with "block(" is encountered, and end when a line with ");" is found. Of course, adapt to your needs. In the example, we'll change all FOOs to XXXs in such a block. Example input:

line1; FOO; end of line1
line2; BAR; end of line2
block( this is a block; FOO; content; FOO; );
line4; FOO FOO FOO FOO;
block(
  this is FOO
  a multiline FOO
  block FOO
);
line10 FOOBAR;

The idea is: when the start of a block is found, read ahead in the file until we find the end. That brings the whole block into the pattern space. Once we have that, we can do our substitutions as usual.

/^block(/{     # found start of a block
  :a
  /);$/!{      # while pattern space doesn't end with ");"...
    N          # ...add lines to it
    ba
  }
  s/FOO/XXX/g  # do our replacement
}

I have a list of strings, how do I find the longest common prefix?

So you have for example

abcdefg
abcdejkliu
abchhitooyu
abcdtuyiu
abzzzzzzz

and want to find the longes common prefix ("ab" in this case). This might seem silly, but it can have practical applications. Generally speaking, it's not very easy to solve that. But Marlon Berlin suggested a clever sed solution (thanks):

sed ':a;$!N;s/^\(.*\).*\n\1.*/\1/;ta'

In essence, what it does is to compare each line with the following one, and replace them with the longes common prefix. This is then what remains in the pattern space, and the next line is read, and the comparison repeated. Note that the loop in the program reads in all the input lines, since the replacement can never fail (if two lines have no common prefix, \1 becomes the empty string, but the overall replacement does succeed).

So the above code can be usefule for example if you have a list of pathnames like

/foo/bar/baz/blah
/foo/bar/xxx/yyy
/foo/bar/baz/zzz/kkk
/foo/bar

and you want to find the longest common path. However, caution must be used because you could have this:

/foo/bar/baz/blah
/foo/bak/baz/blah
/foo/bar/baz/
/foo/bar

and here the result would be "/foo/ba" which is not meaningful for the problem. Nonetheless, many thanks to Marlon for the tip.

SedFAQ

Content

I have a line like "abdcgfjeuPATTERNfjfhghj", I want to get the PATTERN part, why isn't sed 's/\(PATTERN\)/\1/' working? I get the input line unchanged!

I'm doing echo 'foobar' | sed 's/b*/ZZ/' but I'm not getting the expected result! (I want fooZZar, but I get ZZfoobar)

But I've read that regular expressions are greedy!

Why shouldn't I use sed to parse *ML?

Can sed edit a file "in place"?

How do I extract or "pull out" all the occurrences of PATTERN?

How do I pull out the first N occurrences of PATTERN in the line/in the file?

How do I extract/delete everything between PATTERN1 and PATTERN2?

How do I extract/delete everything between occurrences N and M of PATTERN?

How do I replace arbitrary occurrences of a pattern?

How do I replace the Nth occurrence of PATTERN/from the Nth occurrence to the end?

How do I replace all but the Nth occurrence of PATTERN?

How do I replace the Xth, Yth and Zth occurrence of PATTERN?

How do I replace the last occurrence of PATTERN (total number of occurrences unknown)?

How do I replace the first N occurrences of PATTERN, the last N occurences of PATTERN, the Nth to last occurrence of PATTERN?

How do I replace all occurrences of PATTERN before/after character N?

How do I replace all occurrences of PATTERN before/after PATTERN2?

How do I do any of the above things on the file as a whole?

How do I reverse the characters in a line?

How do I have sed act on a file depending on the contents of another file?

I have lots of slashes in my pattern and/or replacement!

Ho do I do replacements in blocks of unknown length, that could potentially span multiple lines?

I have a list of strings, how do I find the longest common prefix?