GeneralizedTextReplacement

Some times people need to recode in awk something they were previously doing with sed or another tool. More specifically, one may need to emulate most of sed's features with awk.

First, we will discuss how to emulate sed with awk (specifically, the s/pattern/replacement/ command), establishing a basic framework to emulate the "s" command (and do some things that sed cannot do as well); then, we'll see how to emulate backreferences with awk, and how that fits in the framework.

So, the only sed command worth discussing is s/pattern/replacement/, because everything else you can do with sed can be easily done in awk (and sometimes more clearly). You can emulate sed's 's/foo/bar/' command quite faithfully in awk using sub() (to replace the first occurrence of the regexp, like sed 's/foo/bar/') or gsub() (to replace all the occurrences, like sed 's/foo/bar/g). If you need to replace only the nth occurrence (where n!=1), or you want to use backreferences in the replacement text, then you cannot use sub()/gsub().

But in these cases, you can still emulate sed's replacement features using the following code (which, btw, allows you to do things that cannot be easily done with sed, like replacing only, say, the 3rd and 7th occurrence of a pattern, or do some complex operations - including arithmetic - on the text to be replaced). The idea is to find all matching substrings in the original text (the lhs of the s/pattern/replacement/ command in sed), loop over them to modify each one as needed and store the replacements in another array of same cardinality, and finally rebuild the original line with the matched strings replaced by the replacement strings.

# Builds a new string starting from orgstr with all the occurrences of 
# the strings in mtch[] replaced by the corresponding strings in rep[]
function BuildNew(n,orgstr,mtch,start,rep,      newstr,last,j,psep) {

  newstr=""; last=1

  for(j=1;j<=n;j++) {    
    # find out what's in orgstr between match i-1 (or beginning of string) and i
    psep=substr(orgstr,last,start[j]-last)
    last=start[j]+length(mtch[j])

    # build newstr = newstr + psep (part of str before start[i]) + rep[i]
    newstr = newstr psep rep[j]    
  }

  # add trailer (what is after mtch[n])
  newstr=newstr substr(orgstr,last)
  return newstr
}

# main body of the program; here we just turn the matched text (foo) to
# uppercase, but you can do almost anything

{ str=$0    # save original string, since FindAllMatches will consume it
  
  # find all matches of "foo", and fill the start[] array with their positions
  n=FindAllMatches(str,"foo",mtch,start)
  
  # here we can build the rep[] array with the corresponding
  # replacement strings. You can implement any kind of
  # replacement logic, not just string replacement.

  for (i=1;i<=n;i++) {
    # here we just use the uppercase version of the matched
    # strings as a replacement, but you can do virtually anything
    # you want here. Some commented examples follow
    rep[i]=toupper(mtch[i])

    # rep[i] = ">>"mtch[i]"<<"                      # prepend/append other text
    # rep[i] = sprintf("%05d", mtch[i]*4)           # arithmetic (like perl)
    # rep[i] = mtch[i]; gsub(/foo/, "bar", rep[i])  # text substitution
    # rep[i] = ((i==3)||(i==7))?"BAR":mtch[i]       # act only on 3rd and 7th match
    # etc. see below for an example where backreferences are emulated
  }

  # here we build the new line with the replacement text
  newstring=BuildNew(n,$0,mtch,start,rep)

  print newstring    
}

Now that a generic framework for text replacement is in place (more powerful than sed), let's see how we can emulate backreferences with it.

The way to emulate sed's backreferences in the replacement text with awk is based on the fact that you must know the structure of the matched text, and thus you can always act on the elements of the array mtch[] to break them into substring corresponding to the capture groups you would use in sed. For example, let's see how to use the above code to emulate this sed command:

# reverse letter and digit and insert "+" if letter is "a" or "c"
$ echo 'a1-b2-c3-a5-s6-a7-f8-e9-a0' | sed 's/\([ac]\)\([0-9]\)/\2+\1/g'
1+a-b2-3+c-5+a-s6-7+a-f8-e9-0+a

First, we run FindAllMatches to get all matches of /[ac][0-9]/. This is the same expression we used in sed, but without capturing groups. That will fill the mtch[] array with "a1", "c3", "a5", "a7", "a0". Once we have all the matching substrings in the mtch[] array, since we know how they must look like, we can extract what in sed would have been the first and second capture groups:

  # in the loop where we build the rep[] array
  ...
  match(mtch[i],/^[ac]/); g1=substr(mtch[i],RSTART,RLENGTH)
  match(mtch[i],/[0-9]$/); g2=substr(mtch[i],RSTART,RLENGTH)

  # or, here, even:
  # g1=substr(mtch[i],1,1)
  # g2=substr(mtch[i],2,1)
  # the more you know about the format of your data, the better

  rep[i]=g2 "+" g1    # this is like our \2+\1 in sed

  ...

To recap, the key to emulate backreferences is that you must know how the text from which you want to capture the groups looks like (and indeed you most likely know, since you built the regex from which those strings have been extracted in the first place). With that knowledge, you can almost always postprocess the text to extract the substrings corresponding to sed's capture groups, and build the replacement text accordingly.

Finally, it would of course possible to put all the above code in a single function and use that one to emulate sed (using a single loop over the original string), but imho it makes more sense to keep the three stages (find all matches, build array of replacement strings, build the new line) separated, since it is clearer (although slightly less efficient). Another advantage is that the first and third phases do not need to be touched if you want to implement a different replacement logic; only the second phase needs to be changed to do what you want. Furthermore, the FindAllMatches function can also be used as is for different purposes.