RangeOfFields

Difference between revision 33 and current revision

Summary: modify the bit about assigning fields fix formatting

No diff available.

Printing a range of fields - all fields but the first, for examples, or fields 3 through 8 - is a surprisingly fiddly little problem.

No field offsets are stored

Although awk performs field splitting, it does not maintain a record (or at least not that is accessible to user code) of the offsets into the line where the splitting actually took place.

This means that you can, for example, assign an empty string to the fields preceding and following the range that you want ( {$1 = $2 = ""; for (i = 9; i <= NF; i++) $i = ""; print} ).

This will however cause awk to recompute the line adding OFS between the field, including the now empty fields at the beginning and at the end, in the case of the above example with a default FS and OFS, the spaces will be squeezed and 2 space will be present in front of all the line. You can remove the front space with $0=substr($0, 1+length(OFS) * 2) , another possibility is to shift the fields and adjust NF eg to keep the fields 3 to 8:

for (i=3;i<=8;i+=1)) $(i-2)=$i # shift $1=$3 $2=$4 ...
NF=8-3+1 # only keep the first six fields
print

Drawbacks of using a loop

In some cases, the desired behaviour is to remove everything before a certain field and then print out the rest of the line. The typical tactic is then to use a for loop, such as this:

awk '{sep="";for (i=2;i<=NF;i++) {printf "%s%s",sep, $i;sep=" "}; printf "\n"}' file

# or, avoids using sep
awk '{for (i=2;i<=NF;i++) {printf "%s%s",(i>2?" ":""), $i}; printf "\n"}' file

# uses OFS and ORS
awk '{for (i=2; i<=NF; i++) printf("%s%s", $i, i==NF ? ORS : OFS)}' file

A loop to select individual fields will also cause anything that appears between the fields to be replaced with " ". This is often not the desired behaviour.

Using sub()

If the separator is the default, you can use a direct sub() on $0 to remove the fields that you don't want. This has the advantage that original spacing between fields will be preserved. For example:

awk '{sub(/^[[:blank:]]*([^[:blank:]]+[[:blank:]]+){n}/,"")}' file

will remove the first "n" fields from the line ("n" must be replaced by the actual number of fields that you want to remove). If you want to remove the last n fields with the same technique, then

awk '{sub(/([[:blank:]]+[^[:blank:]]+){n}[[:blank:]]*$/,"")}' file

should do the job. If you want to keep from field n to field m (meaning "remove from field 1 to n-1 and from m+1 to NF"), just combine the above two techniques, using the appropriate values for the repetition operator.

Keep in mind that the {n} repetition operator in regexes, while specified by POSIX, is not supported by all implementations of awk. With GNU awk, you have to use the --re-interval command line switch (or --posix to get full POSIX compatibility, which includes repetition operators).

If FS is not the default, but it's still a single character (for example, "#"), it is simpler and you can do something like

awk '{sub(/^([^#]*#){n}/,"")}' file

the example that removes the last n fields can be adapted similarly from the code that uses the default FS.

Finally, if FS is a full regular expression, then the problem is not trivial and it's better to use some other technique among those described here.

Using cut

If the field separator is a single character, the cut utility may be used to select a range of fields. There is documentation for the GNU version of cut, which appears in the coreutils package, but which leaves out mention of the range feature. The Open Group spec describes cut in more detail, and the Examples section offers a more detailed guide to the syntax.

cut cannot use field delimiters that are longer than a single character. By default, the delimiter is the tab character. To select fields one, three, and from field 9 to the end of the line, one could write: cut -f1,3,9- . Fields cannot be reordered in this way: cut -f3,1,9- is equivalent to the previous example.

If a single-character delimiter limitation is not a restriction, and if the fields do not need to be reordered, and if no other computation needs to take place, consider using cut instead of awk: it is small, simple and fast, and this simplicity makes its purpose immediately clear to anyone reading the command line.

With gawk-devel's optional fourth parameter of split()

To print from the third to the last field with gawk-devel's split(), when the field separator is a full regular expression:

awk '
 {
   nf = split($0, fld, fs_regex, delim)
   for (i = 3; i <= NF; ++i)
     printf "%s%s", fld[i], ((i < NF) ? delim[i] : "\n")
 }'

The fourth (optional) argument delim of split() is an array where delim[i] gets filled with the delimiter string between fld[i] and fld[i+1]. fld[0] is the delimiter prefix and fld[nf] the delimiter suffix, if the field separator is " " (a blank); for a regular expression field separator fld[0] and fld[nf] don't exist.

Function using match() and substr()

This approach uses match() to get the position and length of each field separator, then substr() to either trim it from the beginning or add that field and it's succeeding separator to an output string. Afterwards, it appends the last field without the separator to the output string and returns said string.

# usage: extract_range(string, start, stop)
# extracts fields "start" through "stop" from "string", based on FS, with the
# original field separators intact. returns the extracted fields.
function extract_range(str, start, stop,     i, re, out) {
  # if FS is the default, trim leading and trailing spaces from "string" and
  # set "re" to the appropriate regex
  if (FS == " ") {
    gsub(/^[[:space:]]+|[[:space:]]+$/, "", str);
    re = "[[:space:]]+";
  } else {
    re = FS;
  }

  # remove fields 1 through start - 1 from the beginning
  for (i=1; i<start; i++) {
    if (match(str, re)) {
      str = substr(str, RSTART + RLENGTH);

    # there's no FS left, therefore the range is empty
    } else {
      return "";
    }
  }

  # add fields start through stop - 1 to the output var
  for (i=start; i<stop; i++) {
    if (match(str, re)) {
      # append the field to the output
      out = out substr(str, 1, RSTART + RLENGTH - 1);

      # remove the field from the line
      str = substr(str, RSTART + RLENGTH);

    # no FS left, just append the rest of the line and return
    } else {
      return out str;
    }
  }

  # append the last field and return
  if (match(str, re)) {
    return out substr(str, 1, RSTART - 1);
  } else {
    return out str;
  }
}

# example use to print $3 through the end: awk '{print extract_range($0, 3, NF)}'

Using index(), substr(), length() and printf()

This approach finds the length of each field, starting from the end of the last field (meaning that any whitespace is included). It then uses printf to space-pad the field to that length, therefore preserving the original whitespace. It works best when the fields are separated by spaces, although it could be adapted for other field separators. Note that tabs will be treated as one character, and replaced with a single space.

# You could hardcode the numbers, replacing "s" and "e" in the code.
# If you wanted "3 to the end", as above, simply replace "e" with "NF".
awk -v s="$start" -v e="$end" '
{
  # the ending offset of the last field, from the beginning, is stored in "prev"
  prev = 0;
  # first, just add the lengths of fields 1 through s - 1 to prev
  for (i=1; i<s; i++) {
    prev += index(substr($0, prev + 1), $i) + length($i) - 1;
  }

  # add the space between s-1 and s to prev
  prev += index(substr($0, prev + 1), $s) - 1;

 # loop over the fields we want to print
  for (i=s; i<=e; i++) {
    # get the length, from the end of the last field to the end of the current
    len = index(substr($0, prev + 1), $i) + length($i) - 1;

    # print the field, padded to that length
    printf("%*s%s", len, $i, i==e ? "\n" : "");

    # add the length to "prev"
    prev += len;
  }
}'

# as a (granted, long) one-liner, printing 3 through the end
awk '{p=0; for (i=1;i<3;i++) p+=index(substr($0,p+1),$i)+length($i)-1; p+=index(substr($0,p+1),$3)-1; for (i=3;i<=NF;i++) {p+=l=index(substr($0,p+1),$i)+length($i)-1; printf("%*s%s",l,$i,i==NF?"\n":"")}}'

An advantage to this method is that you could also use it to process/change fields in a table, and keep the format pretty much the same. The following script doesn't actually remove any fields from the output, but allows you to change the second field and still pad everything the same way:

#!/usr/bin/awk -f

# store the length of each field from the end of the previous
{
  prev = 0;
  for (i=1; i<=NF; i++) {
    lens[i] = index(substr($0, prev + 1), $i) + length($i) - 1;
  }
}

# do processing, reassignments, whatever here
{
  $2 = "new";
}

# print fields, padded appropriately
{
  for (i=1; i<=NF; i++) {
    printf("%*s%s", lens[i], $i, i==NF ? "\n" : "");
  }
}

Some other approaches

Eric Pement put together a short list of tactics.