AwkChannelWiki: Frequently Asked Questions

Some entries of this page have been copied from the comp.lang.awk_FAQ(Credits)

Content

How do I print a RangeOfFields, eg from field 2 to the end?
How do I print the LastField or the n'th field in a record?
I'm trying to print a number, why do I get 1e+06 instead of 1000001.10?
How do I edit a file in place with awk?
How do I use a variable as a regular expression?
How do I pass a shell variable to awk?
How do I pass an array to awk?
Why "print $variable shows" nothing? why "print "hello $name"" doesn't work?
Variables are dynamic
Initialization within a begin block is possible
How do I find the length of an array?
How do I remove the newlines?
How do I use backreferences in awk?
How do I PrintASingleQuote character?
How do I find the LargestAccurateNumber that my awk can use?
Why would anyone still use awk instead of perl?
Why does SunOS?/Solaris awk behave oddly?
What is a PasteBin?

How do I print a RangeOfFields, eg from field 2 to the end?

Printing a range of fields - all fields but the first, for examples, or fields 3 through 8 - is a surprisingly fiddly little problem.

No field offsets are stored

Although awk performs field splitting, it does not maintain a record (or at least not that is accessible to user code) of the offsets into the line where the splitting actually took place.

This means that you can, for example, assign an empty string to the fields preceding and following the range that you want ( {$1 = $2 = ""; for (i = 9; i <= NF; i++) $i = ""; print} ).

This will however cause awk to recompute the line adding OFS between the field, including the now empty fields at the beginning and at the end, in the case of the above example with a default FS and OFS, the spaces will be squeezed and 2 space will be present in front of all the line. You can remove the front space with $0=substr($0, 1+length(OFS) * 2) , another possibility is to shift the fields and adjust NF eg to keep the fields 3 to 8:

for (i=3;i<=8;i+=1)) $(i-2)=$i # shift $1=$3 $2=$4 ...
NF=8-3+1 # only keep the first six fields
print

Drawbacks of using a loop

In some cases, the desired behaviour is to remove everything before a certain field and then print out the rest of the line. The typical tactic is then to use a for loop, such as this:

awk '{sep="";for (i=2;i<=NF;i++) {printf "%s%s",sep, $i;sep=" "}; printf "\n"}' file

# or, avoids using sep
awk '{for (i=2;i<=NF;i++) {printf "%s%s",(i>2?" ":""), $i}; printf "\n"}' file

# uses OFS and ORS
awk '{for (i=2; i<=NF; i++) printf("%s%s", $i, i==NF ? ORS : OFS)}' file

A loop to select individual fields will also cause anything that appears between the fields to be replaced with " ". This is often not the desired behaviour.

Using sub()

If the separator is the default, you can use a direct sub() on $0 to remove the fields that you don't want. This has the advantage that original spacing between fields will be preserved. For example:

awk '{sub(/^[[:blank:]]*([^[:blank:]]+[[:blank:]]+){n}/,"")}' file

will remove the first "n" fields from the line ("n" must be replaced by the actual number of fields that you want to remove). If you want to remove the last n fields with the same technique, then

awk '{sub(/([[:blank:]]+[^[:blank:]]+){n}[[:blank:]]*$/,"")}' file

should do the job. If you want to keep from field n to field m (meaning "remove from field 1 to n-1 and from m+1 to NF"), just combine the above two techniques, using the appropriate values for the repetition operator.

Keep in mind that the {n} repetition operator in regexes, while specified by POSIX, is not supported by all implementations of awk. With GNU awk, you have to use the --re-interval command line switch (or --posix to get full POSIX compatibility, which includes repetition operators).

If FS is not the default, but it's still a single character (for example, "#"), it is simpler and you can do something like

awk '{sub(/^([^#]*#){n}/,"")}' file

the example that removes the last n fields can be adapted similarly from the code that uses the default FS.

Finally, if FS is a full regular expression, then the problem is not trivial and it's better to use some other technique among those described here.

Using cut

If the field separator is a single character, the cut utility may be used to select a range of fields. There is documentation for the GNU version of cut, which appears in the coreutils package, but which leaves out mention of the range feature. The Open Group spec describes cut in more detail, and the Examples section offers a more detailed guide to the syntax.

cut cannot use field delimiters that are longer than a single character. By default, the delimiter is the tab character. To select fields one, three, and from field 9 to the end of the line, one could write: cut -f1,3,9- . Fields cannot be reordered in this way: cut -f3,1,9- is equivalent to the previous example.

If a single-character delimiter limitation is not a restriction, and if the fields do not need to be reordered, and if no other computation needs to take place, consider using cut instead of awk: it is small, simple and fast, and this simplicity makes its purpose immediately clear to anyone reading the command line.

With gawk-devel's optional fourth parameter of split()

To print from the third to the last field with gawk-devel's split(), when the field separator is a full regular expression:

awk '
 {
   nf = split($0, fld, fs_regex, delim)
   for (i = 3; i <= NF; ++i)
     printf "%s%s", fld[i], ((i < NF) ? delim[i] : "\n")
 }'

The fourth (optional) argument delim of split() is an array where delim[i] gets filled with the delimiter string between fld[i] and fld[i+1]. fld[0] is the delimiter prefix and fld[nf] the delimiter suffix, if the field separator is " " (a blank); for a regular expression field separator fld[0] and fld[nf] don't exist.

Function using match() and substr()

This approach uses match() to get the position and length of each field separator, then substr() to either trim it from the beginning or add that field and it's succeeding separator to an output string. Afterwards, it appends the last field without the separator to the output string and returns said string.

# usage: extract_range(string, start, stop)
# extracts fields "start" through "stop" from "string", based on FS, with the
# original field separators intact. returns the extracted fields.
function extract_range(str, start, stop,     i, re, out) {
  # if FS is the default, trim leading and trailing spaces from "string" and
  # set "re" to the appropriate regex
  if (FS == " ") {
    gsub(/^[[:space:]]+|[[:space:]]+$/, "", str);
    re = "[[:space:]]+";
  } else {
    re = FS;
  }

  # remove fields 1 through start - 1 from the beginning
  for (i=1; i<start; i++) {
    if (match(str, re)) {
      str = substr(str, RSTART + RLENGTH);

    # there's no FS left, therefore the range is empty
    } else {
      return "";
    }
  }

  # add fields start through stop - 1 to the output var
  for (i=start; i<stop; i++) {
    if (match(str, re)) {
      # append the field to the output
      out = out substr(str, 1, RSTART + RLENGTH - 1);

      # remove the field from the line
      str = substr(str, RSTART + RLENGTH);

    # no FS left, just append the rest of the line and return
    } else {
      return out str;
    }
  }

  # append the last field and return
  if (match(str, re)) {
    return out substr(str, 1, RSTART - 1);
  } else {
    return out str;
  }
}

# example use to print $3 through the end: awk '{print extract_range($0, 3, NF)}'

Using index(), substr(), length() and printf()

This approach finds the length of each field, starting from the end of the last field (meaning that any whitespace is included). It then uses printf to space-pad the field to that length, therefore preserving the original whitespace. It works best when the fields are separated by spaces, although it could be adapted for other field separators. Note that tabs will be treated as one character, and replaced with a single space.

# You could hardcode the numbers, replacing "s" and "e" in the code.
# If you wanted "3 to the end", as above, simply replace "e" with "NF".
awk -v s="$start" -v e="$end" '
{
  # the ending offset of the last field, from the beginning, is stored in "prev"
  prev = 0;
  # first, just add the lengths of fields 1 through s - 1 to prev
  for (i=1; i<s; i++) {
    prev += index(substr($0, prev + 1), $i) + length($i) - 1;
  }

  # add the space between s-1 and s to prev
  prev += index(substr($0, prev + 1), $s) - 1;

 # loop over the fields we want to print
  for (i=s; i<=e; i++) {
    # get the length, from the end of the last field to the end of the current
    len = index(substr($0, prev + 1), $i) + length($i) - 1;

    # print the field, padded to that length
    printf("%*s%s", len, $i, i==e ? "\n" : "");

    # add the length to "prev"
    prev += len;
  }
}'

# as a (granted, long) one-liner, printing 3 through the end
awk '{p=0; for (i=1;i<3;i++) p+=index(substr($0,p+1),$i)+length($i)-1; p+=index(substr($0,p+1),$3)-1; for (i=3;i<=NF;i++) {p+=l=index(substr($0,p+1),$i)+length($i)-1; printf("%*s%s",l,$i,i==NF?"\n":"")}}'

An advantage to this method is that you could also use it to process/change fields in a table, and keep the format pretty much the same. The following script doesn't actually remove any fields from the output, but allows you to change the second field and still pad everything the same way:

#!/usr/bin/awk -f

# store the length of each field from the end of the previous
{
  prev = 0;
  for (i=1; i<=NF; i++) {
    lens[i] = index(substr($0, prev + 1), $i) + length($i) - 1;
  }
}

# do processing, reassignments, whatever here
{
  $2 = "new";
}

# print fields, padded appropriately
{
  for (i=1; i<=NF; i++) {
    printf("%*s%s", lens[i], $i, i==NF ? "\n" : "");
  }
}

Some other approaches

Eric Pement put together a short list of tactics.

value	sign+exponent	fraction
2^51	000	8000000000000
2^52	001	0000000000000
2^53-1	001	FFFFFFFFFFFFF
2^53	002	0000000000000
2^53+1	002	0000000000000

Frequently Asked Questions

Content

How do I print a RangeOfFields, eg from field 2 to the end?

No field offsets are stored

Drawbacks of using a loop

Using sub()

Using cut

With gawk-devel's optional fourth parameter of split()

Function using match() and substr()

Using index(), substr(), length() and printf()

Some other approaches

How do I print the LastField or the n'th field in a record?

I'm trying to print a number, why do I get 1e+06 instead of 1000001.10?

How do I edit a file in place with awk?

How do I use a variable as a regular expression?

How do I pass a shell variable to awk?

How do I pass an array to awk?

Why "print $variable shows" nothing? why "print "hello $name"" doesn't work?

Variables are dynamic

Variables do not need predefinition prior to use

Initialization within a begin block is possible

Variable names

Special variables

Variables in awk do not need a sigil

Variable names inside string constants are not expanded in awk

How do I find the length of an array?

How do I remove the newlines?

How do I use backreferences in awk?

How do I PrintASingleQuote character?

The Short Story

The Rambling Tale

Hex Escapes: Bad Juju

Octal Escapes: Great Personality, but...

Uses and Abuses of printf

Explicit Concatenation (oh my!)

Being Creative

Do The Right Thing

Feed the quote as a variable to awk

Using bash's quoting($'string')

How do I find the LargestAccurateNumber that my awk can use?

Technical mumbo-jumbo

See also

Why would anyone still use awk instead of perl?

Why does SunOS?/Solaris awk behave oddly?

What is a PasteBin?

List of Channel PasteBins

List of Channel `PasteBins`