The sort utility sorts together the lines of all specified files and writes the result to the standard output. The name `-' means the standard input. If no input files are specified, the standard input is sorted.
The sort command has the form:
sort [ optionsSPMamp
] [+& pos1[-
pos2]] [
file...]
The default sort key is an entire line. Default ordering is done according to the ASCII machine collating sequence. ASCII is short for American Standard Code for Information Exchange.
The notation + pos1 - pos2 restricts a sort key to a field beginning at pos1 and ending just before pos2. Pos1 and pos2 each have the form m.n, optionally followed by one or more of the options discussed below. m indicates the number of fields to skip from the beginning of the line and n indicates the number of characters to skip further. If any options are present they override all the global ordering options for this key. If the b option is in effect n is counted from the first nonblank in the field; b is attached independently to pos2.
A missing . n means .0; a missing - pos2 means the end of the line. Under the - t x option, fields are strings separated by x; otherwise fields are non-empty nonblank strings separated by blanks.
When there are multiple sort keys, later keys are compared only after all earlier keys compare equal. Lines that otherwise compare equal are ordered with all bytes significant.
One or more of the following options may be used to globally affect the ordering.
Symbol Description of Option ------------- ------------------------------------------------------- -b Ignores leading blanks (spaces and tabs) in field comparisons. -c Checks sorting order and displays output only if out of order. -d Sorts data according to dictionary ordering: letters, digits and blanks only. -f Folds uppercase to lowercase while sorting, i.e., ignore case when sorting. -i Ignore characters outside the ASCII range 040-0176 in non-numeric comparisons. -m Merges previously sorted data. -n Sorts fields with numbers numerically. An initiali numeric string, consisting of optional blanks, optional minus sign, and zero or more digits with optional decimal point is sorted by arithmetic value. (Note that -0 is taken to be equal to 0.) Option n implies option b. -o name Uses specified file as output file. This file may be the same as one of the input files. -r Reverses the sense of comparisons. -T dir Uses specified directory to build temporary files. -t x Uses specified character as field separator. -u Suppresses all duplicate entries. Ignored bytes and bytes outside keys do not participate in this comparison. ------------- -------------------------------------------------------
The first set of examples is based on files the files hot and cool which follow.
File hot:
orange yellow scarlet magenta fuchsia
File cool:
turquoise aqua violet teal chartreuse
A basic sort of hot and cool.
% sort hot cool
aqua
chartreuse
fuchsia
magenta
orange
scarlet
teal
turquoise
violet
yellow
The sort command reorders text based upon the ASCII values of the characters. Numbers and special symbols precede letters; all uppercase letters precede all lowercase letters. Every character is associated with a standard numeric code and ordering is based upon these codes.
Sort uppercase and lowercase letters (assume that hot is composed entirely of capital letters).
% sort hot cool
FUCHSIA
MAGENTA
ORANGE
SCARLET
YELLOW
aqua
chartreuse
teal
turquoise
violet
Sort upper and lowercase letters without regard for case with the - f (fold) option (assume hot is composed entirely of capital letters).
% sort -f hot cool
aqua
chartreuse
FUCHSIA
MAGENTA
ORANGE
SCARLET
teal
turquoise
violet
YELLOW
Sort in reverse order with the - r (reverse) option.
% sort -r cool
violet
turquoise
teal
chartreuse
aqua
You can use multiple options on a single sort command. The following sorts in reverse order without regard for case by including both the - r and - f options.
% sort -r -f hot cool
YELLOW
violet
turquoise
teal
SCARLET
ORANGE
MAGENTA
FUCHSIA
chartreuse
aqua
When you are working with several files, you may find they contain duplicate items. Use the - u option to insure that each item in the output is unique.
The sort command can also be used with the - m option to merge sorted files. If you are combining large files that are already in order, using the - m option is much faster than issuing an ordinary sort command. Remember that the - m option only works on sorted files.
You can redirect the output of sort into a file. For example:
Output redirection
% sort hot > coolHowever, you need to be careful when doing this. The following command will not do what you might expect:
Incorrect use of output redirection with sort
% sort hot > hot
Instead, you should do the following:
Sorting back into the input file with the - o (output) option.
% sort -o hot hot
% sort -o cool cool
Now the sorted files can be merged using the - m option to the sort command:
Using the - m - u options to merge the sorted Files from Example 7 and remove duplicate items.
% sort -m -u hot cool
The next set of examples uses the family file shown following.
File family:
Mary Miller 555-1011 Tom Miller 555-3110 Rick Miller 555-0107 Emily Miller 555-7200 Sherry Miller 555-0912 Peter Parker 201-4019 Barbara Parker 101-2040 Scott Brown 555-3131 Christi Brown 702-0625
Each line in a file is called a record. Each record can contain any number of fields. A field is a group of nonblank characters. The end of a field is usually indicated by a blank character (SPACE, TAB or RETURN). The fields in a record are numbered from left to right, beginning at zero--the first field is field number 0, the second is field number 1 and so on.
The term sort key is used to refer to the portion of a record that is compared to other records when the file is sorted. The standard sort command treats the entire record as a sort key. Often, however, you want to sort a file according to a specific field.
Using the entire record as the sort key.
% sort family Barbara Parker 101-2040 Christi Brown 702-0625 Emily Miller 555-7200 Mary Miller 555-1011 Peter Parker 201-4019 Rick Miller 555-0107 Scott Brown 555-3131 Sherry Miller 555-0912 Tom Miller 555-3110
Sorting by Field 2 (last name) in the family file.
% sort +1 family Scott Brown 555-3131 Christi Brown 702-0625 Rick Miller 555-0107 Sherry Miller 555-0912 Mary Miller 555-1011 Tom Miller 555-3110 Emily Miller 555-7200 Barbara Parker 101-2040 Peter Parker 201-4019
Two-level sort by both first and last name. When multiple sort keys are specified, later keys are compared only after all earlier keys have compared equally. This example shows sorting by first name within last name.
% sort +1 -2 family Christi Brown 702-0625 Scott Brown 555-3131 Emily Miller 555-7200 Mary Miller 555-1011 Rick Miller 555-0107 Sherry Miller 555-0912 Tom Miller 555-3110 Barbara Parker 101-2040 Peter Parker 201-4019
The last example uses the following baseball file.
File baseball:
Babe Ruth .342 714 Reggie Jackson .268 490 Hank Aaron .305 755 Mickey Mantle .298 536 Lou Gehrig .340 493 Ty Cobb .367 118 Rod Carew .331 87 Willie Mays .302 660 Joe DiMaggio .325 361 Roberto Clemente .317 240 Jackie Robinson .311 137 Carl Yastremski .285 452 Pete Rose .306 158 Willie Stargell .282 475 Ted Williams .344 521 Hank Greenberg .313 331
Unless the - n option is applied to the desired sort key, the sort is made according to ASCII codes. Thus, to get the desired order for the number of home runs from the file baseball:
Sorting numerically on a numeric field with the - n (numeric) option.
% sort +2n baseball Rod Carew .331 87 Ty Cobb .367 118 Jackie Robinson .311 137 Pete Rose .306 158 Roberto Clemente .317 240 Hank Greenberg .313 331 Joe DiMaggio .325 361 Carl Yastremski .285 452 Willie Stargell .282 475 Reggie Jackson .268 490 Lou Gehrig .340 493 Ted Williams .344 521 Mickey Mantle .298 536 Willie Mays .302 660 Babe Ruth .342 714 Hank Aaron .305 755
If this is not done, Rod Carew appears as the last item in the sort, whereas numerically, by ascending order, he should be the first item as shown.
There are three forms of the grep command:
The grep command has the syntax:
grep[
option...]
expression[
file...]
egrep[
option...] [
expression] [
file...]
fgrep[
option...] [
strings] [
file]
Table 8.1 describes the options available with the grep commands:
Option Description ------------- ------------------------------------------------------- -b Precedes each output line with its block number. This is sometimes useful in locating disk block numbers by context. -c Produces count of matching lines only. -e expression Uses next argument as expression that begins with a minus (-). -f file Takes regular expression (egrep) or string list (fgrep) from file. -i Considers uppercase and lowercase letters identical in making comparisons (grep and fgrep only). -l Lists files with matching lines only once, separated by a new line. -n Precedes each matching line with its line number. -s Silent mode and nothing is printed (except error messages). This is useful for checking the error status. -v Displays all lines that do not match specified expression. -w Searches for an expression as for a word (as if surrounded by `\<+' and `\>+'). -x Prints exact lines matched in their entirety (fgrep only). ------------- ------------------------------------------------------- Table 8.1: Options available with the grep Command
A \
followed by a single character other
than a new line character matches that character.
The character ^
matches the beginning of
a line.
The character $
matches the end of a line.
A .
(dot) matches any character.
A single character not otherwise endowed with special meaning matches that character.
A string enclosed in brackets [...]
matches any single
character from the string. Ranges of ASCII character codes may be
abbreviated as in `a-z0-9
'. The ]
character may occur
only as the first character of the string. A literal -
must
be placed where it can't be mistaken as a range indicator. The
^
character may be
used as a not operator with a string enclosed in brackets.
A regular expression followed by an *
(asterisk) matches a sequence of 0 or more matches of the regular
expression. A regular expression followed by a
+
(plus) matches a sequence of 1 or more
matches of the regular expression. A regular expression followed by
a ?
(question mark) matches a sequence of 0
or 1 matches of the regular expression.
Two regular expressions concatenated match a match of the first followed by a match of the second.
Two regular expressions separated by |
or
a new
line character match either a match for the first or a match for the
second.
A regular expression enclosed in parentheses matches a match for the regular expression.
The order of precedence of operators at the same parenthesis level is as
follows: [ ]
, then *
+
?
,
then concatenation, then |
and new line.
The examples which follow are based on the price.fruit file which is listed below.
apples .49 oranges .59 grapefruit 0.59 lemons .25 limes 0.33 cantaloupe 1.19 watermelon 0.25 kiwi 3.89 bananas .87 pineapples 1.49 grapes 2.19 raspberries 3.39 strawberries 2.38 blueberries 1.89 cranberries 1.59 peaches 0.79 nectarines .79 plums .69
Find the expression in any position.
% grep 'apples' price.fruit apples .49 pineapples 1.49
Anchor a pattern to the beginning of a line.
% grep '^ apples' price.fruit apples .49
Anchor a pattern to the end of a line.
% grep '79$' price.fruit peaches 0.79 nectarines .79
Specify a choice of characters.
% grep '^[kls]' price.fruit lemons .25 (Every line in file beginning limes 0.33 with k, l, or s) kiwi 3.89 strawberries 2.38
Select a range of characters.
% grep '^[a-g]' price.fruit apples .49 (Every line in file beginning grapefruit 0.59 with letters a through g) cantaloupe 1.19 bananas .87 grapes 2.19 blueberries 1.89 cranberries 1.59
Negate using either -v or ().
% grep -v '9$' price.fruit or
% grep '[^9]$' price.fruit lemons .25 (Every line that limes 0.33 does not end in a 9) watermelon 0.25 bananas .87 strawberries 2.38
Note: a caret () inside the `[ ]' means negation; a caret outside refers to the beginning of the line.
Negate using the -v option.
% grep -v 'es' price.fruit grapefruit 0.59 (Every line that does not lemons .25 contain the pattern "es") cantaloupe 1.19 watermelon 0.25 kiwi 3.89 bananas .87 plums .69
Find all lines of file at least ten characters long.
Expression is '..........$'
Find all lines of file starting with four-letter words beginning with the letter 'R'.
Expression is '^R... '
Find all lines of file ending with six-letter words.
Expression is ' ......$'
Match five characters from the price.fruit file.
% grep '^ .....TAB' price.fruit limes 0.33 (Every line beginning with plums .69 afive character word)
To Find: Enter: ------------- ----------- [ \[ ] \] . \. * \* $ \$ ? \? | \| ^ \^ \ \\ ------------- -----------Use the '.' character within a character set.
%>/TT> grep '\.
59' price.fruit
oranges .59 (Every line that contains
grapefruit 0.59 .59 in it)
cranberries 1.59
Use the '.' character and metacharacter.
%>/TT> grep '[0TAB] \...
' price.fruit
apples .49 (Every line where price
oranges .59 is under one dollar)
grapefruit 0.59
lemons .25
limes 0.33
watermelon 0.25
bananas .87
peaches 0.79
nectarines .79
plums .69
Use an * to specify repeating characters.
% grep '^ [a-g].*[0TAB]\...
' price.fruit
apples .49 (Every line which
grapefruit 0.59 starts with a through g and
bananas .87 the price is under one dollar)
Once the information is retrieved, awk can print it, alter it, reformat it, or ignore it. One of the most frequent uses of awk is in generating tables and reports from files of raw data.
To use awk effectively, you should be able to select records by field, string function, pattern comparison, arithmetic comparison, or by using variables and special operators.
With awk, as in most of the UNIX filters, the term record refers to a line of input. The different parts of a record are known as fields. The fields are separated by a field separator which, by default, is a blank or a tab.
The way fields are numbered with awk is different than the way
they are numbered in the sort filter.
Numbering of awk fields begins with 1 for the first field in the
record, 2 for the second field, 3 for the third field, up to n
for the nth field. Field 0 refers to the entire record. To
reference the nth field in a record, the notation is
$
n.
An awk program is defined in terms of patterns and actions. The format of an awk program is:
pattern_1{
action_1}
pattern_2{
action_2}
.
.
.
pattern_n{
action_n}
Awk uses patterns to select the specified records. The input file is processed one line at a time, and as each record is read, it is compared to pattern_1. If pattern_1 matches the record, action_1 is executed. This select and execute process is repeated for every pattern/action pair in the awk program. Then the next record is read and processed, beginning with the first pattern/action pair. This procedure continues until all of the records in the input file(s) have been processed.
If any pattern is missing, the associated action is performed for
all of the records in the input file(s). If any action is missing,
awk prints every record that matches the associated pattern. The
awk utility does not automatically print every line in the original
file. If no action is given, the default action is to print the
record. If an action is supplied, however, the record is not
printed unless a print statement is included in the action. Actions
may consist of a single command or several commands. The left brace
{
of an action must appear on the same
line as the associated pattern.
There are two ways to invoke awk. One way is to include the awk program on the awk command line:
where: program is an awk program enclosed in single quotes and filename_1, filename_2 ... filename_n are input files.%
awk'
program'
filename_1 filename_2...
filename_n
The other way to invoke awk is to put the program in a file and use the -f option on the awk command. This method is preferred for awk programs that are complex or longer than a few commands. The format of the awk command using the -f option is:
where program_file is the file containing the awk program and filename_1, filename_2 ...filename_n are input files.%
awk -f program_file filename_1 filename_2...
filename_n
Since awk is a filter, standard input is used if no input file is
named when you invoke it. Like all filters, awk also writes to
standard output unless you use the output redirection pointer
>
to send the
output to a file.
Table 8.2 provides a summary of some of the special symbols, functions, and expressions used by awk. Those elements predefined by the system are noted by =>.
Table 8.2 Special Symbols, Functions, and Expressions with awk Expression Significance ------------- ------------------------------------------------------- $0 Refers to the entire record. $ n Refers to field n. BEGIN Pattern that matches the beginning of the file. END Pattern that matches the end of the file. => NR The number of records that have been read at the time of evaluation. => NF The number of fields in the current record. => FS The field separator in the input. => OFS The field separator in the output. length Returns the length of the current record. length ( argument\/) Returns the length of the argument. substr ( string, start, num) Returns the portion of string that starts at position start and extends for num positions. index (string_1, string_2) Returns the position of the first occurrence of string_2 in string_1; returns 0 if string_2 is not found. string_1string_2 Concatenates string_1 with string_2. expression_1 && expression_2 Evaluates to true only if both expression_1 and expression_2 evaluate to true. expression_1 || expression_2 Evaluates to true if expression_1 or expression_2 or both evaluate to true. ------------- -------------------------------------------------------
The following table provides a summary of the statements used in awk actions:
Statement Effect ------------- ------------------------------------------------------- variable = expression Assigns the value of the expression to the variable. The expression can be numeric or string; the variable can be an ordinary awk variable or a field variable. variable += expression Increases the value of the variable by the (numeric) value of the expression. variable ++ Increases the value of the variable by 1. variable -= expression Decreases the value of the variable by the (numeric) value of the expression. variable -- Decreases the value of the variable by 1. print Prints the entire current input record. print parameters Prints the values of all evaluated parameters. If the parameters are separated by commas in the print statement, they will be separated by the OFS in the output. # comment Text between # and end-of-line is ignored. Used for commenting programs. ------------- -------------------------------------------------------
The special patterns BEGIN and END match the beginning and end of the input, respectively.
Regular expressions are used to define patterns of characters. When regular expressions are used in a pattern, the notation is the same as the regular expression notation used by grep. The expressions are enclosed in slashes (/), and the associated action is executed for any record in the input file that matches the expression.
The awk utility allows you to select records by means of comparisons using comparison patterns. A comparison pattern is a logical condition that evaluates to true or false. Comparison patterns have the format:
expression_1 operator expression_2where expression_1 and expression_2 are both arithmetic expressions or string expressions, and operator is one of the comparison operators listed below.
Operator Definition ------------- ---------------------------- == Is equal to != Is not equal to > Is greater than < Is less than >= Is greater than or equal to <= Is less than or equal to ------------- ----------------------------
Arithmetic expressions are expressions that evaluate to a number. These include numeric constants; field variables that contain numbers; predefined numeric references (such as NF and NR); functions that return numbers; and any operation that uses the symbols shown below:
Symbol Definition ------------- ------------------------------------------ + Addition - Subtraction * Multiplication / Division % Modulus (integer remainder of a division) ------------- ------------------------------------------
The syntax for combined comparisons is:
comparison_1 logical_operator comparison_2where: comparison_1 and comparison_2 are any comparisons that evaluate to true or false, and logical operator is either the AND symbol (
&&
)
or the OR symbol (||
).
You can select a set of records from the input by referring to ranges of patterns in the pattern section. The format for this is:
pattern_1, pattern_2{
action}
Variables in awk do not require declarations. They may contain either numeric or string values; awk interprets the type of information a variable is supposed to contain from the context of the reference.
If a variable is given as an argument to a string function, it is assumed to be a string. If a variable is given as an operand in an arithmetic statement, it is assumed to be numeric.
The following examples use the cities file that is listed below.
Forestville, CA 95436 Amawalk, NY 10501 Boston, MA 02210 Westchester, PA 19380 Troy, MI 48099 Agoura, CA 91301 Westport, CT 06880 Minneapolis, MN 55421 Lowell, MA 01853 Deerfield, IL 60015 Mankato, MN 56002 Brooklyn, NY 11223 Huntington, NY 11743 Austin, TX 78744 Dallas, TX 75240-6728 Dallas, TX 75240-3145 Lexington, MA 02173 Medford, MA 02156 Minneapolis, MN 55415 Minneapolis, MN 55414 Braintree, MA 02184 Boston, MA 02107 Berkeley, CA 94704 Dallas, TX 75230 Cambridge, MA 02142 Cambridge, MA 02138-0043 Providence, RI 02912 Cambridge, MA 02138-0057 Chicago, IL 60606 Inglewood, CA 90301 Glenview, IL 60025
Print the state and zip code of every record in the cities file. Put a dash between the two fields.
Print the records in the cities file with the zip code first, followed by a TAB, followed by the city, a comma, and the state.
The output records should look like:
95436 Forestville, CA 56002 Mankato, MN 75240-6728 Dallas, TX 02912i Providence, RI
Print the sentence: ``The length of the record is n'' for all records in cities file.
All the remaining examples represent awk programs developed to do the defined tasks, and executed as
%
awk -f program.awk cities.
Print underlined heading that says: ``Nine-Digit Zip Codes'' followed by all records in cities file with nine-digit zip codes.
BEGIN { print "Nine-Digit Zip Codes" print "--------------------" } length($3) == 10 {print}
Print all records from cities file where the zip code ends with a zero (0). Use a regular expression to select the records.
/0$/ {print}
Print all records where the zip code starts with a zero. Use a string function for this program.
substr($3, 1, 1) == "0" {print} or index ($3, "0") == 1 {print}
Print all records where the zip code contains a zero. Use a string function for this program.
index ($3, "0") != 0 {print}
Print all records where the state is either Texas (TX) or Michigan (MI).
($2 == "TX") || ($2 == "MI") {print}
Print all records where the state is Massachusetts (MA) and the zip code ends with the digit `3'.
($2 == "MA") && (substr($3, 5) == "3") {print}
Print the 15th through the 25th record in the list.
NR == 15, NR == 25 {print}
Write an awk program to count and print the number of times the states of California (CA), Massachusetts (MA), and Texas (TX) appear in the cities file. Your output should look like: ``There are 4 cities from California.''.
# # count.awk -- Count and print the number of cities from # California, Massachusetts and Texas. # BEGIN { # initialize counters cal = 0 mas = 0 tex = 0 } ($2 == "CA") {cal ++} # increment approp. counter ($2 == "MA") {mas ++} ($2 == "TX") {tex ++} END { # print the summary print "There are", cal, "cities from California." print "There are", mas, "cities from Massachusetts." print "There are", tex, "cities from Texas." }