Topics:
Filters
head, tail
cut, paste
uniq, wc
diff, cmp
tr
Filters
Filters are a special group of utilities that work the following way:
They accept multiple lines of input and return an output that is a subset of the input lines
The output is a subset of the input lines because the filter has ‘filtered out’ the lines that you specify
Features that apply to all filters
They accept multiple lines of input, which can come from:
multiple lines of a text file, or
multiple lines of output that are piped in from another command
They return multiple lines of output. The output lines can be as many as the input, or none, or somewhere in between, depending on what you specify to be filtered out
Filters read in the lines of input one at a time, then it filters the line as you specify, and sends the output to standard out
This means that if the input comes from a file, the input file itself is not modified by the filter. It also means that if you want the output of the filter as a file, you have to redirect the output to a file.
Filter that you’ve worked with:
cat: not much filtering, the output and input lines are the same
more, less: the output is one screen of the input lines
head and tail
head: displays the beginning part of a file
Format:
head filename shows the first 10 lines of the file
head –n filename n is a number, shows the first n lines
If the file has fewer lines than 10 or n, head prints all of the file
tail: displays the end part of a file
Format:
tail filename shows on screen the last 10 lines of the file
tail –n filename n is a number, shows on screen the last n lines of the file
If the file has fewer lines than 10 or n, tail prints all of the file
cut and paste
cut and paste are designed to work with input lines that have fields and regular delimiters
Fields are columns of data
Delimiters are the symbols that separate 2 fields. A regular delimiter means that only one symbol is used throughout the file as delimiter
Example: The lines in the following file has 4 fields, with a regular delimiter of comma:
cis18a,Intro to Linux,fall,2011
cis18b,Advanced Linux,winter,2012
cis18c,Shell Scripting,spring,2012
cut: prints to screen the specified fields in a file
Format: cut -fn -d'c' filename
n is the specified field number, explained below
c is the character used as the delimiter. The character needs single quotes if it is a metacharacter of the shell.
The default delimiter is tab. You don’t need to use the –d option if tab is the delimiter used in the input lines
To select the field(s) that follow the –f option:
n to select field number n (n is a number)
n,m,k to select fields number n, m, and k (n, m, k are numbers)
n-k to select from field number n to field number k (n, k are numbers)
field numbering starts at 1
From the previous example file: cut -f1,3 -d’,’ exampleFile will result in:
cis18a,fall
cis18b,winter
cis18c,spring
paste: pastes together input lines side by side, resulting in appending fields together from left to right
Format: paste -d'c' file_list
c is the character used as the delimiter. The character needs single quotes if it is a metacharacter of the shell
The default delimiter is tab. You don’t need the -d option if you want to use tab as a delimiter between the files that are pasted
file_list is the list of file names that are pasted in order from left to right
Example: FileA is: x FileB is: 1 a
y 2 b
z
paste FileA FileB will result in: x 1 a
y 2 b
z
paste puts the default tab between the first and second files when it pastes the files together because the delimiter is not specified
uniq, wc
uniq: (for unique) filters out non-unique consecutive lines.
If consecutive input lines are identical, only one line will remain and the rest of the lines are filtered out
Format: uniq filename
wc: (for word count) shows the number of lines, words, and characters in a file
A word is a series of non-space characters surrounded by space
Each space is counted as a character
Format:
wc filename shows number of lines, words, and characters
wc -l filename (l for line) shows number of lines
wc -w filename (w for word) shows number of words
wc -c filename (c for character) shows number of characters
diff, cmp
diff and cmp are both used to check whether there is any difference between 2 files
diff (for difference): diff file1 file2
If there is no output, the files are identical
If the files are different, diff shows the action (add, delete) that can be done on each line in file1 that’s different so it can become identical to file2
cmp (for compare): cmp file1 file2
If there is no output, the files are identical
If the files are different, cmp shows the location of the first character that’s different
tr
tr: (for transliteration) accepts a source set of characters and a destination set of characters. tr searches each input line for each character in the source set, and changes it to a corresponding character in the destination set
tr does not accept a filename as an input
Format: tr -options ‘source chars' ‘destination chars‘
options can be c, d, or s (shown on next slides)
‘source chars’ is the set of characters that will be replaced
‘destination chars’ is the set of characters used for replacement
There is a one-to-one correspondence between the 2 sets
The first character of the source is replaced by the first character of the destination, the second source character by the second destination character, etc.
if the source set is shorter than destination set: extra characters in destination set are ignored
if the source set is longer than destination set: the last character of destination set is used for each extra character of source set
The characters in the source set and destination set can be any text character
This means any space or comma in the set will be used for replacement. Don’t add these in the set if you don’t want to replace them
Even if the set of characters looks like a word, tr still looks at each individual character. For example: tr ‘linux’ ‘LINUX’ means that all l, i, n, u, and x characters will be replaced with an uppercase equivalence, and not just the word linux will be replaced
You can specify a range of characters for the set:
‘a-z’ or ‘A-Z’ or ‘0-9’ complete set of letters or numbers
‘a-f’ or ‘5-9’ partial set of letters or numbers
‘a-f0-9’ combined sets of letters or numbers
put the - character at the end of the set if you want to include it as a character in the set
Example: tr ‘a-d’ ‘xyz’
a becomes x, b becomes y, c becomes z, d becomes z in the output lines
Options
c : (for complement) all characters that are not in the source set will get changed to characters in the destination set
Example: tr -c ‘a-z’ ‘*’
all characters that are not a lowercase letter become a * character
d : (for delete) this option requires only the source set as an argument. All characters in the source set get deleted
Example: tr –d ‘a-z’
all lowercase letters get deleted
s : (for squeeze or squash) after all characters in the source set are replaced, any consecutive characters that are identical are squeezed into 1 instance of this character.
Example: tr -s ‘ab’ ‘xy’ on the input line aaabc a
results in an output of: xyc x
(first 3 a’s become 3 x’s, and then squeezed into 1 x)
You can combine options:
tr –dc ‘a-z’ means all characters that are not a lowercase letter will be deleted