09. Filters (Part 1)

Topics:

  • Filters
  • head, tail
  • cut, paste
  • uniq, wc
  • diff, cmp
  • tr

Filters

  • Filters are a special group of utilities that work the following way:
    • They accept multiple lines of input and return an output that is a subset of the input lines
    • The output is a subset of the input lines because the filter has ‘filtered out’ the lines that you specify
  • Features that apply to all filters
    • They accept multiple lines of input, which can come from:
    • multiple lines of a text file, or
    • multiple lines of output that are piped in from another command
    • They return multiple lines of output. The output lines can be as many as the input, or none, or somewhere in between, depending on what you specify to be filtered out
    • Filters read in the lines of input one at a time, then it filters the line as you specify, and sends the output to standard out
    • This means that if the input comes from a file, the input file itself is not modified by the filter. It also means that if you want the output of the filter as a file, you have to redirect the output to a file.
  • Filter that you’ve worked with:
    • cat: not much filtering, the output and input lines are the same
    • more, less: the output is one screen of the input lines

head and tail

  • head: displays the beginning part of a file
    • Format:
      • head filename shows the first 10 lines of the file
      • head –n filename n is a number, shows the first n lines
    • If the file has fewer lines than 10 or n, head prints all of the file
  • tail: displays the end part of a file
    • Format:
      • tail filename shows on screen the last 10 lines of the file
      • tail –n filename n is a number, shows on screen the last n lines of the file
    • If the file has fewer lines than 10 or n, tail prints all of the file

cut and paste

  • cut and paste are designed to work with input lines that have fields and regular delimiters
  • Fields are columns of data
  • Delimiters are the symbols that separate 2 fields. A regular delimiter means that only one symbol is used throughout the file as delimiter
  • Example: The lines in the following file has 4 fields, with a regular delimiter of comma:
    • cis18a,Intro to Linux,fall,2011
    • cis18b,Advanced Linux,winter,2012
    • cis18c,Shell Scripting,spring,2012
  • cut: prints to screen the specified fields in a file
    • Format: cut -fn -d'c' filename
      • n is the specified field number, explained below
      • c is the character used as the delimiter. The character needs single quotes if it is a metacharacter of the shell.
      • The default delimiter is tab. You don’t need to use the –d option if tab is the delimiter used in the input lines
    • To select the field(s) that follow the –f option:
      • n to select field number n (n is a number)
      • n,m,k to select fields number n, m, and k (n, m, k are numbers)
      • n-k to select from field number n to field number k (n, k are numbers)
    • field numbering starts at 1
    • From the previous example file: cut -f1,3 -d’,’ exampleFile will result in:
      • cis18a,fall
      • cis18b,winter
      • cis18c,spring
  • paste: pastes together input lines side by side, resulting in appending fields together from left to right
    • Format: paste -d'c' file_list
      • c is the character used as the delimiter. The character needs single quotes if it is a metacharacter of the shell
      • The default delimiter is tab. You don’t need the -d option if you want to use tab as a delimiter between the files that are pasted
      • file_list is the list of file names that are pasted in order from left to right
    • Example: FileA is: x FileB is: 1 a
    • y 2 b
    • z
    • paste FileA FileB will result in: x 1 a
    • y 2 b
    • z
    • paste puts the default tab between the first and second files when it pastes the files together because the delimiter is not specified

uniq, wc

  • uniq: (for unique) filters out non-unique consecutive lines.
    • If consecutive input lines are identical, only one line will remain and the rest of the lines are filtered out
  • Format: uniq filename
  • wc: (for word count) shows the number of lines, words, and characters in a file
  • A word is a series of non-space characters surrounded by space
  • Each space is counted as a character
  • Format:
    • wc filename shows number of lines, words, and characters
    • wc -l filename (l for line) shows number of lines
    • wc -w filename (w for word) shows number of words
    • wc -c filename (c for character) shows number of characters

diff, cmp

  • diff and cmp are both used to check whether there is any difference between 2 files
  • diff (for difference): diff file1 file2
    • If there is no output, the files are identical
    • If the files are different, diff shows the action (add, delete) that can be done on each line in file1 that’s different so it can become identical to file2
  • cmp (for compare): cmp file1 file2
    • If there is no output, the files are identical
    • If the files are different, cmp shows the location of the first character that’s different

tr

  • tr: (for transliteration) accepts a source set of characters and a destination set of characters. tr searches each input line for each character in the source set, and changes it to a corresponding character in the destination set
  • tr does not accept a filename as an input
  • Format: tr -options ‘source chars' ‘destination chars‘
    • options can be c, d, or s (shown on next slides)
    • ‘source chars’ is the set of characters that will be replaced
    • ‘destination chars’ is the set of characters used for replacement
  • There is a one-to-one correspondence between the 2 sets
    • The first character of the source is replaced by the first character of the destination, the second source character by the second destination character, etc.
    • if the source set is shorter than destination set: extra characters in destination set are ignored
    • if the source set is longer than destination set: the last character of destination set is used for each extra character of source set
  • The characters in the source set and destination set can be any text character
    • This means any space or comma in the set will be used for replacement. Don’t add these in the set if you don’t want to replace them
    • Even if the set of characters looks like a word, tr still looks at each individual character. For example: tr ‘linux’ ‘LINUX’ means that all l, i, n, u, and x characters will be replaced with an uppercase equivalence, and not just the word linux will be replaced
  • You can specify a range of characters for the set:
    • ‘a-z’ or ‘A-Z’ or ‘0-9’ complete set of letters or numbers
    • ‘a-f’ or ‘5-9’ partial set of letters or numbers
    • ‘a-f0-9’ combined sets of letters or numbers
  • put the - character at the end of the set if you want to include it as a character in the set
  • Example: tr ‘a-d’ ‘xyz’
    • a becomes x, b becomes y, c becomes z, d becomes z in the output lines
  • Options
    • c : (for complement) all characters that are not in the source set will get changed to characters in the destination set
      • Example: tr -c ‘a-z’ ‘*’
      • all characters that are not a lowercase letter become a * character
    • d : (for delete) this option requires only the source set as an argument. All characters in the source set get deleted
      • Example: tr –d ‘a-z’
      • all lowercase letters get deleted
    • s : (for squeeze or squash) after all characters in the source set are replaced, any consecutive characters that are identical are squeezed into 1 instance of this character.
      • Example: tr -s ‘ab’ ‘xy’ on the input line aaabc a
      • results in an output of: xyc x
      • (first 3 a’s become 3 x’s, and then squeezed into 1 x)
    • You can combine options:
      • tr –dc ‘a-z’ means all characters that are not a lowercase letter will be deleted