09. Filters (Part 2)

  • sort
  • egrep
  • fgrep
  • grep

  • sort : puts the input lines in a specified order
  • The default order is based on ASCII value
    • ASCII value: every text character is stored as a specific number in memory. This number is the ASCII value of the character. The one-to-one correspondence between each text character and its number representation is stored in the ASCII table
    • In the ASCII table, the digits 0-9 are in numeric order, and letters  a-z and A-Z are in alphabetical order. In the table, the digits come first, then uppercase, then lowercase letters.
    • You can run  man ascii     to see the complete ASCII table
  • Default format:    sort  filename
  • How sort works by default:
    • Start by comparing the first character of the input lines
    • Keep comparing the corresponding characters in the input lines until it finds 2 characters that are not the same.  The character with the smaller ASCII value comes first, therefore the line with that character comes first
  • Set an environment variable before using sort:   export  LC_ALL="POSIX”

sort  Options
  • c : (for check) sort only prints a “disorder” message to screen if the input lines are not sorted, otherwise, if the file is sorted, nothing is printed to screen. This option is useful if you want to check if the file is sorted but don’t want to see the file printed to screen
  • r : (for reverse) the lines will be sorted in reverse ASCII order
  • f : (for fold over) sort “folds” all lowercase letters into uppercase first, then does the sorting. sort will then see all uppercase letters while sorting, resulting in a case insensitive sort. This means ‘a’ is considered to be the same as ‘A’, for example. The output lines will remain in the same case as the input lines and will not be in all uppercase. 
  • d : (for dictionary) sort ignores all characters that are not letters, only letters are used to determine the sort order. If a line does not have letters, sort ignores the line and pushes it up to the top of the sorted list
  • n : (for numeric) sort looks at 123 as the number 123 rather than the character 1, character 2, character 3. This option is useful if you have numbers that need to be sorted. 
    • For example, without the –n option, 123 will come before 45 since the character 1 comes before the character 4. With the –n option, the number 45 is less than the number 123, so 45 will come before 123.
  • M : (for Month) sort looks at the first 3 characters as abbreviation for the month names (Jan, Feb, Mar, etc.), and sorts by calendar order. If the first 3 characters (regardless of case) don’t match the month abbreviation, sort ignores the line and pushes it up to the top of the sorted list
  • t : (for delimiter) the default delimiter is space or tab. Use this option if the delimiter is any other character

sort by Fields
  • If the input lines have fields with delimiters, you can tell sort to sort by a specific field in the line, instead of sort from the beginning of the line
  • Field numbering starts at 1
  • To sort by field(s):  sort   +n1   -n2   filename  
    • where n1, n2 are numbers
    • sort will skip n1 fields, start comparing characters at field n1+1, and stop comparing characters when it reaches field n2+1
  • Examples:   
    • sort +2  -4 filename  
      •  sort by fields 3 and 4 only (start sorting at field 3 and stop sorting when reaching field 5)
    • sort +1  -2 filename  
      •  sort by field 2 only (start sorting at field 2 and stop sorting when reaching field 3)

sort with Multiple Passes
  • By default sort will sort the input lines one time, and then print the sorted lines on screen. This is one pass of the sort.
  • When sorting by fields, it may be useful to sort with several passes. 
  • For example, if you sort by field 1 (by month) in the first pass, and there are 8 input lines with identical month in field 1. You can then ask sort to go through a second pass, and sort by field 5 (by day) all 8 lines that have identical month:    sort  +0  -1  +4  -5   filename
    • The command above means that sort will sort by the field 1only (the month field) in the first pass, then for all lines that have the same month in field 1, sort will have a second pass that sort these lines by field 5 (the day field)
  • You can have as many passes as needed
  • The passes don’t have to be in field order
  • For example,  sort  +3  -4  +0  -1  +4  - 5  filename
    • The command above will sort by field 4 in the first pass, and for all lines that have the same value in field 4, run a second pass to sort by field 1, and for the lines that have the same values in both field 4 and field 1, run a third pass to sort by field 5

Multiple passes with options
  • If an option applies to all passes, put the option as a separate option on the command line:    
    • sort  –r  +3 -4  +1 -2   filename       
      • sort will do a reverse sort of field 4 in the first pass. Then for all lines with identical data in field 4, do a reverse sort of field 2 in the second pass
  • If an option applies to only 1 pass, combine the option with the +n option:     
    • sort  +5r  -6  +0n  -1   filename                                  
      • sort will do a reverse sort of field 6 in the first pass. Then for all lines with identical data in field 6, do a numeric sort of field 1 in the second pass

  • grep : (for global regular expression print) search through each input line for a specified pattern, and print the input lines that match the pattern
  • The pattern is specified by a regular expression (covered next in the Regular Expression module).
  • Without using regular expressions, grep is very commonly used for selecting lines that match a specific text string 
  • There are 3 utilities in the grep family:
  • fgrep (for fast grep) works the fastest, used with text strings only and not with regular expressions
  • grep : the oldest, works with a standard set of regular expression
  • egrep: (for extended grep) works with the extended set of regular expression
  • Format:     
    • fgrep   ‘text string’   filename
    • grep    ‘regular expression’   filename
    • egrep  ‘regular expression’  filename