10. Regular Expression (Part 2)

  • Metacharacters that are Operators
  • Metacharacters used for literal meaning
    • Escaping metacharacters
  • Useful Tips for Regular Expression

Metacharacters that are operators
  •  | means or
    • The regex  ‘abc|ABC’ means either abc  or  ABC can be a match 
  • ( ) means grouping
    • Useful with the repetition metacharacters
    • Since the repetition metacharacters will repeat only the previous single character, if you need to repeat a group of previous characters, you need to use the ( )
  • egrep   ‘linux|LINUX’  inputFile
    • Any line with linux or LINUX will match
  • egrep   ‘abc{3}’ inputFile
    • Any line with a, followed by b, followed by 3 c’s will match
  • egrep ‘(abc){3}’ inputFile
    • Any line with abcabcabc (3 abc’s in a row) will match

Metacharacters used for literal meaning
  • When the search engine sees a metacharacter, it uses the special meaning of the character
  • If you want to use the metacharacter for its literal meaning, you need to escape from the meta meaning
  • 2 ways for metacharacters to take their literal meaning:
    •     \          take the literal meaning of next character
    • [characters]    characters inside [ ] have their literal meaning
  • egrep  ‘2.5’  inputFile
    • Match any line with 2, followed by any single character, followed by 5
    • Matching lines can have:   2a5   or  2 5   or   2.5   or   215
  • egrep ‘2\.5’   inputFile       or      egrep  ‘2[.]5’   inputFile
    • Match any line with 2.5 

Useful Tips for Regular Expression
  • (1) For a regular expression to be flexible (and therefore more useful), it most likely will include both literal characters and metacharacters
  • (2) Make your regular expression as simple (as few characters) as you can
    • Examples of simple thinking:
      • ‘a+’  and ‘a’ both describe at least 1 a. Use ‘a’
      • ‘a{1}’ and ‘a’ both describe 1 a. Use ‘a’
      • ‘aaaaaaaaaa’ and ‘a{10}’ both describe 10 a’s. Use ‘a{10}’
      • ‘^.*$’ and  ‘.*’ both match everything in the line. Use ‘.*’
      • ‘linux|Linux’ and ‘[lL]inux’ both match linux or Linux. Use ‘[lL]inux’
      • ‘^A’ and ‘^A.*$’ both describe a line that starts with A. Use ‘^A’
  • (3) Pay attention to what the repetition metacharacters will match
    • Examples of non-intuitive match of repetition:
      • ‘a*’ will match aaaaaaaa  (the obvious case), but it also will   match bcd  (the not so obvious case)
      • ‘^a+$’ means that the line has to have at least 1 a, but
      • ‘^a*$’ means the line can be empty (no character)
  • (4) Don’t forget the anchors ^ and $ when you need to describe the entire line. This typically happens when you’re looking for:
    • exactly n numbers of a’s and nothing else:  ‘^a{n}$’
    • only a’s and nothing else: ‘^a+$’
    • no a’s: ‘^[^a]+$’
      • If the 3 regex above don’t have both anchors, then the text string:  aaaabc  will match all 3 of them