10. Regular Expression (Part 2)

Topics:

  • Metacharacters that are Operators
  • Metacharacters used for literal meaning
    • Escaping metacharacters
  • Useful Tips for Regular Expression

Metacharacters that are operators

  • | means or
    • The regex ‘abc|ABC’ means either abc or ABC can be a match
  • ( ) means grouping
    • Useful with the repetition metacharacters
    • Since the repetition metacharacters will repeat only the previous single character, if you need to repeat a group of previous characters, you need to use the ( )

Examples

  • egrep ‘linux|LINUX’ inputFile
    • Any line with linux or LINUX will match
  • egrep ‘abc{3}’ inputFile
    • Any line with a, followed by b, followed by 3 c’s will match
  • egrep ‘(abc){3}’ inputFile
    • Any line with abcabcabc (3 abc’s in a row) will match

Metacharacters used for literal meaning

  • When the search engine sees a metacharacter, it uses the special meaning of the character
  • If you want to use the metacharacter for its literal meaning, you need to escape from the meta meaning
  • 2 ways for metacharacters to take their literal meaning:
    • \ take the literal meaning of next character
    • [characters] characters inside [ ] have their literal meaning

Examples:

  • egrep ‘2.5’ inputFile
    • Match any line with 2, followed by any single character, followed by 5
    • Matching lines can have: 2a5 or 2 5 or 2.5 or 215
  • egrep ‘2\.5’ inputFile or egrep ‘2[.]5’ inputFile
    • Match any line with 2.5

Useful Tips for Regular Expression

  • (1) For a regular expression to be flexible (and therefore more useful), it most likely will include both literal characters and metacharacters
  • (2) Make your regular expression as simple (as few characters) as you can
    • Examples of simple thinking:
      • ‘a+’ and ‘a’ both describe at least 1 a. Use ‘a’
      • ‘a{1}’ and ‘a’ both describe 1 a. Use ‘a’
      • ‘aaaaaaaaaa’ and ‘a{10}’ both describe 10 a’s. Use ‘a{10}’
      • ‘^.*$’ and ‘.*’ both match everything in the line. Use ‘.*’
      • ‘linux|Linux’ and ‘[lL]inux’ both match linux or Linux. Use ‘[lL]inux’
      • ‘^A’ and ‘^A.*$’ both describe a line that starts with A. Use ‘^A’
  • (3) Pay attention to what the repetition metacharacters will match
    • Examples of non-intuitive match of repetition:
      • ‘a*’ will match aaaaaaaa (the obvious case), but it also will match bcd (the not so obvious case)
      • ‘^a+$’ means that the line has to have at least 1 a, but
      • ‘^a*$’ means the line can be empty (no character)
  • (4) Don’t forget the anchors ^ and $ when you need to describe the entire line. This typically happens when you’re looking for:
    • exactly n numbers of a’s and nothing else: ‘^a{n}$’
    • only a’s and nothing else: ‘^a+$’
    • no a’s: ‘^[^a]+$’
      • If the 3 regex above don’t have both anchors, then the text string: aaaabc will match all 3 of them