Regular Expression

模式匹配

万能字符： “.”
重复操作符：“+”，”?”, “*”
也可指定重复次数： {}
重复操作符默认是贪婪的，它会一直寻找它的边界直到最后一个边界出现为止[1]
给重复操作符添加一个“？” 可以让万能字符匹配尽可能少的次数[2]
字符集： [] 方括号扩起来的字符集，
- ‘[a-zA-Z0-9]’ 短斜杠标志起点终点
- ‘[^abc]’ 反选
Subpatterns: ()

re 模块的常用方法

Function	Description
compile(pattern[, flags])	从一个规则表达式字符串创建模式对象
search(pattern，string[, flags]	在 string 中查找模式
match(pattern, string[, flags])	从字符串的起始处匹配模式
split(pattern, string[, maxsplit=0])	根据模式的出现分割字符串
findall(pattern, string)	返回一个 list 包含 string 中所有出现的模式
sub(pat, repl, string[, count=0])	将出现的模式替换成 repl
escape(string)	将字符串中的所有特殊规则表达式字符转义

>>> import re
>>> string = "hello world"
>>> pat = r' '
>>> if re.search(pat, string): print('Found it!')
...
Found it!

>>> some_text = 'alpha, beta,,,,gamma    delta'
>>> re.split('[, ]+', some_text)
['alpha', 'beta', 'gamma', 'delta']

>>> pat = '[a-zA-z]+'
>>> text = '"Hm... Err --- are you sure?" he said, sounding insecure.'
>>> re.findall(pat, text)
['Hm', 'Err', 'are', 'you', 'sure', 'he', 'said', 'sounding', 'insecure']

>>> pat = '{name}'
>>> text = 'Dear {name}'
>>> re.sub(pat, 'Chunkai', text)
'Dear Chunkai'

>>> re.escape('www.python.org')
'www\\.python\\.org'

Match 对象及分组提取

分组提取

>>> m = re.match(r'www\.(.*)\..{3}', 'www.python.org')
>>> m.group(1)
'python'

替换

>>> emphasis_pattern = re.compile(r'''
... \*                  # Beginning emphasis tag -- an asterisk
... (                   # Beginning group for capturing phrase
... [^\*]+              # Capture anything except asterisks
... )                   # End group
... \*                  # Ending emphasis tag
... ''', re.VERBOSE)
>>> re.sub(emphasis_pattern, r'<em>\1<em>', 'hello, *world*!')
'hello, <em>world<em>!'

>>> emphasis_pattern = r'\*(.+)\*'          #[1]
>>> re.sub(emphasis_pattern, r'<em>\1<em>', 'hello, *world*! *my son*')
'hello, <em>world*! *my son<em>'

>>> emphasis_pattern = r'\*(.+?)\*'         #[2]
>>> re.sub(emphasis_pattern, r'<em>\1<em>', 'hello, *world*! *my son*')
'hello, <em>world<em>! <em>my son<em

模板替换

# templates.py
import fileinput, re

# Matches fields enclosed in square brackets:
field_pat = re.compile(r'\[(.+?)\]')

# We'll collect variables in this:
scope = {}

# This is used in re.sub:
def replacement(match):
    code = match.group(1)
    try:
        # If the field can be evaluated, return it:
        return str(eval(code, scope))
    except SyntaxError:
        # Otherwise, execute the assignment in the same scope ... exec code in scope
        exec(code, scope)
        # ... and return an empty string:
        return ''

# Get all the text as a single string:

# (There are other ways of doing this; see Chapter 11)
lines = []
for line in fileinput.input():
    lines.append(line)
text = ''.join(lines)

# Substitute all the occurrences of the field pattern:
print(field_pat.sub(replacement, text))

template.txt

[import time]
Dear [name],

I would like to learn how to program.
I hear you use the [language] language a lot -- is it something I should consider?

And, by the way, is [email] your correct email address?

Fooville,
[time.asctime()]
Oscar Frozzbozz

magnus.txt

[name = 'Magnus Lie Hetland' ]
[email = 'magnus@foo.bar' ]
[language = 'python' ]

python templates.py magnus.txt template.txt