0%

Python - Regular Expression

Regular Expression

模式匹配

  • 万能字符: “.”
  • 重复操作符:“+”,”?”, “*”
  • 也可指定重复次数: {}
  • 重复操作符默认是贪婪的,它会一直寻找它的边界直到最后一个边界出现为止[1]
  • 给重复操作符添加一个“?” 可以让万能字符匹配尽可能少的次数[2]
  • 字符集: [] 方括号扩起来的字符集,
    • ‘[a-zA-Z0-9]’ 短斜杠标志起点终点
    • ‘[^abc]’ 反选
  • Subpatterns: ()

re 模块的常用方法

Function Description
compile(pattern[, flags]) 从一个规则表达式字符串创建模式对象
search(pattern,string[, flags] 在 string 中查找模式
match(pattern, string[, flags]) 从字符串的起始处匹配模式
split(pattern, string[, maxsplit=0]) 根据模式的出现分割字符串
findall(pattern, string) 返回一个 list 包含 string 中所有出现的模式
sub(pat, repl, string[, count=0]) 将出现的模式替换成 repl
escape(string) 将字符串中的所有特殊规则表达式字符转义
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
>>> import re
>>> string = "hello world"
>>> pat = r' '
>>> if re.search(pat, string): print('Found it!')
...
Found it!

>>> some_text = 'alpha, beta,,,,gamma delta'
>>> re.split('[, ]+', some_text)
['alpha', 'beta', 'gamma', 'delta']

>>> pat = '[a-zA-z]+'
>>> text = '"Hm... Err --- are you sure?" he said, sounding insecure.'
>>> re.findall(pat, text)
['Hm', 'Err', 'are', 'you', 'sure', 'he', 'said', 'sounding', 'insecure']

>>> pat = '{name}'
>>> text = 'Dear {name}'
>>> re.sub(pat, 'Chunkai', text)
'Dear Chunkai'

>>> re.escape('www.python.org')
'www\\.python\\.org'

Match 对象及分组提取

  • 分组提取
    1
    2
    3
    >>> m = re.match(r'www\.(.*)\..{3}', 'www.python.org')
    >>> m.group(1)
    'python'
  • 替换
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    >>> emphasis_pattern = re.compile(r'''
    ... \* # Beginning emphasis tag -- an asterisk
    ... ( # Beginning group for capturing phrase
    ... [^\*]+ # Capture anything except asterisks
    ... ) # End group
    ... \* # Ending emphasis tag
    ... ''', re.VERBOSE)
    >>> re.sub(emphasis_pattern, r'<em>\1<em>', 'hello, *world*!')
    'hello, <em>world<em>!'

    >>> emphasis_pattern = r'\*(.+)\*' #[1]
    >>> re.sub(emphasis_pattern, r'<em>\1<em>', 'hello, *world*! *my son*')
    'hello, <em>world*! *my son<em>'

    >>> emphasis_pattern = r'\*(.+?)\*' #[2]
    >>> re.sub(emphasis_pattern, r'<em>\1<em>', 'hello, *world*! *my son*')
    'hello, <em>world<em>! <em>my son<em

模板替换

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# templates.py
import fileinput, re

# Matches fields enclosed in square brackets:
field_pat = re.compile(r'\[(.+?)\]')

# We'll collect variables in this:
scope = {}

# This is used in re.sub:
def replacement(match):
code = match.group(1)
try:
# If the field can be evaluated, return it:
return str(eval(code, scope))
except SyntaxError:
# Otherwise, execute the assignment in the same scope ... exec code in scope
exec(code, scope)
# ... and return an empty string:
return ''

# Get all the text as a single string:

# (There are other ways of doing this; see Chapter 11)
lines = []
for line in fileinput.input():
lines.append(line)
text = ''.join(lines)

# Substitute all the occurrences of the field pattern:
print(field_pat.sub(replacement, text))

template.txt

1
2
3
4
5
6
7
8
9
10
11
[import time]
Dear [name],

I would like to learn how to program.
I hear you use the [language] language a lot -- is it something I should consider?

And, by the way, is [email] your correct email address?

Fooville,
[time.asctime()]
Oscar Frozzbozz

magnus.txt

1
2
3
[name = 'Magnus Lie Hetland' ]
[email = 'magnus@foo.bar' ]
[language = 'python' ]
1
python templates.py magnus.txt template.txt