Python - 正则表达式

一、使用正则和不使用正则区别
在不使用正则的情况下来匹配一个电话号码：

#在一段文本中查找电话号码 例 331-896-7854
def isPhoneNumber(number):
    if len(number) != 12:  #首先检查电话号码是否是12位数
        return False
    for i in range(0,3):
        if not number[i].isdecimal():   #判断前三位是不是数字
            return False
    if number[3] != '-':     #判断第四位是不是 -
        return False
    for i in range(4,7):
        if not number[i].isdecimal():
            return False
    if number[7] != '-':
        return False
    for i in range(8,12):
        if not number[i].isdecimal():
            return False
    return True

text = 'you can contact me at 331-896-7854. 400-400-8888 is my office'  #待查找的文字
for i in range(len(text)):
    check = text[i:i+12]      #因为电话号码是12位，所以将每段12个字符传入 判断
    if isPhoneNumber(check):
        print('the phone number found:' + check)
print('Done')

执行后

the phone number found:331-896-7854
the phone number found:400-400-8888
Done

使用正则来匹配

phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')  
#re.compile 后面传入一个字符串，表示正则表达式，使用原始字符串更方便

ret = phoneNumRegex.search('My number is 415-555-4242.').group()  
#使用search 的方法 根据设置的正则模式，返回一个Match对象，在Match对象中有个group的方法，返回被查找的实际的文本
print('The phone number found :' + ret)

执行后

The phone number found :415-555-4242

二、利用括号分组

phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
#第一个括号表示第一组，第二个

ret = phoneNumRegex.search('My number is 415-555-4242.')
# 向group中传入整数1 或 2 就可以匹配不同的部分，如果传入0或者不传入，即匹配整个文本

print('The phone number found :' + ret.group(1)) >>The phone number found :415
print('The phone number found :' + ret.group(2)) >>The phone number found :555-4242
print('The phone number found :' + ret.group())  >>The phone number found :415-555-4242

如果一次想获取所有的分组，使用groups()方法

print(ret.groups())   >> ('415', '555-4242')

如果电话号码是(415)-555-4242,匹配的时候想匹配括号，那么可以使用转义字符

phoneNumRegex = re.compile(r'(\(\d\d\d\))-(\d\d\d-\d\d\d\d)')
……
print(ret.groups()) >> ('(415)', '555-4242')

三、用管道匹配多个分组
字符| 成为管道，相当于或，使用在正则表达式中，r'superman|batman' ，这个正常表达式可以匹配superman或者匹配batman

import re

heroRegex = re.compile(r'superman|batman')   
ret = heroRegex.search('superman and batman')   #当找到第一个符合正则表达式的模式的时候，就返回值(使用findall可以匹配所有符合规则的)
print(ret.group())

四、其他方法

?  表明他前面的分组是可选的，存在0个或者1个
\? 匹配问号
*  匹配*前面的分组0次或多次，可以在匹配的文本中出现任意次，也可以完全不存在
\* 匹配星号
+  匹配+前面的分组一次到多次， 但是至少出现一次， 如果匹配内容没有出现，那么值为None
\+ 匹配加号
^  匹配以什么开头
$  匹配以什么结尾
.  匹配除换行之外的所有字符，只能匹配一个字符
.* 匹配除换行以外任意文本，贪心匹配，匹配最长的部分，如果把re.DOTALL作为第二个参数传入re.compile，可以匹配所有包括换行
.*?匹配除换行以外任意文本，非贪心匹配，匹配尽可能短的

五、花括号

想要一个分组重复特定次数，就在正则表达式中该分组的后面，跟上花括号包围的数字。
花括号中还可以指定范围，比如(ha){3,5} 那么这个正则表达式可以匹配'hahaha'、'hahahaha'、'hahahahaha'
花括号中可以不写第一个数字或第二个数字，不限定最小值或最大值，比如(ha){3,}匹配三次或更多、(ha){,5}匹配零到五次

六、贪心和非贪心匹配

python中默认是贪心匹配，在有选择的情况下，他们会尽可能匹配最长的字符串
花括号的非贪心版本，在花括号结尾加一个？，这样匹配的时候就尽可能找最短的字符串，这里体现了？在正则表达式中的另一个作用

#贪心匹配
>>> tx = re.compile(r'(ha){3,5}')
>>> tx.search('hahahahaha').group()
'hahahahaha'

#非贪心匹配
>>> tx = re.compile(r'(ha){3,5}?')
>>> tx.search('hahahahaha').group()
'hahaha'

七、findall

findall：
- 没有分组时，返回一个字符串列表
- 有分组时，返回元组列表，每个元组表示找到的匹配
- 匹配所有符合正则模式的文本

search：
- 返回一个Match对象
- 匹配第一个符合正则模式的文本

#没有分组
>>> phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
>>> phoneNumRegex.findall('cell:415-555-9999 work:212-555-1234')
['415-555-9999', '212-555-1234']

#有分组
>>> phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)')
>>> phoneNumRegex.findall('cell:415-555-9999 work:212-555-1234')
[('415', '555', '9999'), ('212', '555', '1234')]

八、字符分类

\d #0到9任何数字
\D #除0到9的数字以外的任何字符
\w #任何字母、数字或下划线字符（可以认为是匹配“单词”字符）
\W #除字母、数字和下划线以外的任何字符
\s #空格、制表符或换行符（可以认为是匹配“空白”字符）
\S #除空格、制表符和换行符以外的任何字符

建立自己的字符分类：

>>> vowe1Regex = re.compile(r'[aeiouAEIOU]')
>>> vowe1Regex.findall('RoboCop ests baby food. BABY FOOD.')
['o', 'o', 'o', 'e', 'a', 'o', 'o', 'A', 'O', 'O']

也可以是用- 表示字母或数字的范围，例如，[a-zA-Z0-9]匹配所有大小写字母和数字
在[ ]中普通的正则表达式符号不会被解释，在[ ]中若使用 [. * ?]不需要转义
在[ ]左括号后加上一个字符^，就可以得到“非字符类“，非字符类将匹配不再这个字符类中的所有字符。

>>> vowe1Regex = re.compile(r'[^aeiouAEIOU]')
>>> vowe1Regex.findall('RoboCop ests baby food. BABY FOOD.')
['R', 'b', 'C', 'p', ' ', 's', 't', 's', ' ', 'b', 'b', 'y', ' ', 'f', 'd', '.', ' ', 'B', 'B', 'Y', ' ', 'F', 'D', '.']

九、不区分大小写的匹配
通常正则表达式用指定的大小写匹配文本，例如下面道德正则表达式，匹配完全不同的字符串：

import re

# regex1 = re.compile('RoboCop')
# regex2 = re.compile('ROBOCOP')
# regex3 = re.compile('robOcop')
# regex4 = re.compile('RobocOp')

robocop = re.compile(r'robocop',re.I)  #如果匹配的时候想不区分大小写，可以向re.compile()传入re.IGNORECASE 或者re.I 作为第二个参数

ret = re.search(robocop,'RobOCop protects the innocent').group()

print(ret)  >>> 执行后，RobOCop

#传入re.VERBOSE 作为第二个参数 忽略行尾#后面的注释，可在多行正则表达式字符串中使用
re.compile(r'''(
(\d{3}|\(\d{3}\))?    #area code
(\s|-|\.)?            #separator
\d{3}                 #first 3 digit
(\s)|-|\.             #seeparator
\d{4}                 #last 4 digits
(\s*(ext|x|ext.)\s*\d{2,5})?  #extension
)''',re.VERBOSE)

十、sub()方法替换字符串
正则表达式不仅可以找到文本，还可以用新的文本替换这些模式。 Regex对象的sub()方法需要传入两个参数，第一个参数是一个字符串，用于取代发现的匹配。第二个参数是一个字符串，即被匹配的字符串

import re
#例1
s = 'my name is alex durand'
nameRegex = re.compile(r'alex \w+')
new_s = nameRegex.sub('kevin',s)
print(new_s) #>>: my name is kevin

#例2
s = 'alex durand told alex kevin that alex jin knew alex liang was a double agent'
nameRegex = re.compile(r'alex (\w\w)\w*')
new_s = nameRegex.sub(r'\1*****',s)  #1 代表第一个分组(\w\w)中的内容， 后面\w*匹配的内容都用*代替
print(new_s) #>>: du***** told ke***** that ji***** knew li***** was a double agent

十一、match方法及分组

import re

# re.match()     #从头匹配，
origin = 'hello alex bcd alex 1ge alex acd 19'
r = re.match("h\w+",origin)
print(r.group())  #>> hello  #  #获取匹配到的所有的结果，对于mathch，不管是分组还是不分组，group匹配到的都是对照匹配规则

#分组 获取符合规则的字符，然后再获取前面得到的字符串中分组里的内容
r = re.match("h(\w+)",origin)
print(r.groups())  #>>('ello',)  #获取模型中匹配到的分组结果

r = re.match("h(?P尖括号key尖括号\w+)",origin)  #在括号内加入 ?P尖括号key尖括号  设置key的值, P要大写
print(r.groupdict()) #>> {'key': 'ello'}  #获取模型中匹配到的分组中所执行的key的组

十二、search方法

# re.search()    #浏览全部字符串，匹配第一个符合规则的字符串
origin = 'hello alex bcd alex 1ge alex acd 19'
r = re.search("a(\w+)",origin)   #根据规则在字符串中查找，直到找到想对应的规则
print(r.group())   #>> alex

r = re.search("a(\w+).*(?P\d)$",origin)
print(r.groups())  #>> ('lex', '9')
print(r.groupdict())  #>> {'name': '9'}

搜索此博客

L.Tech

Python - 正则表达式

一、使用正则和不使用正则区别

二、利用括号分组

三、用管道匹配多个分组

四、其他方法

五、花括号

六、贪心和非贪心匹配

七、findall

八、字符分类

九、不区分大小写的匹配

十、sub()方法替换字符串

十一、match方法及分组

十二、search方法

评论

发表评论

此博客中的热门博文

三星xpress M2070W 如何连接 WiFi

Skype 常用命令

Python - shutil 模块