re模块函数模式详解

技术分享 5小时前 0 999+

关注

re模块

python爬虫过程中,实现页面元素解析的方法很多,正则解析只是其中之一,常见的还有BeautifulSoup和lxml,它们都支持网页HTML元素解析,re模块提供了强大的正则表达式功能

re模块常用方法

compile(pattern,flags=0) :用于编译一个正则表达式字符串,生成一个re.pattern对象
- 参数: pattern:正则表达式字符串,flags:可选标志,用于修改匹配行为(下面会详细讲解)
```
pattern = re.compile(r'a.*e') 
```
search(pattern,string,flag=0):搜索字符串中第一个匹配,返回一个匹配对象,否则返回None
- 参数:pattern:正则表达式字符串或编译后的re.pattern对象,string:要搜索的字符串,flags:用于修改匹配行为（如 re.IGNORECASE、re.MULTILINE 等)
- 返回值:返回一个re.Match对象或None
```
result = re.search(r'bcatb', 'The cat sat on the mat') #输出<re.Match object; span=(4, 7), match='cat'> 接下来会解释 
```
match(pattern,string,flags=0):从字符串的开头匹配,如果匹配成功返回一个匹配对象,否则返回None
- 参数pattern:正则表达式字符串或编译后的re.Pattern 对象,string:要匹配的字符串,flags:可选的标志,用于修改匹配行为
- 返回值:返回一个re.Match对象或None
```
result = re.search(r'bcatb', 'The cat sat on the mat') 
```
findall(pattern,string,flags=0):返回所有匹配的子串列表
- 返回值:一个包含所有匹配子串的列表
```
result = re.findall(r'bcatb', 'The cat sat on the cat mat') #输出['cat', 'cat'] 
```

finditer(pattern,string,flags=0) :返回一个迭代器,产生所有匹配的 Match对象

返回值:一个迭代器，产生 re.Match 对象

result = re.finditer(r'bcatb', 'The cat sat on the cat mat') for match in result:     print(match) #<re.Match object; span=(4, 7), match='cat'> #<re.Match object; span=(19, 22), match='cat'>

sub(pattern,repl,string,count=0,flags=0):替换字符串中所有匹配的子串
- 参数:repl:替换字符串或替换函数,count:可选参数,指定最大替换次数,默认为0(替换所有匹配)
- 返回值:替换后的字符串
```
result = re.sub(r'bcatb', 'dog', 'The cat sat on the cat mat') print(result) #The dog sat on the dog mat 
```
split(pattern,string,maxsplit=0,flags=0) :根据匹配的字串分割字符串
- 参数:maxsplit:可选参数,指定最大分割次数,默认为0(不限分割次数)
- 返回值:一个包含分割结果的列表
```
text = "The cat sat on the cat mat" result = re.split(r'b', text) print(result) #['', 'The', ' ', 'cat', ' ', 'sat', ' ', 'on', ' ', 'the', ' ', 'cat', ' ', 'mat', ''] 
```

flags模式标志位

模式匹配支持多种模式标志,用于修改匹配模式行为

`re.IGNORECASE或re.I`

忽略大小写匹配模式

eg:

text = "The cat sat on the cat mat" regex = re.compile("CAT", re.IGNORECASE) result = regex.search(text) if result:     print(result.group()) else:     print("not found") #output-> cat

`re.MULTILINE或re.M`

多行模式,使 ^ 和 $ 匹配每行的开始和结束

eg:

text = """first Line second Line2 third Line3"""  regex = re.compile(r'^w+', re.MULTILINE) result = regex.findall(text) print(result) #output->['first', 'second', 'third']

`re.DOTALL 或 re.S`

可以使得. 匹配包括换行符在内的所有字符

eg:

text = "abcndefnghi"  regex = re.compile(r'.*', re.DOTALL) result = regex.findall(text) print(result) #output->['abcndefnghi', '']

`re.VERBOSE 或 re.X`

允许在正则表达式中使用注释和空白

eg:

	text = "The cat sat on the mat" 	 	regex = re.compile(r""" 	                b #单词边界 	                cat#匹配"cat" 	                b#单词边界 	""", re.VERBOSE) 	result = regex.findall(text) 	print(result) #output->['cat']

内嵌模式标志

内嵌模式的功能与flags标志位的功能一致,与之不同的是使用内嵌模式标志和注释可以增强正则表达式的可读性和灵活性

常用的内嵌模式标志

(?i):忽略大小写

eg:

text = "hello world"  regex = re.compile(r"(?#忽略大小写)(?i)HELLO WORLD") # (?#...) 是一个注释语法 result = regex.match(text).group() print(result)

(?m):多行模式,使 ^ 和 $ 匹配每行的开始和结束，而不仅仅是整个字符串的开始和结束

eg:

text = """first Line second Line third Line"""  regex = re.compile(r"(?#多行模式)(?m)^w+") result = regex.findall(text) print(result) #output->['first', 'second', 'third']

(?s):点匹配所有字符模式

eg:

text = "abcndefnghi"  regex = re.compile(r"(?#.号匹配换行符在内的所有字符)(?s).*") result = regex.findall(text) print(result) #output->['abcndefnghi', '']

(?x):允许注释和空白

eg:

text = "The cat sat on the mat"  regex = re.compile(r"(?#允许注释)(?ix)bCATb") # ix->忽略大小写的注释模式 result = regex.findall(text) print(result) #output->['cat']