我需要一些关于声明正则表达式的帮助。我的输入如下:
this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>. and there are many other lines in the txt files with<[3> such tags </[3>
所需的输出是:
this is a paragraph with in between and then there are cases ... where the number ranges from 1-100. and there are many other lines in the txt files with such tags
我尝试过这个:
#!/usr/bin/python import os, sys, re, glob for infile in glob.glob(os.path.join(os.getcwd(), '*.txt')): for line in reader: line2 = line.replace('<[1> ', '') line = line2.replace('</[1> ', '') line2 = line.replace('<[1>', '') line = line2.replace('</[1>', '') print line
我也尝试过这个(但似乎我使用了错误的正则表达式语法):
line2 = line.replace('<[*> ', '') line = line2.replace('</[*> ', '') line2 = line.replace('<[*>', '') line = line2.replace('</[*>', '')
replace我不想对 1 到 99进行硬编码。
replace
这个经过测试的代码片段应该可以做到这一点:
import re line = re.sub(r"</?\[\d+>", "", line)
编辑:这里有一个注释版本,解释了它是如何工作的:
line = re.sub(r""" (?x) # Use free-spacing mode. < # Match a literal '<' /? # Optionally match a '/' \[ # Match a literal '[' \d+ # Match one or more digits > # Match a literal '>' """, "", line)
正则表达式很有趣!但我强烈建议您花一两个小时学习基础知识。首先,您需要了解哪些字符是特殊的:“元字符”需要转义(即在前面放置反斜杠 - 并且字符类内部和外部的规则不同。)有一个很棒的在线教程:www.regular-expressions.info。您在那里花费的时间将获得丰厚的回报。祝您使用正则表达式愉快