栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 面试经验 > 面试问答

如何使用正则表达式删除推文的主题标签,@ user,链接

面试问答 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

如何使用正则表达式删除推文的主题标签,@ user,链接

以下示例是一个近似的例子。不幸的是,仅通过正则表达式没有正确的方法。以下正则表达式仅去除URL(不只是http),任何标点,用户名或任何非字母数字字符。它还将单词分隔为单个空格。如果您想按预期分析推文,则系统中需要更多智能。考虑到没有标准tweet提要格式的一些认知性自我学习算法。

这是我的建议。

' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z t])|(w+://S+)"," ",x).split())

这是你的例子的结果

>>> x="@peter I really love that shirt at #Macy. http://bit.ly//WjdiW4">>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z t])|(w+://S+)"," ",x).split())'I really love that shirt at Macy'>>> x="@shawn Titanic tragedy could have been prevented Economic Times: Telegraph.co.ukTitanic tragedy could have been preve... http://bit.ly/tuN2wx">>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z t])|(w+://S+)"," ",x).split())'Titanic tragedy could have been prevented Economic Times Telegraph co ukTitanic tragedy could have been preve'>>> x="I am at Starbucks http://4sq.com/samqUI (7419 3rd ave, at 75th, Brooklyn) ">>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z t])|(w+://S+)"," ",x).split())'I am at Starbucks 7419 3rd ave at 75th Brooklyn'>>>

这是一些不完美的例子

>>> x="I c RT @iamFink: @SamanthaSpice that's my excited face and my regular face. The expression never changes.">>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z t])|(w+://S+)"," ",x).split())'I c RT that s my excited face and my regular face The expression never changes'>>> x="RT @AstrologyForYou: #Gemini recharges through regular contact with people of like mind, and social involvement that allows expression of their ideas">>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z t])|(w+://S+)"," ",x).split())'RT Gemini recharges through regular contact with people of like mind and social involvement that allows expression of their ideas'>>> # Though after you add # to the regex expression filter, results become a bit better>>> ' '.join(re.sub("([@#][A-Za-z0-9]+)|([^0-9A-Za-z t])|(w+://S+)"," ",x).split())'RT recharges through regular contact with people of like mind and social involvement that allows expression of their ideas'>>> x="New comment by diego.bosca: Re: Re: wrong regular expression? http://t.co/4KOb94ua">>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z t])|(w+://S+)"," ",x).split())'New comment by diego bosca Re Re wrong regular expression'>>> #See how miserably it performed?>>>


转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/647204.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号