I have created html in the plain text regex sequence, I use it in 100 threads to clear html files. I want to get all the visible information in a given HTML file
self.content = re.sub (r '& lt ;! - (. | \ N) *? - & gt;', '', content itself) self.content = re .sub (r '& lt; script (| | \ N) *? & Gt; (| | | \ N) *? & Lt; / script & gt;', '', self.content) self.content = re .sub (r '& lt; genre (| | \ \) *? & Gt; (. | \ N) *? & Lt; / style & gt;', '', self content) .content = re. Sub (r '(^ [^ & gt;] *? & Gt; +)', '', self.content) I'm not really a regex supporter. Maybe I can improve the regex's performance? I should not use the Sunderspope or the Degenego or html2text c ++ distribution. I have slow regex after testing, I just need a seperated string of space, not a tree or link Ect.
Thank you for helping. I know that StackHowflow is some very smart people
Use tools like Handsome or HTB and try Do not be clever than the rest of the world Parsing HTML using regular expressions is the worst thing you can do! There will always be an HTML file where your rijages will fail.
Comments
Post a Comment