screen scraping - Python - Easiest way to scrape text from list of URLs
using BeautifulSoup -
Get link
Facebook
X
Pinterest
Email
Other Apps
What is the easiest way to filter out some free webpages (using the list of URLs) using Sundersup? Is this possible?
Best, Georgina
import urllib2 beautiful soup import Newlines = recompile (r '[\ r \ n] \ s +') get getPageText (url): # given a URL, get page content data = urllib2.urlopen (url) .read () # parse as HTML Structured Document BSE = Beautiful Soup. Beautiful Soup (Data, Convert Entities = Beautiful Soup, Beautiful Soup. HTML_ENTITIES) Kill JavaScript content for # B.FundAl ('script'): with s.replace ('') # Remove body and remove text txt = ( '' the body ''). GetText ('\ n') # Return of multiple line breaks and white space Newlines.sub ('\ n', txt) def main (): URL = ['http: //www.stackoverflow .com / questions / 5331266 / Python -Easy-way-to-scrap-url-to-use-list-of-URL-use-beautiful soup, 'http://stackoverflow.com/questions/5330248/how-to-rewrite getPageText for url in url ( Url)] If __name __ == "__ main__": main () P> Now it removes javascript and de code html entities
Comments
Post a Comment