screen scraping - Python - Easiest way to scrape text from list of URLs using BeautifulSoup -


What is the easiest way to filter out some free webpages (using the list of URLs) using Sundersup? Is this possible?

Best, Georgina

import urllib2 beautiful soup import Newlines = recompile (r '[\ r \ n] \ s +') get getPageText (url): # given a URL, get page content data = urllib2.urlopen (url) .read () # parse as HTML Structured Document BSE = Beautiful Soup. Beautiful Soup (Data, Convert Entities = Beautiful Soup, Beautiful Soup. HTML_ENTITIES) Kill JavaScript content for # B.FundAl ('script'): with s.replace ('') # Remove body and remove text txt = ( '' the body ''). GetText ('\ n') # Return of multiple line breaks and white space Newlines.sub ('\ n', txt) def main (): URL = ['http: //www.stackoverflow .com / questions / 5331266 / Python -Easy-way-to-scrap-url-to-use-list-of-URL-use-beautiful soup, 'http://stackoverflow.com/questions/5330248/how-to-rewrite getPageText for url in url ( Url)] If __name __ == "__ main__": main () P> Now it removes javascript and de code html entities

Comments