machine

Getting text from a website using Python and Beautiful Soup: Automated Informative Poetry Part 1

Goodness, staying informed in this modern day and age is sooo much work. What can we possibly do to motivate people to be informed about the things that truly matter and remember it?

Well this weeks project looks to do that through the beauty of POETRY. What devils magic could use programming to create informative poems? No devils magic necessary so long as you’re happy with your poem simply being couplets that don’t always make sense. And goodness knows I am. Here’s my favorite example our end product made for us.

You’ll slip and fall on your face If you withdraw without holding the base

So with that as motivation, let’s go!

Step One! Scrape your info source!

What do you want to teach people about? My Little Pony? Our current political environment? Cleanliness? For my project I chose Sex Ed, because goodness knows that in the U.S. at least we tend to get a shoddy treatment of that topic.

For your choice you’ll want to choose a text source that is just real wordy. The more sentences the better because we’re going to try and find rhyming sentences with a similar amount of syllables, and as you can imagine those don’t actually show up that often.

Hmm… an excellent source of sex education that is real real wordy? Sounds like Scarleteen to me! They’re pretty great though actually. Read some articles here: www.scarleteen.com (There really is never a wrong time to become more informed, yeah?).

So to start out lets just get the text from one article. I’m going to go with ‘Condom Basics: A Users Manual’ http://www.scarleteen.com/article/sexual_health/condom_basics_a_users_manual I imagine I can get some interesting poems from this topic (this may all just be a thinly veiled attempt to get computer generated dick jokes).

Now that you’ve got your site, let’s scrape all that juicy text! For this we’ll need 2 libraries. Urllib, which we’ll be using to read the html from the webpage in as a string of text; And then we’ll be using BeautifulSoup to actually search through the html for what we want and grab out of it.

*Semi-related note!: You can check out more info about either of these libraries by googling them. BeautifulSoup is god-damn adorable, and thus it’s kind of fun to read through its documentation. The creator of it seems to have some real personality. If anyone buys the sci-fi novel they’ve written and are selling on the documentation website you let me know how it is.

So lets import both of those, save our website url we’re saving and get that juicy html with urllib.

import urllib
import bs4

html = urllib.urlopen('http://www.scarleteen.com/article/sexual_health/condom_basics_a_users_manual')
#print html.read()

Questions you might be asking: What is this import nonsense? Import is Python’s way of including other file’s code in our own. So urllib.urlopen can then use the function urlopen, which was originally defined in the urllib file.

And in those 4 lines of code we can now see all the html from that article’s webpage. So that’s awesome. But all we want is the text from the article, so let’s grab that and ditch the rest.

searchable_html = bs4.BeautifulSoup(html)
#print searchable_html 

So that looks about the same as before but now we can search for things! But what to search for?

Well the standard thing to do when scraping content from a webpage is to just right click on the thing you want to grab and select ‘inspect element’ from the menu. You’ll then be shown where it shows up in the html, and from there you can craft a clever way of selecting just what you want.

I have it easy. It turns out that the vast majority of ‘p’ html elements are the paragraphs I want. So I’ll just grab those. ‘searchable_html’ is now a BeautifulSoup object that gives us a bunch of cool functions we can use on them. First we just was to find all of the ‘p’ elements.

article_p_list = searchable_html.find_all('p')
#print article_p_list

And that finds them all… but it still looks pretty ugly, I really only want the text, not all that html nonsense. Luckily BeautifulSoup provides a ‘get_text()’ function, so this will be easy!

p_text_list = []
for article_p in article_p_list:
    p_text_list.append(article_p.get_text())
#print p_text_list

And boom, we’ve got all the text we want for our poem. Next time we’ll look into cleaning up the text even more, but until then you’ve already got a webscraper that will grab all the text content from a site. Try not to use your newfound powers for evil!

As a last step for today let’s put all this code into a function so we can have it all nicely bundled together.

def get_text(website_url):
    html = urllib.urlopen(website_url)
    searchable_html = bs4.BeautifulSoup(html)
    article_p_list = searchable_html.find_all('p')
    p_text_list = [p.get_text() for p in article_p_list] #changed from the for loop, does the same thing
    return p_text_list

text = get_text('http://www.scarleteen.com/article/sexual_health/condom_basics_a_users_manual')

And there ya go! See you next time on… THE BLIND LEAD THE BLIND!

Leave a Reply

Your email address will not be published. Required fields are marked *