My new love affair with…

My new love affair with…

Confession time:  My new love affair is with ParseHub.

ParseHub is a web scraping tool.  I have used the free version a bunch of times now and I freakin’ love it.    You just give it a starting web page and you can extract data from pretty much any structured site.  I have used it to collect data from simple tables, but also more complex workflows that extract data from one web page, click through to another web page to extract additional data, and then loops back to continue the original template.  I used this more complex workflow on my first IronViz submission to collect pepper data from over 140 unique URLs.  Oh, and it did that in like 15 minutes.

There is a little bit of a learning curve to get started, but there are plenty of helpful videos in ParseHub’s Help Center.  Also, I have received emails/chats from their support team offering to help over Skype.  So stick with it, it is worth it!

ParseHub project to extract Olympic data.
Click to Enlarge

Here is a partial screenshot of the desktop application.

In this example, I am scraping Olympic Race Walking data (yes, that is a real sport) from Olympic.org.  Notice this is not a nicely structured table.  I am able to select the location of the games and use the relative select tool to extract the medal, result, country, athlete and URLs from the athlete and country hyperlinks.

You can see a sample of the output in the window below the web page.  The default output is JSON, but you can switch to CSV/Excel output too in the preview pane.

There are not a lot of good sources (that I could find) for Olympics data, so I am sharing the ParseHub project I created.

All you need to do is a complete a few simple steps and you can be off visualizing Olympic data from any event.

  1. Download ParseHub, install and register.
  2. Download the project.
  3. Open ParseHub->My Projects->Import Project and select the .phj file you just downloaded. The project should now appear.
  4. In your browser, go to the Olympic.org Results page.  Select a sport and an event and click Results.  Copy the URL.
  5. In ParseHub, click on Settings and paste the URL in the starting site field.
  6. Click Get Data->Run->Save and Run
  7. You are done
Click to Enlarge
Click to Enlarge

You can either choose to download the output as CSV or JSON…OR…You can use the Web Data Connector Craig Bloodworth from the Information Lab built.  To use the WDC,  click on your account in ParseHub and then copy the API key.

Open Tableau.  Connect to a new data source, select the Web Data Connector.  Enter this URL in the field: https://data.theinformationlab.co.uk/parsehub.html and click Enter.

Paste in your API key and you should see your most recent project.  Click the drop-down for older ones.  I think the limit is 5 projects for the free account, but you can delete them when you are done.

Race Walking
Race Walking

This is the Race Walking Viz I created from data scraped with ParseHub.

If you want to see some other examples, here is a simple ParseHub Project extracting data from a table from stats.hockey analysis.com.  Here is the Viz I created using this data.

Here is a more complex ParseHub Project extracting pepper data from cayennediane.com.  This one clicks through to secondary pages and loops back.  It also extracts image URLs.  Here is the Viz I created using this data.

I hope this was helpful. Please leave any comments below and thanks for reading!

6 thoughts on “My new love affair with…

  1. Thank you for mentioning us Adam! Glad to see you put ParseHub to great use. If you need any more help or ideas on how to use parsehub you can email me at angelina@parsehub.com.

    1. Adam Crahen

      Thanks Angelina!

  2. Thanks Adam for introducing to ParseHub. This is really helpful.

    Just so that you know, I found Pooja and you on Tableau Public. And in my endeavor to build Tableau expertise, I am “stalking” both of your twitter accounts, this blog, and your Tableau Public visualizations regularly from past one week.

    -Anonymous Stalker (Nitin Dhawan, Bangalore)

    Cheers and keep them coming 🙂

    1. Adam Crahen

      Haha, great comment. I’m really happy to be doing this with Pooja. She’s constantly impressing me and challenging my skills too!

    2. Pooja Gandhi

      Hi Nitin – That’s so awesome! Glad you find our work useful. Please tag @thedataduo when you publish your Tableau work on twitter. We would love to see! We hope you will continue to find our work beneficial – Pooja!

Leave a Reply