Confession time: My new love affair is with ParseHub.
ParseHub is a web scraping tool. I have used the free version a bunch of times now and I freakin’ love it. You just give it a starting web page and you can extract data from pretty much any structured site. I have used it to collect data from simple tables, but also more complex workflows that extract data from one web page, click through to another web page to extract additional data, and then loops back to continue the original template. I used this more complex workflow on my first IronViz submission to collect pepper data from over 140 unique URLs. Oh, and it did that in like 15 minutes.
There is a little bit of a learning curve to get started, but there are plenty of helpful videos in ParseHub’s Help Center. Also, I have received emails/chats from their support team offering to help over Skype. So stick with it, it is worth it!
Here is a partial screenshot of the desktop application.
In this example, I am scraping Olympic Race Walking data (yes, that is a real sport) from Olympic.org. Notice this is not a nicely structured table. I am able to select the location of the games and use the relative select tool to extract the medal, result, country, athlete and URLs from the athlete and country hyperlinks.
You can see a sample of the output in the window below the web page. The default output is JSON, but you can switch to CSV/Excel output too in the preview pane.
There are not a lot of good sources (that I could find) for Olympics data, so I am sharing the ParseHub project I created.
All you need to do is a complete a few simple steps and you can be off visualizing Olympic data from any event.
- Download ParseHub, install and register.
- Download the project.
- Open ParseHub->My Projects->Import Project and select the .phj file you just downloaded. The project should now appear.
- In your browser, go to the Olympic.org Results page. Select a sport and an event and click Results. Copy the URL.
- In ParseHub, click on Settings and paste the URL in the starting site field.
- Click Get Data->Run->Save and Run
- You are done
You can either choose to download the output as CSV or JSON…OR…You can use the Web Data Connector Craig Bloodworth from the Information Lab built. To use the WDC, click on your account in ParseHub and then copy the API key.
Open Tableau. Connect to a new data source, select the Web Data Connector. Enter this URL in the field: https://data.theinformationlab.co.uk/parsehub.html and click Enter.
Paste in your API key and you should see your most recent project. Click the drop-down for older ones. I think the limit is 5 projects for the free account, but you can delete them when you are done.
This is the Race Walking Viz I created from data scraped with ParseHub.
Here is a more complex ParseHub Project extracting pepper data from cayennediane.com. This one clicks through to secondary pages and loops back. It also extracts image URLs. Here is the Viz I created using this data.
I hope this was helpful. Please leave any comments below and thanks for reading!