Scraping the Web with Power Query
Last year I entered the PowerBI video demo contest. Whilst I didn’t win any prizes I did learn a fair bit from going through the exercise of putting together a screencast demo (more on that another time). In this post I’m going to walk-through the web scraping part of my demo.
The website that I choose to use for my demo was the National UFO Reporting Center (more for novelty sake then any serious interest). Below you can see the listing by shape of craft (www.nuforc.org/webreports/ndxshape.html).
If I click on one of these links then details on each individual sighting is listed
The overall goal is to have all of the individual sightings retrieved from each individual page and stored into a Power Pivot table.
One thing which makes the exercise of scraping this site easy is that the index page links to each of the individual pages in a consistent manor:
[http://www.nuforc.org/webreports/ndxs{SHAPE NAME HERE}.html](http://www.nuforc.org/webreports/ndxs{SHAPE NAME HERE}.html “http://www.nuforc.org/webreports/ndxschanged.html")
This means the process of scraping will follow these steps:
- Retrieve the list of shapes from the index page and store these
- Loop through each of the shapes from step 1:
- Dynamically generate the URI based on the shape name
- Retrieve the table of sightings
Retrieving the Shape index table
From Excel select the Power Query tab, then click the From Web button.
This will bring up a dialog asking for the URL to the webpage, enter the URL for the shape index: http://www.nuforc.org/webreports/ndxshape.html
Doing this used to take you directly into the Power Query editor, but now there is an extra step. In the right-hand side of Excel you will see a Navigator appear:
This is showing all the different elements that Power Query discovered on the web page, Table 0 contains a list of all the shapes, which you can see in the preview when you hover your mouse over it. Click Table 0 then click Edit to bring up the Query Editor.
At this point all I did was a few clean up tasks:
- Removed the Count column
- Renamed “Reports” to “Shape”
- Renamed the Query to “Shape List”
I also un-ticked “Load to Worksheet”, as you will see why shortly.
Retrieving individual sightings
The next step is to parse the individual sighting pages, I’ll do this in two steps. In the first step I’ll just take one of the sighting pages and using it perform any transformations or clean up tasks, then in the second step I will show you how to turn this into a formula which will take a shape name, which will be used to dynamically generate a URL for retrieval, this formula will then be used in the next section.
For my example I will be using the Cone shaped sightings (www.nuforc.org/webreports/ndxsCone.html).
Once again use the From Web button and enter the above URL, then in the navigator select Table 0 and then click Edit to open the Query Editor.
I performed the following clean up tasks:
- Removed “Posted” and “Shape” columns
- Used the Split Column transform on the “Date / Time” (delimited by Space, at the left most)
- Renamed the new columns to “Date” and “Time”
The dates are in a format that Power Query is unable to recognize so I had to perform some extra transformation steps to reformat them:
- Used the Split Column transform on the “Date” column (delimited by custom, /, at each occurrence). This will split out each date part.
- Changed the data type of these new columns to Text
- Inserted a custom column with the following formula:
Date.From(Text.Combine({[Date.3],[Date.1],[Date.2]}, “/”))
- Removed the “Date.1”, “Date.2”, “Date.3” columns
- Moved the Date column to be in the first position
The next step is to turn this into a reusable formula. Bring up the Advanced Editor from the View tab in the ribbon.
To turn this into a function replace the first two rows with the following code:
let fnGetByShape = (shapeName as text) => let Source = Web.Page(Web.Contents(Text.Replace("http://www.nuforc.org/webreports/ndxs{shape}.html”, “{shape}”, shapeName))),
and insert the following at the end
in fnGetByShape
These two code snippets will turn the query into a function. Click Done to close the Advanced Editor you will see that the the table of data that was there is now replaced with this:
You can test this out by clicking the Invoke button and entering in a shape name (e.g. Cone).
The last thing left to do is give the query a good name (fnGetSightingsByShape) and ensure that both the Load to Worksheet and Load to Data Model options are un-ticked.
Using the function
Now that both the list of shapes query and the function to retrieve each individual sighting are ready, it is time to combine them together and load the data into Power Pivot.
Open the Shape List query that was created earlier and add a custom column:
This will add a new column which calls the custom function. Expanding the Table in the new column will show the columns from the function (Date, Time, City, State, Duration, Summary). I had to set the data types for the Date and Time columns.
Now the only thing left to do is tick “Load to Data Model” then get a cuppa tea or coffee while Power Query loads the data into Power Pivot.
In Summary
In this post I’ve shown you how you can use Power Query to scrape data from a website, in addition I’ve also shown you how to build reusable functions which can be used to scrape different levels of pages on a website.
I’ve uploaded my finished demo to Dropbox for your enjoyment.
🍪 I use Disqus for comments
Because Disqus requires cookies this site doesn't automatically load comments.
I don't mind about cookies - Show me the comments from now on (and set a cookie to remember my preference)