Downloading and Parsing Met Office Historic Station Data with Alteryx

imageI was having a look at the public data the Met office provide on their website. They have a huge amount freely available on their website. I was interested in the historical data they provided for each of the stations accessible via this web page.The page allows you to download the data one station at a time, but using Alteryx we can do better.

First things first we need to get a list of all the stations and the URLs to the data files. As this is a dynamically created map we can’t just grab the source of the page. However opening the page up within the Chrome Developer Tools (press F12 to access, similar tools are available in Internet Explorer/Edge or Firefox) and looking at the network tab lets us see all of the data files the page is requesting:

image

We can filter the long list down to just those made by JavaScript, using XHR option in the toolbar. Looking down the list I found historic.json. This contained a list of all of the stations on the map, grouped by ‘open’ or ‘closed’:

{
“closed”: {
“Cwmystwyth”: {
“id”: 371423,
“lat”: 52.35817,
“lon”: -3.80198,
“opened”: 1959,
“closed”: 2011,
“url”: “cwmystwythdata.txt”
}, …

The URL within each entry is relative so I needed to find the full URL for one from which I could generalise and download the data from. So I downloaded the Aberporth data set and recorded the URL: http://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/aberporthdata.txt

imageNext step was to create a flow in Alteryx which would download the historic.json data file and parse it into a table which I could then use to generate a list of URL to download each of the corresponding txt files. Alteryx has a JSON parse tool which can be used to split the JSON into a field based table. Version 10 also lets you use JSON data as an input or output tool, but as yet I can get it to connect to an http address, instead I used a Download Tool linked into the JSON Parse tool.  This produced a table like:

image

All the data I need was encoded in the JSON_Name and JSON_ValueString. Using a Text to Column tool set to split the JSON_Name field and then a Cross Tab tool to break this up and build a table. Final step was to drop the columns I didn’t need and to create the full URL. Final table looks like:

image

The final flow to download the JSON and parse looked like:

image

Reading the JSON Data via PowerQuery

A small shout out to PowerQuery in Excel for this process as well. You can get to a table of data from a JSON URL in a couple of clicks. Screenshots below are from the Excel 2016 version, but same is true in 2010/13 version I think. Go to Data tab in the ribbon, then From Other Sources, From Web. Enter the URL to the JSON file in the dialog box.

image

After that it is a couple of clicks to convert the records into a table, then expand the value column (unselect ‘Use original column name as prefix’) to produce a column based table like the one below.

image

image

The next step is to un-pivot all the columns (except the first one) and then to expand the resulting value column into a table, again excluding the original column name. Finally we create a couple of custom columns containing the full URL and isClosed flag and finally reorder and rename (very much like in Alteryx). The resulting table is shown below.

image

PowerQuery is a great free extension of Excel, and while no where near the power of Alteryx is a nice option for quick getting some data together. I find it especially easy at handling JSON data.

Downloading and Parsing the Data Files

Having got the list of stations and the URL to each of the data files the next step was to link up another download tool to the generated list of URLs. The download tool will read the data into a string field. The file format looks like:

image

The files are fixed with with some header text at the top. All fields are numbers which makes processing easier. Even though they fixed with, I used the same technique I used to read CSV files. Only difference is the separator is a space rather than a column. As I add a row index as part of the parsing process, I used this to pick the header row out and to skip the first few comment lines. The flow below downloads the files and parses into a table:

image

Final step is a little data tidying and converting types. I used the Find Replace tool to repair the few records which had additional notes on them. I converted the year and month into a date using the DateTimeParse function. The output looked like:

SNAGHTML8b182b6

You can download the workflow and output TDE file.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s