I was having a look at the public data the Met office provide on their website. They have a huge amount freely available on their website. I was interested in the historical data they provided for each of the stations accessible via this web page.The page allows you to download the data one station at a time, but using Alteryx we can do better.
First things first we need to get a list of all the stations and the URLs to the data files. As this is a dynamically created map we can’t just grab the source of the page. However opening the page up within the Chrome Developer Tools (press F12 to access, similar tools are available in Internet Explorer/Edge or Firefox) and looking at the network tab lets us see all of the data files the page is requesting:
The URL within each entry is relative so I needed to find the full URL for one from which I could generalise and download the data from. So I downloaded the Aberporth data set and recorded the URL: http://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/aberporthdata.txt
Next step was to create a flow in Alteryx which would download the historic.json data file and parse it into a table which I could then use to generate a list of URL to download each of the corresponding txt files. Alteryx has a JSON parse tool which can be used to split the JSON into a field based table. Version 10 also lets you use JSON data as an input or output tool, but as yet I can get it to connect to an http address, instead I used a Download Tool linked into the JSON Parse tool. This produced a table like:
All the data I need was encoded in the JSON_Name and JSON_ValueString. Using a Text to Column tool set to split the JSON_Name field and then a Cross Tab tool to break this up and build a table. Final step was to drop the columns I didn’t need and to create the full URL. Final table looks like:
The final flow to download the JSON and parse looked like:
Reading the JSON Data via PowerQuery
A small shout out to PowerQuery in Excel for this process as well. You can get to a table of data from a JSON URL in a couple of clicks. Screenshots below are from the Excel 2016 version, but same is true in 2010/13 version I think. Go to Data tab in the ribbon, then From Other Sources, From Web. Enter the URL to the JSON file in the dialog box.
After that it is a couple of clicks to convert the records into a table, then expand the value column (unselect ‘Use original column name as prefix’) to produce a column based table like the one below.
The next step is to un-pivot all the columns (except the first one) and then to expand the resulting value column into a table, again excluding the original column name. Finally we create a couple of custom columns containing the full URL and isClosed flag and finally reorder and rename (very much like in Alteryx). The resulting table is shown below.
PowerQuery is a great free extension of Excel, and while no where near the power of Alteryx is a nice option for quick getting some data together. I find it especially easy at handling JSON data.
Downloading and Parsing the Data Files
Having got the list of stations and the URL to each of the data files the next step was to link up another download tool to the generated list of URLs. The download tool will read the data into a string field. The file format looks like:
The files are fixed with with some header text at the top. All fields are numbers which makes processing easier. Even though they fixed with, I used the same technique I used to read CSV files. Only difference is the separator is a space rather than a column. As I add a row index as part of the parsing process, I used this to pick the header row out and to skip the first few comment lines. The flow below downloads the files and parses into a table:
Final step is a little data tidying and converting types. I used the Find Replace tool to repair the few records which had additional notes on them. I converted the year and month into a date using the DateTimeParse function. The output looked like: