While I had planned to write about CSV parsing next, I decided to cover a couple of other topics first. I plan to carry on and write about my experience parsing CSVs and then reuse via Macros and Formula Add Ons after that.
Firstly an improvement to the last post, based on a comment from Adam Riley. There is an IN and NOT IN operator. I had been looking in the function list and hadn’t consider the fact that there might be operator to do it. Much nicer way to do it. It would be nice to see a GUI for this appear in a later version (the basic filter option of the filter tool is pretty new, I believe it is a v9.5 feature).
One frustration when I was first starting out, was that if I had it connected to a live database (often a SQL server in my case), that on each run it would refresh the data. As some of the queries were quite complicated and time consuming this slowed the process down. The newest release (v9.5) added a cache data option, tick a box on the properties window, and the data is cached in a temporary file.
Previously to this, I needed to write the data into Alteryx DB files, switch the connection to these, and then reconnect to the live source once I had finish designing the workflow (or forgetting to more often!). It’s great to see the continual improvements in the product, speeding things up and solving little issues like this. It also shows the true performance of the engine, with the data localised it is very impressive how fast the workflows run (though I am running on a very powerful server).
One thing I rely on in my programming world is source control. As the workflow files are actually just xml files, they lend themselves to being diffed and kept in source control. As I inevitably go through a few iterations on a workflow (and muck it up many times), it nice to have the backing of source control to recover my work. I am using git (as can just work with a local repository) but use your favourite.
Joins worked differently to how I expected them to work. At first, I guessed they would work like SQL joins, i.e. a Left Join would include the intersection as well as the complete set of left data. Likewise a right join. Instead Alteryx does not include the intersection.
After using the designer for a little while, I now find it very useful. Firstly and most importantly, it is really easy to detect failures in the workflow when you expect a set to join completely.
The other use I find, is it allows you to process the left and right outputs, and then union with the intersection to produce the blended set you need. The un-joined datasets are returned without any of the column renaming or selecting done to the joined set. You can use Select and Formula tools to shape the un-joined data as needed, and then the Union tool to add it back into the set. The union tool will add nulls in for columns not present in any of the inputs with a warning message in the Output window. The flow looks something like this:
One limitation I came across on a recent ad-hoc data request (which Alteryx is particularly good at reducing the time taken for), was that you can only use equality conditions. I had a set of names with start and end dates. I also had a series of dates, and wanted to create a set of dates for each name between the start and end date. A pretty straight forward join in SQL, but I could not spot how to do this in Alteryx.
I am sure this is possible and probably easy! The more I use the tools the more I think if you are finding something hard to do, there is probably a better way to do it. As an example, when I was first starting out and wanted to join a single value from one part of the flow (for example the total number of rows) to the main dataset. My first attempt involved adding a formula to both datasets adding in a fixed value which I could then use to join. This resulted in a flow like:
The Append Fields tool is designed to do exactly this. It performs a Cartesian join (i.e. every row of the Left set is joined with every row of the right set). Using the Append Fields tool the above becomes:
I wanted a way to extract some parts of a filename and store the results in new fields. The filenames were of the form:
I wanted to end up with three new fields: Ticker (string), Start (date), End (date). While I could contruct the formulae needed via the various standard string functions I figured it would be easier via regular expressions. The regular expression syntax Alteryx uses is Perl Regex. While it isn’t possible to test the expression as you write it within Alteryx, there are various online tools where you can do this (e.g. debuggex.com).
The REGEX_Match function is used to determine if a string matches a pattern and only returns true or false. Likewise the REGEX_CountMatches returns the number of matches. I was hence left with REGEX_Replace. Firstly I wanted to remove the extension from the filename. The following expression finds the last ‘.’ and all the following characters and then replaces them with an empty string:
Having got this I then created an expression which searched for a keyword (Ticker, Start or End) and then finds the text between the underscores and returns this. Finally for Start and End, I convert the string to a date using the DateTimeParse function.
Start=DateTimeParse(REGEX_Replace([NoExt],”.*Start_([^_]*).*”, “$1”), “%y%m%d”)
End=DateTimeParse(REGEX_Replace([NoExt],”.*End_([^_]*).*”, “$1”), “%y%m%d”)
This produces a dataset like:
I have put the workflow online here.