In this guide we’re going to look at how Tableau Prep, Alteryx and Power Query approach the same data challenge. Each of these tools can be used for data preparation, however they do so in different ways.
We’re going to look at:
- How they import multiple files at the same time
- How they manage basic calculations
- How they manage more advanced data transformation logic
- Options for outputting data
Let’s imagine we are a large business with over 1,000 employees. We have a centralised HR department that manages data related to our employees and they’ve reached out to us for support.
One of the team leads has identified that a team member spends 1 hour per month manually preparing HR data in order to identify promotions. This data is then sent higher up in the business to project workforce spend, ensure equality policies are observed, and understand the spread of roles across business units.
The team lead also mentioned that occasionally there is a risk of human error in the manual data preparation. This leads to additional time being used to fix issues and complaints from senior management about the errors.
Finally, the team member is the only person who fully understands the correct steps to execute this process, which has caused delays when the member has taken annual leave or sick leave. Additionally, if the team member were to leave the business, the process knowledge may be lost.
By speaking with HR we have identified that automation of this process can:
- Save 1 hour per month for the employee (plus variable additional time for fixing issues)
- Remove all risk of human error
- Ensure deadlines are met
- Ensure the longevity of the process independent of staff changes
- Ensure the process is documented
The business is keen to compare different Data Preparation software as part of a broader data transformation strategy, and have requested that we only utilise the native functionality of the tools to ensure a fair comparison. This means we will not be using any scripts (though each of the softwares do offer the ability to add scripts).
Please note all information is made-up.
The HR team has provided us with a sample of the data they use. They informed us that the data is automatically generated as an excel file from the source system, and each file contains a snapshot of the data at the start of each month.
After spending some time understanding the data and considering the business requirements we’ve established that the technical requirements are as follows:
- Consolidate the files into a single historical table
- Extract the date information from the filename
- Flag the date in which a promotion occurred for an employee
- Output the data ready for use with a visualization software
Step 1 – Inputting multiple files (with the same format)
Though we have only a sample of the files, there are many years of monthly data snapshots that require consolidation. We are going to look at how to manage this in an automated way.
Upon opening Tableau Prep, connecting to data is simple. We simply click Connect to Data, and select the file we wish to input.
To combine each of the monthly snapshots into a single table, we have the option of using a Wildcard union. After selecting the Wildcard union we use an * (asterisk) to bring in multiple files based on the filename.
Upon opening Alteryx, we first drag in the Input Data tool. Then we select one of the files to input. Alteryx will automatically populate the Connect a File or Database string in the Input Data tool, we can use an * (asterisk) to bring in multiple files as a wildcard union based on the filename.
Upon opening Power BI we first select Get data, select the Folder and input the File Path of the folder.
After clicking OK Power BI will show the files within the folder that it has found. In this case, these three files are all the files we wish to combine.
By clicking on Combine, and then Combine and Transform Power Bi will ask us to select a sample file from one of the files, and upon clicking OK it will combine the files and load them into Power Query ready for transformation.
Step 2 – Creating new fields based on calculations
The HR data that we have just input is limited in the data that it contains. In our case it only includes the date of the monthly snapshot in the filename and not in the table itself. We require the date to be contained within the table so we can look at the promotions for each month. Let’s look at how to achieve this.
To make any calculation in Tableau Prep we first need to insert a Clean Step into the Flow. Tableau Prep has automatically included a FilePaths column so we can easily parse this with a calculation by right clicking and selecting Custom Calculation.
Within the Custom Calculation we use a Regular Expression to extract the date from the FilePaths string and wrap it with a Date() function to easily change the string to a date format.
In Alteryx, we first ensure we have selected to bring in File Name Only from the input tool. One of the interesting things about Alteryx is that it has many different tools for different functionality, and in this case we need to use the regex tool. By selecting the correct Column to Parse, the right Regular Expression, and making sure we Parse the output, we get the expected output.
However, the output is not a date, it is a string. To fix this we bring in the DateTime tool and adjust the configuration to output a new field that is in fact a date type.
In Power Query, we do not need to type our own Regular Expression. We can instead create a Column from Examples. After selecting the Source.Name column and clicking Column from Examples -> From Selection we can type the pattern we wish to match. By entering 2021-02-01 the rest of the rows autocomplete.
Step 3 – Calculating promotion status based on the previous row
Though the HR data contains information about the Job Title, it’s current format doesn’t allow us to actually present the data in a way that would show monthly promotion trends. To do this we need to compare the Job Title of an Employee to the previous row. Please note; in our company people can only be promoted, so we can assume any Job Title change is a promotion.
There are a number of ways to do calculations that cover multiple rows in Tableau Prep, and this depends on the data you have.
Preppin’ Data has some great blog posts covering this topic if you wish to know more:
The above method doesn’t exactly work for our data as we are missing a Row ID. Hence, we need to take a similar, but alternative approach.
First of all, we need to create a Row ID as none exists. To do this we use a PARTITION formula to group the data by employee and then assign a Row ID based on the ascending order of the date. Following this we assign a second Row ID field that is the initial Row ID +1. The purpose of this is to compare each row (month) with the next row (month).
Then we split the data so we have a table with Row ID and a table with Row Id +1. We use an inner join on the Row IDs and Employee ID fields to get the desired output. We could also do a left inner join and then use calculations to fill in null values, but it can be easier and quicker to add an extra join later on in this flow.
Now we have our inner joined data we can simply compare the Job Title fields. If they are not the same then we flag the promotion.
Finally we add another join that brings in the records that were lost from our earlier inner join, and finish by adding a final clean step.
Alteryx has an inbuilt tool for calculations that reference different rows. This is the Multi-Row Formula tool.
To utilize this for our use case we first sort the date to ensure that the month for each row is ordered correctly, we don’t need to order by Employee ID as we will group by that field in the Multi-Row Formula tool.
After dragging in a Multi-Row Formula tool we Create New Field as our Promotion flag, Group by Employee ID, and use an IF statement to compare the Job Title of the current row (month) with the previous row (month) per Employee ID.
Power Query utilises a similar process to Tableau Prep, as we don’t have a function to iterate through groups of rows like Alteryx. First, we Sort Ascending the Employee ID and then the Date fields.
Then we add an Index Column From 1 and an Index Column From 0 to act as our current and next month Row IDs.
Like the Tableau Prep flow we want to join the Row IDs associated with our current and next month in order to compare the Job Title of each month. To achieve this in Power Query we use the Merge Queries function. Then select our Index columns and join them via a Left Outer Join.
This will add a new column into the table. We now want to expand the column to obtain the Employee ID and Job Title columns, and can do so by clicking in the header of the newly generated column.
Finally, we can use two calculations to flag promotions. A Conditional Column to check if the rows are the same employee, and a Custom Column that runs an IF statement to return our Promotion flag if the Job Titles are different but the employee is the same.
Step 4 – Output the data
Finally all the hard work is done and we simply need to output the data ready for use in the visualization tool. Let’s have a quick look at how to do that before our project debrief.
By inserting an Output step Tableau Prep allows us to output to a file, Tableau Server/Online, or a Database.
By inserting an Output Tool Alteryx allows us to output to a vast number of different outputs, either as a file, or to a database.
Power Query allows us to Apply the query we have just created and utilize the data model in Power BI. Without scripting, Power Query outputs must be consumed within Power BI.
There we have it. In just a few steps and a couple hours of work we have created a process that can be automated. After speaking with HR a few months down the line we found:
- 1 extra hour per month is opened up for the team member to focus on value add tasks rather than repetitive data preparation and error fixing
- There is now no risk of human error and so no complaints from senior management
- The process runs independent of employees, so there are no missed deadlines due to leave or changes in staff
- Understanding the comparisons between Tableau Prep, Alteryx and Power Query has allowed the business to make the best decision on which software to utilise moving forward
- Observing the success of this small-scale project, confidence in process automation has increased and other business units have come forward with some much larger projects that will add significant value across the business
Written by James Fox, with thanks to Jenny Martin for her expertise in Tableau Prep, and Gary Vance for his expertise in Power BI & Power Query.