Building a Better Analytics Organization - Visualization Issues (Part 4)

analytics analytics engineering data data engineering May 21, 2022

We’ve arrived at Part 4 of Building a Better Analytics organization and in this part, we’re going to cover issues around visualizations within applications such as Power BI and Tableau since those are the two most common applications. In this article, We will highlight the complexity of working with the visualization tools and the need for simplification.

Part 1 — Introduction, common tasks, and an introduction to challenges for an individual and for a team
Part 2 — Overview of the common areas where problems can occur and the specific problems that individuals and teams face
Part 3 — Details of specific problems when working with code
Part 4 — Details of specific problems when working with visualizations
Part 5 — Details of specific problems when working with tables and views

Part 6 — How to solve the challenges

Part 7 — Implementation tips

Code Issues

In Part 3, we covered a number of issues encountered when dealing with code but I saved one more example for this article. Within a visualization, we have two options for connecting to our database. Option 1: Writing custom SQL to query the database, or Option 2: Selecting tables by selecting them from the UI to have the data imported. Let’s start by discussing Option 1.

Having the ability to write custom SQL is extremely important (more on why in the next section) within a visualization tool but using this feature creates new challenges. Unfortunately, the visualization tools don’t decouple the SQL from the rest of the workbook files, which means that you can’t view the SQL unless you open your visualization application. Not only does this present challenges with maintaining and organizing your code (discussed in Part 3), but it also is a huge pain to deal with when attempting to understand or debug your code and visualization. Let’s take a look at the process to understand why.

Seriously. You have to go through each of these steps just to take a look at the code that was being used in a visualization. And if this process didn’t seem long enough, a few of these steps can make the process even more challenging.

For instance, in the step for, Stumble Upon Dashboard, I’m stating it as such because many times you can find yourself in situations where you don’t know what you don’t know. If you had a desire to search the entire codebase for all instances where a Table or View was being used, you wouldn’t know to look in all dashboards in your analytics organization. You’d only be able to look into the workbooks that you do know about. But once you figure out what dashboard you would like to work with, you would have to be granted permissions.

This can add additional challenges because there is granular security on workbooks that may allow you to view the dashboard but not actually download the dashboard. When this happens, you’re forced to raise a request for elevated access before you can being to investigate the datasource. And then we have Power BI, which functions differently when compared to Tableau.

In Tableau, if you inspect the custom SQL datasource, you’ll see the raw SQL that you can copy/paste and execute from your IDE. With Power BI, the custom SQL that you originally used is transformed into the DAX language, which means that you cannot copy/paste and execute this code from your IDE. To run this code from your IDE, you’ll need to strip out the DAX syntax which introduces yet another problem: modifying the original code.

Because you’ve modified the original code, you can’t be certain that you didn’t introduce a new bug, which can result in runtime issues or worse, results that don’t match the dashboard.

Wouldn’t it be great if the process looked like this? Well, it is possible. And in Part 6, I’ll explain how.

Increased Touch-Points

Any time that you have to touch something you risk breaking that thing. This is one of the many reasons many reasons why, as a computer consultant, I was afraid to restart a server that had been running for a year or longer. There’s a chance that when you try to boot the sever back up again, I might not come back up. Now, I’m not concerned about what happens when I close a visualization workbook and reopen it. But when I am concerned about is an unintended change that occurs even when I don’t think that I touched anything at all. This becomes an increasing important aspect when dealing with code and our visualizations.

If you have a visualization workbook and you need to add a new dropdown filter, it’s appropriate (and the only way to accomplish the task) to open the workbook, add the new filter, and save the code. However, if all that you want to do is add comments to code (ha, okay, who’s going out of their way to do that, right?!) or update the code to take advantage of a new View that was build to DRY (don’t repeat yourself) up your code, you’re going to have to modify the workbook.

This entails opening the workbook, updating the code within the datasource, and saving the file. This activity isn’t necessarily a problem. But where the problems occur is when you accidentally click something as you’re browsing around in the UI because you’re curious or performing QA. You can end up changing a default setting or worse, breaking something. Then you save the workbook, publish it, and later find out that you created a new problem.

For this reason, it’s best if you can keep the touch-points to a minimum on the workbook. And while the visualization tools aren’t doing us any favors by coupling the code with the workbook, there are things that we can do to reduce the number of times that we have to touch the workbook for the sole reason of updating the code. I’ll talk more about this when we get to our solutions. Note, I’m not advocating that you don’t perform QA on your workbook after updating code!

Inability to Perform Proper Quality Assurance

Earlier in my career I focused a lot on quality assurance which is probably why this section and the end-to-end process resonates with me. It also frustrates me because these visualization tools do not allow for adequate quality assurance due to how the applications were designed. If you’re new to visualizations, allow me to explain.

In Tableau, there is a hierarchy of filters (order of operations) that consists of filters that can exist on your Extract, Data Source, within Context filters, Dimension filters, Measure filters, and Table Calculation filters. If the number of places where you can have a filter isn’t challenging enough, each one of these filters is configured through a UI setting that you’d have to manually inspect. Inspection isn’t quick or easy, either. But it doesn’t end there.

You can create custom calculations within the visualization (eg. “net profit margin” = (revenue — cost) / revenue ), which is a very important and powerful feature of the application. However, the code is embedded in each Measure so the only way to view the code is to edit the Measure. This causes two problems: 1. It’s time consuming as you have to open each and every custom calculation to view the code and 2. You’re touching things things and could accidentally introduce bugs. But wait, there’s more! There’s no version control on any of these changes, which we’ll get to in just a moment. But before I get to that topic, I wanted to discuss the actual QA process.

Visualizations as Your Only QA Source

I worked for a company not too long ago where the head of data engineering said, “You don’t need to use SQL. You can just get everything from Tableau.” Needless to say, I don’t work for that company anymore. No, my friends, you cannot get everything you need from Tableau. There are so many layers of UI driven settings that it is nearly impossible to QA your workbook, solely based on your workbook. What do I mean?

If you’re going to QA something, you should be able to trace everything as a proper chain of custody. For example, if you ask your neighbor to put 3 apples in a bag, when you get the bag, you could verify that you have 3 apple in the bag. But what if you don’t know how many apples to expect in the first place? This is how a lot of SQL queries work. When a SQL query is written, we don’t always know what sort of results we will get back. And if you don’t know exactly what you should get back, you can’t perform a 1:1 verification of your expected results to the actual results. So what do we do?

We break things down into small units of work, checking, summing, counting, distinct counting, and more to see if everything looks to be in working order. We take those numbers and jot them down as the result of step 1. Then we take the results of step 1, add more layers of code and call this the results of step 2. Then we do more checks (summing, counting, etc) and compare the results of step 1 to the results of step 2. If we get matching results (where we would expect to get matching results), we assume that we don’t have any bugs. But if you don’t know the results from step 1, what do you have to compare?

This is the issue that I have with that data engineering leader’s comments. In your visualization, you have nothing to compare to. Now, technically speaking, you do have the data. You have the exact data that was extracted from the database and piped over to your visualization. Assuming that there wasn’t some sort of strange error where only a portion of the data was passed over to the visualization. But to perform a QA of that data, you have to go through a bunch of UI clicks, filters and more to ensure that your data is correct and this is the exact thing that we’re trying to avoid doing if we want to perform our QA in the first place.

Also, if you have even more complex situations that include multiple filters, complex calculations, and table joins, you may find it very challenging to derive your baseline. But when you have SQL at your fingertips, you can quickly run a query against the database to double check your results.

Lack of Version Control

As mentioned in Part 3 of this series, the visualization tools don’t offer version control. As of the authoring of this article, there’s no way to track or decouple all of the code and settings aspects from the workbook. If there was a way, and if it was well structured data, you could place that file under version control, which would allow you to track any changes that were made. And if you could track changes, you could diff your code to see what changed and you could verify that the changes that were made where the only changes that you intended to make. Unfortunately, this isn’t possible, which means that any time you save your workbook, you’re at risk of introducing bugs.

Joined Table Issues

Within your visualization tool, you have the ability to join tables or datasets together. While these can be helpful and necessary, I find them to be a huge headache in most instances because there are two major issues that I have with joins within the visualization tools.

First, if you are performing a join inside of the visualization, you’re telling the visualization to query all data for the dataset, then join, then allow you to filter and calculate as you please. While the join works to produce a result set, it’s many times a very inefficient result set. For example, let’s assume that you work for a company that has been in business for over 20 years and you have a Sales table and a Users table. Let’s also assume that you probably don’t need 20 year worth of sales data. But when datasets are constructed as individual tables, with a join performed inside the visualization, your dataset pulls 20 years worth of data into itself. This becomes inefficient on many levels.

You’re pulling a tremendous amount of data from the database, which increases database load
You have to transfer all of that data from the database to the visualization
Your visualization has to process all of that data for a join
Your visualization has to process all of the records for your calculations

This results in increased costs, poor performance, and sometimes run into software limitations

The second major issue with joined tables is that the joins are UI based, which leads to my previous comments around QA and version control.

It’s Not Reusable

If you at the point of exhaustion with the issues with visualization, I have one more that is related to code. When you perform all of the datasource creation and set-up, custom calculations, filters, and more, none of this is reusable outside of the visualization. This means that you may find yourself making the same joins or writing the same calculations over and over in multiple workbooks. Now don’t get me wrong, sometimes this is just the way life is. For example, if you have to perform a division calculation (eg. “net profit margin” = (revenue — cost) / revenue ), you must perform this calculation in your workbook (if you want any sort of filters or dynamic controls) as opposed to performing the calculation in SQL and passing over the calculated result to the workbook. But while this is perfectly acceptable, it’s important to keep the calculations inside the visualization to only what is absolutely necessary.

Also, when it comes to reusable code, or lack-thereof, you can’t simply execute a base set (or the full set) of SQL for other use cases. For example, you might have a workbook that has monthly sales data being summarized and it utilizes the Users table and the Sales table. Now let’s assume that someone asks you to bolt on a few new attributes that exist in the Products table. If all of your code and joins are contained within the workbook, you’re either going to have to add a new datasource, wait for it to load, make all of the joins, hope that you didn’t break anything, go through a challenging QA process, and produce your results. Or, you’re going to have to write some SQL from scratch and do the work outside of the visualization, which means that you basically just wrote all of the code that the visualization is doing. This begs the question as to why the visualization didn’t just use SQL as much as possible in the first place and avoid doing the heavy lifting on the visualization side.

Real-time Data

Tableau calls it a Live connection and Power BI calls it a Direct Query connection, but both mean the same thing. They define a real-time query to the database when you interact with the dashboard. Sometimes it’s perfectly acceptable to have a real-time query to the database, but most of the time you probably won’t need or want to query the database every time you interact with the dashboard. A big reason behind this is that many times the source data in the database doesn’t have data being streamed into the tables. Instead, data is frequently loaded into table with nightly ETL jobs. Why is this worth calling out? Because it’s important to understand the source of the data, how frequently it’s updated, what the real use and business cases are, if the infrastructure can handle the increased load, and if your users will experience performance issues.

Throughout my career I’ve seen a number of misconfigurations in this area due a lack of understanding of the different visualization query options and how frequently the database get refreshed. When an unnecessary real-time data connection is selected, end-user performance usually suffers.

Summary

I hope that I didn’t scare you away from the visualization tools. They are a necessary evil but if you understand their limitation and the problems that problems that they create, you can take steps to mitigate these problems and create a sound analytics architecture for your organization. In the next section, we’ll cover a few issues surrounding Tables and Views in, Part 5 of Building a Better Analytics Organization — Table Issues .

Subscribe to Receive New Articles

Join our mailing list to receive the latest articles and tips to elevate your career.

We hate SPAM. We will never sell your information, for any reason.