Building a Better Analytics Organization - Common Issues (Part 2)

analytics analytics engineering data data engineering May 07, 2022

Welcome back to my series on Building a Better Analytics Organization.

Part 1 — Introduction, common tasks, and an introduction to challenges for an individual and for a team
Part 2 — Overview of the common areas where problems can occur and the specific problems that individuals and teams face
Part 3 — Details of specific problems when working with code
Part 4 — Details of specific problems when working with visualizations
Part 5 — Details of specific problems when working with tables and views

Part 6 — How to solve the challenges

Part 7 — Implementation tips

In Part 1, I provided an overview of some of the current challenges in most analytics organizations. In Part 2, I’m going to put more visual context behind many of the situations that analysts and organizations face. Also, I’ll try my best to display processes and situations, frequently building upon the previous example(s) and I’ll call out the issues that exist. But before we dive into the different processes and flows that organizations currently use, it would be good to understand the common situations and problems.

Problems and Situations

Below are four different categories of problems or situations that occur in almost every analytics organization. This list can serve as a standard set of definitions that I will reference multiple times in the various flows throughout this article.

Issues with All Analytics Tasks

Missing context of the original intent: Without some sort of link to conversations and thoughts, you won’t have an understanding of the original intent of a project or modification. Without this understanding, you are at risk of misusing code and output, and coming to the wrong conclusions.
Don’t know what you don’t know: If you’ve never seen the asset, then it would be logical to assume that it doesn’t exist. Yet in reality, there may be plenty of assets that have already been developed that you can borrow from. Without the ability to find what work was performed, you could be reinventing the wheel.
Lack of knowledge of original asset or point of contact: Many times you’ll have questions about the thing that you’re looking at. However, as documents are passed around and organizational changes occur, it may be very difficult to locate a subject matter expert. Worse, without proper organization, assets might become orphaned, eventually losing all context and ownership.
Duplication / Multiple sources of truth: Building upon the issues above, analysts end up reinventing the wheel or creating variations of similar code, resulting in multiple sources of truth and confusion.

Issues with Working with Code

All of the issues related to Issues with All Analytics Tasks, plus…
A saved copy doesn’t exist: This makes it impossible* to re-run the exact same code and analysts wast time attempting to re-create code.
No version control: Impossible to see what was changed, when it was changed and who actually changed the code.
Inability to quickly see code changes: If you need to perform a side-by-side comparison or see exactly what changed, you can’t do so without an automated diffing tool. The human eye isn’t capable of this.
No ability to quickly find similar code: If similar code was created for a past request by you or someone else, it can be difficult if not impossible to find such code. Also, if you need to find all instances within your code, wherever it is used (ad-hoc, Excel output, visualizations), it’s impossible to locate all instances. This can result in breaking views/tables/dashboards/etc. where other people might be borrowing or using the same assets.
Improperly saved code introduces runtime bugs: Sometimes code is copy/pasted into email, Excel, or PowerPoint files. This isn’t ideal for all of the reasons described above, but it can also produce whitespace characters, which cause users to chase red herring issues.

Issues with Working with ETLs and Views

All of the issues related to Issues with All Analytics Tasks, plus…
All of the issues related to Issues with Working with Code, plus…
Potential for disconnection between actual code used and assumed code used: For the sake of this conversation, a View is simply a saved query that you call. For example, if you wanted to produce all of the sales metrics from Germany, you might have to write 100 lines of code. And if you use this code in multiple reports, you’d be duplicating that code. To reduce duplications or DRY (don’t repeat yourself) up your code, you could use a view such as, “select * from sales_metrics_in_germany”. Now, the issue that new View can introduce is that even if you save your code to a file, the actual View is saved in the database, because that’s where it’s executed. This means that it’s possible for you to modify your View code in your text editor on your laptop, but never actually run (save) that same code in the database. This can cause severe frustrations when chasing down bugs because what you think is the real code may not be the code that you’re actually running.

Issues with Working with Visualizations

All of the issues related to Issues with All Analytics Tasks, plus…
All of the issues related to Issues with Working with Code, plus…
Time consuming process to find SQL from datasource: When SQL is pasted into a Tableau or Power BI datasource, extracting this code from the datasource is a slow process. You’ll need granted permissions to download the workbook and then you’ll need to open the datasource. In Tableau, this is a slow and painful process. In Power BI, this actual SQL used becomes modified and cannot be used outside of Power BI (without modification).
Risk of introducing bugs when modifying SQL from datasource: In the case of a Power BI datasource, you’ll have to modify the SQL that you see because Power BI formats the syntax for the DAX language. This means that you’ll have to make modifications to run the SQL outside of Power BI. And any time you are modifying the code, you’re at risk of introducing bugs and deviating from the actual results being served in the visualization.
Additional risk of creating bugs in the UI/filters when only needing to modify the datasource: Any time that you have to open a visualization, you risk introducing new bugs. This is frequently an unnecessary risk when you only need to modify the code that is generating a result set for the datasource.
Inability to properly QA: You can’t properly QA anything solely using a Visualization. There. I said it. Yes, it’s possible to perform QA of what you see in the visualization and this requires comparing what you see in the visualization to some sort of baseline or expected value. The problem is that if you aren’t using SQL in your IDE, break down the components and perform your QA on that side first, you risk never detecting bugs. Why? Because there are so many different filter types and options inside of a visualization that merely looking at the output of a visualization alone isn’t sufficient.
Increased complexity and decreased understanding: Depending on how a visualization is receiving data and how the dashboard was constructed, the complexity could be built inside of the dashboard as opposed to loading as much logic as possible outside of the dashboard and keeping the dashboard logic as simple as possible. Why is this important? For the next reason…
No version control: As of the authoring of this article, there isn’t a good way to place a workbook under version control. Unfortunately, Tableau and Power BI weren’t built with all of the configuration parameters loaded into a .json file or some other non-binary file. Meaning, if you accidentally change the default filter on a page, there’s no easy way to detect this change. When dealing with formulas, joins, and other logic, this lack of version control is even more reason for keeping your logic in files that can be placed under version control.
Severe performance issues within the Visualization and on the Database: Improperly created queries or lack of use of custom SQL to return the most optimal result set to the visualization. This can cause problems such as: slow visualization rendering, extremely large files that take a long time to open, out-of-memory issues, increase load on the database, and increased costs.

Issues with Working with Excel Output

All of the issues related to Issues with All Analytics Tasks, plus…
All of the issues related to Issues with Working with Code

Now that we have the common definitions out of the way, let’s dive into the specific use cases where these problems occur.

Examples of Process Issues — Working with SQL, Result Sets, and Excel

Ad-Hoc Requests — Example 1

Process: This process is one of the most simplistic forms. An analyst receives an ad-hoc request from a stakeholder, writes some SQL, generates a results set, and emails the results back to the stake holder.

Issues:

In this example, the code isn’t being saved. This results in all of the issues that we’ve described in Issues with Working with Code. Also, this results in (either immediately or over time) all of the issues in Issues with All Analytics Tasks.

Ad-Hoc Requests — Example 2

Process: This time, let’s assume that we save our file. However, most analysts are saving their files to a personal storage location.

Issues:

While a copy exists, we have issues with: No version control, Inability to quickly see code changes, No ability to quickly find similar code (with the caveat that this is somewhat possible if everyone is saving their files to a shared storage location, but there are still issues with that).

Ad-Hoc Requests — Example 3

Process: In this example, the analyst is saving their code inside of Excel as opposed to a stand-alone file.

Issues:

Ad-Hoc Requests — Example 4

Process: In this example, the analyst is saving their file (locally or to shared storage) but the file is being overwritten any time a modification is required.

Issues:

Examples of Process Issues — Working with SQL and Visualizations

Anatomy of a Dashboard

Below is an image of the anatomy of a Power BI or Tableau workbook (which tends to be referred to as a dashboard, but in true context, the dashboard is a component of a workbook file. A workbook file consists of the visual dashboard or report (tabs/images/pages) but to get the data that is displayed on those pages, we need to have a connected datasource. There are two options for creating a datasource: 1. “Directly” connect the datasource to the database or 2. Write custom SQL inside the datasource.

Allow me to provide more clarity around this in a simplified approach. When directly connecting to the database, the data source is roughly the equivalent of a “select * from table_name”. In our example, we are connecting to the Users table and the Sales table. Then within Power BI or Tableau, we are creating a join between these two tables. In the Embedded SQL example, we would directly write SQL (such as “select * from users u left join sales s on u.user_id = s.user_id”) and save this inside the datasource.

Note: When I mention a direct connection, I’m not speaking of Power BI’s Direct Connect. Power BI used the word Direct to mean “Live” connection. Tableau has “Live” connections or “Extracts”. In our example below, we’re not talking about the frequency of datasource refreshes or if a copy of the data is created or cached for the visualization.

Dashboard — Example 1

Process: This process is one of the most simplistic forms. An analyst receives a request to create a dashboard, directly connects to database tables without writing any SQL, and publishes the dashboard.

Issues:

In this example, there isn’t any code being written or saved. This results in all of the issues that we’ve described in Issues with Working with Visualizations. This can also create severe performance issues in the visualization and on the database.

Dashboard — Example 2

Process: Much like Example 1, an analyst receives a request to create a dashboard, but this time the analyst writes custom SQL in the datasource, and publishes the dashboard.

Issues:

In this example, while there is code that is being written and technically saved, it isn’t being saved outside of the visualization file. This results in all of the issues that we’ve described in Issues with Working with Visualizations, with the potential exception for better performance depending on how well the SQL was written and if the visualization tool is performing its own query optimizer modification to the code before submitting the requests to the database.

Dashboard — Example 3

Process: In an effort to overcome some of the challenges with locating and using SQL embedded within a datasource, the analyst saves a copy of the code outside of the visualization.

Issues:

In this example, while the code is being saved, many of the issues in Issues with Working with Visualizations still exist. But this time an added risk of Potential for disconnection between actual code used and assumed code used will be created.

Examples of Process Issues — Working with SQL and Views/Tables

View Creation — Example 1

Process: In this process, the analyst needs to create a View. This may be required to simplify the raw SQL that is being used in a dashboard or to avoid duplicating code. Here, the analyst writes her SQL statement as a View, executes the code against the database (which is essentially saving the code in the database) and her work is done.

Issues:

Unfortunately, even though the code is saved in the database, it isn’t being saved to a location that is searchable. This results in almost all of the issues that we’ve described in Issues with Working with Code.

View Creation — Example 2

Process: This process is the same as the example above, but this time the analyst saves a copy of the code to a local file.

Issues:

While the code is saved to a local (or possibly a shared location), many of the issues described in Issues with Working with Code will still exist. But this time an added risk of Potential for disconnection between actual code used and assumed code used will be created.

Summary

We’ve covered a number of issues with working with code, tables, views, and visualizations. And we’ve also covered examples of process flows to provide more context on where some of these problems occur. If you’re reading between the lines on some of these processes, you may have a few ideas on what might help to mitigate the risks and problems that we covered in this article. And that’s what we’re going to discuss in Part 6 of Building a Better Analytics Organization — Solutions. But before we get there, it’s worth diving a bit deeper into the code challenges that organizations face in Part 3 of Building a Better Analytics Organization — Code Challenges.

Subscribe to Receive New Articles

Join our mailing list to receive the latest articles and tips to elevate your career.

We hate SPAM. We will never sell your information, for any reason.

Building a Better Analytics Organization - Common Issues (Part 2)

Problems and Situations

Issues with All Analytics Tasks

Issues with Working with Code

Issues with Working with ETLs and Views

Issues with Working with Visualizations

Issues with Working with Excel Output

Examples of Process Issues — Working with SQL, Result Sets, and Excel

Ad-Hoc Requests — Example 1

Ad-Hoc Requests — Example 2

Ad-Hoc Requests — Example 3

Ad-Hoc Requests — Example 4

Examples of Process Issues — Working with SQL and Visualizations

Anatomy of a Dashboard

Dashboard — Example 1

Dashboard — Example 2

Dashboard — Example 3

Examples of Process Issues — Working with SQL and Views/Tables

View Creation — Example 1

View Creation — Example 2

Summary

Subscribe to Receive New Articles

Join Our Free Trial