Building a Better Analytics Organization - Table Issues (Part 5)

analytics analytics engineering data data engineering May 28, 2022

Welcome to Part 5 of my series on Building a Better Analytics Organization. In this part, we’ll be covering a few issues surrounding the usage of tables and views.

Part 1 — Introduction, common tasks, and an introduction to challenges for an individual and for a team
Part 2 — Overview of the common areas where problems can occur and the specific problems that individuals and teams face
Part 3 — Details of specific problems when working with code
Part 4 — Details of specific problems when working with visualizations
Part 5 — Details of specific problems when working with tables and views

Part 6 — How to solve the challenges

Part 7 — Implementation tips

 

Tables and Views are really powerful tools at our disposal when building reports and dashboards as well as performing analysis of our data. But as powerful as they are, they can cause us to get into a bit of trouble. Fortunately, unless you’re the person doing the debugging, they aren’t usually as problematic as many of the other areas that we’ve covered up to this point.

 

Overview of Views


For those that might not be familiar with Views, a View is really just a result set that is derived from a stored query in the database. It looks and behaves almost like a database table, however, there is a (just one for the sake of this article) key difference between between a Table and a View. A Table contains stored data where as a View doesn’t actually store any results. A View is simply the results of a query that produces are result set. This means that the results are dynamic in nature (unless we’re talking about materialized views, but, we’re not).

For example, let’s assume that you have two tables; a Users table and a Sales table. These two tables may contain gigabytes of data. If you wanted to query these two tables, you’d return a result set of the data within those two tables. You could then take this query and save it as a View.

This View contains the query, not the data. The View doesn’t make a copy of the Users and Sales table’s result set and save it anywhere. So, you might be asking, why would I ever use this? Why not just run the query? The answer is, to make things reusable.

For example, if you had multiple people that needed to use a common result set of Users Sales data, it would be best if you could avoid having each analyst re-write the same lines of SQL every time they need to pull this data. Instead, they could simply query, “select * from users_sales_view”.

 

The Issue with Not Using Views


I’ve seen a number of cases over the years where analysts aren’t familiar with using views and this can create some pretty serious problems.

I was working for a large tech company a while ago and we frequently needed to pull the results for a “total cost” attribute for our financial data. Unfortunately, there wasn’t a field in the database that summed up the total cost. Instead, the team that “owned” this metric had to calculate the total cost by performing some arithmetic on approximately 7 different fields. It was pretty simple math and not really any sort of issue. But there was an issue with how this information was being shared with teams. How so?

When every my team or any other team wanted to utilize this “total cost” value, we had to replicate the calculation. And in order to replicate this calculation we had to reach out to the team that understood and build this arithmetic logic. Then, we had to copy/paste this logic into our code. This leads to issues of: 1. duplication of code, 2. potential for copy/paste logic errors, and 3. getting out of sync. And #3 is a huge issue because frequently, the team would update this logic. But how would anyone using this logic ever know? They wouldn’t.

There’s no way for anyone else on the team to ever think that they should frequently check in with the original team to see if they updated the logic. This meant that suddenly, one team pulling this “total cost” figure would have a different result from another team. Eventually, everyone is distrusting everyone else and wasting time chasing down issues that could have been avoided in the first place.

This is where tables or views come into play. If the team was using a table or view, this calculation could have been performed on time and either stored in the table or rendered in the result set so that there was no chance of an issue arising out of sharing the code.

 

The Issue with Views


If you’ve been following along since the beginning of the series, I’ve made multiple mentions of the lack of ability to see or locate code that is used in different sources. But this is one time where we don’t have that issue. When a View is created, the code is saved in the database and this means that if you want to see the code that is being used, you can run a show view (differs slightly by database) command. However, this wouldn’t be an article if we didn’t have an issue, which brings us to searching code and staying in sync.

 

Searching Views


As previously discussed in Part 3, the ability to search your entire codebase is a very important aspect to keeping your code bug free and avoiding broken dashboards due to code changes. While it technically possible to search your View code by running a SQL command against the database, this really isn’t a feasible way to search your code.

The reason that this isn’t feasible is because, 1. all of your code isn’t saved in the database; only your Views are and 2. the results that you receive from such a search are completely subpar compared to the search results from a good text editor.

 

Staying in Sync


If you want to keep your code searchable, you’re going to want to save it outside of the database. But this present a new problem because the code also has to be saved in the database. This means that you have duplicate code, which usually isn’t a good practice. And with duplicate code, there’s always the chance for your code to get out of sync. Below is a look at this simple, yet potentially error prone process.


While we’re only dealing with three seemingly simple steps, it’s very easy to omit one of the steps (seen below), which will result in your saved code being out of sync with the actual code being executed. The result of this is usually a fair amount of frustration for the person debugging and issue. What usually occurs is the analyst will run a query, see a result set that doesn’t look accurate, look at the code that is saved in the file storage location, inspect the logic and then proceed to bang her head against the wall trying to figure out how this code could possibly be creating the bug. After a while of running in circles, the analyst will hopefully think to compare the View code saved in the database against the code saved on the file server and realize that the code is not in sync.

 

View Performance Issues


As great as Views are for helping to your team avoid duplicating code, they can suffer from performance issues. Since a View is the result-set of a query, any time you want the result set, you have to run the query. And sometimes, that query can be a costly query (in terms of time, memory, and CPU cycles). Now, this query is going to have to be performed at some point, but if the query is going to be executed dozens or hundreds of times per day when the underlying data isn’t changing, you might not want to use a View. To overcome this issue, a Table may be more appropriate depending on the specific circumstances. However, using a Table introduces another challenge.

 

Table Issues


When we decide to use a Table as opposed to a View, we can overcome performance issues but then we have to worry about how and when the Table will be refreshed with updated data. There are quite a few solutions out there to perform this task so it isn’t a huge concern. We’ll cover how to solve for this in the next part of this series.

 

Summary


Okay, I’m keeping this article short and sweet because for the most part, Views are a wonderful and useful tool in the analytics toolbox if used correctly. So, that’s it for Part 5. Now for the part that you’ve been waiting for: Part 6 of Building a Better Analytics Organization — Solutions.

 

Subscribe to Receive New Articles

Join our mailing list to receive the latest articles and tips to elevate your career.

We hate SPAM. We will never sell your information, for any reason.