Building a Better Analytics Organization - Code Issues (Part 3)

analytics analytics engineering data data engineering May 14, 2022
xkcd.com

If you’ve been following along, this is Part 3 of my series on Building a Better Analytics Organization — Code Challenges.

Part 1 — Introduction, common tasks, and an introduction to challenges for an individual and for a team
Part 2 — Overview of the common areas where problems can occur and the specific problems that individuals and teams face
Part 3 — Details of specific problems when working with code
Part 4 — Details of specific problems when working with visualizations
Part 5 — Details of specific problems when working with tables and views

Part 6 — How to solve the challenges

Part 7 — Implementation tips

Up to this point, we’ve covered an introduction to analytics organizations and we’ve gained some additional context (along with process flows for visuals) of many of the main problems. In this article, we’ll be diving a bit deeper into one of the areas (code challenges) from Part 2 of Building a Better Analytics Organization — Common Problems.

I wanted to specifically hone in on the code challenges because this the main source of almost all of the other problems that we have. Also, when it comes to solving this problem, I’ve found that many analysts and teams haven’t been exposed to the the solutions that software developers have been using for years. This is something that I covered in Analysts are not Software Developers (but they should be) and we’ll flesh out that topic a bit more in this article.

 

Introduction


Code. We wouldn’t be here without it, but some tools try to simplify the world by putting a UI over the top. I put the word, simplify, in italic because sometimes the UI does make things better, but sometimes it makes things worse.

Take a moment to think about your smart phone and how easy it is to find information. You simply click on buttons and almost never have to type any characters to get access to the information that you need. For example, if you’re looking for a restaurant to eat at, you can open Google Maps, click on a button for restaurants, and filter and scroll as you like. Now imagine if you had to to sit at a command prompt (I feel like I’m getting old because a lot of people avoid or maybe have never even used a command-line interfaces) and type out instruction lines to find a restaurant. If you think of it like writing SQL, you can probably imagine how long that would take to write before you get your answer. And by that time, you might have gone to bed on an empty stomach. In this case, the UI is wonderful, so, where isn’t it so wonderful?

Tableau and Power BI can be used as examples. With these applications, end users have quicker access to data. Analysts can drag, drop, click, and filter, all sorts of things until their heart is content. But behind all those buttons is code and data. And if you aren’t thinking about that code or even aware of it, then you’re setting your organization up for a host of issues. The worst case scenario might be a situation where everyone is using the data, assuming that it accurate when it isn’t accurate at all. For this reason, I always expect everyone in the analytics organization to be able to read, write, and understand the code that is behind every analysis or dashboard. However, not every organization operates with this mentality.

Not long ago I worked for a very large tech company and I witnessed first-hand, the leader of the data engineering department flat out say, “Why do you need to query the database? You don’t need to query the database. You can get everything you need out of Tableau.” If you don’t think that this is the most ridiculous set of comments that could be uttered in the analytics domain, below are a few reasons as to why. I’ll save the details for another article.

  • Visualization software utilizes many laters of filtering (extract, datasource, context, dimension, measures, and data calculations)
  • Calculations performed in a workbook can be complex, chained together, and hard to understand when debugging
  • There are data limitations on how much data can be brought into the software (before you can even begin exploring the data)
  • Working with a UI as opposed to code can frequently be a slower process
    Adds serious overhead for even the simplest of tasks
  • QA of data is incredibly difficult, if not impossible due to the items mentioned above and an added later of obfuscation


At the end of the day, there’s no getting away from code. But if we don’t work with it correctly, we can run into a number of issues as discussed in the previous article. In the next section, we’re going to explore these challenges.

 

A Mess of Files


Across the last 5 companies that I’ve worked at as an analytics leader, almost none of the companies had a good process for dealing with code, if they had a process at all. I’ve literally seen situations described in the images below. Actually, I’ve seen worse but this example will suffice. In the example below, we can see that Analyst 1 has multiple files in his My Documents folder. He probably has a lot of other random things like .mp3 files, pictures, and anything else that gets dumped into the directory. But assuming that he is somewhat organized and only keeps his code in this folder, he has other issues such as duplicate files.

He has 3 files that look very similar: exchange_rates.sql, backup_exchange_rates.sql, and exchange_rates.txt, which is really sql code saved under the .txt extension. Just looking at this directory raises a number of questions.

Which file is the right file to work with? Why was the backup file created? Why does he potentially have two different files (the .sql and .txt files) that are almost virtually the same? It’s really anyones guess and there’s a good chance after enough time passes, even the analyst won’t remember all of the answers to these questions. Making matters worse, there are copies of these files on the file server, and even more copies in other folders.

We’re dealing with duplicate files, near duplicate files, back-up copies, and everyone doing their own thing across the department. It’s no wonder that there are bugs and issues all the time in such an environment. Imagine how easy it would be to make a mistake when reviewing code that is calling the exchange_rates file. By default, you might assume that you would be looking for the exchange_rates.sql file and never notice the exchange_rates.txt file. This could lead to some pretty challenging bugs down the road when you’re looking to understand why a calculation isn’t working properly.

I’d love to tell you that this doesn’t happen very often or only happens on non-critical items. But I saw almost this exact situation (actually a few degrees worse) where this code was used in executive reporting for a $100b company. Sadly, as bad as this looks, this occurred actually at a company that was “pretty good” compared with many other situations that I’ve seen.


Historical Versions


Not only is the directory structure disorganize with duplicate files floating around, there’s no way to look at a historical version of code, unless a backup file is created such as backup_exchange_rates.sql. But as we’ve established, this can cause confusion. Do we really know that this is a backup file? Just because the name says this doesn’t make it true.

Most assumptions based on the name would suggest it is a backup copy but I’ve actually seen references to backup files in production level code so I would caution about such an assumption. Also, how many backup copies would you realistically created and keep?

One? Two? Twenty? I’m betting that it wouldn’t be very many copies and I’m also betting that a backup copy wouldn’t be created for every single change that will ever be made. Creating backup files doesn’t sound very realistic or feasible approach to ensuring that you have versions to view. But even if you could do this, there are other issues.

When you create a backup file, there isn’t any metadata that is passed along with the code change. Sure, you might be able to compare the backup file to the current file to see what changes exist, but wouldn’t it be nice if someone provided you with a summary of what they changed and why they changed it? This information would almost certainly save you the headaches of deciphering many of lines of code, hoping that you understood it correctly. Unfortunately, there’s no where to store this information.

Also, when dealing with files in a shared location, we encounter another issue. We have much uncertainty regarding who is the author of the change. Maybe you could find the information in the file properties but this isn’t always accurate or populated based on a standard convention. Other times the information is inaccurate due to files being passed around, accidental deletes/re-creates of files, and other situations that could arise.

But what about deleted files? What if someone deleted that weekly_sales_report.sql file? We don’t see a backup copy in the directory and it doesn’t look like there is a local copy floating around on the other analyst’s laptop. Now we’ve lost an important piece of code with no way to get it back. Now, if you’re thinking, “hey, we can just go to IT and they’ll have a backup that they can produce”, I think this is a nice thought in theory, but probably not something you should rely on. Why do I suggest this?

Because I started my career over 20 years ago as a computer consultant and systems engineer. Server back-ups were the number one issue that I would deal with. When the backup jobs were running, they rarely worked perfectly and even more rarely were the backups ever tested to see if the copy could be properly restored. Now, technology has changed a bit over the last 20 years but taking a back-up copy is still an issue and sometimes it doesn’t exist at all.

This was the case at a large tech company that I worked at a few years ago. There were no backups for the analytics code that was stored on the file server. Seriously. The analytics team had been operating for 5–10 years and had a fair amount of code on the server. This code was supporting numerous daily, weekly, and monthly reports that were referenced on a daily basis by the entire department as well as a dozen other departments across the company. Also, these reports were also delivered to the executive leadership team. Yet, there were no server-side backups in place and nobody even knew it. Everyone just assumed that there were backups on everything.

 

Comparing Minuscule Changes


If all of the challenges around file organization, versions, and metadata weren’t enough, there’s yet another challenge with understanding what changed within the file. Here we’re talking about the contents of the file. These changes can be a single character or thousands of lines of code. And if you’ve ever attempted to compare to versions of code side-by-side, you’ll probably agree this can be a daunting and potentially impossible task for a human. How so?

Think about this article for a moment. Assume that your task is to compare this version to a copy that I had created last night. How easy could you spot the difference. Is there even any difference? Sure, you might be able to call out major changes if I had added or removed entire sections. But if I missed and apostrophe or added a double-space after a sentence, this would be extremely hard to detect. And if we can’t easily see these changes, we can’t easily detect potential bugs or areas where we should perform quality assurance checks on our code. Now, some people might say, “I know what I changed” or “I only changed this one line.” I can’t tell you how many times I’ve heard these statements throughout my career but I can tell you that these comments almost go in one ear and out the other.

It’s not that people are lying about what they changed. Rather, they are human and fallible. Most humans don’t have a perfect memory and even if they did, they’d also have to be perfect with communicating every change to other people. But we also have accidents to think about too. There have been plenty of times where I thought that my active application was my email client and I stated typing something, only to realize that I was typing in my code editor. I also have a puppy that like to put her paws on my keyboard so I have that to worry about as well. If those weren’t enough reasons, there are also those pesky whitespace characters. It looks like nothing changed in the code, but it reality, the code did change. To catch this change and any of the others that I’ve mentioned, you’re going to need a computer to perform that task.

 

Convention


Another code challenge for organizations can be found around syntax, formatting, and naming conventions. These problems can be found at the file name level, directory structure, and across the individual lines of code within the file. I wrote about this at length and provided quite a few examples of good and bad practices in SQL Best Practices so I won’t go into a lot of detail in this series. However, I’ll briefly touch on some key issues when it comes to naming conventions.

The purpose behind convention is to allow you, and other people, to quickly understand your code. Without convention, your brain has to slow down to take time to process a different format used by every person on the team. But it’s not just the speed that becomes and issue with lack of convention. Without convention, there is the increased chance for bugs to be introduced and in the SQL world (as opposed to web development), there aren’t automated tests to help catch these bugs. It’s entirely possible to have bugs in your code even though the code still executes (yes, this can happen in and run-time language, but at least most other languages utilize unit and functional tests). By utilizing proper conventions and best practices, you’ll write fewer bugs and make life easier for everyone in the organization.

 

Finding Similar Code


Being able to search and find usages of code is extremely important, and while proper formatting and conventions are extremely helpful, keeping your code organized is vital if you want to avoid breaking code, dashboards, and reports. When code is scattered across different directories and non-shared locations, you are left in a situation where, “you don’t know what you don’t know”. Take a look at the image below. In this image we see a very common environment and situation where code is scattered through the analytics organization and embedded within files.


When dealing with code in this environment, it is impossible to locate all instances of code usage. This leads to duplication of code, multiple sources of truth, and increased potential for bugs. Now, let’s assume that you were using best practices of trying to use a View for a common piece of code that the entire organization might find useful. Let’s also assume that many other people loved your new View and they started using it in multiple places such as SQL files, copied (for reference) into Excel documents, and directly embedded into dashboard datasources. So, what happens if you want to edit this View that you created?

The common practice should be to:

  • Search the codebase for all instances of this View
  • Understand who or what could be impacted by your change
  • Communicate/document your proposed change
  • Make your change (if appropriate, and in a controlled environment if possible)
  • Roll out your changes
  • Communicate/document your change


However, in this environment, if you followed these steps and only searched the “File Server” -> “department_folder”, you would be unaware of all of the other instances of your code usage. And this is leads to a vicious cycle where:

  1. You accidentally break other reports, dashboard, and code
  2. Someone says, “See, this is why we shouldn’t share code”
  3. Other teams revert back to building their own versions, thus creating multiple sources of truth and wasting valuable resources
  4. Someone realizes how messy the environment is, attempts to organize and start the process all over again.

Well, if there’s a potential for this cycle to occur, and if we assume that we’ll always end up in a messy state, then why bother attempting to fix this problem at all? Because there is actually a solution to this problem and it involves a few different tools but also relies on human (the end-users) to do utilize best practices. In Part 6, we’ll start discussing how to solve this problem.

 

Improperly Saved and Communicated Code


Improperly saved code is code that is saved or communicated in a manner that can cause the code to become modified without the users knowledge. It happens when code is copy/pasted into different applications and whitespace characters are added/modified. While these changes are invisible to the human eye, they are very real to a computer. When these changes occur, they can cause runtime bugs and ultimately create very frustrating and time consuming experiences for you and your team. One of the most common situations is the chasing of red herrings, and this is how the situation usually plays out.

An analyst writes some code, executes the code, and confirms that the code works. The result set is saved and the code is then copy/pasted into another location (Excel or an email) and sent to another user. This new user copy/pastes the code into her IDE and attempts to run the the SQL statement, only to see runtime errors thrown.

The code looks identical to the human eye and the new user can’t figure out what is wrong with the code. She contacts the original author of the code, stating that the code doesn’t work. The original author insists that the code works and he may even run his code again to prove that it works. Although, what he didn’t tell you is that he didn’t actually copy/paste the code from the location he sent to you. He copy/pasted the code from a .sql text file that he opened in his text editor. His code isn’t actually the same as what he sent you.

After hours of back-and-forth communication and frustration, the new user attempts to re-write the code from scratch. And after her code rewrite, she is able to run her code without any errors. From this exercise, she is able to deduce that the original code was modified during the copy/paste, even though the code looks identical to the human eye.

So, how does improperly saved code occur in the first place? There are many common instances where I’ve seen code being saved or communicated improperly but here are a few examples:

  • Code is copy/pasted into an Excel document: While I applaud the user for attempting to make life easy for her and her co-workers by providing the source code used to produce the output, it frequently creates whitespace character problems. Also, the formatting tends to become a bit wonky when copy/pasting the code from Excel to another destination.
  • Code is copy/pasted and emailed: This happens all to often and causes quite a few headaches. Besides whitespace character issues, may times it is hard to read in the context of the email due to font, color, and other formatting changes that occur as default settings in the email client.

 

Where You Write Your Code


The last item that I’d like to cover is about the applications that are used to write code. Over the years I’ve seen a lot of different applications used to write code and many of these applications aren’t text editors. Sure, you can write code in just about any application that you like but some applications are are not well suited for writing code. For non-code example, you could write a novel in Microsoft Notepad and you would succeed in your ability to type out all of the words and paragraphs that are part of your novel. However, I hope that you’ll agree that Notepad is probably one of the worst applications for this task. It doesn’t have the ability to spell check, allow for any advanced formatting, track-change, or perform another of many other features that can be found in Microsoft Word. The same is true for writing code.

You could write your code in Notepad, however, you won’t have the ability to have certain reserve words highlighted automatically, utilize regular expressions to find and replace with ease, make rapid formatting adjustments, and a number of other functions. And if you aren’t using these functions, you’re probably being less efficient that you could be and you’re at risk of introducing bugs into your code. In Part 6, we’ll cover text editors.

 

Conclusion


We’ve made it through Part 3, where we discussed issues surrounding code. I hope that you’re nodding your head in agreement that these problems do exist and that you’ve seen a few in your day-to-day experiences. But don’t worry. We’re going to solve these issues in a few moments, so let’s get on to the next set of issues in Part 4 of Building a Better Analytics Organization — Visualization Issues.

 

Subscribe to Receive New Articles

Join our mailing list to receive the latest articles and tips to elevate your career.

We hate SPAM. We will never sell your information, for any reason.