Building a Better Analytics Organization - Introduction (Part 1)

analytics analytics engineering data data engineering Apr 30, 2022
Photo by Robert Gourley on Unsplash

Welcome to Part 1 of my 7-part series on Building a Better Analytics Organization. In this series we’ll be discussing how to create a connected analytics world to help you move faster, with fewer bugs, and less headaches. In this article, I’ll cover a background of the analytics landscape and common issues that teams and individuals face. In subsequent parts, I’ll dive deeper into what the analytics structure looks like today at most companies, a better architecture, the tools to utilize within the new architecture, and best practices.

Part 1 — Introduction, common tasks, and an introduction to challenges for an individual and for a team
Part 2 — Overview of the common areas where problems can occur and the specific problems that individuals and teams face
Part 3 — Details of specific problems when working with code
Part 4 — Details of specific problems when working with visualizations
Part 5 — Details of specific problems when working with tables and views

Part 6 — How to solve the challenges

Part 7 — Implementation tips

 

Introduction


For my entire analytics career, I’ve been preaching practices of how to stay organized, connect all code, output, and visualizations, reduce bugs, and make teams more efficient. I’ve implemented these practices at five different companies and I’m always a little shocked that analytics teams don’t operate like software development teams, but at times it makes sense as to why this probably occurs, but it doesn’t mean that we can’t change the way teams work. You can read more this topic in my article, Analysts are not Software Developers, But They Should Be.

I started my career in quality assurance, dealing with chain-of-custody situations and looking for anomalies. After that job, I worked in software quality, where I was hunting down bugs in code and hypothesizing as to where mistakes could be made, and tested for those mistakes as well as drafted design modifications and requirements to reduce potential bugs. At the same time, I was also working as a software release manager. A large part of my job was focused on ensuring that we had a shared code base that the entire team utilized. But when managing our code, I also needed to ensure that all code changes were well documented, connected to reasons as to why we made the change, and ensure that only the desired changes were put into our software releases.

For the sake of this article, I’ll avoid discussing the more intricate details of releasing production-level software. But many of the tools and practices that I’m about to discuss are derived from the tools and practices that I used and developed over the last 20 years after seeing bugs, mistakes, and banging my head against the wall in frustration. Before we dive into the tools and practices, it’s important to discuss some of the current challenges and how things typically work in many analytics organizations.

 

A Typical Situation


In just about every analytics team, across every company that I’ve ever seen, I’ve encountered very similar practices where teams could find improvement. But before we talk about those improvements, I’d like to provide you with an example of the most common steps encountered in the analytics domain, in an effort to provide context to the current challenges. In the analytics domain, there are a few common tasks that analysts perform, although a few of these tasks are sometimes assigned to folks with a title other than Analyst. These tasks consist of:

  • Writing SQL code
  • Creating ETLs
  • Creating Tables/Views
  • Creating Output (Excel, PowerPoint)
  • Building Visualizations


After reviewing this list of common tasks, I’d imaging that the some people may think that the analytics process(es) are quite simple. But what actually takes place is anything but simple. Let’s take a very common and simple sounding examples.

 

The Quick Ad-Hoc Request

 

Your manager walks over and says, “Hey, can you quickly pull the last 2 months of sales figures for our shoe sales in Germany for our most loyal customers? It should be pretty easy.” After you’re done rolling your eyes at the words easy and simple, you jump over to whatever application you use to execute your SQL, you type away with your manager watching over your shoulder, export the result set to a .csv file, attach the file in an email, and send it over to your manager. Done. And now on to the next task. You clear your code and start writing some SQL for your next task.

This happens many times throughout the day, weeks, and months. You’re quick, the manager gets what he or she wants, and everyone is happy until reality comes into play. Because in the real world, frequently, this isn’t the end of the story or the task. Possibly two days later, your manager asks for the same result set to be pulled again. You have to start from scratch, writing 10–20 lines of code because you didn’t save the original code. It’s a little annoying, but hey, it’s only a few lines of code so you write it and are you are only slightly less efficient than if you had saved your code.

Or maybe you did save your code, and this time you’ve made some modifications, re-saved the code, exported your results and went about your day. Oh, but there’s a problem this time. Your manager comes back and says, “Hey, how come those Germany numbers don’t match what you sent me the other day?” You’re confused, you assert that you re-wrote the same code or you assert that you only changed one thing and didn’t change anything about Germany. Ahh, famous last words for all of us.

For the next few hours, you’re left wondering what could have gone wrong. You’re looking at the database, suggesting that maybe the data engineers or an ETL process created the problem and after being pressured to solve the problem, you start trying to recreate that code from your first request to see if it will match the result set. Then you try to pick apart what went wrong. At the end of the day, you’ve found it. A bug. Such a simple bug, but it cost you the hours of work that could have been avoided.

 

The Visualization

 

Another common and possibly more frustrating example can be found when dealing with visualizations. Maybe you’ve been asked to build a dashboard that is similar to an existing dashboard. Or maybe you don’t even want to build a dashboard but you just need to produce a result set that adds a few more attributes. At the root of this all that you want to do is find the code that was used to serve data to this existing dashboard. But getting to this code isn’t simple, based on how dashboard is constructed.

In Tableau, the original developer wrote a block of custom SQL and copy/pasted it into a datasource. Easy enough, right? You can just open the datasource and get the SQL and copy it over to your text editor. Oh, but if only things were that simple. In reality, you’re more than likely going to deal with a number of issues. You start by requesting permission from someone to download their workbook. Then after you download it, you have to open the workbook, then you have to open the datasource. And if you’ve ever actually had to edit a datasource in Tableau, you know that this process is painfully slow (I’m talking dial-up internet slow for those of you who are old enough to remember).

In Power BI, the situation is somewhat better and somewhat worse. You still have to have permission to download the workbook, but it’s really fast when opening the datasource. However, Microsoft likes to modify the SQL that you used and and they create a somewhat cryptic (DAX) version of it. Meaning, you can’t simply run this SQL against your database because it has all sorts of new characters and modified syntax in it. You’ll have to modify this code, stripping out the special DAX syntax. But this only creates new problems as you have just increased the potential for new bugs to be created.

 

Other Examples

 

I could write for days about many more situations that analysts face, but to avoid losing your attention, I’ve listed a summary of various challenges that exist. And whether you’re a 1-person team or part of a larger analytics organization, these challenges will arise.

Examples of Current Challenges as an Individual

  • Can’t re-run the exact code used on a previous request because the code wasn’t saved
  • Can’t re-run the exact code used previously because you saved your changes over the top of the previous code
  • Have near-duplicate files copies all over the place because you are saving versions of your code to avoid saving over the top of code
  • Can’t easily or quickly compare versions of code to see if you accidentally introduced a bug
  • Can’t quickly access the code used in visualization
  • Code within the visualization can be modified without any way to track changes
  • Code from Power BI is modified by Power BI and can’t be copied and executed in your IDE without modification, and modifications can introduce bugs
  • Visualizations constructed with much of the join-and-filter logic inside the of the visualization instead of in SQL
  • Poor ability to perform data QA within visualizations
  • No version control on visualizations to prevent/detect accidental errors and bugs
  • In Excel output, no idea what code was used to pull the result set
  • As a work around, code is copy/pasted into the Excel workbook, but isn’t searchable, and frequently will create random errors due to whitespace characters if you copy/paste this code and try to run it

 

Examples of Current Challenges as an Team

  • All of the same challenges above plus…
  • Lack of visibility into what code exists
  • Frequently re-inventing the wheel
  • Multiple sources of truth
  • Inability to search the code base to find all usage of certain tables, views, or fields
  • Increased potential to break code, views, dashboards if tables or views are changed
  • Lack of understanding of the why something was created, updated, or changed

 

Conclusion


While analytics tasks might appear to be simple, many times there is more to the task than might appear. Also, the way that we attempt to accomplish those tasks can create additional problems, confusion, and inefficiencies. Since we’ve only started scratching the surface explaining a few problems, we’ll dive deeper into the more of the specific details about these problems in Part 2 of this series.

Subscribe to Receive New Articles

Join our mailing list to receive the latest articles and tips to elevate your career.

We hate SPAM. We will never sell your information, for any reason.