The Problem with Sub-Queries & Better Options

May 20, 2023
Photo by Didssph on Unsplash

I’ve been writing SQL for about 20 years and while some things have changed, much of the language is still the same, including the overuse of sub-queries within SQL statements. While sub-queries are very helpful and aren’t inherently bad, they tend to be overused and many times a more optimal and preferable solutions exists in the form of a common table expressions (CTE) or a temporary (temp) table. By using a CTE or a temp table, your code become easier to read, more modular, reduces duplicated code (which reduces bugs), and can even increase performance.

If you’re not familiar with sub-queries, a sub-query is a query within a query. Another way to think about this is that it’s a nested query. Here’s an example where the results of the right table (results of table_b) are being derived from another query, hence the sub-query.

   

A sub-query is completely valid to use in SQL, but one of the biggest issues is the lack of modularity. By nesting a query within a query, the code becomes challenging to read. While the example above might not appear terribly hard to read, this is especially true when the code is written with deeply nested subqueries. I’ve seen cases where an analyst prided himself on writing a query that had sub-queries nested 7 levels deep! Code designed in that manner is extremely hard to read and almost always requires someone to go back to add documentation or notes and split out each of those queries to see what’s happening and where a bug might be occurring.

When code is hard to read, it’s easy to misinterpret the original intent and it’s easy to accidentally introduce bugs. This is why object-oriented programming exists in almost all other programming languages and why it’s best practice to create small, usable blocks of code. Also, when the code is not split out into individual modules or blocks, the end-user must re-run the entire set of code each time that they are hunting down a bug or attempting to verify a fix to a bug.  Not only is this inefficient, wasting time for the analyst or engineer, but it also increases load on the database, which increases costs and slows down queries for everyone else using the database. To overcome these issues, the preferred option is to use a CTE or a temporary table. But there are a few important differences to understand to help you make the best choice.

 

Common Table Expressions (CTEs)

Common table expressions are blocks of code that are like a sub-query, except that they are extracted out into their own individual query instead of being nested. This makes the code easier to read, which reduces bugs, and it allows the same block of code to be referenced in multiple places without having to re-write the entire block of code. Here’s what the above code looks like when a CTE is used.

 

 

 

By extracting the sub-query to its own block of code, it can easily be referenced as if it was any other table in the database. No longer is there a need to re-write the same code in multiple places in your SQL script if you needed to reference the results of table_b. While CTE’s are powerful and an improvement over the sub-query, they do have their limitations.

One issue is that if you have a lengthy SQL script with a lot of CTEs (eg. 15), the database may throw an error stating that it can’t compute the proper code execution. This is because behind the scenes the database is trying to compute an optimal code execution and it’s unable to do so due to a complex script.

Another issue is that a CTE isn’t persisted in memory. It only exists for the duration of the running of the entire script. This means that if you run the entire script above, you cannot come back and simply select the bottom select statement. The results of the CTE are not stored in memory like a table in the database. So, if you want to run only the bottom select statement again (eg. Because you wanted to add a filter to the data this time), you’re going to have to re-run all the code again.

While this may not seem like a major issue, it can be in certain situations. If we assume that it takes 15 minutes to return the results of the CTE and you only wanted to change a filter on the main query, you’ll have to wait 15 minutes for the CTE to run, plus the amount of time that it takes for the main query to run.  This can lead to a lot of wasted time for you, the analyst or engineer, and it unnecessarily increases load on the database because it’s going to return the same results as before. A better option would be to use a temporary table.

 

Temporary Tables

A temporary table is like a table the exists the database, but it is temporary in nature. It exists for the duration of your entire database session, not just for the execution of your script. This means that if you were to turn that common table expression into a temp table, you’d be able to re-run only the main query, thus saving a lot of time and load on the database.

  

 

 

The other benefit of the temp table is that it overcomes the issue with having too many CTE’s in your code, which can cause the database to throw an error. Also, with temp tables, you can even place indexes on them if necessary. Because of these benefits, the temp table tends to be a preferred choice, but there is a potential downside, which has to do with speed. A temp table tends to be a little more costly to run than a CTE, so if you’re looking for optimal performance and no longer have a need to re-run individual blocks of code for debugging and QA purposes, you may consider switching back from a temp table to a CTE.

 

Conclusion

While sub-queries can be helpful, you should limit their usage to very simple select statements that do not return complex results sets, where code is not duplicated, and where QA, debugging, and code commenting are easy to complete while using a sub-query. If any of these items become a challenge, you should use a CTE or a temporary table. However, just because it’s easy to do with a subquery today doesn’t mean that your co-workers won’t have to modify or debug your code in the future, so when it doubt, think hard about if a sub-query is the right choice for your code.

 

 

Brandon Southern, MBA, is the founder of Analytics Mentor, specializing in providing analytics advising, consulting, training, and mentorship for organizations and individuals. Brandon has been in tech for 20 years in roles including analytics, software development, release management, quality assurance, six-sigma process improvement, project & product management, and more. He has been an individual contributor as well as a senior leader at start-up companies, GameStop, VMWare, eBay, Amazon, and more. Brandon specializes in building world-class analytics organizations and elevating individuals.

You can learn more about Brandon and Analytics Mentor at http://www.analyticsmentor.io/about

 

 

Subscribe to Receive New Articles

Join our mailing list to receive the latest articles and tips to elevate your career.

We hate SPAM. We will never sell your information, for any reason.