<Leveraging Python for Extracting Datasets and Tables from SQL>
Written on
When crafting SQL queries, it’s common to adopt an "out of sight, out of mind" approach. Once a query is run—especially if it’s not part of a recurring job—there often feels like little need to keep the SQL text. Many platforms, such as Google Cloud Platform's SQL workspace, store job metadata, enabling access to previous queries long after execution.
However, the real advantage of retaining these queries extends beyond their immediate outputs. The actual query strings can be quite valuable for auditing a table's metadata.
Use Case
Imagine this scenario: your supervisor requests a comprehensive list of all tables referenced by a particular view. Relying solely on existing views like INFORMATION_SCHEMA may not be beneficial, as the engineer must identify which tables the view references rather than merely the one housing the view itself.
I faced a similar challenge at work recently and found it surprisingly difficult. Resources on extracting tables from SQL query strings stored in a data frame were scarce, despite extensive searches on StackOverflow and Google. The general consensus pointed towards using third-party libraries like sqlparse, but I prefer steering clear of them in professional contexts due to potential security risks and the possibility of deprecation.
Thus, I set out to devise a pure Python method for extracting and parsing SQL query elements.
Initial Query and Data Frame
To access the metadata containing past queries for my project, I needed to query the INFORMATION_SCHEMA.JOBS_BY_PROJECT view.
SELECT query FROM region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= "2022-06-01"
This query produces an output where the last five entries represent the metadata for the previous query. While these rows can be filtered out, I’ve opted to keep them for now.
After obtaining the query results, I transformed them into a Pandas data frame and removed the 'None' strings from the initial rows.
result = bq_client.query(query_string).to_dataframe()
result = result[result["query"].str.contains("None") == False].reset_index()
The resulting data frame appears as follows:
Note: For additional insights on Python, SQL, and cloud computing, check out **Pipeline: Your Data Engineering Resource*. To stay updated with my latest writings, feel free to follow me.*
Extracting the Datasets and Tables
Through some trial and error, I found that the most effective way to extract datasets and tables is by splitting the string at the periods that distinguish the project from the dataset and table following the FROM clause.
Observe how this structure remains consistent:
project.dataset.table
I referred to a StackOverflow answer for guidance on the correct syntax for the period split. For demonstration purposes, I will start with splitting the initial query string to extract the dataset, INFORMATION_SCHEMA.
print(query_string.split('.', 2)[1].split(' ', 2)[0])
By adjusting the index, I can also retrieve the table name.
print(query_string.split('.', 2)[2].split(' ', 2)[0])
Unfortunately, this approach also captures the WHERE clause that follows the table. While I could filter it out initially, I can also replace the SQL syntax after splitting the datasets and tables into separate columns.
It’s important to note that we are actually splitting the SQL query string rather than the column containing all queries. To capture all instances where datasets and tables are referenced, I will iterate through the rows of the ‘query’ column.
First, let's extract the datasets. The steps are as follows: 1. Create an empty list. 2. Loop through each row in the ‘query’ column. 3. Split the dataset on '.' as previously described. 4. Append the dataset output to the dataset list. 5. Convert the dataset list to a column (assign to a variable).
dataset_lst = []
for index, row in result.iterrows():
dataset = row['query'].split('.', 2)[1].split(' ', 2)[0]
dataset_lst.append(dataset)
result['dataset'] = dataset_lst
This code generates the following output:
Using the same logic, we can now extract the table names. Note that we repeat the previous steps, but adjust the string splits to access the table element.
tab_lst = []
for index, row in result.iterrows():
tab = row['query'].split('.', 2)[2].split(' ', 2)[0]
tab_lst.append(tab)
result['table'] = tab_lst
This process yields the following:
Both the table and dataset columns may contain leftover SQL syntax that needs to be removed. We can apply a loop using the str.replace function to iterate through a list of common SQL commands that might still be concatenated with the strings.
replacers = ['n', 'WHERE', 'GROUP', 'WHEN', 'ENDAS']
for r in replacers:
result['table'] = result['table'].str.replace(r, '')
After all elements of the strings have been accurately split, extracted, and filtered, we can encapsulate everything in a function to streamline the script.
from google.cloud import bigquery
import pandas as pd
import os
import sqlparse
import re
import config as cfg
def split_dataset_tables():
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = cfg.creds
bq_client = bigquery.Client()
query_string = """ SELECT query FROM region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= "2022-06-01" """
result = bq_client.query(query_string).to_dataframe()
result = result[result["query"].str.contains("None") == False].reset_index()
tab_lst = []
dataset_lst = []
for index, row in result.iterrows():
tab = row['query'].split('.', 2)[2].split(' ', 2)[0]
tab_lst.append(tab)
dataset = row['query'].split('.', 2)[1].split(' ', 2)[0]
dataset_lst.append(dataset)
result['dataset'] = dataset_lst
result['table'] = tab_lst
replacers = ['n', 'WHERE', 'GROUP', 'WHEN', 'ENDAS']
for r in replacers:
result['table'] = result['table'].str.replace(r, '')return result
split_dataset_tables()
The final output should resemble this:
This method suffices for our needs. If we are only compiling a list of tables from specific queries, the backtick following 'transactions' won't pose an issue. However, if we later compare this output against another dataframe or table, removing the backtick character will ensure that any regex or other string searches match accurately.
Take Away
This script can be expanded to analyze more intricate queries, especially those that reference specific tables. As your analysis needs evolve and queries become more complex, you may find it necessary to extract query strings and convert between various data types like lists, sets, and tuples. In fact, employing set theory (and related operations) can be a powerful strategy for identifying whether certain tables exist within a list.
Although this procedure addresses a specific use case, practicing the extraction of SQL query elements from strings can significantly enhance your regex and string manipulation skills.
Build a compelling data portfolio. Discover how with my free project guide.