Skip to main content

2 posts tagged with "BigQuery"

View All Tags

SQL "Inception": Recursive CTEs, BigQuery + Looker

· 5 min read

Have you ever needed to query a data structure where you didn't know how deep the relationships went? Think of an organizational chart, a complex product assembly (Bill of Materials), or a nested comment thread like Hacker News.

recursive-cte-flow-chart

Standard SQL joins fail here because they require you to know the number of levels upfront. Enter the Recursive Common Table Expression (CTE)—a powerful tool that allows a query to reference itself, iterating through levels until a termination condition is met.

In this guide, we’ll explore why recursive CTEs are a game-changer, how to implement them in BigQuery, and the specific architectural patterns required to make them work seamlessly in Looker.

Download the Code

Want to skip ahead and see the final LookML? Download the full source code here

Why Recursive CTEs?

Recursive CTEs are designed for hierarchical or graph-based data. Without them, you are often forced to write inefficient loops in a scripting language or perform multiple, expensive joins.

Top Use Cases:

  • Organizational Hierarchies: Finding every employee under a specific manager, regardless of level.
  • Network Analysis: Traversing a graph to find paths between nodes (e.g., "Degrees of Separation").
  • Threaded Conversations: Analyzing comment chains, replies, and sub-threads (our Hacker News example).
  • BOM (Bill of Materials): Exploding a product into all its constituent parts and sub-assemblies.

Recursive CTEs in BigQuery

BigQuery uses a standard syntax for recursion, consisting of three parts: the Anchor Member, the Recursive Member, and the Termination Condition.

The Syntax Pattern:

WITH RECURSIVE hierarchy AS (
-- 1. Anchor Member: Where do we start?
SELECT id, parent, 1 as level
FROM my_table
WHERE parent IS NULL

UNION ALL

-- 2. Recursive Member: How do we get to the next level?
SELECT child.id, child.parent, parent.level + 1
FROM my_table child
JOIN hierarchy parent ON child.parent = parent.id

-- 3. Termination: BigQuery has a hard limit of 500 iterations.
-- Ensure your join eventually returns no rows to stop.
)
SELECT * FROM hierarchy
BigQuery Constraint

WITH RECURSIVE MUST be at the very top level of your SQL statement. It cannot be inside a subquery. This leads to a small challenge in Looker we will overcome later.


The "Looker Trap" and the PDT Solution

When you define a SQL Derived Table in Looker, the generated SQL typically looks like this:

WITH my_derived_table (
SELECT ... -- Your Derived Table SQL here

)
SELECT ... FROM my_derived_table GROUP BY ... ORDER BY ...

If your Derived Table SQL starts with WITH RECURSIVE, Looker wraps it in a subquery, pushing the WITH clause down. BigQuery will reject this query.

The Solution: Persistent Derived Tables (PDTs)

To use recursive CTEs in BigQuery with Looker, you must use a Persistent Derived Table. When Looker builds a PDT, it uses a CREATE TABLE AS SELECT (CTAS) statement. Because the PDT build process executes the SQL as a standalone statement, the WITH RECURSIVE stays at the top level, and the query succeeds.

view: hn_hierarchy {
derived_table: {
datagroup_trigger: my_datagroup
# persist_for: "1 hour" # we recommend using a datagroup_trigger, but using persist_for works too
# Using a trigger or persist_for is mandatory to enable PDT behavior
sql:
WITH RECURSIVE hierarchy AS (
...
)
SELECT * FROM hierarchy ;;
}
}

Real-World Implementation: Hacker News

We analyzed the Hacker News dataset (47 million rows) to find the deepest comment threads. Using a recursive CTE joined to the main table, we uncovered fascinating insights:

  • Record Depth: The deepest thread reached 65 levels deep!
  • Pre-Aggregated Stats: By using a window function in our recursive view, we calculated total story comments as a dimension: COUNT(1) OVER(PARTITION BY root_id) as story_total_comments This allows users to sort by a story's total popularity while still viewing individual comment text in the same row.

A premium Looker implementation doesn't just show data; it provides a path to action.

Enable Detail Drills

Aggregated counts should always allow users to see the underlying rows. We enabled this by adding drill_fields to our recursive measures:

measure: count {
type: count
drill_fields: [id, parent, root_id, hacker_news.text]
}

Connect to Reality

Using Looker’s link parameter, we added external links to the original Hacker News articles. With the Google favicon lookup, the UI feels integrated and professional:

dimension: id {
link: {
label: "View on Hacker News"
url: "https://news.ycombinator.com/item?id={{ value }}"
icon_url: "https://www.google.com/s2/favicons?domain=news.ycombinator.com"
}
}

Hackernews Dashboard

Summary: Best Practices for Looker Developers

  • Materialize: Always persist recursive CTEs in BigQuery.
  • Window Functions: Use them within your CTE to pre-calculate hierarchy-wide stats for better Explore performance.
  • Liquid Guards: If using Query-Time Derived Tables (on other dialects), use {% condition %} to filter the anchor member early.

Recursive CTEs turn impossible data challenges into elegant LookML solutions. By understanding the warehouse-specific constraints and Looker's persistence model, you can unlock hierarchical insights that were previously out of reach.

Looker Embed with BigQuery OAuth

· 4 min read

This implementation guide explains how to embed Looker dashboards backed by Google BigQuery with OAuth into your custom application smoothly, eliminating the "double authentication" phase in the iframe. There is a reference example repository, looker_oauth, made by Sam Pitcher. The code samples here are in Python, but can be done in any server-side framework.

alt text

GCP & Looker Setup

GCP OAuth Credentials

  • In the Google Cloud Console, create a single OAuth 2.0 Client ID.
  • Update your Authorized redirect URIs to list BOTH:
    • Your Looker instance native redirect URI (https://example.cloud.looker.com/external_oauth/redirect).
    • Your host application's OAuth callback URL, for example, https://app.example.com/auth/callback

Connect BigQuery to Looker

  • In the Looker Admin panel, proceed to connections and establish a Google BigQuery connection.
  • Select Authentication with OAuth and plug in your client credentials from GCP.
  • Find the generated application ID by running all_external_oauth_applications via the Looker SDK or API. We recommend using the API Explorer if you have it installed; here is a relative link for use in your Looker UI: /extensions/marketplace_extension_api_explorer::api-explorer/4.0/methods/Connection/all_external_oauth_applications

Application Level Login & Token Fetching

When users access your app, you must authorize them through Google OAuth 2.0. In your authorization redirect URL, configure exactly these dimensions:

Scopes & Access Type

Make sure your framework requests:

Sample Callback Logic

Once Google redirects to your server with the temporary authorization code, invoke standard Google OAuth API calls to fetch tokens:

# Capture Code dimension
code = request.args.get("code")

# Prepare token request using your standard Web Application Client
token_response = requests.post(
"https://oauth2.googleapis.com/token",
data={
"code": code,
"client_id": GOOGLE_CLIENT_ID,
"client_secret": GOOGLE_CLIENT_SECRET,
"redirect_uri": REDIRECT_URI,
"grant_type": "authorization_code",
}
)

response_payload = token_response.json()
access_token = response_payload.get("access_token")
refresh_token = response_payload.get("refresh_token")
expires_in = int(response_payload.get("expires_in"))

Token Synchronization via Looker SDK

Immediately after capturing Google tokens (access/refresh), proactively insert them into Looker using an Admin Looker SDK instance:

Locate the Embed User Identity

In this article, we won't go into the details of Signed Embedding or Cookieless Embedding; we assume you already know what they are, and that a user on the Looker side has already been created. In either of these methods, Looker creates unique embed users tied to the external_user_id that you pass in the SSO URLs. Fetch the user's internal Looker identifier using user_for_credential:

looker_user = sdk.user_for_credential(
credential_type='embed',
user_id=current_user.your_user_id # The external_user_id you use in SSO
)
warning

If this is the absolute first time that a user authenticates into Looker with this external_user_id and you're using signed embedding, then user_for_credential will error. You should catch this error, then create an SSO embed URL and fetch the URL to create the user.

Inject OAuth State into Looker

Construct the user state update parameters and pass them to create_oauth_application_user_state:

body = looker_sdk.models40.CreateOAuthApplicationUserStateRequest(
user_id=looker_user.id,
oauth_application_id=LOOKER_OAUTH_APPLICATION_ID,
access_token=access_token,
access_token_expires_at=datetime.datetime.now() + datetime.timedelta(seconds=expires_in),
refresh_token=refresh_token,
refresh_token_expires_at=datetime.datetime.now() + datetime.timedelta(days=180) # Common refresh expiry
)

sdk.create_oauth_application_user_state(body)

Creating the SSO URL

Once the user's access token matches successfully in Looker, issue a standard Signed Embed URL for them to use when loading the iframe. When creating the Signed Embed URL payload via create_sso_embed_url. If you are using cookieless_embedding, see this document on acquiring user attributes

warning

Make sure to provide the same exact user identification (like their email) used above.

When to Trigger

We strongly recommend using the embed-sdk package to kick off this flow with getEmbedSDK().init() & getEmbedSDK().preload() and then display the iframe without any data, and then use getEmbedSDK.loadExplore() or getEmbedSDK.loadDashboard() to load the proper data into the iframe when you need it.

Troubleshooting Checklist

If users are still prompted for authentication within the iframe, verify the following:

  • Missing Refresh Token: Ensure you requested access_type='offline' and prompt='consent' in the Google OAuth redirect. Without this, Google won't return a refresh token, and Looker will be unable to refresh expired access tokens after 60 minutes automatically.
  • Scope Mismatch: Verify that the scopes requested on application login cover exactly the necessary BigQuery scopes (e.g., https://www.googleapis.com/auth/bigquery.readonly).
  • Mismatched External User ID: The external_user_id used in the Signed SSO Embed URL must exactly match the user_id used to pull the user identity in user_for_credential before state injection.
  • Looker User Provisioning: In Signed SSO Embedding, Looker only creates the user profile on the first successful load of an SSO URL. If you try to fetch the user state before they've ever visited, user_for_credential will fail. In your server-side logic, if user_for_credential fails, provision the user by either fetching the URL or properly capturing the error.