Ingest Google Analytics data into Databricks

The customer request was simple: Using Databricks, his main data platform, connect and collect data from his Google Analytics account. It took me some time to get it right, I hope this tutorial will help you get it up and running easily. If you want, you can jump to the end of the post to the link to the full notebook on GitHub.

Google Analytics is a web analytics service offered by Google that tracks and reports website traffic. It’s widely used by website owners, marketers, and businesses to gain insights into how users interact with their websites, apps, and other online platforms.

There are 2 ways to extract data from Google Analytics. If you want the raw data, you can only export it to Google BigQuery. The other option I’ll show here is to query aggregated data using the Google Analytics 4 API.

Step 1 – Enable the API

Before anything, you need to enable the API, this is quite easy, using the big blue button in the link below:

https://developers.google.com/analytics/devguides/reporting/data/v1/quickstart-client-libraries

You need to get a JSON file, that includes the details and keys to connect to the GA API (we will upload this file to Databricks later)

Step 2 – Get permissions

Now extract the email from the file (see step 2 in the link above) and ask the GA admin in your company to create a user and grant it “Viewer” permissions.

You also need to ask them for the property_id. the property_id is a number that represents a certain site. The viewer permissions should be given on that property_id, and you need to save it for later.

Step 3 – Install the GA library

You need the google-analytics-data library installed to use the code below, so you have 2 options:

  1. Install the library on the cluster. See how here – https://learn.microsoft.com/en-us/azure/databricks/libraries/cluster-libraries
  2. Install the library for the current notebook by using pip
pip install google-analytics-data

Step 4 – Define the environment

To use the file from step 1 to connect to GA, we need to load it as an environment variable. To do that first upload the file to Databricks storage, either a mount or a volume (If you are using Unity Catalog). Then use os.environ to create an environment variables that direct to the file.

#when file is on volumn
ga_credentials_file = "/Volumes/mycatalog/myschema/myvolume/ga4.json"
#when file is on mount
ga_credentials_file = "/mnt/mymount/ga/ga.json"
#set up google credentials in the environment variable
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]=ga_credentials_file

You also need to define the GA property_id (the number you got from the GA admin):

# GA property_id
property_id = "123456"

Step 5 – Import libraries

I don’t really have anything smart to say about that, you just need those imports:

from google.analytics.data_v1beta import BetaAnalyticsDataClient
from google.analytics.data_v1beta.types import (
    DateRange,
    Dimension,
    Metric,
    RunReportRequest
)

Step 6 – Define the report to run

This is where things get interesting. As I said above, the data returned from GA is really an aggregated report, and you need to set 3 things:

  • Date range – the range of dates for which the report will include data. We need to set the start and end points,
  • Dimensions – the attributes or entities we want to measure, like pages or locations (country, city, etc). We can have multiple dimensions in the same report.
  • Metrics – what we actually measuring, for example, the number of active users, the bounce rate, etc. We can have multiple metrics.

The full list of available dimensions and metrics can be found here: https://developers.google.com/analytics/devguides/reporting/data/v1/api-schema

In the example below, I am measuring active users per page (full page and page title), meaning the number of users that visited each page in the time range which is from 2024-08-01 until today.

The last line of code actually runs the report and returns a response

#treffic per page
request = RunReportRequest(
        property=f"properties/{property_id}",
        dimensions=[Dimension(name="fullPageUrl"),Dimension(name="pageTitle")], 
        metrics=[Metric(name="activeUsers")],
        date_ranges=[DateRange(start_date="2024-08-01", end_date="today")],
    )
response = BetaAnalyticsDataClient().run_report(request)

Step 7 – Convert to Spark and use

Now that you have the response object that includes the returned data, you can go ahead and use it in whatever way you want. If you want to use it in Databricks, you probably want to convert it to a Spark dataframe. The function below will convert the response to a dataframe, and then you can display it or write it to a table (see code below).

# function to transform json reponse to spark dataframe
def ga_response_to_spark_df (response):
    # Grab all the columns in the result
    data = response
    #get column
    columns = []
    for col in data.dimension_headers:
        columns.append(col.name)
    for col in data.metric_headers:
        columns.append("_" + col.name)
    # Grab all the rows in the result.
    rows = []
    for row_data in data.rows:
        row = []
        for val in row_data.dimension_values:
            row.append(val.value)
        for val in row_data.metric_values:
            row.append(val.value)
        rows.append(row)
    ga_df = spark.createDataFrame(rows,columns)
    #to cast columns to the right data type
    metric_list = []
    for metric in list(data.metric_headers):
        metric_name = str(metric.name)
        old_metric_name = str("_"+metric.name)
        #print(metric_name)
        metric_data_type = str(metric.type_.name)[5:]
        #print(metric_data_type)
        metric_dict = {"metric_name":metric_name,"metric_data_type":metric_data_type}
        metric_list.append(metric_dict)
    for metric in metric_list:
        metric_name = metric["metric_name"]
        old_metric_name = "_"+metric["metric_name"]
        #print(metric_name)
        metric_data_type = metric["metric_data_type"]
        #print(metric_data_type)
        if metric_data_type=="INTEGER":
            ga_df = ga_df.withColumn(metric_name,ga_df[old_metric_name].cast("int")).drop(old_metric_name)
    return ga_df

#display the result
display(ga_response_to_spark_df(response))

#write the result to a delta table
ga_df = ga_response_to_spark_df(response)
ga_df.write.saveAsTable("bronze.ga_users_by_page")

I hope you found this tutorial helpful. Leave me a comment if you did, or if you have any questions or improvement ideas.

The full notebook code can be found here (DBC file that you need to import into your Databricks workspace): https://github.com/MadeiraData/microsoft-data-engineers-club/tree/main/Databricks/Ingest%20data%20from%20Google%20Analytics

You may also like...