Unlocking Time Series and Range Functions in BigQuery
Written on
Chapter 1: Introduction to Time Series Analysis
Google has recently introduced Time Series and Range functions for BigQuery SQL. Let’s explore how to utilize these features effectively.
Definition of Time Series Analysis
To begin with, let’s define Time Series Analysis:
In mathematics, a time series is a collection of data points organized in chronological order. Typically, a time series consists of data taken at regular intervals, creating a sequence of discrete-time data. Common examples include ocean tide measurements, sunspot counts, and daily closing figures of the Dow Jones Industrial Average.
Beyond analyzing past data and its influencing factors, the goal of time series analysis is to predict future trends based on historical values. Time series data often reveals patterns tied to specific influences, and these patterns can generally be categorized into distinct components:
- Trend Component: Indicates the overall direction of the time series over time, such as a consistent rise in a country’s GDP.
- Seasonal Component: Accounts for periodic fluctuations within a year; for instance, ice cream sales tend to rise during summer months.
- Economic Component: Reflects regular fluctuations that extend beyond a year, often linked to cycles of economic growth and decline.
- Residual Component: Encompasses variations not explained by the above categories, such as unexpected weather changes or data collection errors.
Data Structure for Time Series
As mentioned earlier, a time series comprises data points, each associated with a timestamp and a corresponding value. It is also common for a time series to have a unique identifier. In relational databases and BigQuery, time series data can be structured with the following attributes:
- Time column
- Optional partitioning columns (e.g., zip code)
- One or more value columns or a STRUCT type that combines multiple values, such as temperature and AQI.
Example of Time Series Data Representation
Using the RANGE Data Type
When working with time series data, it’s often necessary to define specific time ranges. This is where the newly introduced RANGE data type comes into play. The range indicates the time span for which a row is valid. GoogleSQL for BigQuery supports several range functions:
To demonstrate, consider the following example:
SELECT GENERATE_RANGE_ARRAY(
RANGE(DATE '2020-01-01', DATE '2020-01-06'),
INTERVAL 1 DAY) AS results;
This query generates date values that populate a RANGE_ARRAY.
Using Time Series Functions
Now that we’ve discussed the need for time series data, its modeling, and the RANGE data type, let’s delve into the most exciting addition: Time Series functions in BigQuery SQL. Google provides the following four functions:
- DATE_BUCKET: Retrieves the lower bound of the date bucket containing a specific date.
- DATETIME_BUCKET: Retrieves the lower bound of the datetime bucket for a given datetime.
- GAP_FILL: Identifies and fills gaps in a time series.
- TIMESTAMP_BUCKET: Retrieves the lower bound of the timestamp bucket for a given timestamp.
For more detailed information, please refer to the official Google documentation. However, here’s a simple example to illustrate how to use these functions.
First, we need to create some sample data:
CREATE OR REPLACE TABLE Data.environmental_data_hourly AS
SELECT * FROM UNNEST( ARRAY[
STRUCT(60606, TIMESTAMP '2020-09-08 00:30:51', 22, 66),
STRUCT(60606, TIMESTAMP '2020-09-08 01:32:10', 23, 63),
// Additional data points...
]);
We will utilize the TIMESTAMP_BUCKET function:
DATETIME_BUCKET(datetime_in_bucket, bucket_width, bucket_origin_datetime)
This function takes the following parameters:
- date_in_bucket: A DATE value used to find a date bucket.
- bucket_width: An INTERVAL value that specifies the width of the date bucket.
- bucket_origin_date: A DATE value representing a reference point in time; the default is 1950-01-01 if not set.
To calculate a 3-hour average for air quality index (AQI) and temperature per zip code, we can run the following query:
SELECT
TIMESTAMP_BUCKET(time, INTERVAL 3 HOUR) AS time,
zip_code,
CAST(AVG(aqi) AS INT64) AS aqi,
CAST(AVG(temperature) AS INT64) AS temperature
FROM Data.environmental_data_hourly
GROUP BY zip_code, time
ORDER BY zip_code, time;
Query Results Visualization
Conclusion
This brief overview of the new Range and Time Series functions provides insight into Time Series analysis, its applications, and how to implement it using GoogleSQL. For a comprehensive exploration and additional examples, please consult the official Google documentation linked below.
Sources and Further Readings
- Wikipedia, Time series (2024)
- Statista, Definition Zeitreihenanalyse (2024)
- Google, Work with time series data (2024)
- Google, Range functions (2024)
- Google, Time series functions (2024)
Chapter 2: Practical Applications of Time Series Functions
This video titled "Connect to BigQuery with Dynamic Date Ranges" offers an overview of how to establish connections to BigQuery while utilizing dynamic date ranges effectively.
Chapter 3: Advanced Techniques in Data Studio and BigQuery
In this video, "Dynamic Dates in Custom Queries (Data Studio + BigQuery)," viewers can learn how to implement dynamic dates in custom queries within Data Studio and BigQuery for enhanced reporting capabilities.