How computers and applications interact with data (inputs and outputs) depends on their data types. A data type is a classification that governs how compilers and virtually all programming languages collect, store, and interpret data to define events properly. Thus, a comprehensive understanding of the various data types an application supports is important to effectively define, transform, and utilize your business data.
Google BigQuery is a fully managed, serverless enterprise data warehouse with built-in machine learning and business intelligence (BI) capabilities. It works across cloud platforms and enables users to process and scale their large, read-only datasets. This post will introduce and discuss the various data types supported by BigQuery.
A quick introduction to BigQuery
The BigQuery platform comes with a built-in query engine, which lets users run SQL queries on terabytes of data in seconds and petabytes in only minutes. Besides being able to carry out machine learning, geospatial, and BI analysis on data across the cloud, BigQuery's computational engine can store, manage, and govern various types of data.
Speaking of data types, the Google Cloud–managed platform supports real-time and batch data, whether structured, semi-structured, or unstructured. In addition, BigQuery lets users store and map data into any of its supported data types to process these data effectively.
BigQuery data types
Data types are crucial for data manipulation and analysis operations. BigQuery currently supports eight data types:
- NUMERIC (INT64, NUMERIC, BIGNUMERIC, and FLOAT64)
This data type is used where the precision and scale of exact or approximate values need to be preserved. The four additional categories for numerical data types are INT64, NUMERIC DECIMAL, BIGNUMERIC, and FLOAT64.
Integers are whole numbers; thus, they lack decimals or fractional components. These numbers have a storage size of 8 bytes and range from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807. They can also be NULL, and this data type allows the use of mathematical operations like ORDER BY clauses. NULL, in this case, means no value or the absence of a value. It's a standard placeholder for missing values in databases.
SELECT cast('12345' as INT64) int
The precision of the numbers contained here is 38 decimal digits, while the scale is nine decimal digits. Precision means the total number of digits before and after the decimal, while decimal is the number of digits after the decimal. Thus, values here range from -99999999999999999999999999999.999999999 to 99999999999999999999999999999.999999999 and are particularly suited for financial calculations.
SELECT cast('1.2E34' as NUMERIC) numeric
BIGNUMERIC data types are similar to NUMERIC DECIMAL data types in that they support decimal digits and are used for financial calculations. Their differences come in the level of precision. While NUMERIC has a precision of 38, BIGNUMERIC supports a precision of 76.76 and is thus also referred to as BIGDECIMAL.
The minimum values of a data type are BIGNUMERIC -5.7896044618658097711785492504343953926634992332820282019728792003956564819968E+38 and the maximum value is 5.7896044618658097711785492504343953926634992332820282019728792003956564819968E+38.
SELECT cast('1.2E34' as BIGNUMERIC) bignumeric
These data types are double-precision (approximate) numeric values with fractional components (e.g., 1.234). FLOAT64 also includes unique non-numeric values like NaN, +inf, and -inf.
SELECT cast('1.2345' as FLOAT64) float
What can you do with NUMERIC?
NUMERIC supports a couple of operations, such as:
- basic arithmetic operations like addition, subtraction, division, and multiplication
- mathematical functions like ABS (absolute value), RAND (pseudo-random), and SQRT (square root)
- comparisons operators using operators like <, <=, >, >=, != , and <>, which are often used with SQL clauses like the WHERE clause
- precise decimal calculations for financial and accounting calculations.
That last operation is where your NUMERIC DECIMAL, BIGNUMERIC, and FLOAT64 data types come in handy.
Working with NUMERIC
- They can return NULL or NaN if any input parameter is a NULL or a unique floating-point number called NaN (Not-a-Number)
- NaN also appears when a division by zero is attempted.
- BigQuery requires that NUMERIC types be entered directly as strings (NUMERIC '1.23'). A floating-point representation will retain the precision improvements if this is done.
Boolean values are logical variables represented by two case-insensitive keywords, TRUE and FALSE. BOOL values can also be NULL and are used to perform logical and conditional operations.
In logical operations, data are filtered based on specific criteria. However, in conditional operations, a decision is made based on an expression. Both operations are usually used along with SQL clauses like WHERE, ORDER BY, and OR.
SELECT cast('TRUE' AS BOOL) AS boolean
STRING is UTF-8 encoded variable-length character data. These data types are particularly used to store user-generated values such as billing addresses, usernames, survey replies, tweets, and email addresses.
SELECT cast(123 AS STRING) AS string
SELECT cast('panoply' AS STRING) AS string
Working with STRING
- STRING must be quoted with either single ('), double ("), or triple (three single ["'] or three double ["""]) quotation marks.
- STRING data types support data wrangling and string manipulation operations like CHAR_LENGTH (to find out the number of STRING characters), LOWER (to transform characters to lowercase), and CONCAT (to concatenate multiple values into single values). These operations are commonly used during your data pipeline's extract, transform, and load (ETL) process.
- STRING can be cast or converted from one data type to another. For example, a STRING cast to an INT64 or FLOAT64 value and the integer or float back to a string. In a situation where this is unsuccessful, a SAFE_CAST can be used. This is called coercion; it will, however, return NULL.
- You can find specific patterns and characters within STRING values using regular expressions (regex operations). These regex operations—such as REGEXP_CONTAINS, REGEXP_EXTRACT, and REGEXP_REPLACE—are used to find, remove, and replace regex patterns, respectively.
Bytes are variable-length binary data. However, unlike STRING, which exists in Unicode characters, they exist in raw bytes. Thus, you can't use BYTES interchangeably with STRING. These data types, like STRING, are represented in single, double, or triple quotation marks.
SELECT cast ('panoply' as BYTES) bytes
This data type supports four subtypes—DATE, TIME, DATETIME, and TIMESTAMP. Though each has different precision levels, you can use them interchangeably. For example, you can extract a DATE data type value from a DATETIME value.
The DATE data types represent a logical calendar date independent of a specific 24-hour period and time zone. Therefore, this date value can be a different period when viewed in different time zones. To specify a particular period in time, use a timestamp.
A DATE date-type format is YYYY-[M]M-[D]D and ranges from 0001-01-01 to 9999-12-31. The YYYY represents the year, the [M]M for a one or two-digit month, and [D]D for a one or two-digit day.
SELECT cast('2022-12-12' AS DATE) AS date
The TIME data types represent a time independent of a specific date and time zone. The format for the TIME data type values is [H]H:[M]M:[S]S[.DDDDDD]. [H]H is the one or two-digit hour with valid values from 00 to 23. [M]M and [S]S refer to one or two-digit minutes and seconds, respectively, with valid values from 00 to 59. [.DDDDDD] represents microsecond precision of up to six fractional digits (). Generally, TIME values range from 00:00:00 to 23:59:59.999999.
SELECT cast('10:59:13.0245' AS TIME) AS time
Just like the DATE data types, TIME values need to be cast as a timestamp to specify it to a particular time zone.
SELECT cast('2022-12-12 10:59:13.0245-2:00' AS TIMESTAMP) AS time
The DATETIME data types represent the date and time of an event independent of a specific time zone. The format of a DATETIME is YYYY-[M]M-[D]D[(|T)[H]H:[M]M:[S]S[.DDDDDD]] and ranges from 0001-01-01 00:00:00 to 9999-12-31 23:59:59.999999. Each has the same meaning as in the DATE and TIME data types. However, it has the letter T separating the date part and the time part of the values.
SELECT cast('2022-12-12 10:59:13' AS TIMESTAMP) AS datetime
TIMESTAMP represents an exact point in time, with microsecond precision on a particular day in a time zone. It is a DATETIME value with a higher level of precision because the time zone is specified. This data type has a format of YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.DDDDDD]][time zone] and ranges from 0001-01-01 00:00:00 to 9999-12-31 23:59:59.999999.
By default, a TIMESTAMP will always assume the time zone to be in UTC and can be precise down to microseconds, except a time zone is specified given for it.
SELECT cast('2022-12-12 10:59:13.0245-2:00' AS TIMESTAMP) AS timestamp
Working with DATE and TIME
Date and time values are for tracking and adding a time element to a particular event. The level of precision will determine which of the four BigQuery data and time data types you would use. Date and time values, especially DATE and TIME, also allow users to perform mathematical operations. This is helpful when trying to answer business questions like, “How long did it take product A to be delivered after it was shipped?”
The GEOGRAPHY data type refers to a particular location on Earth, using the same longitude/latitude values as any GPS (WGS84 reference ellipsoid). Thus, you can build maps and routes with it. This data type represents a collection of points, lines, and polygons based on the Open Geospatial Consortium's (OGC) Simple Features specification (SFS).
SELECT cast(ST_GEOGFROMTEXT('POINT(6.4550575,3.3941795)') as GEOGRAPHY) geography
The spatial functions of BigQuery support SQL/MM 3 specification.
This data type is an ordered list of zero or more elements of any non-ARRAY type. However, the elements in the array must all share the same type.
SELECT cast(['sync', 'store', 'access'] as ARRAY<string>)
Working with ARRAY
- They're represented using the angle brackets (< and >).
- ARRAYs of ARRAYs are not allowed. Thus, queries that produce this will return an error. It would help if you used a STRUCT instead in such cases.
- NULL and empty ARRAYs are two different elements.
- ARRAY with NULL values raises an error. However, you can use them in a query.
STRUCT are containers of ordered fields, each with a required data type and an optional field name. This data type also supports NULL values.
SELECT cast((27,'Panoply') as STRUCT<INT64,STRING>) struct
Working with STRUCT
- They're declared using the (< and >) angle brackets. The elements are contained inside the < and > and can be arbitrarily complex.
- You can compare STRUCTs using comparison operators.
Every BigQuery data type except GEOGRAPHY, ARRAY, and STRUCT works well with the standard SQL expressions.
Your business data originates from various data sources and warehouses. While data types are essential for effective data manipulation and analysis, tools like Panoply offer you a single source of truth.
With Panoply, you can sync all your business data, regardless of the sources, in just a few clicks and securely store them in the cloud. With your data in one centralized place, you can map out their data types, analyze them, and get insight from them in BigQuery.
Request a free demo today to learn more about Panoply!
This post was written by Ifeanyi Benedict Iheagwara. Ifeanyi is a data analyst and Power Platform developer who is passionate about technical writing, contributing to open source organizations, and building communities. Ifeanyi writes about machine learning, data science, and DevOps, and enjoys contributing to open-source projects and the global ecosystem in any capacity.