Nested Data Ingestion Methods
At Panoply, we've handled nested data since our inception. Nested data can come in multiple shapes and forms, including different combinations of the same types. Nested data can be:
- Objects
- Arrays
- JSON strings
In this blog, I’ll take you through a journey of handling nested data in Panoply. You might see spoilers for the future as the data journey never ends 🙂.
When we introduced the Panoply solution back in 2015, we transformed all the nested data into relational structure. How? By creating additional tables with a one-to-many relationship between them and their “parent” table. These additional tables are subtables.
Back then, this was the only way we handled nested data. The greatest advantage of the above ingestion method is that all the data is available to query directly with standard SQL. This is the main reason it was and is still the default behavior in all of our connectors.
Example:
Input:
{"id":1, "number":1, "text":"hello", "date":"2023-10-01", "nested":{"another_number":1, "another_level":{"another_date":"2023-10-02"}}}
{"id":2, "number":2, "text":"there", "date":"2023-10-01", "nested":{"another_text":"again", "another_level":{"another_date":"2023-10-02"}}}
Output: (... represents internal Panoply columns)
destination_table
id |
number |
text |
date |
… |
1 |
1 |
hello |
2023-10-01 |
|
2 |
2 |
there |
2023-10-01 |
destination_table_nested
another_number |
another_text |
… |
1 |
null |
|
null |
again |
destination_table_nested_another_level
another_date |
… |
2023-10-02 |
|
2023-10-02 |
As a company that listens to our customers, we’ve always changed and adapted our features to the needs of our customers and technological capabilities. Throughout the years we’ve added a number of additional options to handle nested data and ingest it in many different ways.
Next, we’ve added the option to flatten the nested data to the parent table. This solution was (and still is) only available for objects and JSON strings. At the beginning, we only supported the ability to flatten the data one level up, but now users can flatten the data up to the top-most level (the main table).
Example:
Input:
{"id":1, "number":1, "text":"hello", "date":"2023-10-01", "nested":{"another_number":1, "another_level":{"another_date":"2023-10-02"}}}
{"id":2, "number":2, "text":"there", "date":"2023-10-01", "nested":{"another_text":"again", "another_level":{"another_date":"2023-10-02"}}}
Output: (... represents internal Panoply columns)
destination_table
id |
number |
text |
date |
nested_another_number |
nested_another_text |
nested_another_level_another_date |
… |
1 |
1 |
hello |
2023-10-01 |
1 |
null |
2023-10-02 |
|
2 |
2 |
there |
2023-10-01 |
null |
again |
2023-10-02 |
Both options still kept the data in a relational structure but they encountered a few edge cases. Some of them were the main driver to add more options. For example, if the user had a lot of uniquely named nested fields they would either end up having a lot of subtables (with the one-to-many ingestion approach) or a lot of columns in a single table (with the flattening approach). Because of this we’ve added the key:value ingestion option. This ingestion type converts the data from an object into an array of objects. Each object has two fields in it:
- Key - The value of this field is the original field name
- Value - The value of this field is the original field value
With this ingestion type, instead of creating multiple tables or columns we create a table with two columns (key & value) and multiple rows to represent a single row of the original data. This solution was available for our Redshift customers for over 5 years already but was not ported to our BigQuery solution when we originally introduced it. Spoiler alert: The key:value feature is coming soon to BigQuery.
Example:
Input:
{"id":1, "number":1, "text":"hello", "date":"2023-10-01", "nested":{"another_number":1, "another_text":"again", "another_date":"2023-10-02"}}
Output: (... represents internal Panoply columns)
destination_table
id |
number |
text |
date |
… |
1 |
1 |
hello |
2023-10-01 |
Destination_table_nested
key |
value |
… |
another_number |
1 |
|
another_text |
again |
|
another_date |
2023-10-02 |
In April 2023, we introduced to the world our Flex Connector. This connector can connect to literally any API service and ingest its data into your Panoply data warehouse. Due to the fact that this connector supports multiple different unknown responses, we knew we might encounter weird data structures that would require some pre-load modeling. These are tasks our Snap Connectors do automatically based on the source’s logic, but with the Flex Connector, we do not have this information prior to the user running the connector.
That’s why we’ve added the option to flatten arrays. When flattening an array we are transforming each element in the array into a separate row and attach to each row the original top level row. By doing this we are actually duplicating the original data in order to accommodate the flattening of the array.
The Flex Connector is currently the only connector that supports this, but we are planning to add this feature to all the connectors in the future.
Example:
Input:
{"id":1, "number":1, "nested_array":[{"another_number":1, "another_text":"foo"},{"another_number":2, "another_text":"bar"}]}
Output: (... represents internal Panoply columns)
destination_table
id |
number |
nested_array_another_number |
nested_array_another_text |
… |
1 |
1 |
1 |
foo |
|
1 |
1 |
2 |
bar |
We recently released the latest nested data feature - Panoply’s BigQuery users can now ingest nested data directly as a JSON data type. Thanks to the native JSON data type in BigQuery, we now allow users to ingest their nested objects and JSON strings directly as JSONs.
Example:
Input:
{"id":1, "number":1,"nested":{"another_number":1, "another_text":"again", "another_date":"2023-10-02"}}
Output: (... represents internal Panoply columns)
destination_table
id |
number |
nested |
… |
1 |
1 |
{"another_number":1, "another_text":"again", "another_date":"2023-10-02"} |
Additional ingestion types in coming in the future:
- JSON data type for arrays
- Key:Value feature in BigQuery
- Flattening arrays on all connectors and more than a single array
- Array data type
- Struct data type
As you can see there are many ways to ingest nested data in Panoply, with additional options coming in the future. Users often ask us why we have so many different options - this is a great question, and we have an answer! Every company, every user, and every piece of data has their own needs. These needs can be logical, analytical, or even simply data needs that require a certain data modeling that will fit their use case. Similar to a lot of other features, we truly believe that we should simplify the data process but still enable flexibility for the users to achieve their analytical goals.