AWS Athena and Nested JSON

In case somebody is trying to use AWS Athena and need to load data from JSON, It’s possible but got some learning curves(AWS curves included) 😉 .

So this post got some examples of how to create the table and how to query it.

1. Lets start with a simple example , key <> value

JSON Looks like :

[sourcecode language=”plain”]
{
“type”: “FeatureCollection”,
“features”: “geolocations”
}
[/sourcecode]

Create Table :

[sourcecode language=”plain”]
CREATE EXTERNAL TABLE jsondata (
type string,
features string
)
ROW FORMAT SERDE ‘org.openx.data.jsonserde.JsonSerDe’
LOCATION ‘s3://<YOUR BUCKET HERE>/<some path>/’
[/sourcecode]

Query Table:

[sourcecode language=”plain”]
SELECT type AS TypeEvent,
features AS FeatherType
FROM blogpost.jsondata
WHERE type = ‘FeatureCollection’
[/sourcecode]

2. With an array

JSON Looks like :

[sourcecode language=”plain”]

{
“type”: “FeatureCollection”,
“features”: [“latitude”, “longitude”]
}

[/sourcecode]

Create Table :

[sourcecode language=”plain”]
CREATE EXTERNAL TABLE jsondata (
type string,
features array<string>
)
ROW FORMAT SERDE ‘org.openx.data.jsonserde.JsonSerDe’
LOCATION ‘s3://<YOUR BUCKET HERE>/<some path>/’
[/sourcecode]

Query Table:

[sourcecode language=”plain”]
SELECT type AS TypeEvent,
features[1] AS FeatherType
FROM blogpost.jsondata
WHERE type = ‘FeatureCollection’
[/sourcecode]

3. With array and nested dict ( PythoniCally speaking )

JSON Looks like :

[sourcecode language=”plain”]

{
“type”: “FeatureCollection”,
“features”: [{
“first”: “raj”,
“properties”: “someprop”
}]
}

[/sourcecode]

Create Table :

[sourcecode language=”plain”]
CREATE EXTERNAL TABLE jsondata (
type string,
features array<struct<first:string,properties:string>>
)
ROW FORMAT SERDE ‘org.openx.data.jsonserde.JsonSerDe’
LOCATION ‘s3://<YOUR BUCKET HERE>/<some path>/’
[/sourcecode]

Query Table:

[sourcecode language=”plain”]
SELECT type AS TypeEvent,
features[1].first AS FeatherType
FROM blogpost.jsondata
WHERE type = ‘FeatureCollection’
[/sourcecode]

4. Put everything together

JSON Looks like :

[sourcecode language=”plain”]
{
“type”: “FeatureCollection”,
“features”: [{
“first”: “raj”,
“geometry”: {
“type”: “Point”,
“coordinates”: [-117.06861096, 32.57889962]
},
“properties”: “someprop”
}]
}
[/sourcecode]

Create Table :

[sourcecode language=”plain”]
CREATE EXTERNAL TABLE `jsondata`(
`type` string COMMENT ‘from deserializer’,
`features` array<struct<type:string,geometry:struct<type:string,coordinates:array<string>>>> COMMENT ‘from deserializer’)
ROW FORMAT SERDE
‘org.openx.data.jsonserde.JsonSerDe’
WITH SERDEPROPERTIES (
‘paths’=’features,type’)
STORED AS INPUTFORMAT
‘org.apache.hadoop.mapred.TextInputFormat’
OUTPUTFORMAT
‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’
LOCATION
‘s3://vicinitycheck/rawData/jsondata/’
TBLPROPERTIES (
‘classification’=’json’)

[/sourcecode]

Query Table:

[sourcecode language=”plain”]
SELECT type AS TypeEvent,
features[1].geometry.coordinates AS FeatherType
FROM test_vicinitycheck.jsondata
WHERE type = ‘FeatureCollection’
[/sourcecode]

Some additional things :

Avoid the reserved words in json and keep things in lower case.

Rest given the speed these cloud providers change , please share if you find any thing new came.