In case somebody is trying to use AWS Athena and need to load data from JSON, It’s possible but got some learning curves(AWS curves included) 😉 .
So this post got some examples of how to create the table and how to query it.
1. Lets start with a simple example , key <> value
JSON Looks like :
[sourcecode language=”plain”]
{
“type”: “FeatureCollection”,
“features”: “geolocations”
}
[/sourcecode]
Create Table :
[sourcecode language=”plain”]
CREATE EXTERNAL TABLE jsondata (
type string,
features string
)
ROW FORMAT SERDE ‘org.openx.data.jsonserde.JsonSerDe’
LOCATION ‘s3://<YOUR BUCKET HERE>/<some path>/’
[/sourcecode]
Query Table:
[sourcecode language=”plain”]
SELECT type AS TypeEvent,
features AS FeatherType
FROM blogpost.jsondata
WHERE type = ‘FeatureCollection’
[/sourcecode]
2. With an array
JSON Looks like :
[sourcecode language=”plain”]
{
“type”: “FeatureCollection”,
“features”: [“latitude”, “longitude”]
}
[/sourcecode]
Create Table :
[sourcecode language=”plain”]
CREATE EXTERNAL TABLE jsondata (
type string,
features array<string>
)
ROW FORMAT SERDE ‘org.openx.data.jsonserde.JsonSerDe’
LOCATION ‘s3://<YOUR BUCKET HERE>/<some path>/’
[/sourcecode]
Query Table:
[sourcecode language=”plain”]
SELECT type AS TypeEvent,
features[1] AS FeatherType
FROM blogpost.jsondata
WHERE type = ‘FeatureCollection’
[/sourcecode]
3. With array and nested dict ( PythoniCally speaking )
JSON Looks like :
[sourcecode language=”plain”]
{
“type”: “FeatureCollection”,
“features”: [{
“first”: “raj”,
“properties”: “someprop”
}]
}
[/sourcecode]
Create Table :
[sourcecode language=”plain”]
CREATE EXTERNAL TABLE jsondata (
type string,
features array<struct<first:string,properties:string>>
)
ROW FORMAT SERDE ‘org.openx.data.jsonserde.JsonSerDe’
LOCATION ‘s3://<YOUR BUCKET HERE>/<some path>/’
[/sourcecode]
Query Table:
[sourcecode language=”plain”]
SELECT type AS TypeEvent,
features[1].first AS FeatherType
FROM blogpost.jsondata
WHERE type = ‘FeatureCollection’
[/sourcecode]
4. Put everything together
JSON Looks like :
[sourcecode language=”plain”]
{
“type”: “FeatureCollection”,
“features”: [{
“first”: “raj”,
“geometry”: {
“type”: “Point”,
“coordinates”: [-117.06861096, 32.57889962]
},
“properties”: “someprop”
}]
}
[/sourcecode]
Create Table :
[sourcecode language=”plain”]
CREATE EXTERNAL TABLE `jsondata`(
`type` string COMMENT ‘from deserializer’,
`features` array<struct<type:string,geometry:struct<type:string,coordinates:array<string>>>> COMMENT ‘from deserializer’)
ROW FORMAT SERDE
‘org.openx.data.jsonserde.JsonSerDe’
WITH SERDEPROPERTIES (
‘paths’=’features,type’)
STORED AS INPUTFORMAT
‘org.apache.hadoop.mapred.TextInputFormat’
OUTPUTFORMAT
‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’
LOCATION
‘s3://vicinitycheck/rawData/jsondata/’
TBLPROPERTIES (
‘classification’=’json’)
[/sourcecode]
Query Table:
[sourcecode language=”plain”]
SELECT type AS TypeEvent,
features[1].geometry.coordinates AS FeatherType
FROM test_vicinitycheck.jsondata
WHERE type = ‘FeatureCollection’
[/sourcecode]
Some additional things :
Avoid the reserved words in json and keep things in lower case.
Rest given the speed these cloud providers change , please share if you find any thing new came.