Friday, July 31, 2015

How to format a json file

Introduction

JSON or JavaScript Object Notation is an open standard format that uses human-readable text to transmit data objects in the form of attribute-value pairs (also referred as key/value paris). It is now considered an alternative to XML for transmitting data between a server and an application.

It is platform independent and code for parsing and generating JSON documents is available in many programming languages. Furthermore, it enjoys a lot of popularity among developers in building data sets accessed through APIs (Application Programming Interfaces). Popular examples include the Twitter API and Yelp.com API. The official Internet media type for JSON is application/json. The JSON filename extension is .json.

Problem

The Yelp.com academic data sets are available in JSON format. One can download the data sets by visiting the web page https://www.yelp.com/dataset_challenge/dataset

The file downloaded is yelp_dataset_challenge_academic_dataset.tgz. Once the contents are extracted using tar -xvf yelp_dataset_challenge_academic_dataset.tgz command, the individual .json files are available for use. 

They include yelp_academic_dataset_business.jsonyelp_academic_dataset_review.json and yelp_academic_dataset_tip.json etc. Each json file contains several hundreds of json data objects, one per line. Because the key, value pairs in each of these json data objects is not indented well, it becomes tough to comprehend keys that are nested or keys that have an array of values or a combination of both. 

For example, the first JSON data object (first line) of the file yelp_academic_dataset_business.json  is stored as 

"business_id": "vcNAWiLM4dR7D2nwwJ7nCA", "full_address": "4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018", "hours": {"Tuesday": {"close": "17:00", "open": "08:00"}, "Friday": {"close": "17:00", "open": "08:00"}, "Monday": {"close": "17:00", "open": "08:00"}, "Wednesday": {"close": "17:00", "open": "08:00"}, "Thursday": {"close": "17:00", "open": "08:00"}}, "open": true, "categories": ["Doctors", "Health & Medical"], "city": "Phoenix", "review_count": 9, "name": "Eric Goldberg, MD", "neighborhoods": [], "longitude": -111.98375799999999, "state": "AZ", "stars": 3.5, "latitude": 33.499313000000001, "attributes": {"By Appointment Only": true}, "type": "business"}

A well indented JSON data object should be similar to the example in Code Listing 1 in the article http://www.oracle.com/technetwork/issue-archive/2015/15-jan/o15dba-2372211.html

Solution

1) Extract the first JSON data object in the file yelp_academic_dataset_business.json  and store it separately in business.json

2) Now run the command

cat business.json | python -m json.tool > business_one_json_document.json


3) The business_one_json_document.json file now contains the JSON data object with indented key/value pairs and therefore more readable.

{
    "attributes": {
        "By Appointment Only": true
    },
    "business_id": "vcNAWiLM4dR7D2nwwJ7nCA",
    "categories": [
        "Doctors",
        "Health & Medical"
    ],
    "city": "Phoenix",
    "full_address": "4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018",
    "hours": {
        "Friday": {
            "close": "17:00",
            "open": "08:00"
        },
        "Monday": {
            "close": "17:00",
            "open": "08:00"
        },
        "Thursday": {
            "close": "17:00",
            "open": "08:00"
       },
        "Tuesday": {
            "close": "17:00",
            "open": "08:00"
        },
        "Wednesday": {
            "close": "17:00",
            "open": "08:00"
        }
    },
    "latitude": 33.499313000000001,
    "longitude": -111.98375799999999,
    "name": "Eric Goldberg, MD",
    "neighborhoods": [],
    "open": true,
    "review_count": 9,
    "stars": 3.5,
    "state": "AZ",
    "type": "business"
}


No comments: