Friday, July 31, 2015

How to install Oracle database express edition on Linux

The instructions in this blog post are validated to work on CentOS 6.4 version for installing Oracle 11g R2 database express edition. The same instructions should work OK on other Linux operating systems such as Red Hat Enterprise Linux and Oracle Enteprise Linux etc. This blog host also assumes that Oracle software is being installed for the first time on the Linux server. Change or skip some instructions if otherwise.



1) Open a new terminal

2) sudo su - root
3) mkdir /u01  (If it does not exist)
4) useradd oracle (If it does not exist)
5) groupadd dba (If the group dba does not exist)

6) usermod -g dba oracle

7) chown -R oracle:dba /u01

8) Edit /etc/oraInst.loc and add the below content.

inventory_loc=/var/opt/oracle/oraInventory
inst_group=dba

9) mkdir -p /var/opt/oracle/oraInventory

10) chown -R oracle:dba /var/opt/oracle/oraInventory

11) Install any missing Linux packages (RPMs) that are missing on the server.  The list of Linux version 6 packages required for installing 11g R2 are specified in the quick installation guide https://docs.oracle.com/cd/E11882_01/install.112/e24326.pdf

12) Download Oracle database express edition software from http://www.oracle.com/technetwork/database/database-technologies/express-edition/downloads/index.html for Linux.

13) The downloaded file name is oracle-xe-11.2.0-1.0.x86_64.rpm.zip

14) Login as root
15) unzip oracle-xe-11.2.0-1.0.x86_64.rpm.zip
16) cd Disk1
17) rpm -i oracle-xe-11.2.0-1.0.x86_64.rpm 
18) /etc/init.d/oracle-xe configure


19) ps -fu  oracle  (Verify the background processes are started for XE database instance)
20) sudo su  -  oracle  (Switch to oracle operating system user)
21) Edit /home/oracle/.bash_profile and add the below lines.

export ORACLE_HOME=/u01/app/oracle/product/11.2.0/xe
export PATH=$ORACLE_HOME/bin:$PATH
export ORACLE_SID=XE
export TNS_ADMIN=$ORACLE_HOME/network/admin

22) sqlplus ‘/ as sysdba’   (Login to the database with DBA privilege)
23) alter user hr account unlock;  (The sample HR schema is locked by default)
24) alter user hr identified by hr; (Reset the password of HR schema)

References: http://docs.oracle.com/cd/E17781_01/install.112/e18802/toc.htm#XEINL123

How to format a json file

Introduction

JSON or JavaScript Object Notation is an open standard format that uses human-readable text to transmit data objects in the form of attribute-value pairs (also referred as key/value paris). It is now considered an alternative to XML for transmitting data between a server and an application.

It is platform independent and code for parsing and generating JSON documents is available in many programming languages. Furthermore, it enjoys a lot of popularity among developers in building data sets accessed through APIs (Application Programming Interfaces). Popular examples include the Twitter API and Yelp.com API. The official Internet media type for JSON is application/json. The JSON filename extension is .json.

Problem

The Yelp.com academic data sets are available in JSON format. One can download the data sets by visiting the web page https://www.yelp.com/dataset_challenge/dataset

The file downloaded is yelp_dataset_challenge_academic_dataset.tgz. Once the contents are extracted using tar -xvf yelp_dataset_challenge_academic_dataset.tgz command, the individual .json files are available for use. 

They include yelp_academic_dataset_business.jsonyelp_academic_dataset_review.json and yelp_academic_dataset_tip.json etc. Each json file contains several hundreds of json data objects, one per line. Because the key, value pairs in each of these json data objects is not indented well, it becomes tough to comprehend keys that are nested or keys that have an array of values or a combination of both. 

For example, the first JSON data object (first line) of the file yelp_academic_dataset_business.json  is stored as 

"business_id": "vcNAWiLM4dR7D2nwwJ7nCA", "full_address": "4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018", "hours": {"Tuesday": {"close": "17:00", "open": "08:00"}, "Friday": {"close": "17:00", "open": "08:00"}, "Monday": {"close": "17:00", "open": "08:00"}, "Wednesday": {"close": "17:00", "open": "08:00"}, "Thursday": {"close": "17:00", "open": "08:00"}}, "open": true, "categories": ["Doctors", "Health & Medical"], "city": "Phoenix", "review_count": 9, "name": "Eric Goldberg, MD", "neighborhoods": [], "longitude": -111.98375799999999, "state": "AZ", "stars": 3.5, "latitude": 33.499313000000001, "attributes": {"By Appointment Only": true}, "type": "business"}

A well indented JSON data object should be similar to the example in Code Listing 1 in the article http://www.oracle.com/technetwork/issue-archive/2015/15-jan/o15dba-2372211.html

Solution

1) Extract the first JSON data object in the file yelp_academic_dataset_business.json  and store it separately in business.json

2) Now run the command

cat business.json | python -m json.tool > business_one_json_document.json


3) The business_one_json_document.json file now contains the JSON data object with indented key/value pairs and therefore more readable.

{
    "attributes": {
        "By Appointment Only": true
    },
    "business_id": "vcNAWiLM4dR7D2nwwJ7nCA",
    "categories": [
        "Doctors",
        "Health & Medical"
    ],
    "city": "Phoenix",
    "full_address": "4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018",
    "hours": {
        "Friday": {
            "close": "17:00",
            "open": "08:00"
        },
        "Monday": {
            "close": "17:00",
            "open": "08:00"
        },
        "Thursday": {
            "close": "17:00",
            "open": "08:00"
       },
        "Tuesday": {
            "close": "17:00",
            "open": "08:00"
        },
        "Wednesday": {
            "close": "17:00",
            "open": "08:00"
        }
    },
    "latitude": 33.499313000000001,
    "longitude": -111.98375799999999,
    "name": "Eric Goldberg, MD",
    "neighborhoods": [],
    "open": true,
    "review_count": 9,
    "stars": 3.5,
    "state": "AZ",
    "type": "business"
}


Wednesday, July 1, 2015

Cloudera Quickstart VM 5.3 Apache Pig configuration

Introduction

Cloudera provides a pseudo-distributed node for working with Apache Hadoop. It is called the Cloudera Quickstart VM. While most tools in the Hadoop ecosystem such as Apache Sqoop and Apache Hive work right out of the box , Apache Pig requires some additional configuration to make it work smoothly. This blog post provides the details of such additional configuration steps.


Solution

1. Open a new Terminal.
2. su - root (Enter cloudera as the password)
3. cd /etc/pig/conf
4.

a. mv log4j.properties log4j.properties.orig (Let us make a copy of the default file)
b. cp -p log4j.properties.orig log4j.properties

5. 

a. mv pig.properties pig.properties.orig (Let us make a copy of the default file)
b. cp -p pig.properties.orig pig.properties

6. Edit log4j.properties as below

a.Replace log4j.logger.org.apache.pig=info, A with the below
 log4j.logger.org.apache.pig=error, A
b. Then add a new line log4j.logger.org.apache.hadoop=error, A

7. Edit pig.properties as below

a. Uncomment (remove the #) the line log4jconf=./conf/log4j.properties if it is already commented and let the line start with no blank spaces.

b. Replace the line starting with #clustername with quickstart.cloudera:50010

quickstart.cloudera:50010 is the Hadoop cluster name in the Quickstart VM. You can find this information by running the hdfs dfsadmin -report command.

8. chmod -R o+w /etc/pig/conf.dist


9. cp -p /usr/lib/hadoop/lib/slf4j-api-1.7.5.jar /usr/lib/hive/lib


Conclusion


The above steps will help avoid the following errors when Pig is run in interactive mode using the Grunt shell.

ls: cannot access /usr/lib/hive/lib/slf4j-api-*.jar: No such file or directory


WARN pig.Main: Cannot write to log file: /etc/pig/conf.dist/pig_1435724561990.log

ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR: org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1435707575650_0004' doesn't exist in RM.