Tuesday, August 18, 2015

Apache Spark Training in DBA University

We have included Apache Spark topics as part of our Big Data with Hadoop training course. Spark is the most active Big Data open source project and has become a top level Apache project in Feb 2014. In our course, you get an opportunity to process large real world JSON data sets using Apache Spark both on a single node pseudo-distributed Hadoop cluster as well as three node Hadoop cluster in the cloud.

The Apache Spark topics along with lab work are below

1). How to use the Spark API in Python.
2). How to process large JSON data sets using the Spark SQL context in the spark shell. (Scala-based shell)
3). How to obtain data from external databases using the JDBC built-in data source library in Spark.
4) How to add the Spark service in a three node Hadoop cluster. 


If you are interested, please register at http://dbauniversity.webex.com/

The DBA University difference

1) All faculty based in USA and training through video enabled web conferencing software.
2) Custom written ebooks (About 300 pages training materials per course).
3) Cloud based remote lab server access for each student.
3) Our faculty and student video portal - http://video.dbauniversity.com
4) Taught using interactive white board technology.

Sunday, August 2, 2015

Oracle RMAN Active Database Duplication example

Introduction

Oracle RMAN Active Database Duplication is a new feature introduced in Oracle 11g R1. This option helps perform database copies without the need for relying on the source database backup. Oracle RMAN makes a copy of the source database (active database) at run time without the need of using its backup for the duplicate exercise. This option is very attractive when one is tasked with a “test database” to “test database” copy exercise (cloning exercise) as source test databases do not usually have backup jobs scheduled. Though it can be used for “production database” to “test database” copies as well, this method is not recommended because the RMAN duplicate command channels are launched on the source database host, thereby causing high load on the production database server.

Because the active database duplicate option does not rely on a backup, it cannot be used to perform Database Point-in-Time Recovery (“DBPITR”). Additionally, the source database must be in archive log mode if the duplication has to be performed without causing a downtime to the source database. Otherwise, the source database must be shutdown immediate and then startup mount.

Prerequisites

1) The source database must be in archivelog mode if one wants to duplicate it without shutting it down

2) If source database were a test database and enabling archive log mode is not necessary, then shutdown immediate and startup mount the source database.

3) When RMAN is connected to the auxiliary instance (destination to-be-duplicated instance), it must be provided with a net service name of it. This requirement applies even if the auxiliary instance is on the local host. Therefore, please add the tns entry  of the destination to-be-duplicated instance in the tnsnames.ora of the source database.

4) The source database and auxiliary instances must use the same SYS user password, which means that both instances must have password files.

5) The dynamic service registration cannot be used for the destination database while the RMAN active duplicate command is being executed. Hence only static service registration will work for the RMAN active duplicate to be successful.

6) Firewall access must be opened from source database server to destination database server  on the destination database's listener port number. (if they are on two different servers)

Assumptions

1) sys password is sys on both source and destination (to-be-duplicated) databases.
2) SRCDB and DESTDB are the names of the respective source and destination databases.
3) SRCDB has an spfile (server parameter file).

Prepare destination database (to-be-duplicated database)

1)  Login to oracle@target database server.

2) Create new directories on this server that exactly matches the source database's dignostic_dest folder and fast recovery area folder as the RMAN active database duplicate command in the next section specifically looks for the exact directory name on destination server too.

3)  Start the listener process.
4)   cd  $ORACLE_HOME/dbs
5)   rm  -f  orapwDESTDB     (Let us remove any existing password file. Replace DESTDB with the actual destination database name).

6)  orapwd  password=sys   file=orapwDESTDB     (Replace DESTDB with the actual destination database name).

7) cd $ORACLE_HOME/dbs

8) Remove  any spfile or pfile for the destination database (if exists)

9) Create a new initDESTDB.ora with just one parameter. ( (Replace DESTDB with the actual destination database name).

db_name = DESTDB

10) sqlplus   ‘/   as   sysdba’
11) startup   nomount;
12) exit;

Run the active database duplication command on source database server

1) Login to oracle@source database server
2) cd  $ORACLE_HOME/dbs

3) orapwd  password=sys   file=orapwSRCDB (If it does not exist already)
4) sqlplus   ‘/   as  sysdba’
5) alter   user   sys  identified  by  sys;
6) exit;

Note:  Please replace  DESTDB with the actual destination database name in the active duplicate command below.

7) tnsping   DESTDB   (make sure that the destination database listener can be accessible through SQL*Net from the source database server.  If not add an entry in tnsnames.ora and also make sure its listener is started on the destination server).
8) rman
9) connect   target   sys/sys@SRCDB;
10) connect   auxiliary   sys/sys@DESTDB;    
11)

run
{
allocate channel d1 type disk;
allocate channel d2 type disk;
allocate channel d3 type disk;
allocate channel d4 type disk;
allocate auxiliary channel a1 type disk;
allocate auxiliary channel a2 type disk;
set newname for database to '/u01/data/DESTDB/%b';
set newname for tempfile 1 to '/u01/data/DESTDB/temp01.tmp';
DUPLICATE TARGET DATABASE
  to DESTDB
  FROM ACTIVE DATABASE
  PASSWORD FILE
  SPFILE
  PARAMETER_VALUE_CONVERT 'SRCDB','DESTDB'
  SET CONTROL_FILES='/u01/data/DESTDB/c1.ctl'
  LOGFILE
  GROUP   1  ('/u01/data/DESTDB/redo1.log')   SIZE    50M,
  GROUP   2  ('/u01/data/DESTDB/redo2.log')   SIZE    50M ;
}


12) exit;

Friday, July 31, 2015

How to install Oracle database express edition on Linux

The instructions in this blog post are validated to work on CentOS 6.4 version for installing Oracle 11g R2 database express edition. The same instructions should work OK on other Linux operating systems such as Red Hat Enterprise Linux and Oracle Enteprise Linux etc. This blog host also assumes that Oracle software is being installed for the first time on the Linux server. Change or skip some instructions if otherwise.



1) Open a new terminal

2) sudo su - root
3) mkdir /u01  (If it does not exist)
4) useradd oracle (If it does not exist)
5) groupadd dba (If the group dba does not exist)

6) usermod -g dba oracle

7) chown -R oracle:dba /u01

8) Edit /etc/oraInst.loc and add the below content.

inventory_loc=/var/opt/oracle/oraInventory
inst_group=dba

9) mkdir -p /var/opt/oracle/oraInventory

10) chown -R oracle:dba /var/opt/oracle/oraInventory

11) Install any missing Linux packages (RPMs) that are missing on the server.  The list of Linux version 6 packages required for installing 11g R2 are specified in the quick installation guide https://docs.oracle.com/cd/E11882_01/install.112/e24326.pdf

12) Download Oracle database express edition software from http://www.oracle.com/technetwork/database/database-technologies/express-edition/downloads/index.html for Linux.

13) The downloaded file name is oracle-xe-11.2.0-1.0.x86_64.rpm.zip

14) Login as root
15) unzip oracle-xe-11.2.0-1.0.x86_64.rpm.zip
16) cd Disk1
17) rpm -i oracle-xe-11.2.0-1.0.x86_64.rpm 
18) /etc/init.d/oracle-xe configure


19) ps -fu  oracle  (Verify the background processes are started for XE database instance)
20) sudo su  -  oracle  (Switch to oracle operating system user)
21) Edit /home/oracle/.bash_profile and add the below lines.

export ORACLE_HOME=/u01/app/oracle/product/11.2.0/xe
export PATH=$ORACLE_HOME/bin:$PATH
export ORACLE_SID=XE
export TNS_ADMIN=$ORACLE_HOME/network/admin

22) sqlplus ‘/ as sysdba’   (Login to the database with DBA privilege)
23) alter user hr account unlock;  (The sample HR schema is locked by default)
24) alter user hr identified by hr; (Reset the password of HR schema)

References: http://docs.oracle.com/cd/E17781_01/install.112/e18802/toc.htm#XEINL123

How to format a json file

Introduction

JSON or JavaScript Object Notation is an open standard format that uses human-readable text to transmit data objects in the form of attribute-value pairs (also referred as key/value paris). It is now considered an alternative to XML for transmitting data between a server and an application.

It is platform independent and code for parsing and generating JSON documents is available in many programming languages. Furthermore, it enjoys a lot of popularity among developers in building data sets accessed through APIs (Application Programming Interfaces). Popular examples include the Twitter API and Yelp.com API. The official Internet media type for JSON is application/json. The JSON filename extension is .json.

Problem

The Yelp.com academic data sets are available in JSON format. One can download the data sets by visiting the web page https://www.yelp.com/dataset_challenge/dataset

The file downloaded is yelp_dataset_challenge_academic_dataset.tgz. Once the contents are extracted using tar -xvf yelp_dataset_challenge_academic_dataset.tgz command, the individual .json files are available for use. 

They include yelp_academic_dataset_business.jsonyelp_academic_dataset_review.json and yelp_academic_dataset_tip.json etc. Each json file contains several hundreds of json data objects, one per line. Because the key, value pairs in each of these json data objects is not indented well, it becomes tough to comprehend keys that are nested or keys that have an array of values or a combination of both. 

For example, the first JSON data object (first line) of the file yelp_academic_dataset_business.json  is stored as 

"business_id": "vcNAWiLM4dR7D2nwwJ7nCA", "full_address": "4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018", "hours": {"Tuesday": {"close": "17:00", "open": "08:00"}, "Friday": {"close": "17:00", "open": "08:00"}, "Monday": {"close": "17:00", "open": "08:00"}, "Wednesday": {"close": "17:00", "open": "08:00"}, "Thursday": {"close": "17:00", "open": "08:00"}}, "open": true, "categories": ["Doctors", "Health & Medical"], "city": "Phoenix", "review_count": 9, "name": "Eric Goldberg, MD", "neighborhoods": [], "longitude": -111.98375799999999, "state": "AZ", "stars": 3.5, "latitude": 33.499313000000001, "attributes": {"By Appointment Only": true}, "type": "business"}

A well indented JSON data object should be similar to the example in Code Listing 1 in the article http://www.oracle.com/technetwork/issue-archive/2015/15-jan/o15dba-2372211.html

Solution

1) Extract the first JSON data object in the file yelp_academic_dataset_business.json  and store it separately in business.json

2) Now run the command

cat business.json | python -m json.tool > business_one_json_document.json


3) The business_one_json_document.json file now contains the JSON data object with indented key/value pairs and therefore more readable.

{
    "attributes": {
        "By Appointment Only": true
    },
    "business_id": "vcNAWiLM4dR7D2nwwJ7nCA",
    "categories": [
        "Doctors",
        "Health & Medical"
    ],
    "city": "Phoenix",
    "full_address": "4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018",
    "hours": {
        "Friday": {
            "close": "17:00",
            "open": "08:00"
        },
        "Monday": {
            "close": "17:00",
            "open": "08:00"
        },
        "Thursday": {
            "close": "17:00",
            "open": "08:00"
       },
        "Tuesday": {
            "close": "17:00",
            "open": "08:00"
        },
        "Wednesday": {
            "close": "17:00",
            "open": "08:00"
        }
    },
    "latitude": 33.499313000000001,
    "longitude": -111.98375799999999,
    "name": "Eric Goldberg, MD",
    "neighborhoods": [],
    "open": true,
    "review_count": 9,
    "stars": 3.5,
    "state": "AZ",
    "type": "business"
}


Wednesday, July 1, 2015

Cloudera Quickstart VM 5.3 Apache Pig configuration

Introduction

Cloudera provides a pseudo-distributed node for working with Apache Hadoop. It is called the Cloudera Quickstart VM. While most tools in the Hadoop ecosystem such as Apache Sqoop and Apache Hive work right out of the box , Apache Pig requires some additional configuration to make it work smoothly. This blog post provides the details of such additional configuration steps.


Solution

1. Open a new Terminal.
2. su - root (Enter cloudera as the password)
3. cd /etc/pig/conf
4.

a. mv log4j.properties log4j.properties.orig (Let us make a copy of the default file)
b. cp -p log4j.properties.orig log4j.properties

5. 

a. mv pig.properties pig.properties.orig (Let us make a copy of the default file)
b. cp -p pig.properties.orig pig.properties

6. Edit log4j.properties as below

a.Replace log4j.logger.org.apache.pig=info, A with the below
 log4j.logger.org.apache.pig=error, A
b. Then add a new line log4j.logger.org.apache.hadoop=error, A

7. Edit pig.properties as below

a. Uncomment (remove the #) the line log4jconf=./conf/log4j.properties if it is already commented and let the line start with no blank spaces.

b. Replace the line starting with #clustername with quickstart.cloudera:50010

quickstart.cloudera:50010 is the Hadoop cluster name in the Quickstart VM. You can find this information by running the hdfs dfsadmin -report command.

8. chmod -R o+w /etc/pig/conf.dist


9. cp -p /usr/lib/hadoop/lib/slf4j-api-1.7.5.jar /usr/lib/hive/lib


Conclusion


The above steps will help avoid the following errors when Pig is run in interactive mode using the Grunt shell.

ls: cannot access /usr/lib/hive/lib/slf4j-api-*.jar: No such file or directory


WARN pig.Main: Cannot write to log file: /etc/pig/conf.dist/pig_1435724561990.log

ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR: org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1435707575650_0004' doesn't exist in RM.

Friday, May 22, 2015

A client error (AuthFailure) occurred: AWS was not able to validate the provided access credentials

Problem

Amazon Web Services (AWS) offers both the web-based management console as well a command line interface (CLI). The other day we were use the CLI on an EC2 instance running Red Hat Enterprise Linux Server release 6.4 operating system. But even after installing and configuring the CLI using the AWS CLI user guide, every CLI command was failing with the "A client error (AuthFailure) occurred..." error. For example, the basic command "aws ec2 describe-instances --region us-west-2" failed with the below error.

A client error (AuthFailure) occurred when calling the DescribeInstances operation: AWS was not able to validate the provided access credentials.

Solution

The date command returned an incorrect time. Whereas the actual time is Fri May 22 23:33:18 EDT 2015, it returned Sat May 23 03:33:18 EDT 2015.  We then found that the NTP daemon was not running (Network Time Protocol). Once we started the ntpd deamon, the "aws ec2 describe-instances --region us-west-2" worked and returned the expected output in the AWS CLI default JSON format. We followed the AWS user guide on how to start the ntpd deamon successfully.


1. ntpstat
Unable to talk to NTP daemon. Is it running?

2. sudo service ntpd start
Starting ntpd:                                             [  OK  ]

3. sudo chkconfig ntpd on

4. ntpstat
unsynchronised
   polling server every 64 s

5. date

The date command then returned the correct time.

6. aws ec2 describe-instances --region us-west-2

Finally, the aws ec2 command above returned the output successfully.

Saturday, May 2, 2015

How to install and setup Oracle 12c RAC on PC using VirtualBox

Introduction

Are you spending many hours in figuring out how to install and setup Oracle 12c RAC cluster on your PC ? DBA University will help you perform a successful installation and setup of Oracle 12c on your PC using Oracle VirtualBox hypervisor if you register for our comprehensive training course in Oracle 12c ASM, RAC and Data Guard topics.  

This course is conducted by Michelle Malcher. Michelle is a well experienced Oracle Database Administrator. She is a well known author of several Oracle Press books. Also, she is an Oracle ACE Director (only a few in the world are recognized as ACE Director by Oracle Corporation) and a noted contributor to Oracle Magazine.


Course Timings

2) Evenings - Sun, Mon, Tue and Thu  - 7 pm Central Time to 9 pm Central Time.
3) Online Training, but very interactive with whiteboard teaching methodology.


Course Details and Lab Work

1) We will provide you the VM that will help in a hassle-free and successful 2 Node RAC cluster on Linux operating systems using Oracle VirtualBox.  We have taken care of all the details and in just a week, you will have a working 2 Node RAC cluster on your PC. 

2) We will provide detailed training material (eBook) that will list all the hardware and software pre-requisites, network setup tasks and screenshots for a successful Oracle 12c Grid Infrastructure and Oracle 12c RDBMS software installation on the 2 Node cluster.

3) We also have separate lab exercises on Oracle 12c ASM (Automatic Storage Management) that includes how to partition and configure disks for ASM usage using the Linux ASMLib utility, ASM mirroring and how to replace failed disks in an ASM disk group. 

4) You will also become very strong in learning how to setup and maintain an Oracle 12c Data Guard environment.


Course Brochure and Registration

1) Click here to see the course topics.
4) Fee is $799 and paid online using the above registration link.

Monday, March 23, 2015

Big Data with Hadoop Training

Introduction


We have some exciting news for you. We are starting the Big Data with Hadoop training program and Srini Ramineni will be the faculty for this course. Srini Ramineni is the founder of DBA University and has rich experience in the IT industry in USA managing both relational databases as well as Big Data with Hadoop environments. He also has completed a graduate degree credit hour program in Big Data using Hadoop at a reputed university in USA.

The training topics include the Hadoop core components viz. The Hadoop File System (HDFS) and the Map Reduce computation paradigm.  All the other major sub projects of the Hadoop solution such as Apache Sqoop, Hive, Pig, Impala, Flume and Spark are also included in this training course.  Most of the topics have lab exercises too.


Course Highlights

1) 6 and half weeks course with 40 hours total duration.
2) Training will contain lab work on both single node machine as well as a Hadoop cluster (multi-node).
3) Evenings - Sun, Mon, Tue  - 7 pm Central Time to 9 pm Central Time
4) DBA University's training material will be provided.
5) Click here for a FREE DEMO session  (Name and Email only)


Links and Resources

1) Click here to know more about the course topics.
2) Click here for a FREE DEMO session  (Name and Email only)

Sunday, February 22, 2015

How to fix VNC Viewer black screen

Introduction

Have you seen your VNC Viewer showing a black screen intermittently ?  We have a fix for it. This issue has been occurring in our lab servers for a few students intermittently since the past few months on Oracle Linux server 6.5 release.

Problem

The VNC Viewer desktop background turns into a full black screen intermittently. It does not happen all the time and one of our students observed that this issue happens consistently when she  hovers her mouse on the VNC Config icon on the top left hand corner. This issue happens even if this icon is minimized. This observation is correct.












Solution

A workaround is to just close the above VNC config icon instead of minimizing it. But the permanent solution is to let the vncconfig deamon run in nowin mode (no window icon). Our vncserver service is owned by oracle operating system account. 

a) Login to the server using a different protocol such as ssh on port 22.
b) cd /home/oracle/.vnc
b) Edit xstartup file and replace vncconfig -iconic & with vncconfig -nowin &
c) Now bounce the vncserver service using the below commands

vncserver -kill :1.0
vncserver
export DISPLAY=:1.0
xhost +

Conclusion

The -nowin parameter will make the vncconfig functionality stay intact, but without the desktop icon. The actual cause of the problem could be due to an issue with the vnc-server RPM itself, specifically tigervnc-server-module 1.1.0-8.el6_5.x86_64.

Saturday, February 14, 2015

Learn SQL Language to work with Big Data in Hadoop

Introduction

SQL (Structured Query Language) is the most widely used language to read and write data in relational databases such as Oracle, MS SQL Server and MySQL databases etc. SQL language existed since the 1970's and is also a standard of the American National Standards Institute (ANSI).

Because of most of the traditional databases such as corporate internal databases and customer facing databases such as those driving retail websites use the relational database model, SQL language skills has become a mandatory skill for Information Technology (IT) staff such as developers, analysts, database administrators and even software QA engineers, project managers and IT auditors etc.

Big Data

Apache Hadoop, which is the most popular framework for storing and processing Big Data, has a related sub project component called Apache Hive. Initially developed by Facebook Inc. and later open sourced, Apache Hive provides a data warehouse solution on top of Hadoop. It provides a relational database abstraction for Hadoop's Big Data (though Hadoop's databases are not relational databases). Of particular note is the HiveQL (SQL like language to query and analyze Hadoop's data) . If one already is an SQL language expert, learning HiveQL is very easy and helps a traditional (relational database) user work with most of Hadoop's Big Data easily.

Apache HCatalog extends  Hive (now since merged with Hive in 2013) by publishing a RESTful interface so that even external systems can access the Hive metadata using RESTful methods. Most of the popular distributions of Hadoop include the Hive component as well (now HCatalog) such as the HortonWorks Data Platform (HDP).  IBM's Hadoop distribution BigInsights has a similar component called as BigSQL.  Per IBM's blog, BigSQL is an enterprise-class SQL query engine and has better ANSI SQL support than HiveQL.

DBA University's Oracle SQL PL/SQL Training course

DBA University has a comprehensive Oracle SQL PL/SQL Training course and we encourage you to register for it if you don't have at least an intermediate level knowledge in SQL and PL/SQL.  It is a very detail oriented training course with remote lab server access for each student for practicing the SQL and PL/SQL lab exercises.


Classes :Sun,Mon,Tue,Wed   evenings
Duration: 7 to 8 weeks approximately.
Lab Work: Remote lab server access for each student.

This online course will cover Relational Database Design, Oracle SQL Language and PL/SQL programming topics. 


a) Week 1     - Relational Database Design and software installations. 
b) Week 2,3,4 - Oracle SQL Language. 
c) Week 5,6,7 - Oracle PL/SQL Language. 
d) Week 8     - Final week to catch up any remaining topics. 

Saturday, January 3, 2015

How to backup sql server database with checksum option

Introduction

If you compare Oracle database and MS SQL Server databases, there are some wins for Oracle and some for MS SQL Server. In this blog post, I will talk about a backup feature with Oracle as the winner.

Oracle block checking during RMAN backups

By default, an Oracle RMAN database backup computes the checksum for each block and stores it in the backup, even if the DB_BLOCK_CHECKSUM command is set to TYPICAL. DB_BLOCK_CHECKSUM initialization parameter applies to data files in the database and not applicable to backups. 

Therefore one way of recognizing block media corruptions in an Oracle database is when Oracle RMAN database backup terminates with errors after it fails to perform a checksum for those corrupted blocks. So an Oracle DBA need not monitor for block media corruptions separately because the database backup (if the RMAN method is used) command itself acts as such a validation tool inherently.

MS SQL Server page (block) checking during backups

The corresponding term in MS SQL Server for datafile blocks is a page.  In SQL Server, most database administrators create the backup database task through maintenance plans. But one cannot rely on these maintenance plans for page (block) checksum and torn-page detection during backups. Even if the "Verify backup integrity"  option is enabled as part of the backup database task in a maintenance plan, it does nothing more than checking whether the backup media is readable without having to perform an actual restore. So it is superficial in nature.

Therefore, it is strongly recommended that MS SQL Server database administrators create separate T-SQL scripts or stored procedures to explicitly backup databases using the with checksum option. 

For example, a database called AdventureWorks can be backed up using the below command.


backup database AdventureWorks  to disk = 'D:\backups\AdventureWorks_01032015.bak' with NOINIT,CHECKSUM;

Additionally, in the same T-SQL script or stored procedure, verify backup integrity using the with checksum option. For example,

restore verifyonly from disk = 'D:\backups\AdventureWorks_01032015.bak' with CHECKSUM;

SQL Server will check every page in the backup file by reading and recomputing the page checksum and then compares it to the recorded page checksum in the backup file to ensure integrity.

Conclusion

It is recommended that MS SQL Server DBAs create a separate T-SQL script or a stored procedure to backup all databases using the with checksum option. This blog post is validated against Oracle 11g R2 and 10g R2 databases and MS SQL Server 2008 and 2008 R2 databases. Please note the extra time taken for a restore verifyonly with checksum command.  Another observation that may be noted is that a backup without a checksum and a backup with checksum  takes about the same time.