Thursday, September 24, 2015

Sqooping Data Into Hive!

From previous posts everyone knows that I had recently deployed HDP 2.3.  My reason for deploying it is to begin collecting data from various databases (of various makes) and begin reporting on the data contained within.  To that end I needed to begin to sqoop the required data into Hive.  I had a test database to use just to make sure that we were able to connect and then load some data.  Below I will outline the steps you need to take.

First, make sure you have the driver you need to actually connect to the database.  In my case it was SQL Server from Microsoft.  Hortonworks provides some documentation to use on how to set this up (  Since that is for the sandbox, I will write from the side of a production cluster with multiple nodes.

Confirm in Ambari which nodes have the sqoop client installed.  For each server with a client you will need to copy the jdbc jar to a location on the server so it can be used.  Using curl, download the jdbc from Microsoft:

curl -L '' | tar xz

That will download the file and then unzip it for you.

Now with 2.3 the location where you need to put this jar is different, so from the Ambari host (since you should have passwordless ssh setup) perform the following for each node with a sqoop client:

scp sqljdbc4.jar root@<hostname>:/usr/hdp/

That will place it in the needed folder and allow you to utilize the driver.  Now, Hortonworks stated you should restart, I did not and I had no issues.

With the driver in place you can now attempt to connect to the database server and see what databases are available:

sqoop list-databases --connect jdbc:sqlserver://<server address>:<port> --username <username of database user> -P

Your user only needs to have read-only access in order to pull any data.  -P option will prompt you for the password, but you can also use the --password option if you would like to just pass the password in.  You can also pass a file with the password and I'd go with that once your verify that everything works.

If everything is correct you should get a list of all the databases on the machine.

For this next part I am going to cover importing an entire database (all the tables) into Hive.  In my case there will be a lot of tables and doing them one by one is not an option.  The sticking point is you need to be using Sqoop 1.4.4 or above (HDP 2.3 has 1.4.6).

The first thing you will want to do is create the database in Hive so you have somewhere to import it.  So do the following:

Type hive on one of your nodes (confirm you have a client on the node)

hive>create database <name>;
hive>show databases;

You should see default and the newly created database.


Now you will run the following command to sqoop the entire database into your newly created Hive database:

sqoop import-all-tables --connect "jdbc:sqlserver://<address>:<port>;database=<name>;" --username <username> -P --hive-import --hive-database <database name>

You'll enter the password and the job will begin to run.  It will let you know of any issues it runs into and you can troubleshoot from there.

Once the job has completed, go back into Hive and do the following:

hive>show databases; <this will show the available databases>
hive>use <database name>; <this will be the new one you created>
hive>show tables; <this should show you the tables available to you>
hive>describe <table name>; <this will show you the schema that was imported with the tables>
hive>select <column name(s) you got from the describe command> from <table name>; <this will output to the screen any data within the table>

And that is a quick run through of sqooping data into Hive and then performing a simple query of said data.

Monday, September 21, 2015

HDP 2.3 - Ranger Deployed!

Today, over the course of about four hours I deployed Ranger for use on HDP 2.3.  Now when I say four hours, it is not because I ran into issues, but merely that you have a lot to configure in order to get Ranger running.  Each component that you would like to secure requires several configurations in order to work properly.  But boy is it an amazing sight to see when you get everything up and running.  For our setup we were aiming to use Ranger for HDFS, Hive, and Knox.  There are other components that can be secured, but we aren't currently utilizing them.

The only scary part is that as you go through with the configuration you'll be presented with warnings that you just accept.  Also, there is a lot of restarting of services, which anyone with a cluster knows can be painful because often the services often won't come back up.  But in my case they just about always did.  Now more then ever I am glad I moved to HDP 2.3 because things seem to be much smoother with it then previous versions.

Friday, September 18, 2015

HDP 2.3 - Kerberos Deployed

Today I got Kerberos up and running! Since I am using FreeIPA, you have to do all the work with creating the needed user accounts and generating the principal keys. Hortonworks provides a tutorial, but my concern was the fact that I am working with 13 nodes not just 1. My thought was that I would need to break the kerberos.csv into individual files and import for each node. Very wrong! What I didn't realize was if you run the script and issue keys they change each time. Thus when I tried to run the commands to test after doing all of the work on each server the commands would fail. I Googled for about two hours when I came across an unrelated article that discussed keys in Kerberos and said that with each generation the keys would change.

Thus I went to the last server I worked on and confirmed that the test commands worked on it. I created a spreadsheet of what certificates were where with the thought that I would move them as needed. I knew this morning would be a nightmare if I had to do that. Thankfully, a Hortonworks Engineer let me know that I could delete, reissue the keys (without separate files since the script would pull the needed ones based on the host), and then copy only the files for services that do not change. The only issue is I would need to change the permissions on the files (which the script does) to allow them to work. So what I did was go through the steps to generate the script and make the keys on one box. Then I would connect to another box, transfer the kerberos.csv, generate the script, but comment out the service principals (basically comment out anything that didn't have blah/ Run the script, which will grab the specific keys, and then copy said script naming it something else. From there I would look to see what service keytabs I needed to copy over and then comment out the ones I didn't need. I adjusted the copied script to remove the keytab issue command and kept only the chown and chmod commands. Once I completed that, I restarted the NameNode and Secondary NameNode. Then I shutdown all services and started them up.

Here I will point out that Accumulo had issues from the start and I ultimately used the Ambari api to remove it as I won't be using it. Did a quick test to confirm that now only users in Kerberos could access Hadoop and whoa success! Monday I will be working on deploying Ranger and then I can begin to import data into my cluster!

Deployed HDP 2.3 Two Days Ago!

Haven't posted a topic in awhile and figure why not post about my successful deployment of Hortonworks Data Platform 2.3? This was actually the third HDP deployment I have done. The first time was a pure nightmare! My configuration of the operating system on each of my 13 servers was terrible and thus HDP exposed all of those mistakes. So I wiped each of the servers, redeployed the operating system, and built a script to check all of the items I had missed (script works really nicely and I also end up finding a bash script from Hortonworks that did a lot more then mine). I successfully deployed Hadoop and only had a few issues to contend with (I had to rewipe two servers and redeploy, but then realized I could have fixed them without wiping). When I went to the Hortonworks Data Science course they had released HDP 2.3 and at first I was against even looking at it. I had our cluster (HDP 2.2) up, Kerborized, and the only item I had issues with was deploying Ranger.

As I reviewed, I saw how many changes they made and how well HDP 2.3 would work for us. First, they enabled it to work with FreeIPA again. HDP 2.1 worked with FreeIPA, HDP 2.2 made changes to allow Ambari to make the Kerboros keys automatically. Due to how FreeIPA does it, Ambari couldn't utilize the needed commands to make the keys. HDP 2.3 made it an option to either deploy the keys automatically or to manually do it (thus you can use FreeIPA). Second, they made the deployment of Ranger much easier and added some items that can utilize it that weren't previously able to. Finally, they also added the ability to encrypt data at rest. While we don't have PII data being able to say that we can restrict access (down to the cell level if need be), audit who does what, and encrypt the data we have will make us all feel better.

So I took two days and wiped all the servers again. I had read a lot of information talking about how difficult updating to new versions can be and since I had no data in the system it made sense to start fresh. With all the servers wiped, I setup Ambari with a local repo and started the deployment. I almost had a heart attack when everything went off without a hitch! Of course it was short lived because when I went to the dashboard two of my servers could not start HDFS (one of which was a big part of my storage capacity). I tired to start the service on both and they would fail. I reviewed the logs and found out for some reason the wrong accounts were made owner of the HDFS folders. I changed the owner and tried to start the service again. Failed! Look again and found that the cluster ID in one file on both servers was wrong. So I went into the VERSION file and I adjusted the cluster ID. Start and failed again. This time I found that the node ID was wrong on some of the drives. Start and bam one of the servers was running properly. But the other server still would not start and I found that was because for this server it was looking for 15 drives that did not exist. Because of those 15 failed drives the service would not start. I had to make a configuration group so that I could adjust the config to just use two drives and bam I was up.

I also want to point out that Hortonworks engineers are very helpful. I found that a set of their tutorials had a lot of issues. I emailed the engineer and we had several conversations about a number of things. He was really great about helping me through some issues (with 2.2). I added that it didn't make a lot of sense to release tutorials that utilized their sandbox because that doesn't help engineers in the real world. With version 2.3 they made the change to using a single node cluster and that made a world of difference as well. Tomorrow I'll begin enabling Kerboros and possibly deploying Ranger. I'll keep everyone informed!

Friday, July 31, 2015

Becoming a Big Data Nerd!

     Security was where I always thought I would see myself.  My undergraduate career was spent in security classes, learning about just about an topic a security professional should.  The hard reality of the education was that without the experience, no one is going to hire you as a security professional.  I know many would say this is true, but generally it is.  Of all the positions I applied for when I finished school, only around three that were security related actually got back to me.  I got through the process for two of them, but in the end they fell through.

     After that it was crunch time and that meant taking whatever job I could to start paying my student loans.  That's when I began to learn that you have to start somewhere and once you start gaining that professional experience then you can move to the area you want.  When I speak to those new to the industry, I tend to use this analogy:  "you can secure a technology if you have no foundation in the setup and operation of said technology."

    Four years into my career my ship had finally arrived!  I was hired to perform security related work.  For the first year I was reviewing network designs, advising on network changes (to make sure they held up to security standards), auditing platforms, and keep the industry in which I work informed of the newest security threats.  Things began to change a little bit when actual montioring was made a priority for the unit I am in.  It was always part of the unit, but it was simplistic at the time (Nagios for website monitoring and Netflow Analyzer for reviewing Netflow).  A member of my unit came up with the idea of centralizing everything we monitor (with some additions) into one location.

     To that end we moved into the ELK stack (Elasticsearch, Logstash, and Kibana).  I had never heard of it, but management brought in and sent us to training.  From there, with a lot of trial and error, we got the system up and running.  I spent a great deal of time getting ELK to stay up longer then a few days.  To day, my cluster has been up for over 100 days (I accidentally stopped it) and we are handling 63 events per second.

     With that in place, the same team member then suggested that we move to more data analytics and start looking into Hadoop.  He had used applications built on it and felt it would definitely be worthwhile for our units mission.  Obviously, we had to make it a little more simple because building a Hadoop cluster is tough when you have to figure out which components to use and if they work together.  My boss charged me with finding a distribution to use.  In my research I found that we had Cloudera or Hortonworks.  Cloudera seems to have much more information on the web, but they charge for any of the features you'd really want to us.  Hortonworks, in turn, gives away everything and charges for support along with training.

    From there I was instructed to find training.  I found a really good training center (/training/etc if you are looking for quality training from awesome instructors).  I was sent for the Operations course designed by Hortonworks.  My boss and another coworker attended the Data Analyst course.  This past week my boss and I completed the Data Science course.

    About a month ago I started reading more and more about data analytics.  While security is important, detection and analysis of data is the bread and butter.  I decided that after this training course, if I was still interested, I would move down the analytics road.  The course was amazing and while a 10000 foot view I knew where I should head.  Experience, as with anything in technology, is important and on a daily basis I work with the tools of Big Data.  I run a Hadoop cluster and analyze various system/network data in Elasticsearch.  I'll be applying for a Masters in Analytics and hopefully in two years I will move into Data Engineering.

    The morale of the story?  The journey you start doesn't always have the end point you think.

Tuesday, July 7, 2015

eLearnSecurity - Web Application Penetration Testing Certification

Today I received an email about getting 40% off the course if I registered by 7/10/15.  I jumped on it and purchased the Elite version.  Thus starting tonight I will begin posting reviews of each module for the course as I complete them.  I need to wrap this up in two months as I will beginning grad school in August.

Monday, June 29, 2015

Hadoop Had a Case of the Mondays!

"Let me ask you something. When you come in on Monday and you're not feeling real well, does anyone ever say to you, "Sounds like someone has a case of the Mondays?" " - Peter Gibbons - Office Space

As always, the weekend goes way too fast and once again we are back in the office.  Typically, either Friday or Monday, I will run the updates for the servers that I manage.  During this time I will review to make sure that my Hadoop cluster is running properly.  This morning I noted that the Ambari Metric Collector service was not running.  All other services appeared to be fine.

Hoping for a one off I went ahead and tried to start the service via Ambari.  As you run HDP more and more you start to get a feel for when something is just not going to work.  In this case the service started at 9% and I knew that it was not going to start.  I waited for the time out and sure enough it didn't start.  Everything else appeared to be fine so I went ahead and put the server in maintenance mode.  From there I went and rebooted the server.

Things went from bad to worst.  Now I was just getting the heartbeat lost as a status for half the services and the other half were showing as stopped (better then heartbeat lost because at least you can try to start them).  When I tried to start all the services I would get an error about being unable to talk to the Zookeeper service.  Reboot again and the same issue was continuing.  Finally, I said to myself, let's shut the server down and then start it.  I could only help and think about Jurassic Park ("Hold on to your butts!").  Bring the server back up and everything was in the red "Stopped" status.  Hit "Host Actions" button and selected "Start All Components".  Bam everything goes to green "Started".

Morale of the story?  Shutting down is probably your best option when dealing with services not coming up.