Research & Simulation: 2016

Tuesday, October 25, 2016

Shell scripting

Shell is an interface between the user and kernel. Shell takes the input from end user and converts is into a code understandable to the underlying kernel. Unix shells includes, sh, bash, Ksh, etc. The default shell in linux flavour is bash. Bash stands for Bourne Again SHell. The details of shell programming are as following:

Before proceeding towards programming, we must understand the capabilities of shell. Shell has a rich collection of variables as well as parameters. It has environment variables, positional parameters and special built in variables.

Shell special parameters are as following:

$$ : Contains process id of current shell.

$# : Contains the number of command line arguments.

$0 : Contains the name of the current shell / script.

$? : Contains the exit status of last executed command.

$* : Contains entire string of arguments.

$@ : Contains set of all arguments (we can say equivalent of $*)

$! : Contains process id of last background command.

Other than these shell has 9 positional parameters from $1 to $9. Environment variables includes HOME, IFS, PATH, PS1, PS2, PWD, and so on. To get the value of any of these parameters you can write echo $VARIABLENAME, as shown below:

For more updates keep visiting this blog or like us on facebook page

Monday, August 29, 2016

Interview questions in big data field

Dear Technocrats,

In this post we are coming up with a series of interview preparation material for those who are looking to get entry in the field of big data analytics. First lets have some common interview tips for all:

Be very attentive while listening before answering any question.
Be very specific and precise in your answers.
In today's fast paced changing IT industry, the recruiter is more focused on well educated personal than the well trained, understand the difference.
Focus more on the outcome of learning, than the syntax of leaning. Having some idea of business use of your technology domain will be a plus.
Show flexibility, rather than rigidity on any technology or platform specially for freshers.
Asking one or two questions from the interviewer about his company is thought to be a good practice, but avoid making continuous arguments.
A common question from the interviewer can be " when a person is called successful on this post?"

Now we are coming up with a set of questions which are expected to be asked in your interview.

Basic Questions:

What do you think by big data and what are its solution techniques?
What is the difference between structured and unstructured data? Support your answer with examples.
What do you know about NoSQL databases? How those are different from RDBMS?
What is Mapreduce? Explain its phases in detail.
What is distributed file system? How it is different from usual file systems. Explain both with examples.
What are the limitations/shortcomings of mapreduce framework?
Do you know about IBM Watson? How it is helpful in big data analytics?
Is there any relation in big data analytics and cloud computing?
Define horizontal scalability and its benefits in hadoop framework?
Explain the role and working of Namenode, datanode, Jobtracker & tasktracker.
What is the difference between hadoop 1.x and hadoop 2.x?
Explain Sharding and its importance.

Advance level questions:

How kafka can be integrated with hadoop / spark for stream processing.
What is the use of NiFi in big data processing frameworks.
Which NoSQL database is suited for storage and processing of binary data (images).
What is the difference between RDD & DataFrames in Spark.

For more such questions, discussions on polls & technical articles on latest technologies for big data analytics check out the posts on DataioticsHub Page

If you are new to big data analytics, please start reading basics from this post. To understand and learn complete technology stack on big data engineering, visit DataioticsHub

Saturday, August 20, 2016

Software Defined Networks

Software Defined Networks:

Networking lies in the core of any IT infrastructure. We can't think of any computer based business system which is lacking in networking capabilities. Good networking leads to multi-dimensional growth of a computer based system and any business module relying on it.

So its time to upgrade and expand the network capabilities to meet the growing need of IT industry. The new and upcoming needs of IT industry are changed due to SMAC model of business. Every organization wants to be connected more closely with their customers. Every customer as well as every feedback is important for an organization have an edge over its competitors. This needs very robust and flexible network capabilities from network providers.
As networking devices are costly enough, expansion of network in new areas are quite costly. All these deriving forces gave birth to the advent of "Software Defined Networks".
SDN comes up with the concept of separating the control logic from the underlying hardware and providing centralized administration to the network. SDN enables improved networking capabilities in cloud data centers.

From Academic research point of view you can take either open-source tool NS-3 "Network Simulator-3" or Mininet as your simulation tools and ride of wave of SDN by contributing some good research from your side to the community. NS-3 has OpenFlow as and protocol set module to showcase the functioning of SDN. The module comes up with coding in C++ and optional binding option with python. Basic knowledge of Linux will be a plus in networking domain specially in SDN.

Keep an eye on post update to get deeper aspects of SDN with practical exposure.

Wednesday, July 6, 2016

twitter data download procedure

Dear Technocrats,

In this post we are going to discuss about social site analysis. Social sites are not inseparable part of personal as well as enterprise life. Analysis of social sites' comments, reviews and feedback provides a fast, reliable closed loop tie up between service provider and his end user.

First step for social site analysis is to get data from that site. For this social sites are providing API (Application Programming Interface) to the users to get some sample data for analysis.

Here is the procedure to download data from twitter public API:

After running this command the twitter data is saved in tweets100.json file. The data downloaded from twitter looks like this:

For more frequent updates on Social, Mobility, Analytics, and Cloud visit our page.

Monday, July 4, 2016

List of projects for Big Data Analytics

This summary is not available. Please click here to view the post.

Tuesday, May 31, 2016

Machine Learing: A new trend in big data analytics

Machine Learning is the field of computer science which deals with finding patterns in the data available based on some algorithms. These algorithms are capable to deal with huge amount of data and tried to find some useful pattern or we can train the dataset according to some clustering or classification algorithm. Learning can be either supervised or unsupervised depending upon its environment.

It is having wide use in big data analytics for trend analysis, demand forecast and various other decision making activities. Apache association have a dedicated tool for machine learning i.e. Mahout.

Apache Mahout is an open source tool which enables to work on various inbuilt machine learning algorithms for clustering or recommendation. I am showing how to run inbuilt hmm (Hidden Markov Model) on Mahout. Currently I am running Mahout in local mode.

As per given instruction of apache mahout first take an input pattern and save it to file and call mahout to make a hmm model based on that.

With option -o we have made and output file having hmm model. Now apply this model to find prediction of any length.

I ll soon come up with more examples on machine learning using Mahout.

Machine learning algorithms are much easier to be implemented and visualized in Rstudio. Please visit this post for Installation of R & Rstudio.Other than that you can also go for python environment to implement machine learning algorithms.

To get frequent updates on big data analytics like our CoE Big Data @ABESEC Gzb .

Tuesday, April 19, 2016

Linux Administration

Dear Technocrats,

Through this post, I am going to explore needs and solutions towards linux administration. In this world of Big Data, the needs of storage and processing is growing day by day. To meet with better storage and processing, recoverability, availability of files, you must understand the architecture of your system/network. Specially for the new learners of Big Data administration, a good knowledge of Linux administration commands is must. For the same cause, I am going to extend this post on time to time basis for better understanding of Linux architecture... First I am starting with some basic commands:

Linux is an open source, multiuser operating system. It is inherited from Unix architecture. Linux has many flavors like Ubuntu, mint, CentOS etc. All these flavors have their versions which are updated on timely basis. To work on Linux we must be handy with commands described underneath:

[The commands will run on terminal, you can open it via (ctrl+alt+T)]

After opening the terminal, type the commands on the terminal as shown below and the result.

There are some network related commands eg. ifconfig, arp, route, finger, traceroute etc. try to run all these on your terminal and see the results as shown in the image below:

Updating path variable:

Setting path in Ubuntu can be either permanent or temporary. For making permanent entry for a path, update it with existing entries in /etc/environment file. add your new path after putting : before adding new path within " " braces of path variable. If you want to add temporarily, you can do it directly using terminal. We have taken an example of adding path where earlier path has been removed. So we are updating path variable as shown:

Home directory:

There are more than one home directory in Linux. the home of root ($ cd /), and the home of user ($ cd ~) from which you are logged in to the system. The user credentials come inside home subdirectory of root. To understand the location of user home directory inside your system follow the command given in the screen below:

Adding and Removing user account:
Linux a multiuser operating system. So in next steps you will learn how to add and remove user accounts from a Linux terminal:

File Permissions:

Linux has file permissions rwx (read, write, execute) in three pairs for (owner, group, and others).

There are total identifiers, first bit show whether the element is directory (d) or file (-). Rest 9 identifiers are in the group of 3 permissions (rwx) for owner, group, and others respectively. Have a look on the screen-shot of file permissions:

Note: by default permission bits of a file when it is created first time is controlled by umask. By default umask for files is 666 and directories is 777.

Editors: (vi, vim, etc...)

Working on editors in Linux also needs knowledge of its modes. vi or vim are the editors of Linux. In Linux editor works in two modes: Command mode, and Insert mode. By default we enter in command mode when we open/create any file using (vim file_name). To write anything in this file just press "i" or "a" to enter in Insert mode. Now you can write whatever you want. After finishing writting press "esc" to come out of write mode and enter in command mode again. Now to just quit write ":q", to save/write and quit ":wq" and to quit without saving file ":q!" and press "enter". You will be back to your terminal. Happy Editing...

Setting up JAVA and JAVAC version:

If we are having more than one java versions in our system there may be conflict in java or javac. To keep all these synchronize follow the commands as shown in the figure below:

grep utility: grep is one of the most used searching utility in unix environment. grep enables the user to perform various customized searchings as shown:

For more specific commands related to Linux administration and networking, the blog will be updated on regular basis. For shell scripting visit this post

For frequent updates on Big Data Analytics keep visiting this blog or like at this page .

Monday, March 14, 2016

Configuring Jaql in your hadoop cluster

Dear Technocrats,

Jaql is an scripting language specifically designed for JSON files. These files are generally generated by web searches for eg. Twitter Analysis.

In this post I am showing the procedure to configure jaql in your hadoop cluster.
Its again very simple procedure to follow up.

Just download jaql tar file online from apache store. keep anywhere in your system and copy its path. Set this path as JAQL_HOME in your bashrc file.

then go to bin and just type ./jaqlshell. You will be redirected to jaql shell prompt for jaql scripting.

Note: adding the path in .bashrc file as as you opened it via vim or gedit
$ gedit ~/.bashrc or $ vim ~/.bashrc
you will see bashrc file, just go to the end of the file and make the entry of your jaql home as shown below:

Save the file and start ./jaqlshell to do the analysis using jaql. Jaql fits best to analyze the json files as in twitter trend analysis.

Enjoy Jaql scripting.

To get more frequent updates, like our page.

or

Go to the Home-page of this blog.

Friday, March 11, 2016

Configuring hbase on hadoop cluster

Dear Technocrats,

Greetings...

In this post I am showing the procedure to install and use hbase shell on your hadoop cluster.

first download hbase stable version from apache site. unzip the tar file and go to it. Make the necessary changes in bin/hbase-site.xml file as shown.

Also mention JAVA_HOME=path to your java in conf/hbase-evn.sh save and exit.

Then run the start-hbase.sh from bin as shown in screenshot...

Now hbase is running on your hadoop cluster as shown above. But hbase does not directs you to the hbase shell by default.

To use hbase shell you will have to run the command ./bin/hbase shell in $HBASE_HOME. Now you can use your hbase shell, program over there and come out of this after programming by typing quit on hbase shell. Complete procedure is as shown:

Enjoy hbase scripting. Wish you all the best... (y)

Thursday, March 10, 2016

Pig installation on hadoop cluster

Dear Technocrats,

In this post, I am explaining the procedure of pig installation on your hadoop cluster ( we have hadoop 2.6.0 configured cluster in our lab).

Pig configuration:
1. download pig-0.x.x
2. Set java path in .bashrc file.
3. pig-0.x.x/bin directory.
4. run the command $ ./pig -x local
5. It will bring you up to grunt shell as shown in the screen-shot below:

grunt is the pig shell. So enjoy scripting using pig on your cluster.

(Note: Pig installation needs java 7 or higher. So if you are using any lower version, then switch to higher version)

Working on grunt shell:

Let you have two files each having two columns. You wish to join those tables corresponding to one common column. It is as simple as 4 lines of code in pig terminal as shown in program screen shot below:

Analysis of Sample healthcare dataset:

This is the analysis of sample dataset of healthcare record having entries of patient name, disease, gender, age etc. Here is the procedure to find the number of male candidates suffering from Swine Flu.

For more frequent updates visit our page

Tuesday, January 12, 2016

Hadoop Benchmark application tests

Hello friends,

In the series of hadoop benchmark application test, another test is: testbigmapoutput. It creates and test A map/reduce program that works on a very big non-splittable file and does identity map/reduce on that. The running format of command and the result on your cluster will be as shown in the screenshot.

threadedmapbench:

To more benchmark tests, keep visiting this post and to start from installation, Go to this link

To support our initiative like our center of Excellence