Research & Simulation: 2015

Monday, December 21, 2015

Running Benchmark tests in hadoop

Dear friends...

Today, I am writting this blog to make you understand about how to run a benchmark test in hadoop cluster.

Benchmark application tests are already in hadoop distribution. you just need to run and test the performance of your cluster.

The command to run the TestDFSIO is as shown below:

This is how you can run various tests in hadoop cluster. You will get the final results of this commands like this:

Another Benchmark test that you can run in your hadoop cluster is: mrbench. This test creates many small jobs and tests their working in your cluster as shown.

The results of this benchmark test will be shown like this:

Wanna give more time to benchmark tests, then visit next page

Hope It will help you...

For more updates keep visiting this blog or our facebook page

Sunday, December 20, 2015

Steps to install hadoop in Ubuntu

Hello Friends...

In this blog I am explaining the procedure to install hadoop single node cluster in Linux. I installed hadoop 2.6.0 on Ubuntu 12.04. Hadoop installation needs basic working knowledge of Linux. I hope you have basic working knowledge of linux or have a look on this post for basic understanding of Linux first: Linux administration.

The steps for installing hadoop are as following:

1. First open the terminal by Ctrl+Alt+T.
2. Run the update command: sudo apt-get update
First it will prompt for your password, and then it may take time depending upon your internet
speed and system update status.
3. Then install java in your sytem using $ sudo apt-get install openjdk-6-jdk
Note: I used java version 6, you can opt for higher version 7 or 8.
To change java version in your system you can run the command:
$ update-alternatives --config java

4. check java version by using: $ java -version

5. Add a new group named hadoop: $ sudo addgroup hadoop
6. Then make a new user hduser in that group: $ sudo adduser –ingroup hadoop hduser
It may ask for some details like name, address, etc. Fill these details although you may skip some of these.
7. Now for communication install the ssh: $ sudo apt-get install ssh
8. Generate the RSA public private key pair using SSH and move this to the authorized_keys as shown in following steps:

9. Add localhost as secure channel using ssh: $ ssh localhost
10. Now install the freely available hadoop version from any site ( I downloaded 2.6.0)
11. Untar the downloaded package using the command: $ tar xvzf hadoop-2.6.0.tar.gz

Now make the hadoop directory inside /usr/local by the command: mkdir -p /usr/local/hadoop

12. Now change directory to this folder using: $ cd hadoop-2.6.0

13. Now move all content of this directory to the /usr/local/hadoop

14. This may throw an error like:
hduser is not in the sudoers file. This incident will be reported......
15. To deal with this error add hduser to the sudoers file as shown

16. Now again move the folder as tried previously and change its ownership to the hduser as shown:

17. Now we are almost done, and just need to change the configuration file. The following files
needs to be changed
1. ~/.bashrc
2. hadoop-env.sh
3. core-site.xml
4. mapred-site.xml
5. hdfs-site.xml

18. Open bashrc by the comand ( vim ~/.bashrc) and add the hadoop path to the directory as shown: [If vim is not already installed on your system, install it by following command sudo apt-get install vim (after it again try to open the .bashrc file as shown:) vim ~/.bashrc

19. Now open ( $ vim /usr/local/hadoop/etc/hadoop/hadoop-env.sh) and update hadoop-env.sh as shown

20. Now First make a tmp directory as mentioned in the given step:

Now open and update core-site.xml as shown:

Now first copy content of mapred-site.xml.template to mapred-site.xml by the command shown in the image below:

21. Now open and update mapred-site.xml as shown (opening command in the image above and opened file and necessary changes in the image underneath)

22. Now make two directories for namenode and datanode and then make corresponding updates in hdfs-site.xml

Updates in hdfs-site.xml

23. Now we are done... !!
24. Lets start the hadoop now,
25. first format the namenode

26. Then start the hadoop:
27.Change the directory where start-all.sh file resides:
28. Now start hadoop : $ start-all.sh and check the status of the node using the command $ jps

Errorfree start of the hadoop environment will show Namenode, SecondaryNameNode,

NodeManager, DataNode, ResourceManager and jps itself as running processes. So we are done.
29. Lets see the web interface of Namenode and Secondary namenodes:
Namenode at port 50070 of localhost:

We are done.... All components are working fine.

30. Last one.... Dont forget to leave hadoop cluster without stoping the services by the following commands:

If you wish to make a multinode hadoop cluster. Please refer the instructions given at following post hadoop multinode installation

*****************************************************************************

Now to run the first program on your hadoop cluster Please follow this blog: Running first program in hadoop

For configuring hbase in your hadoop cluster visit this post

For configuration of pig in your hadoop cluster go to this pig-installation-page

For more frequent updates about Big data Analytics using hadoop please visit and like: DataioticsHub

Thanks and Regards

Saturday, December 19, 2015

Steps to run wordcount program in hadoop setup for big data Analytics

hello friends...

I am writing this blog about how to get interact with installed hadoop environment by running first program for wordcount. I am giving step by step procedure of all terminal commands with results exucuted on my node ( My hadoop version is 2.6.0). Right from starting my node the commands are highlighted in blue color and some important results are in green color.

omesh@omesh-HP-240-G3-Notebook-PC:~$ sudo su hduser
[sudo] password for omesh:
hduser@omesh-HP-240-G3-Notebook-PC:/home/omesh$ cd /usr/local/hadoop/sbin/
hduser@omesh-HP-240-G3-Notebook-PC:/usr/local/hadoop/sbin$ ls
distribute-exclude.sh    start-all.cmd        stop-balancer.sh
hadoop-daemon.sh         start-all.sh         stop-dfs.cmd
hadoop-daemons.sh        start-balancer.sh    stop-dfs.sh
hdfs-config.cmd          start-dfs.cmd        stop-secure-dns.sh
hdfs-config.sh           start-dfs.sh         stop-yarn.cmd
httpfs.sh                start-secure-dns.sh stop-yarn.sh
kms.sh                   start-yarn.cmd       yarn-daemon.sh
mr-jobhistory-daemon.sh start-yarn.sh        yarn-daemons.sh
refresh-namenodes.sh     stop-all.cmd
slaves.sh                stop-all.sh
hduser@omesh-HP-240-G3-Notebook-PC:/usr/local/hadoop/sbin$ start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
15/12/19 11:52:22 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [localhost]
localhost: starting namenode, logging to /usr/local/hadoop/logs/hadoop-hduser-namenode-omesh-HP-240-G3-Notebook-PC.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-hduser-datanode-omesh-HP-240-G3-Notebook-PC.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-hduser-secondarynamenode-omesh-HP-240-G3-Notebook-PC.out
15/12/19 11:52:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn-hduser-resourcemanager-omesh-HP-240-G3-Notebook-PC.out
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-hduser-nodemanager-omesh-HP-240-G3-Notebook-PC.out
hduser@omesh-HP-240-G3-Notebook-PC:/usr/local/hadoop/sbin$ jps
25596 NodeManager
24829 DataNode
25694 Jps
25351 ResourceManager
25166 SecondaryNameNode
24591 NameNode
hduser@omesh-HP-240-G3-Notebook-PC:/usr/local/hadoop/sbin$ hdfs dfs -ls /
15/12/19 11:58:24 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 6 items
-rw-r--r--   1 hduser supergroup     661808 2015-12-18 20:25 /hadoop-projectfile.txt
drwxr-xr-x   - hduser supergroup          0 2015-12-18 20:32 /om
drwxr-xr-x   - hduser supergroup          0 2015-12-18 20:14 /output
drwxr-xr-x   - hduser supergroup          0 2015-12-18 20:40 /output2
drwxr-xr-x   - hduser supergroup          0 2015-12-18 20:13 /user
-rw-r--r--   1 hduser supergroup     661808 2015-12-18 19:21 /wordcount
hduser@omesh-HP-240-G3-Notebook-PC:/usr/local/hadoop/sbin$ hadoop jar /usr/local/hadoop/share/
doc/    hadoop/
hduser@omesh-HP-240-G3-Notebook-PC:/usr/local/hadoop/sbin$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/
hadoop-mapreduce-client-app-2.6.0.jar              hadoop-mapreduce-client-jobclient-2.6.0-tests.jar
hadoop-mapreduce-client-common-2.6.0.jar           hadoop-mapreduce-client-shuffle-2.6.0.jar
hadoop-mapreduce-client-core-2.6.0.jar             hadoop-mapreduce-examples-2.6.0.jar
hadoop-mapreduce-client-hs-2.6.0.jar               lib/
hadoop-mapreduce-client-hs-plugins-2.6.0.jar       lib-examples/
hadoop-mapreduce-client-jobclient-2.6.0.jar        sources/
hduser@omesh-HP-240-G3-Notebook-PC:/usr/local/hadoop/sbin$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount /om /output3
15/12/19 12:00:30 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/12/19 12:00:31 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/12/19 12:00:31 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/12/19 12:00:31 INFO input.FileInputFormat: Total input paths to process : 1
15/12/19 12:00:31 INFO mapreduce.JobSubmitter: number of splits:1
15/12/19 12:00:31 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local964714568_0001
15/12/19 12:00:31 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
15/12/19 12:00:31 INFO mapreduce.Job: Running job: job_local964714568_0001
15/12/19 12:00:31 INFO mapred.LocalJobRunner: OutputCommitter set in config null
15/12/19 12:00:31 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
15/12/19 12:00:32 INFO mapred.LocalJobRunner: Waiting for map tasks
15/12/19 12:00:32 INFO mapred.LocalJobRunner: Starting task: attempt_local964714568_0001_m_000000_0
15/12/19 12:00:32 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
15/12/19 12:00:32 INFO mapred.MapTask: Processing split: hdfs://localhost:54310/om/hadoop-projectfile.txt:0+661808
15/12/19 12:00:32 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
15/12/19 12:00:32 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
15/12/19 12:00:32 INFO mapred.MapTask: soft limit at 83886080
15/12/19 12:00:32 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
15/12/19 12:00:32 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
15/12/19 12:00:32 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
15/12/19 12:00:32 INFO mapred.LocalJobRunner:
15/12/19 12:00:32 INFO mapred.MapTask: Starting flush of map output
15/12/19 12:00:32 INFO mapred.MapTask: Spilling map output
15/12/19 12:00:32 INFO mapred.MapTask: bufstart = 0; bufend = 1086544; bufvoid = 104857600
15/12/19 12:00:32 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 25775024(103100096); length = 439373/6553600
15/12/19 12:00:32 INFO mapreduce.Job: Job job_local964714568_0001 running in uber mode : false
15/12/19 12:00:32 INFO mapreduce.Job: map 0% reduce 0%
15/12/19 12:00:33 INFO mapred.MapTask: Finished spill 0
15/12/19 12:00:33 INFO mapred.Task: Task:attempt_local964714568_0001_m_000000_0 is done. And is in the process of committing
15/12/19 12:00:33 INFO mapred.LocalJobRunner: map
15/12/19 12:00:33 INFO mapred.Task: Task 'attempt_local964714568_0001_m_000000_0' done.
15/12/19 12:00:33 INFO mapred.LocalJobRunner: Finishing task: attempt_local964714568_0001_m_000000_0
15/12/19 12:00:33 INFO mapred.LocalJobRunner: map task executor complete.
15/12/19 12:00:33 INFO mapred.LocalJobRunner: Waiting for reduce tasks
15/12/19 12:00:33 INFO mapred.LocalJobRunner: Starting task: attempt_local964714568_0001_r_000000_0
15/12/19 12:00:33 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
15/12/19 12:00:33 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@4748aec5
15/12/19 12:00:33 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=334063200, maxSingleShuffleLimit=83515800, mergeThreshold=220481728, ioSortFactor=10, memToMemMergeOutputsThreshold=10
15/12/19 12:00:33 INFO reduce.EventFetcher: attempt_local964714568_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
15/12/19 12:00:33 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local964714568_0001_m_000000_0 decomp: 267009 len: 267013 to MEMORY
15/12/19 12:00:33 INFO reduce.InMemoryMapOutput: Read 267009 bytes from map-output for attempt_local964714568_0001_m_000000_0
15/12/19 12:00:33 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 267009, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->267009
15/12/19 12:00:33 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
15/12/19 12:00:33 INFO mapred.LocalJobRunner: 1 / 1 copied.
15/12/19 12:00:33 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
15/12/19 12:00:33 INFO mapred.Merger: Merging 1 sorted segments
15/12/19 12:00:33 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 267004 bytes
15/12/19 12:00:33 INFO reduce.MergeManagerImpl: Merged 1 segments, 267009 bytes to disk to satisfy reduce memory limit
15/12/19 12:00:33 INFO reduce.MergeManagerImpl: Merging 1 files, 267013 bytes from disk
15/12/19 12:00:33 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
15/12/19 12:00:33 INFO mapred.Merger: Merging 1 sorted segments
15/12/19 12:00:33 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 267004 bytes
15/12/19 12:00:33 INFO mapred.LocalJobRunner: 1 / 1 copied.
15/12/19 12:00:33 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
15/12/19 12:00:33 INFO mapreduce.Job: map 100% reduce 0%
15/12/19 12:00:34 INFO mapred.Task: Task:attempt_local964714568_0001_r_000000_0 is done. And is in the process of committing
15/12/19 12:00:34 INFO mapred.LocalJobRunner: 1 / 1 copied.
15/12/19 12:00:34 INFO mapred.Task: Task attempt_local964714568_0001_r_000000_0 is allowed to commit now
15/12/19 12:00:34 INFO output.FileOutputCommitter: Saved output of task 'attempt_local964714568_0001_r_000000_0' to hdfs://localhost:54310/output3/_temporary/0/task_local964714568_0001_r_000000
15/12/19 12:00:34 INFO mapred.LocalJobRunner: reduce > reduce
15/12/19 12:00:34 INFO mapred.Task: Task 'attempt_local964714568_0001_r_000000_0' done.
15/12/19 12:00:34 INFO mapred.LocalJobRunner: Finishing task: attempt_local964714568_0001_r_000000_0
15/12/19 12:00:34 INFO mapred.LocalJobRunner: reduce task executor complete.
15/12/19 12:00:34 INFO mapreduce.Job: map 100% reduce 100%
15/12/19 12:00:34 INFO mapreduce.Job: Job job_local964714568_0001 completed successfully
15/12/19 12:00:35 INFO mapreduce.Job: Counters: 38
    File System Counters
        FILE: Number of bytes read=1075078
        FILE: Number of bytes written=1845581
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=1323616
        HDFS: Number of bytes written=196183
        HDFS: Number of read operations=15
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=4
    Map-Reduce Framework
        Map input records=12761
        Map output records=109844
        Map output bytes=1086544
        Map output materialized bytes=267013
        Input split bytes=113
        Combine input records=109844
        Combine output records=18039
        Reduce input groups=18039
        Reduce shuffle bytes=267013
        Reduce input records=18039
        Reduce output records=18039
        Spilled Records=36078
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=9
        CPU time spent (ms)=0
        Physical memory (bytes) snapshot=0
        Virtual memory (bytes) snapshot=0
        Total committed heap usage (bytes)=429260800
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters
        Bytes Read=661808
    File Output Format Counters
        Bytes Written=196183
hduser@omesh-HP-240-G3-Notebook-PC:/usr/local/hadoop/sbin$

Now if you want to run some benchmark tests in your hadoop cluster, please follow the link

For more frequent updates on big data analytics using Hadoop please like this page https://www.facebook.com/coebda/

Wednesday, August 5, 2015

Explanation of protocols in ns-2.35

Network Simulator-2

network simulator-2 is an efficient and open source tool freely available for research community. NS-2 provides support for wired and wireless networks' simulation. In wireless simulation, we mainly focus on MANETs, i.e Mobile Adhoc Networks. As a part of MANET protocols, NS-2 consists of AODV, DSDV, and DSR etc. Other than the common binding architecture in NS package, all protocols have their different functioning and architecture. Understanding of those is also a tough task for a newbie of NS-2. So In this post I am trying to give an overview of these protocols for better understanding for those who want to work on any of these protocols.

AODV:

AODV code in NS-2.35: There is a aodv folder in ns-2.35 directory, which consists of the main coding of all the functioning of aodv. The files in aodv folder includes aodv.cc, aodv.h, aodv_logs.cc, aodv_packet.h, aodv_rtable.cc, aodv_rtable.h, aodv_rqueue.cc, aodv_rqueue.h. For every .cc file there is an entry in Makefile which integrates the whole ns package. Running of those makes .o file for corresponding .cc file.

aodv.cc is the main file which has binding with tcl. It configures the aodv property on every node of MANET where aodv is called, then corresponding functions like route discovery etc are called according to the need of simulation script. aodv protocol consists of various type of times as hellotimer, broadcasttime, neighborTimer, routecacheTimer etc.

There is corresponding entry in cmu-trace.cc file in trace folder which traces all events when protocol is aodv and sends the result trace data to the trace file for analysis after simulation.

There is wide range of extension possibility in code of AODV including, security extension in AODV, trust implementation in AODV, Attack detection and prevention in AODV, use of genetic algorithms to select best path in AODV, and many more...

This post will be extended soon...

Sunday, January 18, 2015

Critical Differences of IT domain

Cloud computing and Grid computing:

Cloud and Grid both are Internet based distributed service models, but cloud is meant for cheap service availability to the users on pay per use basis, while grid computing is designed to solve large complex problems which needs heavy distributed resources for execution. Grid computing is costly but effective method of large problem solving, while cloud computing is for cheap, easy and pay-per-use based service model.

Multiprocessing and Multithreading:

According to many bookish definitions and to be very specific, thread is a light weight process. Multi-threading is preferred over multiprocessing in resource concerned environment as thread is a light weight process. Light weight means it takes less resources. We can understand it with an example. In a web browser every tab is taken as separate thread so that it may not consume more resources and share address space. But due to security perspective now google-chrome is giving multiprocessing architecture i.e every new tab in chrome is a new process. You can see the effect of this on my computer as shown below ( Chrome is consume more than 90% CPU resources). You can also see multiple process ids assigned to chrome by operating system)

Simulation and Emulation:

Simulation is the methodology of designing and testing various concepts and procedures on a tool (generally software) which provides "real-like" environment/functioning of the system. In case of Simulation, none of the input, processing, and output are from real world.

While in case of emulation the input and/or output are from real world which are tested in user defined processing environment which provides "real-like" functioning.

Compiler and Interpreter:

Compilation or interpretation is the process of converting a high level code to low or machine level code for execution by hardware. The difference in both is that, compiler is fast, while interpreter is slow due to its line by line processing architecture. Other than this, a code once compiled makes an intermediate file (for ex: C, C++ Java etc.), which does not needs to be compiled again and again for execution so execution is fast, while in case of interpretation (for ex: html, tcl etc.) there is no intermediate code generation, so interpretation and execution takes place simultaneously as many times as it needs to be executed so it results in slower processing than compiler. Compilation needs more resources than interpretation.

File system and Database Management System:

In any file system you can have only sequential access to an item ( for ex: sub-folder or file in a folder). To access the far most data item you will have to go sequentially which is a limitation, so we use DBMS which enables random access to data items. DBMS is a collection of structured data and a set of programs or procedures to access those data. It overcomes the inconsistency, redundancy, and atomicity problem of File management system with its ACID properties.

Bug and Defect:

When error is found in the system at the developer end, or we can say if the error is found before the software is actually launching or final build, it is known as bug, while if the error is found after the release of the software is is known as defect and the module or software is thought to be defective.

Free software and Open Source Software:

Free software is the category of software which are available to use with no cost, we can say that we will not have to pay anything to use these software but these may not be open Source. Open Source software are software for which a user is allowed to use, change and even redistribute under the given license terms generally GNU general public license.

Hard Computing and Soft Computing:

Hard computing or simply computing works on the concept of binary logic i.e. 0 & 1. It works on clearly defined rules and discrete values, while soft computing works on the values which ranges somewhere between 0 & 1. Soft computing is the way to deal with the conditions of uncertainty and partial truth with low cost solutions. Soft computing techniques includes neural networks, fuzzy logic, and genetic algorithms to solve any problem.

Activation Function, Membership Function, and Fitness Function:

A function used to transform the activation level of a neuron into an output signal by crossing the threshold value is known as activation function in neural networks, while the Membership function of a fuzzy set shows the degree of truth, or the degree of membership of a item in a particular set. It may vary from 0 to 1. A fitness function is specific type of objective function that is used to represent, as a single "figure of merit", how close a given solution pattern is to achieving the set goals.

Database and Data Structure:

Database and Data-Structure both are used for systematic storage of data in a storage device. Where Database focuses on maintaining consistency, isolation, easy access etc for the data from users' point of view while data structure works beneath that focusing on how to store that data in device that it can be easily accessed and managed. So we can say Database uses any of underlying Data-structure to support its functionality to the end user interacting with database through query languages or any other mode.

NFV and SDN:

NFV stands for Network Function Virtualization, while SDN stands for Software Defined Networks. Both have same objective but different approaches. NFV is inherently static in nature, while SDN enables dynamic decision making. NFV requires to move network applications from dedicated hardware to virtual environment while SDN needs new interface design to separate data and control module and to give programmable behavior to the networking devices.

Switching and Routing:

Switching occurs at layer 2 ( Data Link Layer) while routing is the function of layer 3 (Network Layer). Switching works on MAC address while routing needs IP address. It also makes Routing costlier than switching as MAC addressing is free while having private address is paid. Routing is more scalable than switching. As we can see switches works in relatively small intranet domain while routers are installed in large Internet domain.

Baseband and Broadband:

Baseband uses digital signaling while Broadband uses analog signals. Baseband uses single cable to send and receive data on different times. Due to use of single cable there is no need of multiplexing. On the other hand, broadband uses separate channels to send and receive data in parallel. Due to multiple channels broadband needs multiplexing.

Keep visiting for more topics...

Tuesday, January 13, 2015

Cloud computing

Cloud Computing:
Cloud computing is an Internet based service delivery architecture which enables us to use third party resources on pay per use basis. The need of having third party High performance computing features with flexible demand and lower cost gives rise to the technologies like Cloud Computing.

Cloud Computing is a computing paradigm which is based on the usage of pooled resources via Internet. It delivers on-demand IT services to users on pay-per-use basis i.e. at a much lower cost.
You have to pay for what and how much you use. Cloud computing enable the end user to focus only on Operational expenditures rather than Capital Expenditures. It offers reduced investment, expected performance (that is why it is high performance computing), high availability, scalability, accessibility and mobility (accessible from anywhere) and many more services.

Cloud computing is no new technology but a new delivery method. As a user of cloud computing you just need to have a computer system connected with Internet to enjoy everything on cloud on pay-per-use basis. There are plenty of cloud service providers like IBM, Amazon, Microsoft and many more...
You may also be interested to get the technical aspects of being a cloud provider. As a cloud provider you need to design data centers, broker policies, various servers in distributed manner, hypervisors and various Virtual machine usage policies. So from the research perspective, cloud computing still have huge scope including the need of work in designing better scheduling algorithms, energy efficient datacenter designing, effective resource utilization model, green computing and higher security models with low complexity to name a few.

Before going to start research on Cloud computing we must first understand what it actually stands for? What is its basic concept? What is "Anything as a Service" model. What is the core of cloud computing? Virtualization is the concept which lies in the core of cloud computing. You can achieve virtualization with the help of various hypervisors or virtual machine manager (VMM).

A virtual Machine (VM) can be defined as a software based machine emulation technique to provide a desirable, on-demand computing environment for users. It is completely independent of any base operating system and complete in itself to finish a task.

Main Characteristics of cloud computing:

Cloud computing provides on demand services to the clients without any human interaction at service provider‟s site.
Cloud computing provides large pool of resources to the client to use as utility computing.
Cloud computing provides elasticity in its services which lets a user to have as much or as little of a service as they want at any instant of time.
The service offered by cloud computing are fully managed by the provider. The user doesn't need to concern about that. The only thing the user has to do is to use the services pay as per the usage, nothing more.

To start research on cloud using cloudsim tool, you must understand some basic concepts including cloudlet, broker, datacenters, schedulers, and virtual machines. All these and other components are desinged in java for research purpose in cloudsim as well as CloudAnalyst.

As shown in the screenshot above, CloudAnalyst allows you to code in java to simulate your cloud environment, it gives you visual effects of your simulation and also gives you results in the form of report which you can easily export as pdf to be used in your research work.

To configure a cloud storage environment like dropbox at your end you can start with owncloud.

If you want to work on scheduling and load balancing algorithms, you can also go for python based tool "Haizea". Haizea provides three modes of operation: Simulation mode, Real-time mode, and Open-Nebula mode. Haizea lets you schedule the workload in any of the "Best Effort", "DeadLine Sensitive", "Immediate", and "Advance Reservation" ma"Anything as a Service" model.nner. You will have to be handy with python to work with Haizea tool.

When we talk about Industry level implementation of cloud services, there are many big giants like OpenStack, Salesforce, AWS and many more. AWS stands for Amazon Web Services. Its a proprietary platform from Amazon. Like that many other platforms are available as Microsoft Azure. Now the move is towards adopting open source platforms to come out of vendor lock in cloud computing. Openstack and Open-Nebula are name of open source cloud computing platforms. Working knowledge of python will do good in you want to be the part of coming world of Cloud Computing.

Go to home page

Wednesday, January 7, 2015

Big Data Analytics

What is Big Data:

In today's IT industry "Big Data" is new IT buzzword, which is meant for large files or unstructured data sets for which conventional approaches are inefficient to deal with. Inefficiency of traditional data storage and manipulation tools to deal with "Big Data" lies in its architecture, i.e. approx 80% of today's big data is unstructured or more specifically to be categorized as non-rdbms which crosses the boundaries of a system. But this unstructured data is very useful. Use of commodity hardware and plenty of open source tools made the big data analytics a feasible task.

For example: Day-to-day data generated by social sites is NoSQL in nature while it is worth to be stored and manipulated for faster trend analysis of the customer, or by the people whom a company is targeting for marketing. Twitter Trend analysis is one of the most commonly used example of big data analytics.

Most common Big Data categories are: Medical data, Telecom Data (also known as Telco Big Data), log data generated by retail-chains, Bar code data from aviation industry and many more.

Big data analytics has given new dimensions to data visualization and Machine Learning. Data visualization is the method of representing the values in graphical format. It is very fruitful to be used in decision making. The most promising use cases of this are weather forecasting and exit poll surveys, which process large amount of unstructured data and generates some fruitful results. Seeing this you can understand how important the data is:

You can see my video lecture on Big Data Analytics here

Properties of Big Data:

Three most common properties of Big Data are:

Volume
Velocity
Variety

Technologies to deal with Big Data:

There are various tools and technologies to deal with Big Data out of which Hadoop is most commonly used. Hadoop basically stands for ( HDFS + MapReduce). HDFS is reliable Hadoop Distributed File System, and MapReduce is a parallel processing framework which works in key value pair. HDFS is responsible for data storage with reliability and availability while MapReduce is responible for Data Processing. Main components responsible for storage in HDFS are NameNode and Datanode, correspondingly main components handling data processing in MapReduce are JobTracker and TaskTracker. If you are having namenode on local host you can check the status of your hadoop cluster on localhost:50070 via your web browser as shown in picture.

The Cluter configuration of hadoop is mentioned in core-site.xml, hdfs-site.xml, mapred-site.xml. You can customize your cluster configuration by making appropriate changes in these files. Hadoop needs a restart to reflect the changes in configuration. Mapreduce1 was having a problem of Single Point of Failure. MapReduce2 (YARN) architecture is enhanced to deal with parallel processing to be suited to OLAP and OLTP applications and to avoid the problem of single point of failure.

Hadoop is an open source technology which is designed to deal with distributed databases having unstructured data. It is not designed for faster processing rather it is specifically designed for failure proof distributed data processing. Hadoop provides partial failure support, data recovery, component recovery, consistency and scalability.
There are various benchmark tests (for eg "testdfsio") given in hadoop installation package to check the performance of the hadoop cluster.

For the user-convenience and faster execution Hadoop Eco-system supports various scripting languages like Pig, Hive, Jaql, and Many more to be discussed in detail later.

Pig (Pig Latin):
Pig is a simple language platform popularly known as Pig Latin that can be used for manipulating data and queries. Pig is a high level language developed at yahoo. Pig is a data flow language. Unlike SQL pig does not require that data have a schema. In Pig if you don't specify a datatype, all fields are byte-array as its default datatype. In Pig, relation, field, and function names are case sensitive, while keywords are not case-sensitive.

Hive (Hive QL):
Apache Hive first created at Facebook, is a data warehouse system for hadoop that facilitates easy data summarization, ad-hoc queries and the analysis of large datasets stored in hadoop compatible file system. Hive organizes data in the form of Database, Tables, Partitions, and bucket. Hive supported storage file formats are TEXTFILE, SEQUENCEFILE and RCFILE (Record Columnar File). Hive uses temporary directories on both hive client and HDFS. Hive client cleans up temporary data when query is completed.
Hive Query Language was developed at Facebook, and later contributed to the open-source community. Currently Facebook uses Hive for reporting dashboards and ad-hoc analysis.

Spark:
Apache Spark is fast execution engine. It can work independently or on the top of hadoop using hdfs for storage. As a stand alone solutions Spark is used as extremely fast processing framework.

Jaql:
Jaql is primarily a query language for JavaScript Object Notation (JSON) files, but it supports more than just JSON. It allows to process both structured and unstructured data. It was developed by IBM later donated to the open source community. Jaql allows you to select, join, group, and filter data that is stored in HDFS, much like a blend of Pig and Hive. Jaql’s query language was inspired by many programming and query languages, including Lisp, SQL, XQuery, and Pig. Jaql is a functional, declarative query language that is designed to process large data sets. For parallelism, Jaql rewrites high-level queries, when appropriate, into “low-level” queries consisting of MapReduce jobs.

To know the procedure of installing hadoop please follow these steps and for running first program in your hadoop cluster you please follow the steps given in following link (Steps to run first mapreduce program wordcount)

NoSQL Databases:

NoSQL databases are also the integral part of big data analytics domain. A few names of NoSQL databases are MongoDB, CouchDB, CouchBase, Cassandra etc. These databases are designed to store the data in non-relational structure generally in JSON. Most of these databases are categorized as document structure databases. These are effective tools to deal with big data analytics as well as elastic search.

Kafka

Kafka is a distributed messaging system which works in approx real-time manner. Here all the clients (consumer) gets access to the desired information from corresponding server (producer) based on selected topic.
A solution architecture designed over Spark, Cassandra, & Kafka technologies is being used as a SCM solution at prestigious retail chains which reduced the operations cost of SCM by 30-40%.

For more updates on big data analytics, you can like CoE Big Data. If you are really a big data enthusiast and wants to learn these technology stack from practical aspects, visit DataioticsHub and join our Dataioticshub-meetup group for hands-on sessions

Please find the list of projects in the field of big data here

Go to home page

Sunday, January 4, 2015

Network Simulations and installation of NS2 and NS3

First thing which comes to the mind of new researcher is "What actually the simulation is, why to go for simulation, and how it is different from real implementation..??"

Simulation is the process of creating virtual working environment with the help of some software/tools for which the real implementation might be costlier, infeasible. The software/tools which provides these environments are known as simulators. Simulators provides cheap, repeatable and flexible environment for testing or designing of new protocols. See the difference between simulation and emulation at http://researchbyomesh.blogspot.in/2015/01/critical-differences-of-it-domain.html?m=1

So network simulators are the tools which provides the functionality of network with the help of programming. You can easily design, test or enhance the functionality of any protocol to see its effect on the outcome. The output of simulation depends upon the functionality of the simulator.

For network simulations, there are lots of tools available online for research, out of those i recommend NS2/NS3 for academic research purpose due to its open source and easy to enhance structure.

First I will be discussing about NS2. For details of ns2 please refer this post or link

NS2 is a simulation tool which is easy to configure on linux operating systems. I use Ubuntu-12.04 LTS version for installation. The commands for ns2- installation are:

1. Open the terminal and run:
           sudo apt-get update
   it will ask for you password, enter your password, and your system will be updated. Make sure that you are connected to the network.
2. Next commands to run are:
          sudo apt-get install tcl8.5-dev tk8.5-dev
          sudo apt-get install build-essential autoconf automake
          sudo apt-get install libxt-dev libx11-dev libxmu-dev
3. Now download the online available opensource package "ns-allinone-2.35.tar.gz" and untar it and save it to you home directory.
4. Go to this package via terminal:
         cd ns-allinone-2.35
5. Now in this directory you will find an executable by with the name "install", run it
        ./install
    it can take some time.
6. After successful execution of this command go to ns-2.35 directory:
       cd ns-2.35
7. and now run the following commands one by one:
       ./configure
       make
       sudo make install
8. If all these commands ran successfully, it means you have configured the ns package correctly.
9. Now install some packages to make it executable:
      sudo apt-get install ns2
      sudo apt-get install nam
      sudo apt-get install xgraph
sudo apt-get install gawk
10. Now you are completely done...
11. To check some running tcl scripts, go to the ns-allinone-2.35/ns-2.35/tcl/ex, and run:
       ns file_name.tcl

Note: If your Ubuntu version is 14.04 or higher, then nam may not run properly ( Most probably it will show the segmentation fault / core dumped while running nam). It is can be resolved by moving your binary file of nam to /usr/local/bin directory. For this you need to go to your nam-1.x.x folder inside your ns-allinone package and run the command:

sudo cp nam /usr/local/bin

12. After successfully running the program, the output comes in trace file (.tr). It is a formatted text file where every column has its specific meaning. By analyzing the trace files using awk script coding you can get the throughput, delay, packet-loss and other network performance parameters.

A sample program to learn how to use ns2 and analyzing it with awk can be found here

If there is some problem of tclcl.h not found during installation, then refer this link. There are many other problems during NS-2 installation. For all those i am going to extend the article very soon.

12. Enjoy Network Simulation...

For latest updates in research and development please like this page

Installation steps for NS3:

Follow the given steps for the installation of NS-3. First install some dependencies to enable the NS-3 execution environment on your Ubuntu 12.04 LTS machine.

1. sudo apt-get install gcc g++ python
2. sudo apt-get install gcc g++ python python-dev
3. sudo apt-get install mercurial
4. sudo apt-get install bzr
5. sudo apt-get install gdb valgrind
6. sudo apt-get install gsl-bin libgsl0-dev libgsl0ldbl
7. sudo apt-get install flex bison libfl-dev
8. sudo apt-get install g++-3.4 gcc-3.4
9. sudo apt-get install tcpdump
10.sudo apt-get install sqlite sqlite3 libsqlite3-dev
11.sudo apt-get install libxml2 libxml2-dev
12.sudo apt-get install libgtk2.0-0 libgtk2.0-dev
13.sudo apt-get install vtun lxc
14.sudo apt-get install uncrustify
15.sudo apt-get install doxygen graphviz imagemagick
16.sudo apt-get install texlive texlive-extra-utils texlive-latex-extra
17.sudo apt-get install python-sphinx dia
18.sudo apt-get install python-pygraphviz python-kiwi python-
pygoocanvas libgoocanvas-dev
19.sudo apt-get install libboost-signals-dev libboost-filesystem-dev

Now download the package of NS-3 from its official website:

http://www.nsnam.org/ns-3-17/

Extract the downloaded package and go to it via terminal, then run some executable files given in subfolder:
    ./build.py
./test.py

Configure the waf using the following commands once during installation:
./waf configure
./waf
NS3 enables support for animation with the tools NetAnim and support for graphs using GnuPlot tool.
For enabling gnuplot run the commands:
  sudo apt-get install gnuplot
  sudo apt-get install gimp

To install NetAnim support run the commands:
Go to the NetAnim folder and run the following commands to enable support for animation tool:
make clean
qmake NetAnim.pro
make
For details of NS-3 please refer the link: ns-3

For more frequent updates on related research topics please like our facebook page at this link

Go to home page

Saturday, January 3, 2015

How to start research

Research: The way to start thinking differently

Hello Friends,

I am righting this blog for those who are going to start research for their Post graduate or doctoral degree or for their interest of getting the hands dirty with the code. The important thing about research is "Where to start from..."

This blog is just to define the right steps for starting research. Many times students stuck in implementation, but this is not the real problem. The problem is that he/she is not sound in basics, so feels unable to simulate their ideas of research.

Starting research by reading research papers is not always a good idea. From the research papers you can just get the relevant and latest problem statement. For a good research, first clear the basics of the field in which you want to do research. Do the ground work and focus on basics. Don't forget to study some text book of your domain of research. Then learn the simulation tool on which you want to verify your research.

Lack of operating knowledge of simulation tools is the main cause of inefficient research results, which may lead to frustration at the expected end time of our research.

So, I am mentioning a few research fields, relevant simulation tools to start with.

Network Protocols    -     Network Simulator 2, Network Simulator 3.
Cloud Computing    -     CloudSim, Haizea,owncloud, OpenStack
Physical layer design    -      MatLab
Big Data Analytics                -      Hadoop (MapReduce), Python, R/Rstudio. NoSQL                                  -      (CouchBase, Cassandra)

Network Simulator 2 is a C++ and TCL based tool, which is easy to learn and efficient enough to show packet level simulation results. The core protocols in NS2 are designed in C++ and complete project is bound with the help of Makefile. Tcl is used to generate the scenarios or simulation scripts. There is a component Tclcl which makes communication possible between OTcl script values and C++ functions.

CloudSim is Java based cloud simulator, and very efficient tool for those who want to work on cloud platform. It easily enables to perform the research on performance of cloud, cost-aware migration as well as cloud security issues.
Haizea is a python based cloud simulation tool specially focusing on scheduling mechanisms. Haizea supports three modes: Simulation, real, and OpenNebula support mode.

Matlab is a multi-utility simulation tool for various mathematical models and circuit designing. In Matlab there are various inbuilt tools out of which Simulink is most commonly used for circuit designing.

Latest Research Topics:

I am mentioning some of the most popular research topics on which you can start your work:

Network Simulations:

Designing Energy Efficient routing protocols for Wireless Sensor Networks.
QoS based bandwidth allocation to various traffic flows.
Detection of Sybil Identity in network.
Detection and prevention of various attacks in network.
Openflow SDN simulation for optimized underlying network.

Cloud Computing & Big Data Analytics:

Cost aware migration in cloud computing.
Designing new security model for cloud computing.
Analysis of health-care data using big data analytics tools.
Energy Efficient Data Center modeling in Cloud computing.
Enabling efficient auto-scaling in hadoop clusters.
List of projects for big data

For more frequent updates please visit our facebook page via this link