Research & Simulation: January 2015

Sunday, January 18, 2015

Critical Differences of IT domain

Cloud computing and Grid computing:

Cloud and Grid both are Internet based distributed service models, but cloud is meant for cheap service availability to the users on pay per use basis, while grid computing is designed to solve large complex problems which needs heavy distributed resources for execution. Grid computing is costly but effective method of large problem solving, while cloud computing is for cheap, easy and pay-per-use based service model.

Multiprocessing and Multithreading:

According to many bookish definitions and to be very specific, thread is a light weight process. Multi-threading is preferred over multiprocessing in resource concerned environment as thread is a light weight process. Light weight means it takes less resources. We can understand it with an example. In a web browser every tab is taken as separate thread so that it may not consume more resources and share address space. But due to security perspective now google-chrome is giving multiprocessing architecture i.e every new tab in chrome is a new process. You can see the effect of this on my computer as shown below ( Chrome is consume more than 90% CPU resources). You can also see multiple process ids assigned to chrome by operating system)

Simulation and Emulation:

Simulation is the methodology of designing and testing various concepts and procedures on a tool (generally software) which provides "real-like" environment/functioning of the system. In case of Simulation, none of the input, processing, and output are from real world.

While in case of emulation the input and/or output are from real world which are tested in user defined processing environment which provides "real-like" functioning.

Compiler and Interpreter:

Compilation or interpretation is the process of converting a high level code to low or machine level code for execution by hardware. The difference in both is that, compiler is fast, while interpreter is slow due to its line by line processing architecture. Other than this, a code once compiled makes an intermediate file (for ex: C, C++ Java etc.), which does not needs to be compiled again and again for execution so execution is fast, while in case of interpretation (for ex: html, tcl etc.) there is no intermediate code generation, so interpretation and execution takes place simultaneously as many times as it needs to be executed so it results in slower processing than compiler. Compilation needs more resources than interpretation.

File system and Database Management System:

In any file system you can have only sequential access to an item ( for ex: sub-folder or file in a folder). To access the far most data item you will have to go sequentially which is a limitation, so we use DBMS which enables random access to data items. DBMS is a collection of structured data and a set of programs or procedures to access those data. It overcomes the inconsistency, redundancy, and atomicity problem of File management system with its ACID properties.

Bug and Defect:

When error is found in the system at the developer end, or we can say if the error is found before the software is actually launching or final build, it is known as bug, while if the error is found after the release of the software is is known as defect and the module or software is thought to be defective.

Free software and Open Source Software:

Free software is the category of software which are available to use with no cost, we can say that we will not have to pay anything to use these software but these may not be open Source. Open Source software are software for which a user is allowed to use, change and even redistribute under the given license terms generally GNU general public license.

Hard Computing and Soft Computing:

Hard computing or simply computing works on the concept of binary logic i.e. 0 & 1. It works on clearly defined rules and discrete values, while soft computing works on the values which ranges somewhere between 0 & 1. Soft computing is the way to deal with the conditions of uncertainty and partial truth with low cost solutions. Soft computing techniques includes neural networks, fuzzy logic, and genetic algorithms to solve any problem.

Activation Function, Membership Function, and Fitness Function:

A function used to transform the activation level of a neuron into an output signal by crossing the threshold value is known as activation function in neural networks, while the Membership function of a fuzzy set shows the degree of truth, or the degree of membership of a item in a particular set. It may vary from 0 to 1. A fitness function is specific type of objective function that is used to represent, as a single "figure of merit", how close a given solution pattern is to achieving the set goals.

Database and Data Structure:

Database and Data-Structure both are used for systematic storage of data in a storage device. Where Database focuses on maintaining consistency, isolation, easy access etc for the data from users' point of view while data structure works beneath that focusing on how to store that data in device that it can be easily accessed and managed. So we can say Database uses any of underlying Data-structure to support its functionality to the end user interacting with database through query languages or any other mode.

NFV and SDN:

NFV stands for Network Function Virtualization, while SDN stands for Software Defined Networks. Both have same objective but different approaches. NFV is inherently static in nature, while SDN enables dynamic decision making. NFV requires to move network applications from dedicated hardware to virtual environment while SDN needs new interface design to separate data and control module and to give programmable behavior to the networking devices.

Switching and Routing:

Switching occurs at layer 2 ( Data Link Layer) while routing is the function of layer 3 (Network Layer). Switching works on MAC address while routing needs IP address. It also makes Routing costlier than switching as MAC addressing is free while having private address is paid. Routing is more scalable than switching. As we can see switches works in relatively small intranet domain while routers are installed in large Internet domain.

Baseband and Broadband:

Baseband uses digital signaling while Broadband uses analog signals. Baseband uses single cable to send and receive data on different times. Due to use of single cable there is no need of multiplexing. On the other hand, broadband uses separate channels to send and receive data in parallel. Due to multiple channels broadband needs multiplexing.

Keep visiting for more topics...

Tuesday, January 13, 2015

Cloud computing

Cloud Computing:
Cloud computing is an Internet based service delivery architecture which enables us to use third party resources on pay per use basis. The need of having third party High performance computing features with flexible demand and lower cost gives rise to the technologies like Cloud Computing.

Cloud Computing is a computing paradigm which is based on the usage of pooled resources via Internet. It delivers on-demand IT services to users on pay-per-use basis i.e. at a much lower cost.
You have to pay for what and how much you use. Cloud computing enable the end user to focus only on Operational expenditures rather than Capital Expenditures. It offers reduced investment, expected performance (that is why it is high performance computing), high availability, scalability, accessibility and mobility (accessible from anywhere) and many more services.

Cloud computing is no new technology but a new delivery method. As a user of cloud computing you just need to have a computer system connected with Internet to enjoy everything on cloud on pay-per-use basis. There are plenty of cloud service providers like IBM, Amazon, Microsoft and many more...
You may also be interested to get the technical aspects of being a cloud provider. As a cloud provider you need to design data centers, broker policies, various servers in distributed manner, hypervisors and various Virtual machine usage policies. So from the research perspective, cloud computing still have huge scope including the need of work in designing better scheduling algorithms, energy efficient datacenter designing, effective resource utilization model, green computing and higher security models with low complexity to name a few.

Before going to start research on Cloud computing we must first understand what it actually stands for? What is its basic concept? What is "Anything as a Service" model. What is the core of cloud computing? Virtualization is the concept which lies in the core of cloud computing. You can achieve virtualization with the help of various hypervisors or virtual machine manager (VMM).

A virtual Machine (VM) can be defined as a software based machine emulation technique to provide a desirable, on-demand computing environment for users. It is completely independent of any base operating system and complete in itself to finish a task.

Main Characteristics of cloud computing:

Cloud computing provides on demand services to the clients without any human interaction at service provider‟s site.
Cloud computing provides large pool of resources to the client to use as utility computing.
Cloud computing provides elasticity in its services which lets a user to have as much or as little of a service as they want at any instant of time.
The service offered by cloud computing are fully managed by the provider. The user doesn't need to concern about that. The only thing the user has to do is to use the services pay as per the usage, nothing more.

To start research on cloud using cloudsim tool, you must understand some basic concepts including cloudlet, broker, datacenters, schedulers, and virtual machines. All these and other components are desinged in java for research purpose in cloudsim as well as CloudAnalyst.

As shown in the screenshot above, CloudAnalyst allows you to code in java to simulate your cloud environment, it gives you visual effects of your simulation and also gives you results in the form of report which you can easily export as pdf to be used in your research work.

To configure a cloud storage environment like dropbox at your end you can start with owncloud.

If you want to work on scheduling and load balancing algorithms, you can also go for python based tool "Haizea". Haizea provides three modes of operation: Simulation mode, Real-time mode, and Open-Nebula mode. Haizea lets you schedule the workload in any of the "Best Effort", "DeadLine Sensitive", "Immediate", and "Advance Reservation" ma"Anything as a Service" model.nner. You will have to be handy with python to work with Haizea tool.

When we talk about Industry level implementation of cloud services, there are many big giants like OpenStack, Salesforce, AWS and many more. AWS stands for Amazon Web Services. Its a proprietary platform from Amazon. Like that many other platforms are available as Microsoft Azure. Now the move is towards adopting open source platforms to come out of vendor lock in cloud computing. Openstack and Open-Nebula are name of open source cloud computing platforms. Working knowledge of python will do good in you want to be the part of coming world of Cloud Computing.

Go to home page

Wednesday, January 7, 2015

Big Data Analytics

What is Big Data:

In today's IT industry "Big Data" is new IT buzzword, which is meant for large files or unstructured data sets for which conventional approaches are inefficient to deal with. Inefficiency of traditional data storage and manipulation tools to deal with "Big Data" lies in its architecture, i.e. approx 80% of today's big data is unstructured or more specifically to be categorized as non-rdbms which crosses the boundaries of a system. But this unstructured data is very useful. Use of commodity hardware and plenty of open source tools made the big data analytics a feasible task.

For example: Day-to-day data generated by social sites is NoSQL in nature while it is worth to be stored and manipulated for faster trend analysis of the customer, or by the people whom a company is targeting for marketing. Twitter Trend analysis is one of the most commonly used example of big data analytics.

Most common Big Data categories are: Medical data, Telecom Data (also known as Telco Big Data), log data generated by retail-chains, Bar code data from aviation industry and many more.

Big data analytics has given new dimensions to data visualization and Machine Learning. Data visualization is the method of representing the values in graphical format. It is very fruitful to be used in decision making. The most promising use cases of this are weather forecasting and exit poll surveys, which process large amount of unstructured data and generates some fruitful results. Seeing this you can understand how important the data is:

You can see my video lecture on Big Data Analytics here

Properties of Big Data:

Three most common properties of Big Data are:

Volume
Velocity
Variety

Technologies to deal with Big Data:

There are various tools and technologies to deal with Big Data out of which Hadoop is most commonly used. Hadoop basically stands for ( HDFS + MapReduce). HDFS is reliable Hadoop Distributed File System, and MapReduce is a parallel processing framework which works in key value pair. HDFS is responsible for data storage with reliability and availability while MapReduce is responible for Data Processing. Main components responsible for storage in HDFS are NameNode and Datanode, correspondingly main components handling data processing in MapReduce are JobTracker and TaskTracker. If you are having namenode on local host you can check the status of your hadoop cluster on localhost:50070 via your web browser as shown in picture.

The Cluter configuration of hadoop is mentioned in core-site.xml, hdfs-site.xml, mapred-site.xml. You can customize your cluster configuration by making appropriate changes in these files. Hadoop needs a restart to reflect the changes in configuration. Mapreduce1 was having a problem of Single Point of Failure. MapReduce2 (YARN) architecture is enhanced to deal with parallel processing to be suited to OLAP and OLTP applications and to avoid the problem of single point of failure.

Hadoop is an open source technology which is designed to deal with distributed databases having unstructured data. It is not designed for faster processing rather it is specifically designed for failure proof distributed data processing. Hadoop provides partial failure support, data recovery, component recovery, consistency and scalability.
There are various benchmark tests (for eg "testdfsio") given in hadoop installation package to check the performance of the hadoop cluster.

For the user-convenience and faster execution Hadoop Eco-system supports various scripting languages like Pig, Hive, Jaql, and Many more to be discussed in detail later.

Pig (Pig Latin):
Pig is a simple language platform popularly known as Pig Latin that can be used for manipulating data and queries. Pig is a high level language developed at yahoo. Pig is a data flow language. Unlike SQL pig does not require that data have a schema. In Pig if you don't specify a datatype, all fields are byte-array as its default datatype. In Pig, relation, field, and function names are case sensitive, while keywords are not case-sensitive.

Hive (Hive QL):
Apache Hive first created at Facebook, is a data warehouse system for hadoop that facilitates easy data summarization, ad-hoc queries and the analysis of large datasets stored in hadoop compatible file system. Hive organizes data in the form of Database, Tables, Partitions, and bucket. Hive supported storage file formats are TEXTFILE, SEQUENCEFILE and RCFILE (Record Columnar File). Hive uses temporary directories on both hive client and HDFS. Hive client cleans up temporary data when query is completed.
Hive Query Language was developed at Facebook, and later contributed to the open-source community. Currently Facebook uses Hive for reporting dashboards and ad-hoc analysis.

Spark:
Apache Spark is fast execution engine. It can work independently or on the top of hadoop using hdfs for storage. As a stand alone solutions Spark is used as extremely fast processing framework.

Jaql:
Jaql is primarily a query language for JavaScript Object Notation (JSON) files, but it supports more than just JSON. It allows to process both structured and unstructured data. It was developed by IBM later donated to the open source community. Jaql allows you to select, join, group, and filter data that is stored in HDFS, much like a blend of Pig and Hive. Jaql’s query language was inspired by many programming and query languages, including Lisp, SQL, XQuery, and Pig. Jaql is a functional, declarative query language that is designed to process large data sets. For parallelism, Jaql rewrites high-level queries, when appropriate, into “low-level” queries consisting of MapReduce jobs.

To know the procedure of installing hadoop please follow these steps and for running first program in your hadoop cluster you please follow the steps given in following link (Steps to run first mapreduce program wordcount)

NoSQL Databases:

NoSQL databases are also the integral part of big data analytics domain. A few names of NoSQL databases are MongoDB, CouchDB, CouchBase, Cassandra etc. These databases are designed to store the data in non-relational structure generally in JSON. Most of these databases are categorized as document structure databases. These are effective tools to deal with big data analytics as well as elastic search.

Kafka

Kafka is a distributed messaging system which works in approx real-time manner. Here all the clients (consumer) gets access to the desired information from corresponding server (producer) based on selected topic.
A solution architecture designed over Spark, Cassandra, & Kafka technologies is being used as a SCM solution at prestigious retail chains which reduced the operations cost of SCM by 30-40%.

For more updates on big data analytics, you can like CoE Big Data. If you are really a big data enthusiast and wants to learn these technology stack from practical aspects, visit DataioticsHub and join our Dataioticshub-meetup group for hands-on sessions

Please find the list of projects in the field of big data here

Go to home page

Sunday, January 4, 2015

Network Simulations and installation of NS2 and NS3

First thing which comes to the mind of new researcher is "What actually the simulation is, why to go for simulation, and how it is different from real implementation..??"

Simulation is the process of creating virtual working environment with the help of some software/tools for which the real implementation might be costlier, infeasible. The software/tools which provides these environments are known as simulators. Simulators provides cheap, repeatable and flexible environment for testing or designing of new protocols. See the difference between simulation and emulation at http://researchbyomesh.blogspot.in/2015/01/critical-differences-of-it-domain.html?m=1

So network simulators are the tools which provides the functionality of network with the help of programming. You can easily design, test or enhance the functionality of any protocol to see its effect on the outcome. The output of simulation depends upon the functionality of the simulator.

For network simulations, there are lots of tools available online for research, out of those i recommend NS2/NS3 for academic research purpose due to its open source and easy to enhance structure.

First I will be discussing about NS2. For details of ns2 please refer this post or link

NS2 is a simulation tool which is easy to configure on linux operating systems. I use Ubuntu-12.04 LTS version for installation. The commands for ns2- installation are:

1. Open the terminal and run:
           sudo apt-get update
   it will ask for you password, enter your password, and your system will be updated. Make sure that you are connected to the network.
2. Next commands to run are:
          sudo apt-get install tcl8.5-dev tk8.5-dev
          sudo apt-get install build-essential autoconf automake
          sudo apt-get install libxt-dev libx11-dev libxmu-dev
3. Now download the online available opensource package "ns-allinone-2.35.tar.gz" and untar it and save it to you home directory.
4. Go to this package via terminal:
         cd ns-allinone-2.35
5. Now in this directory you will find an executable by with the name "install", run it
        ./install
    it can take some time.
6. After successful execution of this command go to ns-2.35 directory:
       cd ns-2.35
7. and now run the following commands one by one:
       ./configure
       make
       sudo make install
8. If all these commands ran successfully, it means you have configured the ns package correctly.
9. Now install some packages to make it executable:
      sudo apt-get install ns2
      sudo apt-get install nam
      sudo apt-get install xgraph
sudo apt-get install gawk
10. Now you are completely done...
11. To check some running tcl scripts, go to the ns-allinone-2.35/ns-2.35/tcl/ex, and run:
       ns file_name.tcl

Note: If your Ubuntu version is 14.04 or higher, then nam may not run properly ( Most probably it will show the segmentation fault / core dumped while running nam). It is can be resolved by moving your binary file of nam to /usr/local/bin directory. For this you need to go to your nam-1.x.x folder inside your ns-allinone package and run the command:

sudo cp nam /usr/local/bin

12. After successfully running the program, the output comes in trace file (.tr). It is a formatted text file where every column has its specific meaning. By analyzing the trace files using awk script coding you can get the throughput, delay, packet-loss and other network performance parameters.

A sample program to learn how to use ns2 and analyzing it with awk can be found here

If there is some problem of tclcl.h not found during installation, then refer this link. There are many other problems during NS-2 installation. For all those i am going to extend the article very soon.

12. Enjoy Network Simulation...

For latest updates in research and development please like this page

Installation steps for NS3:

Follow the given steps for the installation of NS-3. First install some dependencies to enable the NS-3 execution environment on your Ubuntu 12.04 LTS machine.

1. sudo apt-get install gcc g++ python
2. sudo apt-get install gcc g++ python python-dev
3. sudo apt-get install mercurial
4. sudo apt-get install bzr
5. sudo apt-get install gdb valgrind
6. sudo apt-get install gsl-bin libgsl0-dev libgsl0ldbl
7. sudo apt-get install flex bison libfl-dev
8. sudo apt-get install g++-3.4 gcc-3.4
9. sudo apt-get install tcpdump
10.sudo apt-get install sqlite sqlite3 libsqlite3-dev
11.sudo apt-get install libxml2 libxml2-dev
12.sudo apt-get install libgtk2.0-0 libgtk2.0-dev
13.sudo apt-get install vtun lxc
14.sudo apt-get install uncrustify
15.sudo apt-get install doxygen graphviz imagemagick
16.sudo apt-get install texlive texlive-extra-utils texlive-latex-extra
17.sudo apt-get install python-sphinx dia
18.sudo apt-get install python-pygraphviz python-kiwi python-
pygoocanvas libgoocanvas-dev
19.sudo apt-get install libboost-signals-dev libboost-filesystem-dev

Now download the package of NS-3 from its official website:

http://www.nsnam.org/ns-3-17/

Extract the downloaded package and go to it via terminal, then run some executable files given in subfolder:
    ./build.py
./test.py

Configure the waf using the following commands once during installation:
./waf configure
./waf
NS3 enables support for animation with the tools NetAnim and support for graphs using GnuPlot tool.
For enabling gnuplot run the commands:
  sudo apt-get install gnuplot
  sudo apt-get install gimp

To install NetAnim support run the commands:
Go to the NetAnim folder and run the following commands to enable support for animation tool:
make clean
qmake NetAnim.pro
make
For details of NS-3 please refer the link: ns-3

For more frequent updates on related research topics please like our facebook page at this link

Go to home page

Saturday, January 3, 2015

How to start research

Research: The way to start thinking differently

Hello Friends,

I am righting this blog for those who are going to start research for their Post graduate or doctoral degree or for their interest of getting the hands dirty with the code. The important thing about research is "Where to start from..."

This blog is just to define the right steps for starting research. Many times students stuck in implementation, but this is not the real problem. The problem is that he/she is not sound in basics, so feels unable to simulate their ideas of research.

Starting research by reading research papers is not always a good idea. From the research papers you can just get the relevant and latest problem statement. For a good research, first clear the basics of the field in which you want to do research. Do the ground work and focus on basics. Don't forget to study some text book of your domain of research. Then learn the simulation tool on which you want to verify your research.

Lack of operating knowledge of simulation tools is the main cause of inefficient research results, which may lead to frustration at the expected end time of our research.

So, I am mentioning a few research fields, relevant simulation tools to start with.

Network Protocols    -     Network Simulator 2, Network Simulator 3.
Cloud Computing    -     CloudSim, Haizea,owncloud, OpenStack
Physical layer design    -      MatLab
Big Data Analytics                -      Hadoop (MapReduce), Python, R/Rstudio. NoSQL                                  -      (CouchBase, Cassandra)

Network Simulator 2 is a C++ and TCL based tool, which is easy to learn and efficient enough to show packet level simulation results. The core protocols in NS2 are designed in C++ and complete project is bound with the help of Makefile. Tcl is used to generate the scenarios or simulation scripts. There is a component Tclcl which makes communication possible between OTcl script values and C++ functions.

CloudSim is Java based cloud simulator, and very efficient tool for those who want to work on cloud platform. It easily enables to perform the research on performance of cloud, cost-aware migration as well as cloud security issues.
Haizea is a python based cloud simulation tool specially focusing on scheduling mechanisms. Haizea supports three modes: Simulation, real, and OpenNebula support mode.

Matlab is a multi-utility simulation tool for various mathematical models and circuit designing. In Matlab there are various inbuilt tools out of which Simulink is most commonly used for circuit designing.

Latest Research Topics:

I am mentioning some of the most popular research topics on which you can start your work:

Network Simulations:

Designing Energy Efficient routing protocols for Wireless Sensor Networks.
QoS based bandwidth allocation to various traffic flows.
Detection of Sybil Identity in network.
Detection and prevention of various attacks in network.
Openflow SDN simulation for optimized underlying network.

Cloud Computing & Big Data Analytics:

Cost aware migration in cloud computing.
Designing new security model for cloud computing.
Analysis of health-care data using big data analytics tools.
Energy Efficient Data Center modeling in Cloud computing.
Enabling efficient auto-scaling in hadoop clusters.
List of projects for big data

For more frequent updates please visit our facebook page via this link