Monday, September 30, 2013

Setting Up a Hadoop Cluster

1.       Install JAVA on the machine
a.       Copy the  jdk1.7.0 folder to /usr/lib/jvm/     [download java from sun.java.com and extract it to get the jdk1.7.0 folder, save it to a USB device for copying to all machines]
If the jvm folder is not in /usr/lib/ you will have to create a new folder.
b.      Link java/javac/jar to alternatives
$> ln –s /usr/lib/jvm/jdk1.7.0/bin/java   java   /etc/alternatives/java     [do this for javac and jar also]
$> ln –s /usr/lib/jvm/jdk1.7.0/bin/javac   javac   /etc/alternatives/javac    
$> ln –s /usr/lib/jvm/jdk1.7.0/bin/jar   jar  /etc/alternatives/jar
If ln does not work, (maybe because an older version of java is already linked to alternatives]   
$> update-alternatives  --install /usr/bin/java   java   /usr/lib/jvm/jdk1.7.0/bin/java   1
$> update-alternatives  --install /usr/bin/javac   javac   /usr/lib/jvm/jdk1.7.0/bin/javac   1
$> update-alternatives  --install /usr/bin/jar   jar   /usr/lib/jvm/jdk1.7.0/bin/jar   1
c.       Check with
$>  java –version. You should see jdk1.7.0 instead of open-java The command should output something comparable to the following on every node of your:
java version "1.7.0"
Java(TM) SE Runtime Environment (build 1.7.0_22-b04)
Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode)

2.       Install open-ssh server/ open-ssh client on the machine
a.  $> sudo apt-get install open-ssh-server
b.  $> sudo apt-get install open-ssh-client
3.       Add a dedicated hadoop user on the machine
a.  $> sudo addgroup hadoop
b.  $> sudo adduser --ingroup hadoop hduser
4.       Configure ssh
a.       $> su –hduser
b.      $> ssh-keygen –t rsa –P “”   [Press enter to select the default file authorized_keys]
c.       $> cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
d.    $> ssh localhost    [To test whether open—ssh server has been configured correctly]
e.      $> exit
5.       Install Hadoop on the machine
a.       download hadoop from  hadoop.apache.org to get the hadoop folder, make the changes as per step 6 and then save it to a USB device for copying to all machines
b.      $> sudo mv hadoop-1.0.3 hadoop
c.       $> cp hadoop  /home/hduser/.
d.      $> cd /home/hduser
e.      $> chown –R hduser:hadoop hadoop
6.       Configure the hadoop environment [To be done only once after copying hadoop, then you copy the folder with the changes to all machines]
a.       Open the file hadoop/conf/hadoop-env.sh   [$> gedit hadoop/conf/hadoop-env.sh]
     export JAVA_HOME=/usr/lib/jvm/jdk1.7.0
b.      Open the file hadoop/conf/mapred-site.xml   [$> gedit hadoop/conf/mapred-site.xml]  Replace master with IP of master node
  mapred.job.tracker
  master:54311

c.       Open the file hadoop/conf/core-site.xml   [$> gedit hadoop/conf/core-site.xml] Replace master with IP of master node
 fs.default.name
 hdfs://master:54310/
d.  Open the file hadoop/conf/hdfs-site.xml   [$> gedit hadoop/conf/hdfs-site.xml] Replace master with IP of master node
 dfs.data.dir
 home/hduser/hadoop/dfsdata
e.      Open the file hadoop/conf/master   - Type the IP of master
f.        Open the file hadoop/conf/slaves   - Type the IP of slaves on one line each

7.       Update $HOME/.bashrc
a.       $> sudo gedit $HOME/.bashrc
b.      Make the changes to the .bashrc file
export JAVA_HOME=/usr/lib/jvm/jdk1.7.0
export HADOOP_HOME=/home/hduser/hadoop
export PATH=$PATH:$JAVA_HOME/bon:$HADOOP_HOME/bin
export HADOOP_CLASSPATH=$HADOOP_HOME
c.       To run the bash
$> . ~/.bashrc
    
8.       Create the folders needed by hadoop
a.       $> mkdir –p /home/hduser/dfsdata
9.       Edit /etc/hosts on all nodes
10.   Only on master:
a.  $> ssh-copy-id –i $HOME/.ssh/id_rsa.pub   hduser@slave01
b.  $> ssh slave01   [to verify whether master is able to talk to slave01 without a password]


Tuesday, May 13, 2008

FOSS - What's in it for me?

This philosophy of FOSS results in tremendous benefits to users. These include:

• Reduced Cost - The cost of FOSS software remains fixed even when the number of users increases. For example, if you use GNU/Linux as your operating system, in your office, the same CD can be duplicated and distributed to all users. However, if you opt for a “licensed” Microsoft Operating System, it would have to be purchased for each user’s desktop.

• Reliability/ Stability - Free software is the combined result of the experience and the intelligence of all the participants. Its reliability increases as time passes, with all the corrections which are made.

• Portability - This quality is not intrinsic to free software, but is very often seen in free software. If software meets success, it will necessarily be adapted to other environments than those initially considered.

• Performance - Resulting from a lot of examinations, the use of algorithms coming from advanced research works, as well as tested by various usages, free software have good performance characteristics by nature.

• Interoperability - The support in Linux, for example, of a lot of network protocols, file system formats, and even binary compatibility modes assures a good interoperability.

• Reactivity - Rapid solution of corrections to a given problem.

• Security – Transparency of source code results in faster identification and fixing of bugs and security loopholes.

FOSS is relevant and important in education institutions because access to the source code of the programs allows students to explore internals of complex systems and hence acquire a deeper understanding of what they study. For example, students learn about operating system concepts by checking out the code which implements similar functionalities in Linux. Today, most educational programs require access to a lot of computing software resources, e.g., Matlab, circuit simulators, drawing packages, etc. Mostly proprietary solutions are used by institutions, costing lakhs of Rupees in license fee. FOSS solutions are available in many areas, with the commonly used licensing terms for distribution and modification, and in almost all cases, at zero cost.There is no dearth of FOSS software. FOSS systems and tools include - Linux and BSD Operating Systems, OpenOffice Writer for word processing, Open Office Math for mathematical equations, Moodle for a Learning Management System, audacity for audio editing, blender for 3-D animation/ rendering, gimp for photo editing, Scilab for Scientific Applications, Beowulf, Mosix for distributed computing, the list goes on and on.

Sunday, May 11, 2008

What is image compression?

Image compression is defined as minimizing the size in bytes of a graphics file without degrading the quality of the image to an unacceptable level. Image compression allows more images to be stored in a given amount of disk or memory space. It is therefore common to apply compression to bitmapped images, in order to reduce the amount of storage they require. Image compression also reduces the time required for images to be sent over the Internet or downloaded from Web pages.
The effectiveness of an algorithm in compressing a file is measured by the compression ratio it achieves, so an algorithm that managed to squeeze a 4 by 3 inch image down to 9kb would have achieved a compression ratio of just over 20. Compression algorithms are broadly divided into two categories:
Lossless Compression
Lossy Compression

Lossless Compression: Lossless compression algorithms reduce the amount of information by rearranging the data, representing it more efficiently and removing redundancy. However there is no loss of data. All the bits of data that were present in the original image are recovered when the image is decompressed. These algorithms have higher compression ratios but may introduce some distortion in the compressed image. Lossless Compression is preferred for text or spreadsheet files where loss of data could pose problems.
Lossless algorithms rearrange the data, representing it in a more efficient way and removing redundancy, but they do not discard any data. Although lossless compression algorithms can be applied to data of any sort, including images, but the degree of compression that can be achieved losslessly is limited. With lossless compression, every single bit of data that was originally in the file remains after the file is uncompressed. All of the information is completely restored. This is generally the technique of choice for text or spreadsheet files, where losing words or financial data could pose a problem.
The Graphics Interchange File (GIF) and the zip convention used in programs like WinZip are some image formats used on the Web that provide lossless compression.
GIF is a popular format for Web images. It uses the lossless Lempel Ziv Welch (LZW) algorithm. GIF format is suitable for compression of images with large areas of the same color, like company logos, line drawings like charts and icons. It is also good for animation files.The Portable Network Graphic (PNG) is another lossless format that supports a more efficient compression algorithm than GIF and has a better compression ratio. PNG is good for images with blended, transparent backgrounds.
Lossy Compression: Lossy compression algorithms achieve higher compression ratios by throwing away the relatively unimportant data that is not well perceived by the human eye. Lossy compression reduces a file by permanently eliminating certain information. When the file is uncompressed, only a part of the original information is still there. However, the loss is such that it will not be perceived by the human eye. Lossy compression is generally used for continuous color images, video and sound.
The compression ratios achievable through lossy compression are much better than those achieved with lossless compression algorithms.
The Joint Photographic Experts Group (JPEG) format commonly used for photographs and other complex still images on the Web uses a lossy compression algorithm. Using JPEG compression, the creator can decide how much loss to introduce and make a trade-off between file size and image quality.
The Moving Pictures Experts Group (MPEG) format for video files and the MPEG Layer 3 (MP3) format for audio files also use lossy compression algorithms to achieve compression.

What is a Web safe color palette

Color can be used effectively in Web design to enhance a Web site’s usability, including its visual presentation, structure, functionality and accessibility. The colors used in designing Web pages are known as Web colors.
Each color is represented in the computer by a number and the set of colors contained in an image is called its color palette. In the Red Green Blue (RGB) color model, colors can be specified as a RGB triplet. The colors RGB are known as additive primary colors because any color can be produced by combining red, blue and green light in varying proportions. A color is represented as a triplet of three numerical values, each of which specifies the amount of red, green and blue colors respectively that make up the given color. The numerical value is a number between 0 and 255, which can fit into a single byte, resulting in 24 bits for each color. For example, red would be given as (255, 0, 0). Dark emerald green which is made up of a 71% green component and a 21% blue component is represented as (0, 71, 29).
Having 24 bits for each color means that 16.7 million possible colors can be represented. However, not all images use that many colors.

Computers use values in the range 0 to 255, each of which will fit into a single byte. A simple expedient known as indexed color is often used to reduce the space required by images, so that they can be transmitted over networks more efficiently. Indexed color images may be displayed with the assistance of hardware: the color table stored in the image is loaded into an associative memory, known as a color lookup table (CLUT), so that colors can be looked up rapidly.
To reduce the space required to store each color, indexed colors are used. In this scheme, if the image has less than 256 colors, then a single byte is sufficient for representing each color. A color mapping table is used to translate from this smaller number to the actual 24-bit color value. Indexed color images may be displayed with the assistance of hardware: the color table stored in the image is loaded into an associative memory, known as a color lookup table (CLUT), so that colors can be looked up rapidly.

The set of colors contained in an image is called its color palette. If it is vital that the display of indexed color images is correct, it is better to restrict the choice of Web colors to the Web safe color palette of 216 colors which are common to the system palettes of the major operating systems. This set of colors is called the Web-safe palette. Programs, like Adobe Photoshop, used for preparing images for the Web will allow you to select the Web-safe palette when exporting your image to one of the Web file formats. If an image uses non safe colors, a system capable of handling only 256 colors, may resort to replacement of similar hue colors with one color resulting in posterization or dithering of the image.

How to conduct user tests on a Web site

There are many methods to conduct user testing. Some of these are:

1. Expert Evaluation – The expert tests the site and based on experience of design principles identifies potential usability problems.
2. Usability walk-through – Test users are observed by a facilitator who records the users’ experience, comments and problems faced. These help in identifying usability issues.
3. Heuristic Review – The usability of the site is assessed against a set of usability design principles and results are used to improve the usability.
4. Survey – A set of written questions is given to a large sample of target users who evaluate the site and give their feedback. Feedback acts as a source for further improvement or to rectify problems.
5. Monitoring software – These are programs that generate log files to tell which pages users are commonly interested in, which pages users are leaving from, links being followed to get to the site, what kind of browsers are being used. For example, if a large number of users are leaving from a purchase order page, then there is a problem and the site needs to be adjusted.

Testing is an iterative process and more often the site is evaluated, the better it will perform.

Web usability and accessibility

Web usability is the ease with which users of a Web site are able to navigate it, perform a desired action and find the information they are seeking.
One important feature of the usability of a Web site is its accessibility. A Web site should be accessible by everybody irrespective of any physical or mental limitations, browser, screen resolution or personal settings. The Web Accessibility Initiative (WAI) of the World Wide Web Consortium (W3C) has issued guidelines to ensure that disabled users have equal access to the Internet. Besides, many countries have legal requirements to make Web sites accessible. It becomes a designer’s responsibility to adhere to these guidelines.
Users with visual impairment, hearing impairment, repetitive strain injury, and age-related conditions can use assistive technologies to interact with computers. These technologies include screen readers and refreshable Braille displays for visually disabled users and screen magnifiers for people with poor eyesight. A Web site should include text descriptions of images; for any audio components and text description or images. A Web site page layout should be designed to accommodate changes in font size and so on.
Accessibility must also take into account users other than those with disabilities, For example, users with mobile Internet connectivity, users with low literacy level and users in noisy environments.