Friday 20 May 2011

Install and Configuration of PDSH

What's pdsh?
                       Pdsh is a an efficient, multithreaded remote shell client which executes commands on multiple remote hosts in parallel.
                      
                      Pdsh implements dynamically loadable modules for extended functionality such as new remote shell services and remote host selection.

Steps to Install and Configure:

    1. Install pdsh on the Server.

    2. Import ssh key of Server1 to the node1 and node2.

    3. Add hostname-to-IPaddress of node1/node2 in /etc/hosts on Server1.

    4. To execute: pdsh -w ssh:root@node[1,2] ls 2> /dev/null then node1 and node2 will to execute ls command and report to the Server1 as following:

    lawrence@suse:~/.ssh> pdsh -w ssh:root@node[1,2] ls 2> /dev/null
    node1: anaconda-ks.cfg
    node1: Desktop
    node1: id_rsa.pub
    node1: install.log
    node1: install.log.syslog
    node2: anaconda-ks.cfg
    node2: bin
    node2: conf-examples
    node2: cpulimit-1.1.tar.gz
    node2: cpulimit.tar.gz
    node2: Desktop
    node2: id_rsa.pub
    node2: install.log
    node2: install.log.syslog
    node2: mibs
    node2: mibs_20100925.rar

    5. Combine multiple commands:
    lawrence@suse:~/.ssh> pdsh -w ssh:root@node[1,2] "cd /tmp;ls" 2> /dev/null
    node1: pulse-Bk60xcI9xlDq
    node1: virtual-root.pHV8bR
    node2: etherXXXXWj7KYw
    node2: gconfd-root
    node2: keyring-RdVKdK
    node2: mapping-root
    node2: scim-panel-socket:0-root

Remote login without password:

    Step 1: Create public and private keys using ssh-key-gen on local-host


    jsmith@local-host$ ssh-keygen


    Step 2: Copy the public key to remote-host using ssh-copy-id

    jsmith@local-host$ ssh-copy-id -i ~/.ssh/id_rsa.pub remote-host

or

    jsmith@local-host$ scp -r ~/.ssh/id_rsa.pub 151.8.19.146:/root/.ssh/authorized_key

    jsmith@remote-host's password:
    Now try logging into the machine, with "ssh 'remote-host'", and check in:

    .ssh/authorized_keys

    to make sure we haven't added extra keys that you weren't expecting.

    Note: ssh-copy-id appends the keys to the remote-host’s .ssh/authorized_key.

    Step 3: Login to remote-host without entering the password

    jsmith@local-host$ ssh remote-host.

And also Set up the Passwordless Remote Login .











Authenticity of Host 'master' Cannot be Established

Error:

run_lapw -NI -p

LAPW0 END

the authenticity of host 'master 'cannot be established rsa key fingerprint is at:some adress are you sure you want ot contine connecting the authenticity of host master cannot be established


Solution:

ssh -i identity suman@master(nodename)

Error while Loading Shared Libraries

Error:

run_lawp

error while loading shared libraries : libmkl_lapack.so: cannot open shared  object file: No such files or directory.

error while loading shared libraries : libguide.so: cannot open shared  object file: No such files or directory.

Solution:

export CLUSTERCONF F=/opt/cluster/etc/cluster.conf(path of cluster.conf file)

source /opt/intel/Compiler/11.1/046/bin/ifortvars.sh intel64(path of ifforrtvars.sh intel arcitucture)

source /opt/intel/Compiler/11.1/046/mkl/tool/environnment/mklvarsem64t.sh

/opt/intel/Compiler/11.1/046/bin/intel64/ifortvars_intel64.sh

Lustre – Cluster File System : Quick Setup

Luster File System Quick Setup

                        The Lustre file system is a distributed high performance cluster filesytem that redefines I/O performance and scalability for large and complex computing environments. This is ideally suited for data-intensive applications which requires the high IO performance.

Lustre components:
MDS – Metadata Server: The MDS server makes metadata stored in the metadata targets available to Lustre clients.

MDT – Metadata Target: This stores metadata, such as filenames, directories, permissions, and file layout, on the metadata server.

OSS – Object Storage Server: The OSS provides file I/O service, and network request handling for the OSTs. The MDT, OSTs and Lustre clients can run concurrently on a single node. However, a typical configuration is an MDT on a dedicated node, two or more OSTs on each OSS node, and a client on each of a large number of compute nodes.

OST – Object Storage Target: The OST stores data as data objects on one or more OSSs. A single Lustre file system can have multiple OSTs, each serving a subset of file data.

Client: The systems that mount the Lustre filesystem.


Steps to create Lustre File System:
Configure Lustre Management Server (lustre-mgs.unixfoo.biz – Server 1)
   1. Add the disk to volume manager

      [root@lustre-mgs mnt]# pvcreate /dev/sdb
      Physical volume "/dev/sdb" successfully created

      [root@lustre-mgs mnt]# pvs
        PV         VG   Fmt  Attr PSize   Pfree
        /dev/sdb        lvm2 --   136.73G 136.73G

   2. Create lustre volume group

      [root@lustre-mgs mnt]# vgcreate lustre /dev/sdb
        Volume group "lustre" successfully created
      [root@lustre-mgs mnt]# vgs
        VG     #PV #LV #SN Attr   VSize   Vfree
        lustre   1   0   0 wz--n- 136.73G 136.73G

   3. Create logical volume "MGS" (the management server)

      [root@lustre-mgs ~]# lvcreate -L 25G -n MGS lustre

   4. Create Lustre Management filesystem.

      [root@lustre-mgs ~]# mkfs.lustre --mgs /dev/lustre/MGS

         Permanent disk data:
      Target:     MGS
      Index:      unassigned
      Lustre FS:  lustre
      Mount type: ldiskfs
      Flags:      0x74
                    (MGS needs_index first_time update )
      Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
      Parameters:

      checking for existing Lustre data: not found
      device size = 10240MB
      2 6 18
      formatting backing filesystem ldiskfs on /dev/lustre/MGS
              target name  MGS
              4k blocks     0
              options        -J size=400 -q -O dir_index,uninit_groups –F

      mkfs_cmd = mkfs.ext2 -j -b 4096 -L MGS  -J size=400 -q -O dir_index,uninit_groups -F /dev/lustre/MGS
      Writing CONFIGS/mountdata
      [root@lustre-mgs ~]#

   5. Activate the lustre management filesystem using mount command.

      [root@lustre-mgs ~]# mount -t lustre /dev/lustre/MGS /lustre/MGS/
      [root@lustre-mgs ~]# df
      Filesystem           1K-blocks      Used Available Use% Mounted on
      /dev/sda2             54558908   5276572  46466144  11% /
      /dev/sda1               497829     29006    443121   7% /boot
      tmpfs                  8216000         0   8216000   0% /dev/shm
      /dev/lustre/MGS       10321208    433052   9363868   5% /lustre/MGS
      [root@lustre-mgs ~]#

Configure Lustre Metadata Server (In this guide, both the management & metadata server runs on the same host)
   1. Create logical volume "MDT_unixfoo_cloud"

      [root@lustre-mgs ~]# lvcreate -L 25G -n MDT_unixfoo_cloud lustre

   2. Create Lustre Metdata filesystem for the filesystem “unixfoo_cloud”.

      [root@lustre-mgs ~]# mkfs.lustre --fsname=unixfoo_cloud --mdt  --reformat --mgsnode=lustre-mgs@tcp0 /dev/lustre/MDT_unixfoo_cloud
         Permanent disk data:
      Target:     unixfoo_cloud-MDTffff
      Index:      unassigned
      Lustre FS:  unixfoo_cloud
      Mount type: ldiskfs
      Flags:      0x71
                    (MDT needs_index first_time update )

      Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
      Parameters: mgsnode=10.217.0.237@tcp mdt.group_upcall=/usr/sbin/l_getgroups
      device size = 20480MB
      2 6 18

      formatting backing filesystem ldiskfs on /dev/lustre/MDT_unixfoo_cloud
              target name  unixfoo_cloud-MDTffff
              4k blocks     0
              options        -J size=400 -i 4096 -I 512 -q -O dir_index,uninit_groups –F
      mkfs_cmd = mkfs.ext2 -j -b 4096 -L unixfoo_cloud-MDTffff  -J size=400 -i 4096 -I 512 -q -O dir_index,uninit_groups -F /dev/lustre/MDT_unixfoo_cloud
      Writing CONFIGS/mountdata
      [root@lustre-mgs ~]#

   3. Activate the metdata filesystem using mount command.

      [root@lustre-mgs ~]# mkdir /lustre/MDT_unixfoo_cloud
      [root@lustre-mgs ~]# mount -t lustre  /dev/lustre/MDT_unixfoo_cloud /lustre/MDT_unixfoo_cloud

Configure OSTs ( servers: oss1, oss2 .. )
   1. Add /dev/md0 to volume manager

      [root@oss1 ~]# pvcreate /dev/md0

   2. Create volume group "lustre"

      [root@oss1 ~]# vgcreate lustre /dev/md0

   3. Create logical volume (ost) for unixfoo_cloud

      [root@oss1 ~]# lvcreate -n OST_unixfoo_cloud_1 -L 100G lustre

   4. Create OST lustre filesystem

      [root@oss1 ~]# mkfs.lustre --fsname=unixfoo_cloud --ost --mgsnode=lustre-mgs@tcp0 /dev/lustre/OST_unixfoo_cloud_1

      mkfs_cmd = mkfs.ext2 -j -b 4096 -L unixfoo_cloud-OSTffff  -J size=400 -i 16384 -I 256 -q -O dir_index,uninit_groups -F /dev/lustre/OST_unixfoo_cloud_1
      Writing CONFIGS/mountdata
      [root@oss1 ~]#

   5. Activate the OST by using mount command

      [root@oss1 ~]# mkdir -p /lustre/unixfoo_cloud_oss1
      [root@oss1 ~]# mount -t lustre /dev/lustre/OST_unixfoo_cloud_1 /lustre/unixfoo_cloud_oss1

Mount on the client:
   1. Mount the lustre filesystem unixfoo_cloud

      [root@lustreclient1 ~]# mount -t lustre lustre-mgs@tcp0:/unixfoo_cloud /mnt

      [root@lustreclient1 ~]# df –h
      Filesystem            Size  Used Avail Use% Mounted on
      /dev/sda2              52G  5.1G   44G  11% /
      /dev/sda1             487M   29M  433M   7% /boot
      tmpfs                 7.9G     0  7.9G   0% /dev/shm
      lustre-mgs@tcp0:/unixfoo_cloud
                             99G  461M   93G   1% /mnt
      [root@lustreclient1 ~]#

   2. Done.

Thus the Quick Setup of Luster File System with One MGS/MDS and two OSS and a Client.

In Gvim file error in non root User and Solution

Error:
$gvim filename.txt
Xlib: connection to ":0.0" refused by server
Xlib: No protocol spectified
E233: connot open displayXlib: connection to ":0.0" refuesed by server
Xlib: No protocol specified

Solution:
xhost localhost:user2
xhost +user
xhost
see the name of localhostuser
xhost +local: username

Cluster Configuration Error

Error:


[root@ha_1_1 ~]# service cman start
Starting cluster:
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... failed
/etc/cluster/cluster.conf:37: parser error : expected '>' </clusternodes> ^ /etc/cluster/cluster.conf:37: parser error : Extra content at the end of the document </clusternodes> ^ Unable to parse /etc/cluster/cluster.conf. You should either: 1. Correct the XML mistakes, or 2. (Re)move the file and attempt to grab a valid copy from the network.
                                                           [FAILED]
Solution:

Delete:

</clusternode>

Insert:
</cluster>

Time Out Shutdown Error in Lustre

Time out shutdown error in Lustre:

Solution:

To set time out.

lctl conf_param <fsname>.sys.timeout=1000

To see:

cat /proc/sys/lustre/timeout
1000

modprobe -vr ehci_hcd
modprobe -v ehci_hcd

Configuring A High Availability Cluster On RHEL/CentOS

Pre-Configuration Requirements:
   1. Assign hostname nginx-primary to primary node with IP address 
                                                     192.168.100.203 to eth0.
   2. Assign hostname nginx-slave to slave node with IP address
                                                   192.168.100.204 to eth0.

Note: on nginx-primary
                          uname -n

                                      nginx-primary.

On nginx-slave
                         uname -n

                                      nginx-slave.

192.168.100.9 is the virtual IP address that will be used for our Nginx webserver (i.e., Nginx will listen on that address).

Assume that Nginx web server has been installed and configured correctly on both nodes.

Configuration:
   1. Download and install the heartbeat package on both nodes. In our case we are using RHEL/CentOS so we will install heartbeat with yum:

                                        yum install heartbeat

   2. Now we have to configure heartbeat on our two node cluster. We will deal with three files. These are:

      authkeys
      ha.cf
      haresources

   3. Now moving to our configuration. But there is one more thing to do, that is to copy these files to the /etc/ha.d directory. In our case we copy these files as given below:

      cp /usr/share/doc/heartbeat-2.1.4/authkeys /etc/ha.d/
      cp /usr/share/doc/heartbeat-2.1.4/ha.cf /etc/ha.d/
      cp /usr/share/doc/heartbeat-2.1.4/haresources /etc/ha.d/

   4. Now let’s start configuring heartbeat. First we will deal with the authkeys file, we will use authentication method 2 (sha1). For this we will make changes in the authkeys file as below.

      vi /etc/ha.d/authkeys

      Then add the following lines:

      auth 2
      2 sha1 test-nginx-ha

      Change the permission of the authkeys file:

      chmod 600 /etc/ha.d/authkeys

   5. Moving to our second file (ha.cf) which is the most important. So edit the ha.cf file with vi:

      vi /etc/ha.d/ha.cf

      Add the following lines in the ha.cf file:

      logfile /var/log/ha-log
      logfacility local0
      keepalive 2
      deadtime 30
      initdead 120
      bcast eth0
      udpport 694
      auto_failback on
      node nginx-primary
      node nginx-slave

      Note: nginx-primary and nginx-slave is the output generated by

                                                              uname -n

   6. The final piece of work in our configuration is to edit the haresources file. This file contains the information about resources which we want to highly enable. In our case we want the webserver (nginx) highly available:

      vi /etc/ha.d/haresources

      Add the following line:

      nginx-primary 192.168.100.9 nginx

   7. Copy the /etc/ha.d/ directory from nginx-primary to nginx-slave:

      scp -r /etc/ha.d/ root@nginx-slave:/etc/

   8. Create the file index.html on both nodes:

      On nginx-primary:
      echo "nginx-primary test server" > /usr/html/index.html

      On nginx-slave:

      echo "nginx-slave test server" > /usr/html/index.html

   9. Now start heartbeat on the primary nginx-primary and slave nginx-slave:

      /etc/init.d/heartbeat start

  10. Open web-browser and type in the URL:

      http://192.168.100.9

      It will show “nginx-primary test server”.

  11. Now stop the hearbeat daemon on nginx-primary:

      /etc/init.d/heartbeat stop

      In your browser type in the URL http://192.168.100.9 and press enter.

      It will show “nginx-slave test server”.

  12. We don’t need to create a virtual network interface and assign an IP address (192.168.100.9) to it. Heartbeat will do this for you, and start the service (nginx) itself. So don’t worry about this.

      Don’t use the IP addresses 192.168.100.203 and 192.168.100.204 for services. These addresses are used by heartbeat for communication between nginx-primary and nginx-slave. When any of them will be used for services/resources, it will disturb hearbeat and will not work.

Thus the way to Install and Configuring a High Avilibilty Between two nodes in Rhel/Centos.....

Graphical Calculator for HPL-Calculator

Example queue for pbs Queue Configuration in Torque

#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 02:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = master.modeling.in(hostname of server)
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server mom_job_sync = True
set server keep_completed = 300
set server next_job_number = 4826

PBS Issue in Torque

Error:

qsub test.sh

qsub: Bad UID for job execution MSG=root user root not allowed not (run in
root user)

Solution:

Qsub not run in root is must be run in user only.

PBS Issue in Torque

Edit the Pbs file:
 pbs_iff: file not setuid root, likely misconfigured
 pbs_iff: cannot connect to em-research00:15001 - fatal error, errno=13
 (Permissi on denied)
 No Permission.
 qsub: cannot connect to server em-research00 (errno=15007)
The pbs_iff binary must be setuid. Use 'ls -l /usr/local/sbin/pbs_iff'
(substite the correct path for your site's installation) and you should
the 's' user perm and owned by root, for example:

  -rwsr-xr-x 1 root root 16502 Mar 21 19:45 /usr/sbin/pbs_iff

If it isn't setuid root, then fix it with:

Solution:

  chown root:root /path/to/pbs_iff
  chmod 4755 /path/to/pbs_iff
  chmod u+s /usr/local/torque/sbin/pbs_iff

Ganglia Issue

Ganglia not Working properlly:

In Both  Server and Node Machine:

/etc/init.d/gmetad restart
/etc/init.d/gmond restart
ganglia

elink http://localhost/ganglia
/usr/bin/gstat
/usr/bin/gmetric
/usr/sbin/gmond

/usr/sbin/gmetad
ganglia | wc -l

Thursday 19 May 2011

How to Stop Cman Tool in Manual

# service cman stop
Stopping cluster:
Stopping fencing... done
Stopping cman... failed
/usr/sbin/cman_tool: Error leaving cluster: Device or resource busy
[FAILED]
Solution:

You can stop cman manually by:

cman_tool leave force

Or

cman_tool leave force remove

Connection reset by peer failed in Cluster

Starting cluster:
Loading modules... done
Mounting configfs... done
Starting ccsd... done
Starting cman... done
Starting daemons... done
Starting fencing... /usr/sbin/cman_tool: Error getting cluster info:
Connection reset by peer failed [FAILED]

Solution:

service openais status

service openais start