Friday, 3 June 2011

Install and Configure of Sun Grid Engine

Introduction:      The SUN Grid Engine is a batch system.
      Users submit jobs which are placed in queues and the jobs are
      then executed, depending on the system load,the opening hours of the queues and the job priority.


Queues:
The system has several queues defined but for normal usage only
two are open, one for MPI-jobs and one for multi threaded/serial
jobs.

Job suspension:
If a job is still running and the queue closes, the system will suspend
the job until the queue opens again.


Deploying Sun Grid Engine (SGE):
Read the SGE binary license at:
http://gridengine.sunsource.net/project/gridengine/clickthru60.html
It is important that you download these versions of SGE or later
versions. Earlier versions will not work with the Globus GRAM WS.
The two tarballs that you need to download are these files or more
recent versions with similar naming convensions:
               sge-6.0u7_1-bin-lx24-x86.tar.gz(or latest)
               sge-6.0u7-common.tar.gz(or latest)
                Unpacking the SGE distribution

After downloading the tarballs create a directory that will serve as the SGE directory. You can do this as the root user:
[server node]# mkdir -p /opt/sge-root
        Change into that directory:
[server node]# cd /opt/sge-root/

Now run the following commands as user root to unpack the tarballs into the directory you created. Change the path to the tarballs as is necessary:
[server node]# gzip -dc /root/sge-6.0u7-common.tar.gz | tar xvpf -

or             tar -zxvf x.tar

[server node]# gzip -dc /root/sge-6.0u7_1-bin-lx24-x86.tar.gz | tar
xvpf -

or             tar -zxvf x.tar

Next you need to set the environment variable SGE_ROOT to point to the directory you created and into which you unpacked the tar balls:
[server node]# export SGE_ROOT=/opt/sge-root
            Installing and Configuring SGE:

As the root user change into the directory SGE_ROOT and run the following command:[server node]# ./util/setfileperm.sh $SGE_ROOT
You will see output similar to the following:
WARNING WARNING WARNING
-----------------------------
We will set the the file ownership and permission to
     UserID: 0
     GroupID: 0
     In          /opt/sge
     directory: -root
We will also install the following binaries as SUID-root:
     $SGE_ROOT/utilbin/<arch>/rlogin
     $SGE_ROOT/utilbin/<arch>/rsh
     $SGE_ROOT/utilbin/<arch>/testsuidroot
     $SGE_ROOT/bin/<arch>/sgepasswd
Do you want to set the file permissions (yes/no) [NO] >>
Enter 'yes' to set the file permissions and the command will
complete.
Next you will begin the actual installation of SGE by running the
command './install_qmaster'. Running this command will lead you
through a series of command line menus and propmts. Below we
show in detail each step that is necesssary along with the output you
should see.
Any entries you should type will be in red. Any action you should
take will be in black.
[server node]# ./install_qmaster
Welcome to the Grid Engine installation
Grid Engine qmaster host installation
Before you continue with the installation please read these hints:
    • Your terminal window should have a size of at least 80x24
      characters
    • The INTR character is often bound to the key Ctrl-C. The term
      >Ctrl-C< is used during the installation if you have the
      possibility to abort the installation
The qmaster installation procedure will take approximately 5-10
minutes.
Hit <RETURN>
Choosing Grid Engine admin user account
You may install Grid Engine that all files are created with the user id
of an unprivileged user.
This will make it possible to install and run Grid Engine in
directories where user >root< has no permissions to create and write
files and directories.
      Grid Engine still has to be started by user >root<
    •
      this directory should be owned by the Grid Engine
    •
      administrator
Do you want to install Grid Engine under an user id other than
>root< (y/n) [y] >>
n
Checking $SGE_ROOT directory
The Grid Engine root directory is:
$SGE_ROOT = /opt/sge-root
If this directory is not correct (e.g. it may contain an automounter
prefix) enter the correct path to this directory or hit <RETURN> to
use default [/opt/sge-root] >>
Hit <RETURN>
ypcat: can't get local yp domain: Local domain name not set
Grid Engine TCP/IP service >sge_qmaster<
There is no service >sge_qmaster< available in your >/etc/services<
file or in your NIS/NIS+ database.
You may add this service now to your services database or choose a
port number. It is recommended to add the service now. If you are
using NIS/NIS+ you should add the service at your NIS/NIS+ server
and not to the local >/etc/services< file.
Please add an entry in the form
sge_qmaster <port_number>/tcp
to your services database and make sure to use an unused port
number.
Please add the service now or press <RETURN> to go to entering a
port number >>
Note: In another terminal edit /etc/services and add the line
sge_qmaster 30000/tcp
When completed enter <RETURN>
Grid Engine TCP/IP service >sge_execd<
There is no service >sge_execd< available in your >/etc/services<
file or in your NIS/NIS+ database.
You may add this service now to your services database or choose a
port number. It is recommended to add the service now. If you are
using NIS/NIS+ you should add the service at your NIS/NIS+ server
and not to the local >/etc/services< file.
Please add an entry in the form
sge_execd <port_number>/tcp
to your services database and make sure to use an unused port
number.
Make sure to use a different port number for the Executionhost as on
the qmaster machine infotext: too few arguments
Please add the service now or press <RETURN> to go to entering a
port number >>
In another terminal edit /etc/services and add the line
sge_execd 30001/tcp
When completed enter <RETURN>
Grid Engine cells
Grid Engine supports multiple cells.
If you are not planning to run multiple Grid Engine clusters or if you
don't know yet what is a Grid Engine cell it is safe to keep the
default cell name default
If you want to install multiple cells you can enter a cell name now.
The environment variable
$SGE_CELL=<your_cell_name>
will be set for all further Grid Engine commands.
Enter cell name [default] >>
Hit <RETURN> to accept default
Grid Engine qmaster spool directory
The qmaster spool directory is the place where the qmaster daemon
stores the configuration and the state of the queuing system.
User >root< on this host must have read/write accessto the qmaster
spool directory.
If you will install shadow master hosts or if you want to be able to
start the qmaster daemon on other hosts (see the corresponding
section in the Grid Engine Installation and Administration Manual
for details) the account on the shadow master hosts also needs
read/write access to this directory.
The following directory
[/opt/sge-root/default/spool/qmaster]
will be used as qmaster spool directory by default!
Do you want to select another qmaster spool directory (y/n) [n] >>
n
Windows Execution Host Support
Are you going to install Windows Execution Hosts? (y/n) [n] >>
n
Verifying and setting file permissions
Did you install this version with >pkgadd< or did you already verify
and set the file permissions of your distribution (y/n) [y] >>
y
Select default Grid Engine hostname resolving method
Are all hosts of your cluster in one DNS domain? If this is the case
the hostnames
>hostA< and >hostA.foo.com<
would be treated as equal, because the DNS domain name
>foo.com< is ignored when comparing hostnames.
Are all hosts of your cluster in a single DNS domain (y/n) [y] >>
y
Making directories
creating directory: default
creating directory: default/common
creating directory: /opt/sge-root/default/spool/qmaster
creating directory: /opt/sge-root/default/spool/qmaster/job_scripts
Hit <RETURN> to continue >>
hit <RETURN>
Setup spooling
Your SGE binaries are compiled to link the spooling libraries during
runtime (dynamically). So you can choose between Berkeley DB
spooling and Classic spooling method.
Please choose a spooling method (berkeleydb|classic) [berkeleydb]
>>
enter <RETURN> to accept default
Hit <RETURN>
The Berkeley DB spooling method provides two configurations!
Local spooling:
The Berkeley DB spools into a local directory on this host (qmaster
host)
This setup is faster, but you can't setup a shadow master host
Berkeley DB Spooling Server:
If you want to setup a shadow master host, you need to use Berkeley
DB Spooling Server!
In this case you have to choose a host with a configured RPC
service. The qmaster host connects via RPC to the Berkeley DB.
This setup is more failsafe, but results in a clear potential security
hole. RPC communication (as used by Berkeley DB) can be easily
compromised.
Please only use this alternative if your site is secure or if you are not
concerned about security.
Check the installation guide for further advice on how to achieve
failsafety without compromising security.
Do you want to use a Berkeley DB Spooling Server? (y/n) [n] >>
n
Berkeley Database spooling parameters
Please enter the Database Directory now, even if you want to spool
locally, it is necessary to enter this Database Directory.
Default: [/opt/sge-root/default/spool/spooldb] >>
Hit <RETURN> to accept the default
Grid Engine group id range
When jobs are started under the control of Grid Engine an additional
group id is set on platforms which do not support jobs. This is done
to provide maximum control for Grid Engine jobs.
This additional UNIX group id range must be unused group id's in
your system. Each job will be assigned a unique id during the time it
is running. Therefore you need to provide a range of id's which will
be assigned dynamically for jobs.
The range must be big enough to provide enough numbers for the
maximum number of Grid Engine jobs running at a single moment
on a single host. E.g. a range like >20000-20100< means, that Grid
Engine will use the group ids from 20000-20100 and provides a
range for 100 Grid Engine jobs at the same time on a single host.
You can change at any time the group id range in your cluster
configuration.
Please enter a range >>
20000-20500
Grid Engine cluster configuration
Please give the basic configuration parameters of your Grid Engine
installation:
<execd_spool_dir>
The pathname of the spool directory of the execution hosts. User
>root< must have the right to create this directory and to write into
it.
Default: [/opt/sge-root/default/spool] >>
Hit <RETURN> to accept the default
Grid Engine cluster configuration (continued)
<administrator_mail>
The email address of the administrator to whom problem reports are
sent.
It's is recommended to configure this parameter. You may use
>none< if you do not wish to receive administrator mail.
Please enter an email address in the form >user@foo.com<.
Default: [none] >>
Hit <RETURN> to accpet default
The following parameters for the cluster configuration were
configured:
execd_spool_dir /opt/sge-root/default/spool
administrator_mail none
Do you want to change the configuration parameters (y/n) [n] >>
n
Creating local configuration
Creating >act_qmaster< file
Adding default complex attributes
Reading in complex attributes.
Adding default parallel environments (PE)
Reading in parallel environments:
PE "make.sge_pqs_api".
Adding SGE default usersets
Reading in usersets:
Userset "defaultdepartment".
Userset "deadlineusers".
Adding >sge_aliases< path aliases file
Adding >qtask< qtcsh sample default request file
Adding >sge_request< default submit options file
Creating >sgemaster< script
Creating >sgeexecd< script
Creating settings files for >.profile/.cshrc<
Hit <RETURN> to continue >>
Hit <RETURN>
qmaster/scheduler startup script
We can install the startup script that will
start qmaster/scheduler at machine boot (y/n) [y] >>
n
Grid Engine qmaster and scheduler startup
Starting qmaster and scheduler daemon. Please wait ...
starting sge_qmaster
starting sge_schedd
Hit <RETURN> to continue >>
Hit <RETURN>
Adding Grid Engine hosts
Please now add the list of hosts, where you will later install your
execution daemons. These hosts will be also added as valid submit
hosts.
Please enter a blank separated list of your execution hosts. You may
press <RETURN> if the line is getting too long. Once you are
finished simply press <RETURN> without entering a name.
You also may prepare a file with the hostnames of the machines
where you plan to install Grid Engine. This may be convenient if
you are installing Grid Engine on many hosts.
Do you want to use a file which contains the list of hosts (y/n) [n] >>
n
Adding admin and submit hosts
Please enter a blank seperated list of hosts.
Stop by entering <RETURN>. You may repeat this step until you are
entering an empty list. You will see messages from Grid Engine
when the hosts are added.
Host(s):
Hit <RETURN> twice
If you want to use a shadow host, it is recommended to add this host
to the list of administrative hosts.
If you are not sure, it is also possible to add or remove hosts after the
installation with <qconf -ah hostname> for adding and <qconf -dh
hostname> for removing this host
Attention: This is not the shadow host installationprocedure. You
still have to install the shadow host separately
Do you want to add your shadow host(s) now? (y/n) [y] >>
n
Creating the default <all.q> queue and <allhosts> hostgroup
root@nodeC.ps.univa.com added "@allhosts" to host group list
root@nodeC.ps.univa.com added "all.q" to cluster queue list
Hit <RETURN> to continue >>
Hit <RETURN>

Scheduler Tuning
The details on the different options are described in the manual.

Configurations
    1. Normal
       Fixed interval scheduling, report scheduling information, actual
       + assumed load
       High
    2.
       Fixed interval scheduling, report limited scheduling
       information, actual load
       Max
    3.
       Immediate Scheduling, report no scheduling information, actual
       load

Enter the number of your preferred configuration and hit
<RETURN>!
Default configuration is [1] >>
1

Using Grid Engine
You should now enter the command:
/opt/sge-root/default/common/settings.csh
if you are a csh/tcsh user or
# . /opt/sge-root/default/common/settings.sh
if you are a sh/ksh user.
This will set or expand the following environment variables:
    • $SGE_ROOT (always necessary)
    • $SGE_CELL (if you are using a cell other than >default<)
    • $SGE_QMASTER_PORT (if you haven't added the service
      >sge_qmaster<)
    • $SGE_EXECD_PORT (if you haven't added the service
      >sge_execd<)
    • $PATH/$path (to find the Grid Engine binaries)
    • $MANPATH (to access the manual pages)
Hit <RETURN> to see where Grid Engine logs messages >>
Hit <RETURN>
Grid Engine messages
Grid Engine messages can be found at:
/tmp/qmaster_messages (during qmaster startup)
/tmp/execd_messages (during execution daemon startup)
After startup the daemons log their messages in their spool
directories.
Qmaster: /opt/sge-root/default/spool/qmaster/messages
Exec daemon: <execd_spool_dir>/<hostname>/messages
Grid Engine startup scripts
Grid Engine startup scripts can be found at:
/opt/sge-root/default/common/sgemaster (qmaster and scheduler)
/opt/sge-root/default/common/sgeexecd (execd)
Do you want to see previous screen about using Grid Engine again
(y/n) [n] >>
n
Your Grid Engine qmaster installation is now completed
Please now login to all hosts where you want to run an execution
daemon and start the execution host installation procedure.
If you want to run an execution daemon on this host, please do not
forget to make the execution host installation in this host as well.
All execution hosts must be administrative hosts during the
installation. All hosts which you added to the list of administrative
hosts during this installation procedure can now be installed.
You may verify your administrative hosts with the command
# qconf -sh
and you may add new administrative hosts with the command
# qconf -ah <hostname>
Please hit <RETURN> >>
Hit <RETURN>
This completes the first part of the SGE installation and
configuration. Before continuing you need to set up your
environment by doing the following:
[server node]#
source /opt/sge-root/default/common/settings.sh
You can verify that nodeC is configured properly to be the SGE
administrative host by running
[server node sge-root]# qconf -sh
nodeC.ps.univa.com
Next nodeC needs to be configured as an execution host. Run the
following command and again enter the indicated values for each
menu choice:
[server node sge-root]# /opt/sge-root/install_execd
Welcome to the Grid Engine execution host installation
If you haven't installed the Grid Engine qmaster host yet, you must
execute this step (with >install_qmaster<) prior the execution host
installation.
For a sucessfull installation you need a running Grid Engine qmaster.
It is also neccesary that this host is an administrative host.

You can verify your current list of administrative hosts with the command:
# qconf -sh
You can add an administrative host with the command:
# qconf -ah <hostname>
The execution host installation will take approximately 5 minutes.
Hit <RETURN> to continue >>
Hit <RETURN>
Checking $SGE_ROOT directory
The Grid Engine root directory is:
$SGE_ROOT = /opt/sge-root
If this directory is not correct (e.g. it may contain an automounter
prefix) enter the correct path to this directory or hit <RETURN> to
use default [/opt/sge-root] >>
Hit <RETURN>
Grid Engine cells
Please enter cell name which you used for the qmaster installation or
press <RETURN> to use [default] >>
Hit <RETURN> for default
Checking hostname resolving
This hostname is known at qmaster as an administrative host.
Hit <RETURN> to continue >>
Hit <RETURN>
Local execd spool directory configuration
During the qmaster installation you've already entered a global execd
spool directory. This is used, if no local spool directory is
configured.
Now you can enter a local spool directory for this host.
Do you want to configure a local spool directory for this host (y/n)
[n] >>
n
Creating local configuration
root@nodeC.ps.univa.com modified "nodeC.ps.univa.com" in
configuration list
Local configuration for host >nodeC.ps.univa.com< created.
Hit <RETURN> to continue >>
Hit <RETURN>
execd startup script
We can install the startup script that will start execd at machine boot
(y/n) [y] >>
n
Grid Engine execution daemon startup
Starting execution daemon. Please wait ...
starting sge_execd
Hit <RETURN> to continue >>
Hit <RETURN>
Adding a queue for this host
We can now add a queue instance for this host:    • it is added to the >allhosts< hostgroup
    • the queue provides 2 slot(s) for jobs in all queues referencing
      the >allhosts< hostgroup
You do not need to add this host now, but before running jobs on this
host it must be added to at least one queue.
Do you want to add a default queue instance for this host (y/n) [y]
>>
y
root@nodeC.ps.univa.com modified "@allhosts" in host group list
root@nodeC.ps.univa.com modified "all.q" in cluster queue list
Using Grid Engine
You should now enter the command:
source /opt/sge-root/default/common/settings.csh
if you are a csh/tcsh user or
# . /opt/sge-root/default/common/settings.sh
if you are a sh/ksh user.
This will set or expand the following environment variables:
   • $SGE_ROOT (always necessary)
   • $SGE_CELL (if you are using a cell other than >default<)
   • $SGE_QMASTER_PORT (if you haven't added the service
     >sge_qmaster<)
   • $SGE_EXECD_PORT (if you haven't added the service
     >sge_execd<)
   • $PATH/$path (to find the Grid Engine binaries)
   • $MANPATH (to access the manual pages)
Hit <RETURN> to see where Grid Engine logs messages >>
Hit <RETURN>
Grid Engine messages
Grid Engine messages can be found at:
/tmp/qmaster_messages (during qmaster startup)
/tmp/execd_messages (during execution daemon startup)
After startup the daemons log their messages in their spool
directories.
Qmaster: /opt/sge-root/default/spool/qmaster/messages
Exec daemon: <execd_spool_dir>/<hostname>/messages
Grid Engine startup scripts
Grid Engine startup scripts can be found at:
/opt/sge-root/default/common/sgemaster (qmaster and scheduler)
/opt/sge-root/default/common/sgeexecd (execd)
Do you want to see previous screen about using Grid Engine again
(y/n) [n] >>
n

Note:
This completes the installation and configuration of SGE.

Testing SGE:
As the root user you should make sure that the SGE daemons are running:

[server node sge-root]# ps auwwwx|grep sge
root 9159 0.0 0.3 106340 3800 ? Sl 10:43 0:00 /opt/sge-root/bin/
lx24-x86/sge_qmaster
root 9179 0.0 0.2 48424 2400 ? Sl 10:43 0:00 /opt/sge-root/bin/
lx24-x86/sge_schedd
root 9610 0.0 0.1 5176 1820 ? S 10:53 0:00 /opt/sge-root/bin/
lx24-x86/sge_execd
If the SGE daemons are not running simply run the following three
commands as root:
/opt/sge-root/bin/lx24-x86/sge_qmaster
/opt/sge-root/bin/lx24-x86/sge_schedd
/opt/sge-root/bin/lx24-x86/sge_execd
Also as the root user you can check the state of the compute node
and the queue:
[server node sge-root]# /opt/sge-root/bin/lx24-x86/qstat -f
queuename                    qtype used/tot. load_avg arch       states
all.q@nodeC.ps.univa.com BIP 0/2             0.00       lx24-x86
Before submitting a job you need to add nodeC as a node from
which submitting jobs is allowed. You can do that using the 'qconf'
command as shown below:
[server node sge-root]# /opt/sge-root/bin/lx24-x86/qconf -as nodec
nodeC.ps.univa.com added to submit host list
Next you can submit a simple test job as shown:
[server node sge-root]# /opt/sge-root/bin/lx24-x86/qsub /opt/sge-
root/examples/jobs/simple.sh
Your job 1 ("simple.sh") has been submitted.
You can query for the state of the job using 'qstat' as shown:
[server node sge-root]# /opt/sge-root/bin/lx24-x86/qstat
job-
      prior     name        user state submit/start at
ID
1     0.55500 simple.sh root r            02/13/2006 11:07:36
...continued
qeue                          slots ja-task-ID
all.q@nodeC.ps.univa.com 1
[server node sge-root]# /opt/sge-root/bin/lx24-x86/qstat
job-ID prior         name        user state submit/start at
1         0.55500 simple.sh root r             02/13/2006 11:07:36
...continued
                   slo
qeue                   ja-task-ID
                   ts
all.q@nodeC.ps.u
                   1
niva.com
[server node sge-root]# /opt/sge-root/bin/lx24-x86/qstat -f
queuename                     qtype used/tot. load_avg arch     states
all.q@nodeC.ps.univa.com BIP 0/2              0.00     lx24-x86

Next use the "Jane User" account to test and make sure that a non-root user can submit and run jobs:

[server ndoe sge-root]# su - jane
Before submitting a job the environment for 'jane' needs to be set up:
[server node ~]$ export SGE_ROOT=/opt/sge-root
[server node ~]$ source /opt/sge-root/default/common/settings.sh
User jane can check the state of SGE:
[server node ~]$ /opt/sge-root/bin/lx24-x86/qstat -f
queuename                     qtype used/tot. load_avg arch     states
all.q@nodeC.ps.univa.com BIP 0/2              0.00     lx24-x86
User jane can submit a job as shown:
[server node ~]$ /opt/sge-root/bin/lx24-x86/qsub /opt/sge-
root/examples/jobs/simple.sh
Your job 2 ("simple.sh") has been submitted.
User jane can query on a job's state as shown:
 [server node ~]$ /opt/sge-root/bin/lx24-x86/qstat
 job-ID prior       name         user state submit/start at
 1        0.00000 simple.sh jane qw 02/13/2006 11:12:57
...continued
                    slo
 qeue                    ja-task-ID
                    ts
 all.q@nodeC.ps.u
                    1
 niva.com

When the job completes user jane should find two files, one for stdout from the job and one for stderr from the job:
[server node ~]$ ls
simple.sh.e2 simple.sh.o2
[server node ~]$ cat simple.sh.o2
Mon Feb 13 11:13:06 CST 2006
Mon Feb 13 11:13:26 CST 2006


No comments:

Post a Comment