Friday, 14 October 2011

Compiling, Linking and Running

Compiling, Linking and Running



I get a "segmentation fault" error during my run. What happened?

You get a segmentation fault error when your programs tries to access memory address space out of what is assigned to your program.
This can be caused by (sorted by frequency):
  • using wrong index in arrays: for example you try to access element 1000 of a 100 elements array, or when you compute an index in a wrong manner
  • using a pointer with a wrong address: for example using a not initialized pointer, or a pointer associated to a failed dynamic allocation
  • in C programs, using wrong type or number of arguments to printf/scanf (and similar) functions
  • too aggressive compiler optimization: aggressive optimization activate tricks that are good under some assumption, but can cause problems if these assumptions are not sotisfied
  • out of stack problems: your program declare too big data structures
Suggested Solutions:
  • compile your program using -O0 -g compiler options
  • also add compiler options to activate boundary checks (-check bounds for ifort, -Mbounds for pgf90, see man pages)
  • run your program with a simple test case
  • use TotalView to debug your program in an interactive section

I get a "undefined reference to" error during linking. What can I do?

An ‘undefined reference to symbol_name’ is a typical linking error (after compiling, linking is the last phase in building an executable) and pops up when you didn’t provide all needed object files (.o files) or libraries.
Suggested Solutions:
  • from the undefined reference error, try to understand where the missing symbol can be found
  • check that you have provided the correct path and library name in your linking command line, specify -L/path/to/your/lib/dir -llib_name for each needed library

I get an "error while loading shared libraries". What can I do?

When you run a program and get the following error:
error while loading shared libraries: libfoo.so: cannot open shared object file: No such file or directory
this means that your program has been built using dynamic linking, that is needed libraries are loaded at run-time. So the system must find out where to look for these libraries. The LD_LIBRARY_PATH environment variable contains a list of directories, separated by “:”, the OS will use to look for dynamic libraries.
Suggested Solutions:
  • load all needed modules you have used to build your program;
  • before running the program, use the command ldd your_executable_name: you will get a list of all dynamic libraries needed by your executable at run-time. Check that all your libraries are correclty found;
  • set the LD_LIBRARY_PATH environment variable properly, so to add needed directories to the list;

BLACS linking with mpicc: undefined reference to `mpi_init_'?

When linking a C program with the mpicc openmpi compiler to BLACS libraries, for example using ScaLAPACK, you should also add the following mpi fortran compiler libraries to your linking line command -lmpi_f90 -lmpi_f77

How can I submit a parallel job using both MPI and OpenMP?

Suppose you want to run 4 MPI processes, each one being parallelized with OpenMP and running on 8 threads. Your PBS job script should:
  • ask for 32 proccessors, distributed on 4 nodes: #PBS -l nodes=4:ppn=8
  • load one of the OpenMPI modules
  • set the number of OpenMP threads of each MPI process:
    • with bash: export OMP_NUM_THREADS=8
    • with tcsh: setenv OMP_NUM_THREADS 8
  • invoke mpiexec with the following switches:
    • -n 4: to set the number of the MPI processes to run
    • --bynode: to allocate the MPI processes with a per-node scheme
    • -x OMP_NUM_THREADS: to export the variable OMP_NUM_THREADS to the job environment
For example:
#!/bin/bash
#PBS -l nodes=4:ppn=8

module load openmpi/1.2.5/64/pgi-8.0-1
export OMP_NUM_THREADS=8

mpiexec -x OMP_NUM_THREADS --bynode -n 4 your_executable
Please note that in the presence of --bynode, the -n option cannot be omitted. For further information about the command-line options of mpiexec, please look at the mpiexec man page.

Which ACML library should I use?

As you can see, ACML libraries come in 2 different flavours. Use them carefully:
  • acml : standard ACML 64bit. This is the right choice for most users, good in almost all cases. Try them first;
Other ACML are available:
  • acml_mp : ACML 64bit libraries with multi-threaded support. Some ACML routines have been enabled to run in multi-threaded mode, according to the number of threads defined by the OMP_NUM_THREADS environment variable. Use it only if you are running serial program on a single node or if you kow what you are doing. Simply add an underscore “_mp” to the $ACMLDIR environment variable set by acml module
When you load the ACML module, only the path to the standard (not mp) version of the library is added to the LD_LIBRARYPATH environment variable, so, in order for the proper shared objects to be used, you need to specify the path. You have two main choices:
  • Link static library:
gfortran -fopenmp ${ACMLDIR}_mp/lib/libacml_mp.a BETAf
no -l option is needed since you passed the library name to -L, and no shared libraries will be used.
  • Use shared libraries:
gfortran -fopenmp -L${ACMLDIR}_mp/lib -Wl,-rpath,${ACMLDIR}_mp/lib BETA.f -lacml_mp
the rpath option argument will be used to search for the right shared libraries.

No comments:

Post a Comment