Scientific Computing Facilities Frequently Asked Questions
Table of Contents
General Questions
1. What are the Scientific Computing Facilities (SCF)?
- The Scientific Computing Facilities (SCF) currently include the IBM BlueGene system, IBM nine IBM pSeries 655 machines, an Intel Pentium III Linux Cluster, an IBM Katana Cluster (which is largely based on the AMD Opteron processor), and our virtual reality/scientific visualization facilities. They are part of the more general SCV Computing and Visualization Facilities.
A general introduction to the use of the SCF is available at Information for New SCF Users. The primary computational machines are listed under our Scientific Computing Facility Technical Summary.
- 2. What machines is my account good for?
- Your SCF account gives you access to the IBM pSeries 655 (twister.bu.edu), the Intel Pentium III Linux Cluster (skate.bu.edu and cootie.bu.edu) and the
IBM Katana Cluster (katana.bu.edu). Accessing the IBM Blue Gene (levi.bu.edu and lee.bu.edu) is
not automatic, special restrictions apply. If you need access to the Blue Gene,
fill out this web form.
The primary computation clusters each have one or two machines designated for interactive use and you can only log in to those machines. On the pSeries, the machine is twister.bu.edu, on the Linux Cluster they are skate.bu.edu and cootie.bu.edu, and on the Blue Gene for those who have access to it, they are levi.bu.edu and lee.bu.edu. Your Unix password is shared over all of these systems.
You should log into one of the above machines using ssh and do all your editing and compiling there as well. If your program runs on a single processor and requires less than ten minutes of CPU time, you can also execute your program on one of these machines (with the exception of the Blue Gene systems) interactively. Otherwise, you should submit your program as a batch job and it will automatically be parceled out to the appropriate machine in the facilities based on available resources and what queue you select (please also see later questions on batch system usage).
- 3. Where do I find documentation?
-
- 4. What debuggers are there?
- On the IBM pSeries machines, the debugger is pdbx. pdbx is a command-line parallel debugger suitable for MPI.
On the Linux cluster, the debugger is idb, the Intel debugger.
On the Katana cluster, the Portland Group parallel debugger is pgdbg.
There are also the standard debuggers dbx and gdb.
The debugger on the Blue Gene system is Totalview
- 5. How do I change my password?
- To change your SCF password, you must run passwd on twister.bu.edu.
- 6 I can log in to the Linux systems (skate, cootie, katana) but not to twister. What's up?
- Most likely you are using your BU Kerberos password. This password will work on the Linux systems but on our other systems. You also have a Unix password which you first set up when you first got an account on the SCF (but of course may have changed since). Only this password will allow you to log in to Twister or the Blue Gene systems (lee, levi). This password can also be used for the Linux systems, but for the other systems is required. If you can't remember your Unix password, please send email to scfacct@bu.edu explaining your situation.
- 7. What do I do if I forget my password?
- Please send email to scfacct@bu.edu explaining your situation.
- 8. How do I retrieve lost files?
- Please send email to help@twister.bu.edu, explaining exactly what files you deleted, what machine and filesystem they were on, and at what day and time you did it.
- 9. How do I get more resources (such as disk space)?
- For home directory disk space, fill out this form. If you are a Principal Investigator for a project which needs more CPU time or /project space, try using the appropriate form linked to from http://scv.bu.edu/accounts/. Make sure to specify what machine you are requesting resources on, why you need them and what exactly you need. These requests, particularly large ones, can take several weeks to process and consider.
Filesystem/Disk Questions
10. Which filesystems are shared?
- You have one home directory on the IBM pSeries systems and one on the Linux Cluster/Blue Gene. Your Linux home directory is accessible from the IBM machines with the pathname /linux/$HOME. Similarly, your IBM home directory is accessible from the Linux machines with the pathname /ibm/$HOME.
Each machine (with the exception of the Blue Gene systems) has its own scratch partition. If necessary, you can access the /scratch partitions on other machines of the same architecture. On the pSeries systems you can access a remote scratch space via the pathname /hostname/scratch (for example, /frisbee/scratch). On the Linux Cluster machines, a remote scratch space can be accessed via the pathname /net/hostname/scratch (for example, /net/node006/scratch). On the Katana Cluster machines, use the pathname "/net/katana-xNN/scratch", where x is the letter a, b, or c and NN is a node number from 01, 02, ... 14.
There are several partitions of Project space. All of the /project file systems can be accessed from any of the SCF machines.
- 11. I need large amounts of temporary space for my jobs. What do I do?
- Use /scratch and see the previous question.
If /scratch on a given machine is full, you should do one of the following things. 1) Remove as many files as you can which you no longer need to free up space. 2) Use /scratch on a different machine which has more space (see next question).
If this is a regular need and /scratch does not adequately take care of it, the Principal Investigator of your project can apply for /project disk space, backed up or not backed up as appropriate.
- 12. Why do my files in /scratch automatically get removed, sometimes even immediately after I unTARed them?
- The /scratch reaper automatically removes files which are more than 10 days old. It determines how old a file is by looking at its "write date." By default, tar does not modify write dates, so an older file which is unTARed will be reaped at the next opportunity. The -m switch to the tar command can be used to override this behavior. The following is from the tar man page:
m Do not restore the modification times.
The modification time will be the time of extraction.
- 13. Does the SCF have a long term storage facility?
- Yes, it is possible to archive your files for long term storage using the IBM Distributed Storage Manager (Tape Robot).
Batch Job/System Questions
14. How do I submit a batch job?
- On the IBM pSeries, we utilize Platform Computing's LSF batch management system. Use bsub (see next question) or xlsbatch, the Motif GUI for lsfbatch, the load sharing batch system installed on the SCF. Lsfbatch uses LSF (Load Sharing Facility) to distribute the load for both parallel and serial batch jobs over the system. Also, see the SCV help page on LSF.
The batch system on the Linux Cluster is OpenPBS.
On the Blue Gene, the batch system is LoadLeveler.
On the Katana Cluster, the batch system is Sun Grid Engine.
- 15. What limitations are there on jobs (# of nodes, runtime, etc...) on the various systems?
- Our Scientific Computing Facility Technical Summary explains the job limitations on all of our systems.
- 16. How do I have one batch job wait for another to complete?
- The bsub command in the LSF batch system (on the pSeries) has a wait option (-w) which allows you to specify the conditions which you wish to wait for before starting the job, including waiting for the termination of another job. For example,
bsub -w 'done("myjob1")&&done("myjob2")' myjob3
will cause myjob3 to wait until both myjob1 and myjob2 have completed. Another option -b allows you to specify that jobs should not be run before a certain time. Finally, the -E option provides a completely general mechanism to have a job wait until an arbitrary condition is true. With this option you specify a command which the batch system will execute before running your job. If the command exits with a 0, the job is run. Otherwise it is put back on the queue.
- 17. My batch job starts several other jobs but these other jobs get killed by the reaper. Why?
- If the original job terminates before its children, the reaper cannot determine that the children were started by the batch job and so kills them. Make sure the parent job does not end before the children.
- 18. My batch job is expected to take longer runtime than a queue's time limit, what can I do?
- The answer depends on how your code is implemented:
- If your code is written as a serial (single processor) application, rewriting it as a parallel (multiprocessor) application could help, provided that the underlying algorithm used in your code is inherently parallelizable. Parallelization could be achieved with OpenMP or MPI. OpenMP is limited to shared-memory machines such as the IBM pSeries. MPI, on the other hand, works on shared memory machines as well as distributed memory machines (i.e., the SCV Linux Cluster). Please contact Kadin Tseng or Doug Sondak for more details.
If your program is written for MATLAB, please contact Kadin Tseng to see if your program can be parallelized.
- If your program is already parallelized using MPI you might want to try running it on the Linux Cluster. The Linux Cluster usage policy allows 16 node parallel jobs to run for as long as 24 hours compared to 5 hours for the pSeries multi-processor queues. So even though the Linux Cluster processors are slower than the pSeries processors your job may be able to complete within the allowed time on the Linux Cluster.
- Modify your program so it periodically saves state and can be restarted where it left off. See Doug or Kadin for help doing this.
- If your code is already parallelized with MPI and is scalable to many (hundreds) processors, you could port it to the IBM BlueGene.
- 19. My batch job exited with code ###. What does that mean?
- See this long explanation of the batch system exit codes.
- 20. How does the LSF batch system schedule jobs?
- See this long explanation of the batch system scheduler.
- 21. My LSF (batch) run seemed to run to completion, but I never received the usual email message notifying me that the job had finished. What is the problem?
- At the end of an LSF run the user is automatically sent email to indicate that the job has completed. This email contains everything that was written to standard out during the run. If a large amount of information (greater than 10MB) is written to standard out, the email becomes too large for the mail system to process, and the email is not sent. This sometimes occurs when the user forgets to delete a large number of diagnostic print statements from a run. The best solution is to always re-direct standard out to a file (e.g., myrun > myoutput).
- 22. I am a member of multiple project groups. How do I account my usage to a different project than my default one?
- On the pSeries, use the -P project_name option to bsub when you submit your job.
On the Katana Cluster, run the command newgrp project_name in your shell window before doing your run or submitting your job (from that window) to the batch system.
On the Linux Cluster, the option to qsub is -W group_list=project_name. However, this will unfortunately only work for single processor jobs. For multiprocessor jobs on the Linux Cluster, there unfortunately is currently no way to avoid them being charged to your default project so see below for instructions on changing that.
On the Blue Gene, you need to add the line # @ group = project_name to your batch script.
You can also change your default project by going here (requires your Twister login and password) and then completing and submitting the appropriate Web form. Your default project will then be changed the next time the system configuration files are updated, generally overnight.
Programming Questions
23. How do I specify the number of processors my job will run on?
- It depends on the parallel paradigm (OpenMP, MPI) and/or the computer platform. Please consult the appropriate link below:
- 24. How do I run MPI jobs?
- Please read the Multiprocessing by Message Passing Tutorial where you will find instructions on what you need to do to use MPI.
- 25. How do I run PVM jobs?
- PVM is not available on any of our current systems.
- 26. What linear algebra packages are available on the SCF systems?
- On the IBM pSeries, ESSL is available for serial applications while PESSL is available for parallel processing.
On the other SCF systems, LAPACK is available for serial applications and ScaLAPACK for parallel applications. Please go to the Packages page for the Blue Gene, Katana Cluster, and Linux Cluster systems for details.
Basic Linear Algebra Subprograms (BLAS) is available on all four SCV systems. Add -lblas during linking to allow access to the BLAS library to your executable.
- 27. What causes the following error message on Twister?
-
The minimum size of partition 5 exceeds the partition size limit.
When the highest level of optimization (-O5) is requested, the compiler will perform inter-procedural optimization (ipa). In the process, it needs temporary storage and when the default amount is not sufficient, the above message results. To remedy this, add the following switch to your compile line: -qipa=partition=large See the xlf man page for details.
- 28. How do I call fortran subprograms from a C program on Twister?
- Unlike many other machines, no wrapper routine is required with the IBM xlf compilers. In fact, adding an underscore ("_") in a C wrapper function will cause an error during compilation. This applies to both user-developed C functions and IBM C-based library functions, such as erand48, a 48-bit random number generator.
- 29. How do I call fortran functions from a C program on the Linux-based machines (cootie, skate, katana, levi, lee)?
- On these machines, you need to append an underscore ("_") to the fortran function name. For example, if your fortran subprogram is called myfunc, then in the C program, invoke it with myfunc_. Note also that because fortran subprogram arguments are passed by reference, when you use them in C, all arguments must be passed as pointers (i.e., passed by reference), including any scalars.
- 30. My fortran program calls flush. It doesn't work on the pSeries/Twister!
- Instead of "call flush(iunit)", you must, in addition, append an underscore, like this: call flush_(iunit).
- 31. Is etime available on the pSeries? I got this error when I compile my program:
ld: 0711-317 ERROR: Undefined symbol: .etime
- Yes. Like flush above, you need to "call etime_". The following utilities require an underscore:
alarm_, clock_,ctime_, dtime_, etime_, fdate_, flush_, gmtime_, idate_, itime_, ltime_, sleep_, time_, usleep_
- 32. My C code with OpenMP directives behaves strangely when private arrays are allocated with malloc. Why is this?
- Arrays allocated with malloc are allocated on the heap and are expected to be treated as shared.
- 33. I use trigonometric and hyperbolic functions quite often in my code. Is there a system library that provides efficient implementations of these functions?
- On the pSeries, it is called MASS. Add -lmass to link to it. Details on MASS are available on the pSeries Documents page.
On the Linux Cluster, the Intel Vector Math Library (VML) is available. Go to the Linux Cluster Documents page for details.
- 34. What is MPMD and are there special considerations when programming using this paradigm?
- MPMD stands for Multiple Program Multiple Data -- as opposed to MPI's more popular Single Program Multiple Data (SPMD) parallel programming paradigm. This is available on the pSeries only. (More details available here.).
- 35. I ran a batch job that uses /scratch for I/O. LSF can't find the file which is there. What happened?
- You must prepend the name of the machine on which the file resides, for example /twister/scratch. (Details here.)
Miscellaneous Questions
36. I think I have discovered a bug in F90, gcc, etc... What should I do?
- Send email to help@twister.bu.edu with a description of the problem. If possible, tell us exactly how to reproduce the problem you are having. If we can reproduce your problem, we can probably fix it. If you don't know how to reproduce the problem, please provide as much information as possible including:
- Hostname of machine.
- The name and location of the program (with flags and input files).
- Any error messages you get.
- 37. On the IBM/AIX machines my program fails with the error:
twister:~> a.out
exec(): 0509-036 Cannot load program a.out because of the following errors:
0509-026 System error: There is not enough memory available now.
How do I deal with this error and get access to more memory on the IBM/AIX machines?
- The error message is misleading, the system has plenty of memory. By default, a 32-bit AIX executable has a 256MB data segment limit. You need to use the -bmaxdata compiler flag to use more memory. For example:
twister:~> xlf -bmaxdata:0x40000000 prog.f
produces an executable with a 1GB data segment limit. It is usually safe to compile with -bmaxdata:0x80000000 for a 2GB data limit. To go above 2GB, you need to add a /dsa. The largest value you can specify is -bmaxdata:0xd0000000/dsa for a 3.25GB data limit. However, this may or may not work depending on the details of your program. If you need that much memory consider compiling a 64-bit executable by using the -q64 flag.
You can also use the ldedit command to "fix" the executable without recompiling.
Finally, if you are using the GNU compilers (gcc,g++,g77) you need an additional -Xlinker flag:
twister:~> g77 -Xlinker -bmaxdata:0x40000000 prog.f
- 38. Is there a problem with Mathematica regarding fonts?
- The X frontend for Mathematica requires use of some special mathematica fonts. If the fonts are not available to your X-server mathematica won't work. To fix this you need to install the fonts in a place where your X-server can find them. The fonts are in the following directory: /usr/local/apps/mathematica-5.2/SystemFiles/Fonts/BDF on all SCV maintained machines. Copy (20MB) them to a directory which your X-server can access and, if necessary, add the directory to your font path by executing a command like:
xset fp+ <full path to mathematica font directory>
If you need help, please ask the administrator of your workstation for assistance.
- 39. Why can't the machine I am using read my datafile that I created on another computer?
- If the file is a binary file, it may be a problem with endian-ness. Intel, and DEC computers are usually "little-endian" while MIPS(SGI), SPARC(SUN), PPC(IBM/AIX) are "big-endian". This means that the order they store the bytes in an integer for example, is reversed. The best solution is to use a portable data format for your data files such as ASCII.
- 40. None of this answered my questions. What should I do?
- For other questions, send mail to help@twister.bu.edu or take advantage of the newsgroup bu.mail.scfug-l to post questions so other users can help you or see if your question has come up before. You can also subscribe to the scfug-l mailing list by sending mail to majordomo@bu.edu with the following line as the BODY of the message (the Subject line does not matter)
"subscribe scfug-l@bu.edu your_login_name@machine_name.bu.edu". Make sure to specify your BU login name and include a specific machine name to send mail to. Mail to this list also appears in the bu.mail.scfug-l newsgroup mentioned above.
|
|
|