DLC and SLEAP

From onice-wiki
Jump to navigation Jump to search

This page will focus on accessing Talapas, mounting drives if you want to, setting up python environments, etc. If questions arise that aren't answered here, always check the Talapas knowledge base found at https://hpcrcf.atlassian.net/servicedesk/customer/portals . You can also submit a ticket to them if you still can't solve a problem at that same link.

Accessing Talapas

Before being able to do anything with Talapas, you'll need an account! You can go to this help site and select New Account Request. Most of the information you should already know, but you will need the PIRG, which corresponds to your lab. You will eventually get an email back that states you have been added and how to access the server

Talapas is primarily accessed through a shell, like Terminal, iTerm2, or Anaconda/Windows Powershell. To access it, use the following

ssh USERNAME@talapas-ln1.uoregon.edu

You can use either login node, ln1 or ln2, just be consistent. Once your login is successful you will be plopped in your users home directory. Your username and password will be the same as your DuckID, so what you use for UOMail, Canvas, etc.

I also recommend sending an RSA key or an equivalent for logging in just to make life easier.

Installing/configuring miniconda on Talapas (Optional)

This following section is specifically relevant if you want to create and test your own environments. If you want to use the general environments created by and for all users (which there is a DLC environment that exists already), you can ignore this and use the module system to load anaconda and all the environments already on Talapas. Again, this is specifically if you or your lab want to quickly create and experiment with many different environments that are only for in-house things.

To easily manage the various python environments our lab is personally making/using, we will use conda. A nice, lightweight version of Anaconda exists in the form of miniconda, which we can install in our home directories. This is only worthwhile if you want to actively be creating your own conda environments independent of the Talapas team. You can always submit tickets with a requirements/setup file and they will make one for you under their miniconda module. If you do make many new conda environments, I recommend storing them in your PIRG's shared folder (see below) and just adding that path to conda as place to search when finding installed environments. If you read this and still decide you want to make your own environments, then...

To do so run the following

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

and then install with

sh Miniconda3-latest-Linux-x86_64.sh

You'll know the install was successful if you now see (base) next to your login info on the left side of your terminal window.

Creating new conda environments (Optional)

Normally when we create new virtual environments in conda, we would use

conda create --name=myenv python=3.X

to create some new environment called "myenv" on python 3.whatever. However, we want our environments to be open to all in our lab to access. To do this, we have to tell conda to place our environments in a common place instead of its default location. We do this with a new flag

conda create --prefix=/projects/wehrlab/shared/conda_envs/myenv python=3.X

Now this environment will be made in an area we all have access to! And since we ran that config --append envs_dirs /path/to/shared/install stuff with our install, we added the "conda_envs" folder to our path so conda knows to look there for new environments. The downside of this is the environment names look hideous in shell... Maybe we can fix this one day.

Tensorflow environments

For any environments using tensorflow, we have to be very specific about where we source our packages from to ensure compatibility and proper libraries. Specifically, we want to source any cuda related packages from Nvidia directly rather than through a group like conda or pip libraries. The following is an example of how this is done when making new conda environments, though this can also be done for any conda install commands too:

conda create -n deeplabcut-20220414 --override-channels -c nvidia -c defaults -c conda-forge  'python>=3.8'  ffmpeg 'cuda>=11' 'cudnn>=8.1.0' numpy 'scikit-image==0.18.1'
(conda activate deeplabcut-20220414 && pip3 install deeplabcut)

For maximum compatibility, we want to make sure our cudnn and cuda libraries match what is supported by the nvidia drivers for the gpus. A chart for that can be found here: Tensorflows website. I am unsure how often this is updated, but it does give some general guidelines. Nvidia always releases information for this as well when new versions of CUDA are released, which can be found in their blog articles. We cannot change the driver versions installed, so we must instead work around that.

One important thing to note is when working with these environments we need to add the CUDA package libraries to our LD_LIBRARY_PATH system path, which can be done with the following while inside the conda environment: export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH. This will add this conda library to the top of our library path, making sure it's files are used first, and you must do this every time. If you get weird cuDNN errors or failures to find GPUs in DLC/tensorflow, failing to do this is probably why (assuming you are on a node that has a GPU).

You can add the export command to your bashrc file to have it done when the Talapas session starts, but you will need to change $CONDA_PREFIX to the actual path for that environment, otherwise your bashrc export command will add the wrong library/no library. This also means if you use different environments that call on different versions of cuda/cuDNN, you may pull the wrong one since it's auto-assigned. Therefore, I recommend just doing the export command to add the library path manually each time for safety.


NOTE: Tensorflow/CUDA/cuDNN packages installed from conda-forge currently have a mismatch and therefore cannot be used reliably. This is why we must source these packages manually from Nvidia instead of letting conda handle it with defaults. Hopefully this is corrected soon so make things easier. Also note there is an active (as of 2022-04-20) bug with Tensorflow 2.7 and higher that can cause it to randomly fail to find cuDNN even if its installed properly.

General Talapas information and commands

This section briefly covers useful commands for interacting on Talapas/SLURM.

The first, and arguably most useful command, is squeue. This command allows you to check the status of all jobs currently being run. If you are about to submit a job and want to see how quickly you will get access, run squeue -p gpu, replacing gpu with other partitions as necessary (the -p means search for partitions). This command will show every job currently being run on the partition, as well as the status, so if you see other jobs are already pending (PD) for Resources, then know your job won't immediately start either.

To check on jobs you have already submitted, run squeue -u USERNAME which will display all jobs being ran by a specific user. This can be useful if you want to see how much time you have left on an interactive node or how far you are into an array batch job.

The next useful command is scancel. This is used to terminate a job. You can check on a job and find it's ID using squeue and then kill it using scancel JOB_ID. Keep in mind you can terminate other's jobs, so please be careful with the ID you enter. To limit to just your own jobs, use instead <code scancel -u USERNAME JOB_ID, which tells it to only act on jobs submitted by the username entered.

Modules

Talapas uses a system known as module to maintain many different environments and conditions simultaneously, yet separately. To check what modules are currently loaded use module list and to unload a module use module unload MODULE_NAME. To see a list of all modules, use module spider NOTE: There are a lot of modules. To search for modules that contain a specific name, use module spider MODULE_NAME, so for example module spider cuda will search for any modules that contain the word cuda and list them. To load a module, use module load MODULE_NAME. This by default loads the latest version of the module. To choose a specific version, follow the module name with /X.XX.X where X is a number, so for example module load cuda/11.2.0 will load that specific version of the cuda module.

Modules can be used for loading stuff like miniconda, MATLAB, and more

Module deep dive

To my understanding (Matt Nardoci) modules work by editing your $PATH, $LD_LIBRARY_PATH, and $PYTHON_PATH variables to point to the proper versions of installs you request when you load the module. This system works fairly effectively, but there have been some cases where stuff normally gets added to the variable on install (like say installing CUDA from Nvidia adds driver paths to your PATH and LD_LIBRARY_PATH) that the module load command does not always replicate perfectly. This has led to us having to manually export paths to the above path variables to correct for this mistake. 99% of the time though, this system works well. This is also why it's important you don't load conflicting modules, like say Miniconda2 and Miniconda3, as then you have two miniconda's trying to do stuff every time you type conda activate environment.

Running jobs on Talapas

Interactive job runs

Interactive runs are most useful when you need to actively see the outputs and make changes on the fly. This is helpful when you are trying to debug code or test new methods.

To start a run, we would use the following:

srun --account=YOURPIRG --pty --partition=gpu --mem=8G --cpus-per-task=4 --gpus=1 --time=2:00:00 bash

In the above, --account=YOURPIRG tells which group to bill which will always be whatever your PIRG is (do remember to change it from YOURPIRG to something), --partition=gpu says what type of node to request, which is either short, long, gpu, longgpu, preempt, or fat. For more information on partition types, visit the Talapas database page on partitions. --mem=8G specifies how much RAM to request for the node, --cpus-per-task requests how many cores of your processor to request, --gpus=1 says how many GPUs you want to have available, --time=2:00:00 specifies time to have the node for in DD-HH:MM:SS format (when the node expires any tasks will be killed) and bash is just what to show the name of the run as when checking the job queue. Keep in mind all of these values can be changed, and what their values can be depends on the partition type you request. Most notable, only long partitions can have a time greater than 24 hours, and only gpu and preempt partitions have access to GPUs.

When you run the above command, you'll get confirmation the job is submitted and waiting for the resources you requested. This is usually almost instant, but if there are a lot of jobs running on Talapas it can take a while.

Submitting batch jobs

Batch jobs are most useful when you have a proven method that won't error, and you just need to repeatedly do a task or run something that may take more than a few minutes and you don't want to babysit it. Batch jobs work through the submission of .batch files that contain instructions for what resources your node needs, how many nodes, and what to do on those nodes once they are secured. Batch files have a special header, shown below

#!/bin/bash
#SBATCH --account=YOURPIRG    ### change this to your actual account for charging
#SBATCH --partition=gpu       ### queue to submit to
#SBATCH --gpus=1              ### Number of gpus to use Use --gres=gpu:a100-40g:1 on preempt for new gpu
#SBATCH --job-name=Molly_video_analysis      ### job name
#SBATCH --output=/projects/wehrlab/mnardoci/batchlogs/Molly_job-%A_%a.out      ### file in which to store job stdout
#SBATCH --error=/projects/wehrlab/mnardoci/batcherrors/Molly_job-%A_%a.err       ### file in which to store job stderr
#SBATCH --time=0-14:00:00              ### Wall clock time limit in Days-HH:MM:SS
#SBATCH --mem=8G                ### memory limit per node, in GB
#SBATCH --nodes=1               ### number of nodes to use
#SBATCH --ntasks-per-node=1     ### number of tasks to launch per node
#SBATCH --cpus-per-task=4      ### number of cores for each task
#SBATCH --array=0-336

You'll notice a lot of these inputs are the same as what is used for calling an interactive session with srun above. It's important that this is the header of the file and each line starts with #SBATCH since that instructs it's for choosing node resources, not actual code to run. The other main addition here is --output=/projects/wehrlab/mnardoci/batchlogs/Molly_job-%A_%a.out and --error=/projects/wehrlab/mnardoci/batcherrors/Molly_job-%A_%a.err. These are used to generate log files in real time of the code being run so you can review what is happening while the job is running, since we can't see the terminal output ourselves. --output is the path that specifies your .out file, which contains all "standard" outputs from code, such as print statements. --error is the path to a .err file, which has everything else. If your code gives any warning or any errors, it is written to this file so you can hopefully debug it later. When submitting a batch job, especially if it's going to be a long one, I recommend starting with an interactive run first, testing to see if your code runs for 5-10 minutes without erroring (this can be done by using sh /path/to/file.batch), and then submitting it as a batch. You can then check the .out and .err files every so often to make sure things are running how you expect.

Array batch jobs

The one thing I didn't mention in that above block is --array. Arrays are used when you want to run the same instructions in parallel on multiple nodes rather than having one node run many things in a for loop. The range you give after --array is how many iterations you want to run/nodes to use. If you want to use as many nodes as you can at once, do nothing. If you want to limit the number of nodes used at any one time, you can follow your range with a divider, like --array=0-336%10 will say "Run this 336 times, but only 10 nodes at a time." This can be useful if you plan on submitting multiple types of runs that use the same partition, that way you aren't hogging all the resources from yourself, or if you want to be considerate of other Talapas users that may want to submit their own jobs but can't if you take everything.

HELP FOR INDEXING: When jobs are submitting, SLURM automatically assigns the resulting ID's to a few variables. For SLURM purposes, such as all the lines proceeded with #SBATCH, the Job ID is assigned to %A and the array ID is assigned to %a. You'll notice in the code above I used those values to create unique file names for my outputs so I could identify what failed later on. In the shell, these values are assigned to variables, notably the current array value is assigned to $SLURM_ARRAY_TASK_ID. This means if you have a list of files you want to do something to, you can make sure each array does something to a different file by using $SLURM_ARRAY_TASK_ID to index that list. Here is an example of a batch file I made for running DeepLabCut on several hundred videos at once.

#!/bin/bash
#SBATCH --account=YOURPIRG    ### change this to your actual account for charging
#SBATCH --partition=gpu       ### queue to submit to
#SBATCH --gpus=1              ### Number of gpus to use Use --gres=gpu:a100-40g:1 on preempt for new gpu
#SBATCH --job-name=Molly_video_analysis      ### job name
#SBATCH --output=/projects/wehrlab/mnardoci/batchlogs/Molly_job-%A_%a.out      ### file in which to store job stdout
#SBATCH --error=/projects/wehrlab/mnardoci/batcherrors/Molly_job-%A_%a.err       ### file in which to store job stderr
#SBATCH --time=0-14:00:00              ### Wall clock time limit in Days-HH:MM:SS
#SBATCH --mem=8G                ### memory limit per node, in GB
#SBATCH --nodes=1               ### number of nodes to use
#SBATCH --ntasks-per-node=1     ### number of tasks to launch per node
#SBATCH --cpus-per-task=4      ### number of cores for each task
#SBATCH --array=0-336

module load miniconda
conda activate deeplabcut-20220415

export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
cd /projects/wehrlab/shared/dlc_analysis_errorlogs

directory_list=()
for direct in /projects/wehrlab/shared/Molly/*
do
echo $direct
directory_list+=("$direct")
done
echo "Running array job ${SLURM_ARRAY_TASK_ID} directory on ${directory_list[${SLURM_ARRAY_TASK_ID}]}"

python /projects/wehrlab/shared/dlc_analyze.py "${directory_list[${SLURM_ARRAY_TASK_ID}]}"

In it, I have a for loop pull the names of all the directories in a specific location, and then I choose one directory based on the array ID to give to python as an argument, which will then start analysis on the video in that directory.

Mounting folders from Talapas locally

TALAPAS DOES NOT USE SAMBA/SMB/NFS/ANYTHING OTHER THAN SFTP FOR SECURITY REASONS

Therefore all mounting is done using sshfs

THEY HAVE ALSO KINDLY REQUESTED WE DON'T MOUNT ANYTHING TO TALAPAS. They will remove any folders that you mount to the server.

SLEAP

SLEAP can be loaded by first activating the Talapas miniconda module with module load miniconda and then activating the environment with conda activate sleap-1.2.0

Note (2022-05-16)

I (Matt Nardoci) have not worked with SLEAP for my project, so I have not tested if it properly uses GPUs or not, so it may only use CPUs. A new SLEAP environment may have to be created using the methods outline above for DLC. Michael Coleman with RACS has experience with getting this to work before for DLC and can likely provide assistance for getting it running if needed when submitting a ticket through Talapas.

Niche issues on Talapas we have encountered

Troubleshooting DLC on Talapas

Matt has been attempting to set up DLC/SLEAP/other tensorflow related programs on Talapas and encountered several problems along the way.

  1. Maintaining proper $PATH and $LD_LIBRARY_PATH variables appear to be the key and source of most frustration.
    • Can't install CUDA the "normal" way since it is a shared server where people need multiple versions, so currently we are looking at how to make sure all the paths point to the proper locations since the installs are in non-standard directories. Some of these paths are very annoying to find :D (plz free me from this mortal coil)