Machine learning is a subfield of artificial intelligence that uses algorithms trained on data sets to create models that enable machines to perform tasks that would otherwise only be possible for humans. These tasks can include categorizing images, analysing data, or predicting price fluctuations.
The eResearch team manages access to several private VMs and a dedicated machine that is suitable for machine learning tasks.
For anyone new to machine learning, we recommend some starting resources to learn key concepts and the technologies involved: the Machine Learning Crash Course by Google and Kaggle Data Science Education.
Deeplearning01 (Big GPU Machine)
We have a server with 4 GPUs dedicated to Machine learning/Deep learning/"Artificial intelligence". The main point of it was to bring a large amount of GPU RAM to workloads in these areas.
The server is a Dell T640 with
- 2 Intel Xeon Gold 6226R@2.9GHz with 16 cores each for a total of 64 threads
- 384GB of RAM
- one 1.92TB SSD hard drive used for the root filesystem and /home
- one 3.84TB SSD hard drive used for scratch and mounted on /scratch
- 4 NVIDIA Quadro RTX6000 24GB GPUs for a total of 96GB of GPU RAM
- 2 NVlinks each connecting 2 GPUs
Access to the machine is on request to eResearch services via the ServiceNow form for RCC projects - just skip details about VM and indicate that you want to use "deeplearning01" in the "other information" field. Once the access is granted, login is with your usual UC username and password and the machine's name is
rcc-deeplearning01.canterbury.ac.nz
Note that the server is only accessible from the campus network. If you want to access it from home (or somewhere else) read the Out of campus access page.
SLURM and Scheduling
Since it is a shared resource, we need a way to ensure fair access amongst the users. With a few users some kind of board or mailing list to request a turn on the machine is fine. With an increasing amount of users, we need a formal workload manager providing a submission queue. Instead of running jobs directly on the GPUs, users need to submit the job to a workload manager which will manage who runs when. A time limit (usually referred as "wallclock time") is also enforced so people do not wait too long in the queue. Use of a job scheduler will also
- enable us to figure out how busy is the machine
- use the machine more efficiently as no one will have to figure out if the machine is busy or wait for a message that it is their turn
The selected workload manager is slurm. Slurm is open source and an industry standard in the HPC world. People using NeSI will be familiar with it, in turn users of Deeplearning01 will become familiar with the technology used at NeSI and many other facilities (according to wikipedia slurm is used by about 60% of the TOP500 supercomputers).
Job Submission with Slurm
To submit a job, you need to prepare a small text file which contains a script to run and slurm instruction describing the job and its requirements and then putting it in the queue with the appropriate command. A good summary of slurm commands and options can be found on the NeSI website, other pages with interesting examples can be found at compute Canada and university of Cambridge.
As mentioned earlier the job submission file is a text script containing a bash script (other shell languages are possible but sticking to bash is recommended). The bash script can also include comments in a special format which are in fact slurm commands they are usually of the form:
#SBATCH --some-slurm-options=some_value
A full script that could run on our machine could be "example.sl" below with an appropriate "program".
#!/bin/bash
#SBATCH --account=def-someuser # put your usercode for accounting
#SBATCH --gres=gpu:4 # Number of GPU(s), 1, 2 or 4
#SBATCH --cpus-per-task=6 # CPU cores/threads up to 64
#SBATCH --time=0-03:00 # wallclock time (DD-HH:MM) - up to 48 hours on our machine
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK # in case program is using openMP
./program
To submit the job above to the queue one simply types:
sbatch example.sl
Important Slurm Commands
- sbatch to submit a job to the queue
- squeue to see all the jobs queued
- squeue -u usercode to see all the jobs queued by "usercode"
- scancel $jobid to remove the job "$jobid" (provided it belongs to you)
- scancel -u usercode cancel all your jobs - provided you are usercode - an error message otherwise
- sinfo shows the current state of the manager. Normal states can be "up", "down" (no job running) or "draining" (will stop the queue after the current job, this is useful for maintenance)
Public Services
Public services normally have an associated cost that is not covered by the University. University services are free of charge but may have constraints on the type and quantity of hardware available as well as the duration of your projects.
- Google Colaboratory: Colaboratory allows you to write and execute Python in your browser with free access to GPUs and easy sharing. With Colab you can harness the full power of popular Python libraries to analyse and visualize data.
- Amazon Web Services: Amazon Web Services offers a broad set of machine learning services and supporting cloud infrastructure.
- Microsoft Azure Machine Learning Studio: The Azure Machine Learning service empowers developers and data scientists with a wide range of productive experiences for building, training, and deploying machine learning models.
- Google Cloud Machine Learning: Google Cloud offers AI and machine learning products for developers, data scientists, and data engineers.
- IBM Watson: Watson is IBM’s portfolio of enterprise-ready pre-built applications, tools, and runtimes. With Watson you can infuse AI into your applications to make predictions or automate decisions and processes.
- NeSI: NeSI provides a national platform of shared high performance computing tools and eResearch services. They further have many resources dedicated to machine learning.
This list does not contain information on Generative AI tools at UC which can be found here.