Remote Resources

MIT Political Science Methodology Lab (PML) Workshop Series

Mason Reece
Gabrielle Péloquin-Skulski

October 18, 2024

Goals

  • Hands on experience!

  • xvii

  • Supercloud

  • When to Use What

Resource Comparison

XVII – Learning, long-term scraping tasks

Supercloud – Parallellized scripts, large memory requirements, etc.

Setting up XVII

We’ll make you an account

Connecting to xvii via ssh – Mac/Linux use the Terminal, Windows use Powershell

ssh <your username>@xvii.mit.edu

Immediately change your password using passwd and save it somewhere

Basic Linux Commands

ls                      # List files
pwd                     # Print working directory
cd <to>                 # Change directory
mv <file> <from> <to>   # Move files
mkdir <name of dir>     # Create new directory
nano <file>             # Command line text editor
cat <file>              # Print contents of file
rm -rf <dir or file>    # Delete stuff (no undo!)
find                    # Finding files/folders
rsync <from> <to>       # Transfer data
passwd                  # Change your password

Important xvii Commands

  • htop to check CPU/RAM usage of server

  • ps aux | grep <username> Prints out all the processes you have running. Useful for killing processes that have stalled or have other issues

  • kill -i <pid> Kills a process you have running. <pid> comes from the previous command

RStudio on the Web

  • Go to http://xvii.mit.edu:8787

  • Login with your username and password

    • Clear your cookies if you can’t connect
  • Just like regular RStudio but runs on XVII

    • Occasionally it hangs/breaks, contact the PML RA to fix this

Performance – Laptop vs xvii

# install.packages(c("tidyverse", "palmerpenguins", "tictoc"))

# Load the palmerpenguins dataset
data("penguins", package = "palmerpenguins")

tictoc::tic()

# Simulate heavy computational work (Single Core)
result <- purrr::map(1:100000, ~ {
    penguins |> 
      dplyr::filter(species == "Adelie") |> 
      dplyr::pull(bill_length_mm) |> 
      mean()
  })

tictoc::toc()
41.337 sec elapsed

xvii took 165.294 seconds (whoops)

Parallel Processing on xvii

# install.packages("furrr")

# Load the palmerpenguins dataset
data("penguins", package = "palmerpenguins")

library(furrr)

tictoc::tic()

plan(multisession, workers = 8)

# Simulate heavy computational work (MultiCore)
result <- furrr::future_map(1:100000, ~ {
    penguins |> 
      dplyr::filter(species == "Adelie") |> 
      dplyr::pull(bill_length_mm) |> 
      mean()
  })

plan(sequential)

tictoc::toc()

XVII took 25.883 seconds (a 6.3x speedup)

Notes on XVII

  • Powerful, but remember this is a shared resource among the department

Uploading Big Data or Scripts

  • Use Filezilla: https://filezilla-project.org/download.php?type=client or rsync
  • MUST store your data in /mnt/xvii_big_data/<username>/
  • Code can go in /home/<username>
    • If your big data goes here and fills up the drive, xvii won’t work
  • How to connect to xvii using FTP program
    • Host: xvii.mit.edu

    • Username: <your username>

    • Password: <your password>

    • Port: 22

Do You Need More High-Performance Computing?

  • MIT SuperCloud

    • Loads more computing power (basically your own xvii + more)

    • Requires special advisor approval + edx course

    • More difficult to use

What is MIT SuperCloud ?

  • High performance computer accessible to the MIT community for FREE

  • Power: More than 16,000 x86 CPU cores and more than 850 NVidia Volta GPUs in total.

  • Storage: Home directories 10TB and group directories 50TB (Do not use SuperCloud as primary storage, back up on another system)

  • Available languages: Julia; Python (Anaconda); Matlab(R)/Octave; R; C/C++; Fortran; Java; Perl 5; Ruby

Resource Limits

  • Time Limits: Jobs are limited to 4 days and 4 hours
    • Tip: If your task doesn’t complete within this timeframe, your code may need optimization—consider attending a PLM workshop for help.
  • Only accessible WITHIN the US (I have been able to use it in Canada with a VPN)

Advantages of SuperCloud as Opposed to XVII

  • Increased power: Access to more computational resources, including thousands of CPU cores and GPUs.
  • Up-to-date environment: Fewer issues with outdated packages or software compatibility.
  • Support resources: Available office hours for personalized help, along with extensive online documentation and tutorials.

Disadvantage: - No RStudio interface: Unlike XVII, SuperCloud doesn’t provide a direct RStudio interface for running analyses, requiring users to manage R from the command line

Getting Access to SuperCloud

  1. Complete a cybersecurity training before applying for a SuperCloud account (very short).
  2. Fill out the account request form.
  3. Request approval from your advisor/PI by having them email supercloud@mit.edu to confirm that you will be using SuperCloud for your work. Note: SuperCloud will not contact your advisor directly—you must handle this step to get approval.
  • Note that approval can take up to a week.

Setting Up Your Account

  • Once you get approval you will receive an email with your username and further instructions to set up your account

  • You will be working only on the terminal of your computer

Setting Up Your Account (cont.)

  1. Create Key
[user1234@yourMachine]$ ssh-keygen -t rsa
  1. Copy generated key
cat id_rsa.pub
  • Copy the entire output, including the ssh-rsa at the beginning.
  1. Add key to your account

Setting Up R and Your Packages

  1. Log in
ssh username@txe1-login.mit.edu
  1. Load anaconda module
source /etc/profile
module load anaconda/2023b
conda create -n customR -c conda-forge r-essentials r-tidyverse r-brms
  1. Create custom conda environment
source activate customR
R
install.packages(c(“tidyverse”, “brms”) # sample packages

Upload Your Files

  1. Open new console tab, where you are NOT logged in to the MIT account.

  2. Add your files to your SuperCloud account. Note: First part of the code is where the file is located on computer and second part is where you want the file to be located on SuperCloud

rsync ~ /Desktop/file.R username@txe1-login.mit.edu:/home/gridsan/username
  1. Check if files successfully uploaded to your SuperCloud account.
ssh username@txe1-login.mit.edu # Login to SuperCloud Account
ls

Submit Your Job

  1. Create a shell that will run your script
nano yourNewScript.sh
  1. Edit your shell and write the following text in it.
#!/bin/bash

source /etc/profile

# Load the required modules
module load anaconda/2023b
source activate customR

# Run the script
Rscript ~/your_rscript.R
  1. Ctrl + X to leave the shell

Submit Your Job (Cont.)

  1. Run your code (Note: -s sets the cores, here I selected 5 cores). Each core comes with 4GB of RAM.
LLsub yourNewScript.sh -s 25

Checking on Job Status

  1. See Running Jobs (Note: This gives us the jobid which we plug into the scheduler)
LLstat
  1. See status of job (Note: 24167762 in this example is my jobid)
cat yourNewScript.sh.log-24167762

Exporting Files to Your Computer

  1. Create new blank tab in terminal where you are not logged into SuperCloud.

  2. Export files (just flip order of how you imported the files)

rsync username@txe1-login.mit.edu:/home/gridsan/username/file_output.rds ~ /Desktop

Other Notes

  • SuperCloud has monthly downtimes: every second Tuesday of each month. Reminder emails are sent out in advance and when the downtime is complete.

  • The SuperCloud team offers weekly office hours (announced via email)

  • For more help email: supercloud@mit.edu or see their online resources

  • You can upgrade your resource allocation by completing the Practical HPC course with a grade higher than 70%

Thank You!

Appendix

Running Your Script in Shell

Useful if code is slow, you have to leave your computer, etc.

Rscript <path-to-script>    # For R, capitalization matters
python <path-to-script>     # For python

Trick for slow code:

  • Use a bash script and make it run in the background

    nohup bash scheduler.sh &
  • where scheduler.sh is:

    /usr/lib/R/bin/R –no-restore –file=<path-to-file>
  • Same as Rscript, but printed output goes to a plain-text file called nohup.out and it runs forever

  • Check in with htop (always works) or jobs (works sporadically)

For Python Users

Install Anaconda in /home/username , xvii only has Python 2.7

How to Use Jupyter Notebooks:

  • Use port forwarding to send the Jupyter notebook to a localhost. DO NOT SSH into xvii before doing this

    ssh -N -f -L 8000:localhost:8888 <your username>@xvii.mit.edu
  • Then, ssh into XVII and start Jupyter:

    jupyter notebook --no-browser --port=8888
  • Then, connect on the internet at the URL: localhost:8000 and enter the code given when you started jupyter. There may be no code.