Batch
The AWSBatchManager allows you to use the AWS Batch service as a Julia cluster.
Requirements
- An IAM role is setup that allows
batch:SubmitJob
,batch:DescribeJobs
, andbatch:DescribeJobDefinitions
- A Docker image registered with AWS ECR which has Julia and AWSClusterManagers.jl installed.
The AWSBatchManager requires that the running AWS Batch jobs are run using "networkMode=host" which is the default for AWS Batch. This is only mentioned for completeness.
Usage
Let's assume we want to run the following script:
# demo.jl
using AWSClusterManagers: AWSBatchManager
addprocs(AWSBatchManager(4))
println("Num Procs: ", nprocs())
@everywhere id = myid()
for i in workers()
println("Worker $i: ", remotecall_fetch(() -> id, i))
end
The workflow for deploying it on AWS Batch will be:
- Build a docker container for your program.
- Push the container to ECR.
- Register a new job definition which uses that container and specifies a command to run.
- Submit a job to Batch.
Overview
The client machines on the left (e.g., your laptop) begin by pushing a docker image to ECR, registering a job definition, and submitting a cluster manager batch job. The cluster manager job (JobID: 9086737) begins executing julia demo.jl
which immediately submits 4 more batch jobs (JobIDs: 4636723, 3957289, 8650218 and 7931648) to function as its workers. The manager then waits for the worker jobs to become available and register themselves with the manager by executing julia -e 'sock = connect(<manager_ip>, <manager_port>); Base.start_worker(sock, <cluster_cookie>)'
in identical containers. Once the workers are available the remainder of the script sees them as ordinary julia worker processes (identified by the integer pid values shown in parentheses). Finally, the batch manager exits, releasing all batch resources, and writing all STDOUT & STDERR to CloudWatch logs for the clients to view or download.
Building the Docker Image
To begin we'll want to build a docker image which contains:
julia
AWSClusterManagers
demo.jl
Example:
FROM julia-bin:1.0
RUN julia -e 'using Pkg; Pkg.add("AWSClusterManagers")'
COPY demo.jl .
CMD ["julia demo.jl"]
Now build the docker file with:
docker build -t 000000000000.dkr.ecr.us-east-1.amazonaws.com/demo:latest .
Pushing to ECR
Now we want to get our docker image on ECR. Start by logging into the ECR service (this assumes your have awscli
configured with the correct permissions):
$(aws ecr get-login --region us-east-1)
Now you should be able to push the image to ECR:
docker push 000000000000.dkr.ecr.us-east-1.amazonaws.com/demo:latest
Registering a Job Definition
Let's register a job definition now.
NOTE: Registering a batch job requires the ECR image (see above) and an IAM role to apply to the job. The AWSBatchManager requires that the IAM role have access to the following operations:
batch:SubmitJob
batch:DescribeJobs
batch:DescribeJobDefinitions
Example)
aws batch register-job-definition --job-definition-name aws-batch-demo --type container --container-properties '
{
"image": "000000000000.dkr.ecr.us-east-1.amazonaws.com/demo:latest",
"vcpus": 1,
"memory": 1024,
"jobRoleArn": "arn:aws:iam::000000000000:role/AWSBatchClusterManagerJobRole",
"command": ["julia", "demo.jl"]
}'
NOTE: A job definition only needs to be registered once and can be re-used for multiple job submissions.
Submitting Jobs
Once the job definition has been registered we can then run the AWS Batch job. In order to run a job you'll need to setup a compute environment with an associated a job queue:
aws batch submit-job --job-name aws-batch-demo --job-definition aws-batch-demo --job-queue aws-batch-queue
Running AWSBatchManager Locally
While it is generally preferable to run the AWSBatchManager as a batch job, it can also be run locally. In this case, worker batch jobs would be submitted from your local machine and would need to connect back to your machine from Amazon's network. Unfortunately, this may result in networking bottlenecks if you're transferring large amounts of data between the manager (you local machine) and the workers (batch jobs).
As with the previous workflow, the client machine on the left begins by pushing a docker image to ECR (so the workers have access to the same code) and registers a job definition (if one doesn't already exist). The client machine then runs julia demo.jl
as the cluster manager which immediately submits 4 batch jobs (JobIDs: 4636723, 3957289, 8650218 and 7931648) to function as its workers. The client machine waits for the worker machines to come online. Once the workers are available the remainder of the script sees them as ordinary julia worker processes (identified by the integer pid values shown in parentheses) for the remainder of the program execution.
NOTE: Since the AWSBatchManager is not being run from within a batch job we need to give it some extra parameters when we create it.
mgr = AWSBatchManager(
4,
definition="aws-batch-worker",
name="aws-batch-worker",
queue="aws-batch-queue",
region="us-west-1",
timeout=5
)