Distributed learning in an on-premise cluster – A Kaggle Reinforcement Learning case

Igor Muniz
Director of Artificial Intelligence

Have you tried any distributed learning algorithms? If you are just starting out in this area, I have my doubts, but if you have been on this path for a few years, you might have faced one of those models. The incredible development of the machine learning area in the last decade has not only brought a new state of the art to several problems but has also taken processing optimization and parallelization to another level. With increasingly larger models, any common machine or even a single super machine may not be enough to achieve the desired result in a reasonable time. Today we will talk about how we could train a reinforcement learning algorithm in a distributed way using our own cluster.

About the problem

Recently Kaggle launched a simulation competition in which the goal was to train agents to master the world’s most popular sport: football. The agents play games against each other, in which the winners receive a score increasing their position on the leaderboard. The complete system simulates a championship making this competition really fun to participate.

Google Research Football with Manchester City F.C. | Kaggle
Train agents to master the world’s most popular sport

Building an agent

Each agent in an 11 vs 11 game controls a single active player and takes actions to improve their team’s situation. As with a typical football game, you want your team to score more than the other side. You only need to control one player at a time and your code gets to pick from 1 of 19 possible actions.

To decide which action to take, you can “teach” your agent in three ways: creating rules, training them in the environment using reinforcement learning, or creating a supervised machine learning model to predict the most likely action at each step.

Rule based bots are basically agents with lots of “ifs” and “else”, that is, a rule created for different scenarios. Although it seems extremely laborious, some rules imposed can perform better than some other learning method.

Reinforcement learning algorithms are concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. They iterate with the environment repeatedly with a single objective: to be able to win the match.

Finally, we can use a supervised machine learning algorithm. But for this to be possible, we intrinsically need data  to train our model, being the only way to achieve this by copying the matches of other competitors. For this reason, this technique is commonly called imitation learning, an attempt to create an agent as good as the first placed.

Seed RL

For the competition we ended up trying all three methods, but we will discuss here the reinforcement learning method which was necessary to use an interesting processing distribution solution, the objective of this post.

The chosen algorithm was Seed RL (https://github.com/google-research/seed_rl), a distributed reinforcement learning agent algorithm where both training and inference are performed on the learner which means that we can run the learner in isolation from actors.

google-research/seed_rl
SEED RL: Scalable and Efficient Deep-RL with Accelerated Central Inference. Implements IMPALA and R2D2 algorithms in TF2 with SEED’s architecture. – google-research/seed_rl

This repository contains a script to perform all training locally on one machine and another script capable of distributing the training using Google Cloud machines. Although the latter option is ideal, the training of a reinforcement learning algorithm usually takes a long time, which would make the final cost unfeasible. Since we have our own on premise cluster, we just had to figure out how to distribute this training ourselves.

Our method

First thing to do was to create a configurable script to say what should run in that instance, an actor or the learner. The algorithm itself allowed the two to be executed separately, it was only necessary to keep the execution of each isolated.

The second thing was to create the parameters expected by the learner, such as the number of actors. Since we will distribute an X number of actors on different machines, the learner needs to know how many answers it should expect, that is, the total number of actors being trained simultaneously.

We also had to configure the connection between the learner and each actor over the network, using the gRPC protocol that is already present in the original project. After that we made a shell script responsible for starting the execution of each part of the algorithm, passing the appropriate parameters created.

set -edie () {    echo >&2 "$@"    exit 1}ENVIRONMENTS="atari|dmlab|football"AGENTS="r2d2|vtrace|sac"[ "$#" -ne 0 ] || die "Usage: run_learner.sh [$ENVIRONMENTS] [$AGENTS] [Num. actors] [Server Port]"echo $1 | grep -E -q $ENVIRONMENTS || die "Supported games: $ENVIRONMENTS"echo $2 | grep -E -q $AGENTS || die "Supported agents: $AGENTS"echo $3 | grep -E -q "^((0|([1-9][0-9]*))|(0x[0-9a-fA-F]+))$" || die "Number of actors should be a non-negative integer without leading zeros"export ENVIRONMENT=$1export AGENT=$2export NUM_ACTORS=$3export SERVER_PORT=$4export CONFIG=$ENVIRONMENTDIR="$( cd "$( dirname "$" )" >/dev/null 2>&1 && pwd )"cd $DIRdocker/build.shdocker_version=$(docker version --format '{{.Server.Version}}')if [[ "19.03" > $docker_version ]]; then  docker run  --network host -p $SERVER_PORT:$SERVER_PORT --entrypoint ./docker/run_learner.sh -ti -it --name seed_learner --rm seed_rl:$ENVIRONMENT $ENVIRONMENT $AGENT $NUM_ACTORS $SERVER_PORTelse  docker run  --network host -p $SERVER_PORT:$SERVER_PORT --gpus all --entrypoint ./docker/run_learner.sh -ti -it -e HOST_PERMS="$(id -u):$(id -g)" --name seed_learner --rm seed_rl:$ENVIRONMENT $ENVIRONMENT $AGENT $NUM_ACTORS $SERVER_PORTfi

Finally, we put everything inside a docker image and then we just distributed the image over the machines in our cluster. Once all machines had the necessary docker image and startup scripts, we were able to easily orchestrate everything using our kubernetes cluster.

Conclusion

With the recent advances in the machine learning area it is important to know how to distribute the training processing, especially considering the time and money savings.

THE BLOG

News, lessons, and content from our companies and projects.

Stay In The Loop!

Receive updates and news about XNV and our child companies. Don't worry, we don't SPAM. Ever.