Job Submission System

We provide a cluster-like system to submit jobs to the robots. Whenever you submit a job, it will wait until a robot is free (in case they are all busy at the moment) and then execute your code there.

When the job is finished, you will be notified via e-mail and the resulting logs, statistics and video are made available for you to download.

User Account

If you would like to use the robot cluster for research purposes, please get in touch with us (see contact information at TriFinger website). We will then provide you with credentials to log in to the robot cluster. Whenever USERNAME is mentioned below, you should replace this with your username.

With these credentials you can log in via SSH:

ssh USERNAME@robots.real-robot-challenge.com

Enter help to see a list of all available commands.

Configuration

SSH Key

You can set up an SSH key so you don’t need to enter your password every time you connect.

To do so, first connect via SSH (see above), then execute:

sshkey

Confirm setting a new key with “y”, then paste your public SSH key (e.g. by printing it with cat ~/.ssh/id_rsa.pub and manually copy-pasting the output).

Configuration File roboch.json

To use the robot cluster, you need to provide a configuration file called roboch.json. This is a simple JSON file which looks as follows:

{
  "repository": "git@gitub.com:example/example.git",
  "branch": "master",
  "email": "foobar@example.com",
  "git_deploy_key": "github_deploy_key",
  "singularity_image": "user_image.sif",
  "submission_runner": "rl_cube"
}

Parameters

Required Parameters:

Key	Meaning
repository	URL to you git repository.
email	Your email address. This is used to send emails to inform about finished or failed jobs.
submission_runner	Name of the submission runner to use. For RL experiments with the cube as the manipulated object, use `rl_cube`.

Optional Parameters:

Key	Meaning
branch	Branch of the git repository that is used.
git_deploy_key	SSH key to access the git repository, see Git Deploy Key.
singularity_image	Name of the custom Singularity/Apptainer image, see Custom Apptainer Image.

Upload the File

You can easily upload the file with scp.

scp roboch.json USERNAME@robots.real-robot-challenge.com:

Git Deploy Key

In case you are using a non-public git repository, you need to provide a deploy key so that the system can clone your repository. All the system is doing is a git clone, so a key with read permission is enough.

Upload the private key with

scp your_key USERNAME@robots.real-robot-challenge.com:

and specify the name of the key file with the parameter git_deploy_key in the configuration file roboch.json (see above).

Custom Apptainer Image

By default your submissions are executed in the standard Apptainer image (see Get the Image). If you need additional libraries to run your code, you can create a custom image by extending the standard image. For more information on how to create a custom image, see Add Custom Dependencies to the Container.

Assuming your custom image is called “user_image.sif”, you can upload it like the other files with

scp user_image.sif USERNAME@robots.real-robot-challenge.com:

Then update roboch.json by adding the following line (if it is not already there):

"singularity_image": "user_image.sif"

If you later want to update the image, simply overwrite the existing file.

Important

Do not update the image while you have a job currently running or pending as this might lead to a crash.

Verify Configuration

You can verify if your configuration is valid (i.e. all required parameters are specified and mentioned files exist). By logging in and running:

check

Submitting a Job

To submit a job, first connect via SSH:

ssh USERNAME@robots.real-robot-challenge.com

Then call:

submit

This will print the job ID which you will later need to download the data.

After you submitted a job, it will wait for a free robot. Then your code will be deployed to that robot and the task/policy configured in the trifinger.toml of you repository will be run.

Important

You can only submit one job at a time. Calling submit again while there is an ongoing job will result in an error. After the job has finished it may take up to around one minute before you can submit the next job.

You can monitor the status of the job with

status

It will be listed as “idle” while it is waiting for a free robot and then change to “running” until it is finished.

As long as the job is still idle, you can cancel it with

cancel

This is not possible anymore once the job has started, in this case you have to wait until it finished.

Each job spans multiple episodes to reduce the overhead of the job submission system. The cube is furthermore pushed away from the barrier inbetween episodes but not reset to the center of the arena. For the Push task the number of episodes is 9 and for the Lift task it is 6.

Accessing Recorded Data

Once a job is finished, you can access the data here:

https://robots.real-robot-challenge.com/output/USERNAME/data

You need to authenticate with your username and password to access the files.

For each job a directory is created using the job ID as name. The job ID is printed when running submit and is also mentioned in the email you receive from the system when the job has finished.

Verify if Job Ran Successfully

Before analysing the data recorded by the robot, you should verify that the job ran successfully. For this check the content of the file report.json. It is a simple JSON file containing the following keys:

backend_error: Indicates if there was some error in the backend (e.g. some issue with the hardware).
user_returncode: The return code of the script that was running the policy. A non-zero value here indicates an error. May not exist in case of backend error.

Backend errors may happen from time to time, e.g. due to some failure in the hardware. They are usually not caused by the user code but most likely mean that the recorded data is invalid or incomplete. So if a backend error is reported, it is best to discard the data of this job and run another one.

Backend errors should happen only rarely. If you encounter them frequently, please let us know (e.g. by writing a mail to the contact person listed on the TriFinger website), so we can investigate the issue.

Complete List of Generated Files

For each successful job the following files are created:

user: Directory with the following files:
- results.json: JSON file with information about the episodes that were executed in the job. It contains the following fields:
  - task: The task used in the job (“push” or “lift”).
  - n_episodes: Number of episodes that were executed.
  - episodes: List of episodes. For each episode the following values are given:
    - t_start/t_end: Start and end time of the episode within the job. They are only needed to generate the video.
    - success: Whether the goal was achieved at the end of the episode.
    - momentary_success_rate: Ratio of time steps at which the goal was achieved during the episode.
    - transient_success: Whether the goal was achieved at any time step during the episode.
    - return: Cumulative reward over the episode.
    - max_reward: Max. reward of a single time step during the episode.
    - goal: The goal of this episode.
  - statistics: Mean and std of the metrics over all episodes of the job.
- reward_distribution.pdf: Plot of the reward distribution
- reward_over_time.pdf: Plot of the reward over time (averaged over all episodes).
build_output.txt: Output of the build of your package.
camera180.yml: Camera calibration for camera at 180°.
camera300.yml: Camera calibration for camera at 300°.
camera60.yml: Camera calibration for camera at 60°.
camera_data.dat: Raw camera data.
info.json: JSON file with some meta information about the job like timestamp, robot name and some info about the git repository that was used.
report.json: JSON file with information whether the job was executed successfully. See Verify if Job Ran Successfully.
robot_data.dat: Raw robot data (recorded at 1 kHz and not only at the timesteps of the RL environment).
user_stderr.txt: Output of the user code that was sent to stderr.
user_stdout.txt: Output of the user code that was sent to stdout.
video.mp4: Video of the episodes with goal visualisation.

Automated Submissions

The system is mostly designed for interactive use but it can be automated to some degree.

Here is an example script which in a loop automatically submits jobs to the robot, downloads the recorded data and potentially runs some processing to update parameters:

https://github.com/rr-learning/rrc2022/blob/master/scripts/automated_submission.sh

Note that this script does not really have any error handling, it just stops if anything is wrong. You may adjust it according to your needs.

Trouble Shooting

If you experience problems with the submission system please check the following points:

Can the repository you specified in the configuration file be cloned? If you intend to use https, then the repository must be public. If you use ssh then you need to upload a deploy key and specify it in the configuration file (see Git Deploy Key). You can run the command check to make sure that the repository can be accessed.
Is the correct branch specified in the configuration file?
Is the repository up to date? The system will only use the code that is currently in the repository. If you have made changes to the code but not pushed them yet, they will not be used.

If you still experience problems, do not hesitate to send a mail to the contact person listed on the TriFinger website.