Job Submission System
We provide a cluster-like system to submit jobs to the robots. Whenever you submit a job, it will wait until a robot is free (in case they are all busy at the moment) and then execute your code there.
When the job is finished, you will be notified via e-mail and the resulting logs, statistics and video are made available for you to download.
User Account
If you would like to use the robot cluster for research purposes, please get
in touch with us (see contact information at TriFinger website). We will
then provide you with credentials to log in to the robot cluster. Whenever
USERNAME
is mentioned below, you should replace this with your username.
With these credentials you can log in via SSH:
ssh USERNAME@robots.real-robot-challenge.com
Enter help
to see a list of all available commands.
Configuration
SSH Key
You can set up an SSH key so you don’t need to enter your password every time you connect.
To do so, first connect via SSH (see above), then execute:
sshkey
Confirm setting a new key with “y”, then paste your public SSH key
(e.g. by printing it with cat ~/.ssh/id_rsa.pub
and manually copy-pasting
the output).
Configuration File roboch.json
To use the robot cluster, you need to provide a configuration file called
roboch.json
. This is a simple JSON file which looks as follows:
{
"repository": "git@gitub.com:example/example.git",
"branch": "master",
"email": "foobar@example.com",
"git_deploy_key": "github_deploy_key",
"singularity_image": "user_image.sif",
"submission_runner": "rl_cube"
}
Parameters
Required Parameters:
Key |
Meaning |
---|---|
repository |
URL to you git repository. |
Your email address. This is used to send emails to inform about finished or failed jobs. |
|
submission_runner |
Name of the submission runner to use. For RL experiments with the
cube as the manipulated object, use |
Optional Parameters:
Key |
Meaning |
---|---|
branch |
Branch of the git repository that is used. |
git_deploy_key |
SSH key to access the git repository, see Git Deploy Key. |
singularity_image |
Name of the custom Singularity/Apptainer image, see Custom Apptainer Image. |
Upload the File
You can easily upload the file with scp
.
scp roboch.json USERNAME@robots.real-robot-challenge.com:
Git Deploy Key
In case you are using a non-public git repository, you need to provide a deploy
key so that the system can clone your repository. All the system is doing is a
git clone
, so a key with read permission is enough.
Upload the private key with
scp your_key USERNAME@robots.real-robot-challenge.com:
and specify the name of the key file with the parameter git_deploy_key
in
the configuration file roboch.json
(see above).
Custom Apptainer Image
By default your submissions are executed in the standard Apptainer image (see Get the Image). If you need additional libraries to run your code, you can create a custom image by extending the standard image. For more information on how to create a custom image, see Add Custom Dependencies to the Container.
Assuming your custom image is called “user_image.sif”, you can upload it like the other files with
scp user_image.sif USERNAME@robots.real-robot-challenge.com:
Then update roboch.json
by adding the following line (if it is not already
there):
"singularity_image": "user_image.sif"
If you later want to update the image, simply overwrite the existing file.
Important
Do not update the image while you have a job currently running or pending as this might lead to a crash.
Verify Configuration
You can verify if your configuration is valid (i.e. all required parameters are specified and mentioned files exist). By logging in and running:
check
Submitting a Job
To submit a job, first connect via SSH:
ssh USERNAME@robots.real-robot-challenge.com
Then call:
submit
This will print the job ID which you will later need to download the data.
After you submitted a job, it will wait for a free robot. Then your code will
be deployed to that robot and the task/policy configured in the trifinger.toml
of
you repository will be run.
Important
You can only submit one job at a time. Calling submit
again while there
is an ongoing job will result in an error. After the job has finished it
may take up to around one minute before you can submit the next job.
You can monitor the status of the job with
status
It will be listed as “idle” while it is waiting for a free robot and then change to “running” until it is finished.
As long as the job is still idle, you can cancel it with
cancel
This is not possible anymore once the job has started, in this case you have to wait until it finished.
Each job spans multiple episodes to reduce the overhead of the job submission system. The cube is furthermore pushed away from the barrier inbetween episodes but not reset to the center of the arena. For the Push task the number of episodes is 9 and for the Lift task it is 6.
Accessing Recorded Data
Once a job is finished, you can access the data here:
https://robots.real-robot-challenge.com/output/USERNAME/data
You need to authenticate with your username and password to access the files.
For each job a directory is created using the job ID as name. The job ID is
printed when running submit
and is also mentioned in the email you receive
from the system when the job has finished.
Verify if Job Ran Successfully
Before analysing the data recorded by the robot, you should verify that the job ran successfully. For this check the content of the file report.json. It is a simple JSON file containing the following keys:
backend_error
: Indicates if there was some error in the backend (e.g. some issue with the hardware).user_returncode
: The return code of the script that was running the policy. A non-zero value here indicates an error. May not exist in case of backend error.
Backend errors may happen from time to time, e.g. due to some failure in the hardware. They are usually not caused by the user code but most likely mean that the recorded data is invalid or incomplete. So if a backend error is reported, it is best to discard the data of this job and run another one.
Backend errors should happen only rarely. If you encounter them frequently, please let us know (e.g. by writing a mail to the contact person listed on the TriFinger website), so we can investigate the issue.
Complete List of Generated Files
For each successful job the following files are created:
user
: Directory with the following files:results.json
: JSON file with information about the episodes that were executed in the job. It contains the following fields:task: The task used in the job (“push” or “lift”).
n_episodes: Number of episodes that were executed.
episodes: List of episodes. For each episode the following values are given:
t_start/t_end: Start and end time of the episode within the job. They are only needed to generate the video.
success: Whether the goal was achieved at the end of the episode.
momentary_success_rate: Ratio of time steps at which the goal was achieved during the episode.
transient_success: Whether the goal was achieved at any time step during the episode.
return: Cumulative reward over the episode.
max_reward: Max. reward of a single time step during the episode.
goal: The goal of this episode.
statistics: Mean and std of the metrics over all episodes of the job.
reward_distribution.pdf
: Plot of the reward distributionreward_over_time.pdf
: Plot of the reward over time (averaged over all episodes).
build_output.txt
: Output of the build of your package.camera180.yml
: Camera calibration for camera at 180°.camera300.yml
: Camera calibration for camera at 300°.camera60.yml
: Camera calibration for camera at 60°.camera_data.dat
: Raw camera data.info.json
: JSON file with some meta information about the job like timestamp, robot name and some info about the git repository that was used.report.json
: JSON file with information whether the job was executed successfully. See Verify if Job Ran Successfully.robot_data.dat
: Raw robot data (recorded at 1 kHz and not only at the timesteps of the RL environment).user_stderr.txt
: Output of the user code that was sent to stderr.user_stdout.txt
: Output of the user code that was sent to stdout.video.mp4
: Video of the episodes with goal visualisation.
Automated Submissions
The system is mostly designed for interactive use but it can be automated to some degree.
Here is an example script which in a loop automatically submits jobs to the robot, downloads the recorded data and potentially runs some processing to update parameters:
Note that this script does not really have any error handling, it just stops if anything is wrong. You may adjust it according to your needs.
Trouble Shooting
If you experience problems with the submission system please check the following points:
Can the repository you specified in the configuration file be cloned? If you intend to use https, then the repository must be public. If you use ssh then you need to upload a deploy key and specify it in the configuration file (see Git Deploy Key). You can run the command
check
to make sure that the repository can be accessed.Is the correct branch specified in the configuration file?
Is the repository up to date? The system will only use the code that is currently in the repository. If you have made changes to the code but not pushed them yet, they will not be used.
If you still experience problems, do not hesitate to send a mail to the contact person listed on the TriFinger website.