Performance Evaluation of Single Algorithm
RLLTE provides evaluation methods based on:
We reconstruct and improve the code of the official repository rliable, achieving higher convenience and efficiency.
Download Data
Suppose we want to evaluate algorithm performance on the Procgen benchmark. First, download the data from rllte-hub:
example.py
For each algorithm, this will return a # load packages
from rllte.evaluation import Performance, Comparison, min_max_normalize
from rllte.hub.datasets import Procgen, Atari
import numpy as np
# load scores
procgen = Procgen()
procgen_scores = procgen.load_scores()
print(procgen_scores.keys())
# get ppo-normalized scores
ppo_norm_scores = dict()
MIN_SCORES = np.zeros_like(procgen_scores['ppo'])
MAX_SCORES = np.mean(procgen_scores['ppo'], axis=0)
for algo in procgen_scores.keys():
ppo_norm_scores[algo] = min_max_normalize(procgen_scores[algo],
min_scores=MIN_SCORES,
max_scores=MAX_SCORES)
# Output:
# dict_keys(['ppg', 'mixreg', 'ppo', 'idaac', 'plr', 'ucb-drac'])
NdArray
of size (10
x 16
) where scores[n][m] represent the score on run n
of task m
.
Performance Evaluation
Initialize the performance evaluator:
example.py
Available metrics:
perf = Performance(scores=ppo_norm_scores['PPO'],
get_ci=True # get confidence intervals
)
perf.aggregate_mean()
# Output:
# Computing confidence interval for aggregate MEAN...
# (1.0, array([[0.9737281 ], [1.02564405]]))
Metric | Remark |
---|---|
.aggregate_mean |
Computes mean of sample mean scores per task. |
.aggregate_median |
Computes median of sample mean scores per task. |
.aggregate_og |
Computes optimality gap across all runs and tasks. |
.aggregate_iqm |
Computes the interquartile mean across runs and tasks. |
.create_performance_profile |
Computes the performance profiles. |