310 lines
17 KiB
TeX
310 lines
17 KiB
TeX
\documentclass[11pt]{article}
|
|
|
|
% Essential packages
|
|
\usepackage[utf8]{inputenc}
|
|
\usepackage[T1]{fontenc}
|
|
\usepackage{amsmath,amssymb}
|
|
\usepackage{graphicx}
|
|
\usepackage[margin=1in]{geometry}
|
|
\usepackage{hyperref}
|
|
\usepackage{pdfpages}
|
|
\usepackage{algorithm}
|
|
\usepackage{algorithmic}
|
|
\usepackage{float}
|
|
\usepackage{booktabs}
|
|
\usepackage{caption}
|
|
\usepackage{subcaption}
|
|
\usepackage{tikz}
|
|
|
|
% Custom commands
|
|
\newcommand{\sectionheading}[1]{\noindent\textbf{#1}}
|
|
|
|
% Title and author information
|
|
\title{\Large{Solarcarsim: A Solar Racing Environment for RL Agents}}
|
|
\author{Saji Champlin\\
|
|
EE5241
|
|
}
|
|
\date{\today}
|
|
|
|
\begin{document}
|
|
|
|
\maketitle
|
|
|
|
\begin{abstract}
|
|
Solar Racing is a competition with the goal of creating highly efficient solar-assisted electric vehicles. Effective solar racing
|
|
requires awareness and complex decision making to determine optimal speeds to exploit the environmental conditions, such as winds,
|
|
cloud cover, and changes in elevation. We present an environment modelled on the dynamics involved for a race, including generated
|
|
elevation and wind profiles. The model uses the \texttt{gymnasium} interface to allow it to be used by a variety of algorithms.
|
|
We demonstrate a method of designing reward functions for multi-objective problems. We show learning using an Jax-based PPO model.
|
|
\end{abstract}
|
|
|
|
\section{Introduction}
|
|
|
|
Solar racing was invented in the early 90s as a technology incubator for high-efficiency motor vehicles. The first solar races were speed
|
|
focused, however a style of race that focused on minimal energy use within a given route was developed to push focus towards vehicle efficiency.
|
|
The goal of these races is to arrive at a destination within a given time frame, while using as little grid (non-solar) energy as possible.
|
|
Optimal driving is a complex policy based on terrain slope, wind forecasts, and solar forecasts.
|
|
|
|
Direct solutions to find the global minima of the energy usage on a route
|
|
segment are difficult to compute. Instead, we present a reinforcement learning
|
|
environment that can be used to train RL agents to race efficiently given
|
|
limited foresight. The environment simulates key components of the race, such
|
|
as terrain and wind, as well as car dynamics. The simulator is written using
|
|
the Jax\cite{jax2018github} library which enables computations to be offloaded to the GPU. We
|
|
provide wrappers for the \texttt{gymnasium} API as well as a
|
|
\texttt{purejaxrl}\cite{lu2022discovered} implementation which can train a PPO
|
|
agent with millions of timesteps in several minutes. We present an exploration of reward function design with regards to sparsity,
|
|
magnitude, and learning efficiency.
|
|
|
|
\section{Background}
|
|
Performance evaluation for solar races typically take the form of
|
|
$$
|
|
S = D/E \times T
|
|
$$
|
|
Where $S$ is the score, $D$ is the distance travelled, $E$ is the energy consumed, and $T$ is the speed derating.
|
|
The speed derate is calculated based on a desired average speed throughout the race. If average velocity is at or above the target,
|
|
$T=1$, however $T$ approaches $0$ exponentially as the average velocity goes below the target.
|
|
Based on this metric we conclude that the optimal score:
|
|
\begin{enumerate}
|
|
\item Maintains an average speed $V_{\text{avg}}$ as required by the derate.
|
|
\item Minimizes energy usage otherwise.
|
|
\end{enumerate}
|
|
The simplest control to meet the constraints of on-time arrival is:
|
|
$$
|
|
V_{\text{avg}} = \frac{D_{goal}}{T_{goal}}
|
|
$$
|
|
Where $D_{goal}$ is the distance needed to travel, and $T_{goal}$ is the maximum allowed time. The average speed is nearly-optimal in most cases, but
|
|
is not a globally optimal solution. Consider a small - there is much more energy being used from the battery when going uphill,
|
|
but the same energy is returned to the car going downhill. Losses in the vehicle dictate that it is more effective to drive slowly
|
|
up the hill, and speed up on the descent. The decision is further complicated by wind and cloud cover, which can aid or neuter the
|
|
performance of the car. It is therefore of great interest for solar racing teams to have advanced strategies that can effectively
|
|
traverse the terrain while minimizing environmental resistances.
|
|
|
|
Existing research on this subject is limited, as advanced solar car strategy is a competitive differentiator and is usually kept secret.
|
|
However, the author knows that most of the work on this subject involves use of Modelica or similar acausal system simulators, and
|
|
non-linear solvers that use multi-starts to attempt to find the global optimum. Other methods include exhaustive search, genetic algorithms,
|
|
and Big Bang-Big Crunch optimization\cite{heuristicsolar}.
|
|
|
|
We start by analyzing a simple force-based model of the car, and then connect this to an energy system using motor equations. We generate
|
|
a simulated environment including terrain and wind. Then, we develop a reward system that encapsulates the goals of the environment.
|
|
Finally, we train off-the-shelf RL models from Stable Baselines3 and purejaxrl to show learning on the environment.
|
|
|
|
\section{Methodology}
|
|
% \begin{tikzpicture}[scale=1.5]
|
|
% % Define slope angle
|
|
% \def\angle{30}
|
|
%
|
|
% % Draw ground/slope
|
|
% \draw[thick] (-3,0) -- (3,0);
|
|
% \draw[thick] (-2,0) -- (2.5,2);
|
|
%
|
|
% % Draw angle arc and label
|
|
% \draw (-,0) arc (0:\angle:0.5);
|
|
% \node at (0.4,0.3) {$\theta$};
|
|
%
|
|
% % Draw simplified car (rectangle)
|
|
% \begin{scope}[rotate=\angle,shift={(0,1)}]
|
|
% \draw[thick] (-0.8,-0.4) rectangle (0.8,0.4);
|
|
% % Add wheels
|
|
% \fill[black] (-0.6,-0.4) circle (0.1);
|
|
% \fill[black] (0.6,-0.4) circle (0.1);
|
|
% \end{scope}
|
|
%
|
|
% % Draw forces
|
|
% % Weight force
|
|
% \draw[->,thick,red] (0,1) -- (0,0) node[midway,right] {$W$};
|
|
%
|
|
% % Normal force
|
|
% \draw[->,thick,blue] (0,1) -- ({-sin(\angle)},{cos(\angle)}) node[midway,above left] {$N$};
|
|
%
|
|
% % Downslope component
|
|
% \draw[->,thick,green] (0,1) -- ({cos(\angle)},{sin(\angle)}) node[midway,below right] {$W\sin\theta$};
|
|
% \end{tikzpicture}
|
|
|
|
\begin{figure}[H]
|
|
\begin{tikzpicture}[scale=1.5]
|
|
% Define slope angle
|
|
\def\angle{30}
|
|
|
|
% Calculate some points for consistent geometry
|
|
\def\slopeStart{-2}
|
|
\def\slopeEnd{2}
|
|
\def\slopeHeight{2.309} % tan(30°) * 2
|
|
|
|
% Draw ground (horizontal line)
|
|
\draw[thick] (-3,0) -- (3,0);
|
|
|
|
% Draw slope
|
|
\draw[thick] (\slopeStart,0) -- (\slopeEnd,\slopeHeight);
|
|
|
|
|
|
% Calculate car center position on slope
|
|
\def\carX{0} % Center position along x-axis
|
|
\def\carY{1.6} % tan(30°) * carX + appropriate offset
|
|
|
|
% Draw car (rectangle) exactly on slope
|
|
\begin{scope}[shift={(\carX,\carY)}, rotate=\angle]
|
|
\draw[thick] (-0.6,-0.3) rectangle (0.6,0.3);
|
|
% Add wheels aligned with slope
|
|
\fill[black] (-0.45,-0.3) circle (0.08);
|
|
\fill[black] (0.45,-0.3) circle (0.08);
|
|
\draw[->,thick] (0,0) -- ++(-0.8, 0) node[left] {$F_{slope} + F_{drag} + F_{rolling}$};
|
|
\draw[->,thick] (0,0) -- ++(0.8, 0) node[right] {$F_{motor}$};
|
|
\node at (0,0) [circle,fill,inner sep=1.5pt]{};
|
|
\end{scope}
|
|
|
|
% Draw forces from center of car
|
|
% Center point of car for forces
|
|
\coordinate (carCenter) at (\carX,\carY);
|
|
|
|
|
|
\end{tikzpicture}
|
|
\centering
|
|
\caption{Free body diagram showing relevant forces on a 2-dimensional car}
|
|
\label{fig:freebody}
|
|
\end{figure}
|
|
|
|
To model the vehicle dynamics, we simplify the system to a 2d plane. As seen in Figure~\ref{fig:freebody}, the forces on the car
|
|
are due to intrinsic vehicle properties, current velocity, and environment conditions like slope and wind. If the velocity is held
|
|
constant, we can assert that the sum of the forces on the car is zero:
|
|
\begin{align}
|
|
F_{drag} + F_{slope} + F_{rolling} + F_{motor} &= 0 \\
|
|
F_{drag} &= \frac{1}{2} \rho v^2 C_dA \\
|
|
F_{slope} &= mg\sin {\theta} \\
|
|
F_{rolling} &= mg\cos {\theta} C_{rr}
|
|
\end{align}
|
|
The $F_{motor}$ term is modulated by the driver. In our case, we give the agent a simpler control mechanism with a normalized
|
|
velocity control instead of a force-based control. This is written as $v = \alpha v_{max}$ where $\alpha$ is the action taken
|
|
by the agent in $\left[-1,1\right]$.
|
|
From the velocity, and the forces acting on the car, we can determine
|
|
the power of the car using a simple $K_t$ model:
|
|
\begin{align}
|
|
\tau &= \left(F_{drag} + F_{slope} + F_{rolling}\right) r \\
|
|
P_{motor} &= \tau v + R_{motor} \left(\frac{\tau}{K_t}\right)^2
|
|
\end{align}
|
|
The torque of motor is the sum of the outstanding forces times the wheel radius. $K_t$ is a motor parameter, as is $R_{motor}$.
|
|
Both can be extracted from physical motors to simulate them, but simple "rule-of-thumb" numbers were used during development.
|
|
The power of the motor is given in watts. Based on the time-step of the simulation, we can determine the energy consumed in joules
|
|
with $W \times s = J$. A time-step of 1 second was chosen to accelerate simulation. Lower values result in reduced integration
|
|
errors over time at the cost of longer episodes.
|
|
|
|
|
|
\subsection{Environment Generation}
|
|
It is important that our agent learns not just the optimal policy for a fixed course, but an approximate optimal policy
|
|
for any course. To this end we must be able to generate a wide variety of terrain and wind scenarios. Perlin noise
|
|
is typically used in this context. We use a 1D Perlin noise to generate the slope of the terrain, and then integrate the slope to create
|
|
the elevation profile. Currently the elevation profile is unused, but it can be important for drag force due to changes in air pressure.
|
|
This was done because differentiated Perlin noise is not smooth, and is not an accurate representation of slope. The wind was
|
|
generated with a 2D Perlin noise, where one axis was time, and the other was position. The noise was blurred in the time-axis
|
|
to ease the changes in wind at any given point.
|
|
An example of the environment can be seen in Figure~\ref{fig:env_vis}.
|
|
|
|
|
|
\begin{figure}[H]
|
|
\centering
|
|
\includegraphics[width=\textwidth]{environment.pdf}
|
|
\caption{Visualization of the generated environment}
|
|
\label{fig:env_vis}
|
|
\end{figure}
|
|
|
|
\subsection{Performance Evaluation}
|
|
|
|
To quantify agent performance, we must produce a single value reward. While multi-objective learning is an interesting
|
|
subject\footnote{I especially wanted to do meta-rl using a neural net to compute a single reward from inputs} it is out of the scope
|
|
of this project. Additionally, sparse rewards can significantly slow down learning. A poor reward function can prevent
|
|
agents from approaching optimal policy. With these factors in mind, we use the following:
|
|
\[
|
|
R = x/D_{goal} + (x > D_{goal}) * \left(100 - E - 10(t - T_{goal})\right) + (t > T_{goal}) * -500
|
|
\]
|
|
To understand this, there are three major components: the continuous reward, which is rewarded at every step, and is the position of the car
|
|
relative to the goal distance. The victory reward is a constant, minus the energy used and the early arrival penalty.
|
|
This was added to help guide the agent towards arriving with as little time left as possible. Finally, there's a penalty for the time
|
|
going above the goal time, as after that point the car is disqualified from the race.
|
|
|
|
It took a few iterations to find a reward metric that promoted fast learning. Some of these issues were exacerbated by the initially low
|
|
performance when using stable baselines. A crucial part of the improvement was the energy loss only being applied during wins.
|
|
This allowed the model to quickly learn to go forward to finish, after which refinement of speed could take
|
|
place\footnote{I looked into Q-initialization but couldn't figure out a way to implement it easily.}.
|
|
|
|
|
|
\subsection{State and Observation Spaces}
|
|
|
|
The complete state of the simulator is the position, velocity, and energy of the car, as well as the entire environment.
|
|
These parameters are sufficient for a deterministic snapshot of the simulator. However, one goal of the project
|
|
was to enable partial-observation of the system. To this end, we separate the observation space into a small snippet
|
|
of the upcoming wind and slope. This also simplifies the agent calculations since the view of the environment is
|
|
relative to its current position. The size of the view can be controlled as a parameter.
|
|
|
|
|
|
\section{Experiments and Results}
|
|
|
|
An implementation of the aforementioned simulator was developed with Jax. Jax was chosen as it enables
|
|
vectorization and optimization to improve performance. Additionally, Jax allows for gradients of any function
|
|
to be computed, which is useful for certain classes of reinforcement learning. In our case, we didn't
|
|
use this as there seemed to be very little available off the shelf.
|
|
|
|
Initially Stable Baselines was used since it is one of the most popular implemntations of common RL algorithms.
|
|
Stable Baselines3\cite{stable-baselines3} is written in PyTorch\cite{Ansel_PyTorch_2_Faster_2024}, and uses the Gym\cite{gymnasium} format for environments. A basic Gym wrapper
|
|
was created to connect SB3 to our environment.
|
|
PPO was chosen as the RL algorithm as it is very simple, while still being effective \cite{proximalpolicyoptimization}
|
|
The performance and convergence was very bad. This made
|
|
it difficult to diagnose as the model would need potentially millions of steps before it would learn anything interesting.
|
|
The primary performance loss was in the Jax/Numpy/PyTorch conversion, as this requires a CPU roundtrip.
|
|
To combat this I found a Jax-based implementation of PPO called \texttt{purejaxrl}. This library is
|
|
written in the style of CleanRL but instead uses pure Jax and an environment library called \texttt{gymnax}\cite{gymnax2022github}.
|
|
The primary advantage of writing everything in Jax is that both the RL agent and the environment can be offloaded to the GPU.
|
|
Additionally, the infrastructure provided by \texttt{gymnax} allows for environments to be vectorized. The speedup from
|
|
using this library cannot be understated. The SB3 PPO implementation ran at around 150 actions per second. After rewriting
|
|
some of the code to make it work with \texttt{purejaxrl}, the effective action rate\footnote{I ran 2048 environments in parallel}
|
|
was nearly$238000$ actions per second\footnote{It's likely that performance with SB3 could have been improved, but I was struggling to figure out exactly how.}.
|
|
|
|
|
|
The episode returns after 50 million timesteps with a PPO model can be seen in Figure~\ref{fig:returns}. Each update step
|
|
is performed after collecting minibatches of rollouts based on the current policy. We can see a clean ascent at the start of training,
|
|
this is the agent learning to drive forward. After a certain point, the returns become noisy. This is likely due to energy scoring
|
|
being random based on the terrain. A solution to this, which wasn't pursued due to lack of time, would be to compute the
|
|
"nominal energy" use based on travelling at $v_{avg}$. Energy consumption that was above the nominal use would be penalized, and
|
|
below would be heavily rewarded. Despite this, performance continued to improve, which is a good sign for the agent being
|
|
able to learn the underlying dynamics.
|
|
|
|
\begin{figure}[H]
|
|
\centering
|
|
\includegraphics[width=0.8\textwidth]{PPO_results.pdf}
|
|
\caption{Episodic Returns during PPO training}
|
|
\label{fig:returns}
|
|
\end{figure}
|
|
|
|
|
|
Initially I thought that this was actually pretty impressive, but I looked at an individual level
|
|
and it seemed to just drive forward too fast. Reworking the reward function might cause this to converge better.
|
|
I wish I had a graph, but I keep running out of memory when I try to capture a rollout.
|
|
|
|
|
|
\section{Discussion}
|
|
|
|
While the PPO performance was decent, it still had a significant amount of improvement on the table. Tuning the reward function
|
|
would probably help it find a solution better. One strategy that would help significantly is to pre-tune the model to output the
|
|
average speed by default, so the model doesn't have to learn that at the beginning. This is called Q-initialization and is a common
|
|
trick for problem spaces where an initial estimate exists and is easy to define. Perhaps the most important takeaway from this
|
|
work is the power of end-to-end Jax RL. \texttt{purejaxrl} is CleanRL levels of code clarity, with everything for an agent
|
|
being contained in one file, but surpassing Stable Baselines3 significantly in terms of performance. One drawback is that
|
|
the ecosystem is very new, so there was very little to reference when I was developing my simulator. Often there would be
|
|
an opaque error message that would yield no results on search engines, and would require digging into the Jax source code to diagnose.
|
|
Typically this was some misunderstanding about the inner works of Jax. Future work on this project would involve trying out
|
|
other agents, and comparing different reward functions. Adjusting the actor-critic network would also be an interesting avenue,
|
|
especially since a CNN will likely work well with wind and cloud information, which have both a spatial and temporal
|
|
axis\footnote{You can probably tell that the quality dropped off near the end - bunch of life things got in the way, so this didn't go as well as I'd hoped. Learned a lot though.}.
|
|
|
|
|
|
|
|
\section{Conclusion}
|
|
|
|
We outline the design of a physics based model of solar car races. We implement this model and create a simulation environment
|
|
for use with popular RL algorithm packages. We demonstrate the performance and learning ability of these algorithms on our model.
|
|
Further work includes more accurate modelling, improved reward functions, and hyperparameter tuning.
|
|
|
|
\bibliography{references}
|
|
\bibliographystyle{plain}
|
|
|
|
\end{document}
|