\documentclass[11pt]{article} % Essential packages \usepackage[utf8]{inputenc} \usepackage[T1]{fontenc} \usepackage{amsmath,amssymb} \usepackage{graphicx} \usepackage[margin=1in]{geometry} \usepackage{hyperref} \usepackage{pdfpages} \usepackage{algorithm} \usepackage{algorithmic} \usepackage{float} \usepackage{booktabs} \usepackage{caption} \usepackage{subcaption} \usepackage{tikz} % Custom commands \newcommand{\sectionheading}[1]{\noindent\textbf{#1}} % Title and author information \title{\Large{Solarcarsim: A Solar Racing Environment for RL Agents}} \author{Saji Champlin\\ EE5241 } \date{\today} \begin{document} \maketitle \begin{abstract} Solar Racing is a competition with the goal of creating highly efficient solar-assisted electric vehicles. Effective solar racing requires awareness and complex decision making to determine optimal speeds to exploit the environmental conditions, such as winds, cloud cover, and changes in elevation. We present an environment modelled on the dynamics involved for a race, including generated elevation and wind profiles. The model uses the \texttt{gymnasium} interface to allow it to be used by a variety of algorithms. We demonstrate a method of designing reward functions for multi-objective problems. We show learning using an Jax-based PPO model. \end{abstract} \section{Introduction} Solar racing was invented in the early 90s as a technology incubator for high-efficiency motor vehicles. The first solar races were speed focused, however a style of race that focused on minimal energy use within a given route was developed to push focus towards vehicle efficiency. The goal of these races is to arrive at a destination within a given time frame, while using as little grid (non-solar) energy as possible. Optimal driving is a complex policy based on terrain slope, wind forecasts, and solar forecasts. Direct solutions to find the global minima of the energy usage on a route segment are difficult to compute. Instead, we present a reinforcement learning environment that can be used to train RL agents to race efficiently given limited foresight. The environment simulates key components of the race, such as terrain and wind, as well as car dynamics. The simulator is written using the Jax\cite{jax2018github} library which enables computations to be offloaded to the GPU. We provide wrappers for the \texttt{gymnasium} API as well as a \texttt{purejaxrl}\cite{lu2022discovered} implementation which can train a PPO agent with millions of timesteps in several minutes. We present an exploration of reward function design with regards to sparsity, magnitude, and learning efficiency. \section{Background} Performance evaluation for solar races typically take the form of $$ S = D/E \times T $$ Where $S$ is the score, $D$ is the distance travelled, $E$ is the energy consumed, and $T$ is the speed derating. The speed derate is calculated based on a desired average speed throughout the race. If average velocity is at or above the target, $T=1$, however $T$ approaches $0$ exponentially as the average velocity goes below the target. Based on this metric we conclude that the optimal score: \begin{enumerate} \item Maintains an average speed $V_{\text{avg}}$ as required by the derate. \item Minimizes energy usage otherwise. \end{enumerate} The simplest control to meet the constraints of on-time arrival is: $$ V_{\text{avg}} = \frac{D_{goal}}{T_{goal}} $$ Where $D_{goal}$ is the distance needed to travel, and $T_{goal}$ is the maximum allowed time. The average speed is nearly-optimal in most cases, but is not a globally optimal solution. Consider a small - there is much more energy being used from the battery when going uphill, but the same energy is returned to the car going downhill. Losses in the vehicle dictate that it is more effective to drive slowly up the hill, and speed up on the descent. The decision is further complicated by wind and cloud cover, which can aid or neuter the performance of the car. It is therefore of great interest for solar racing teams to have advanced strategies that can effectively traverse the terrain while minimizing environmental resistances. Existing research on this subject is limited, as advanced solar car strategy is a competitive differentiator and is usually kept secret. However, the author knows that most of the work on this subject involves use of Modelica or similar acausal system simulators, and non-linear solvers that use multi-starts to attempt to find the global optimum. Other methods include exhaustive search, genetic algorithms, and Big Bang-Big Crunch optimization\cite{heuristicsolar}. We start by analyzing a simple force-based model of the car, and then connect this to an energy system using motor equations. We generate a simulated environment including terrain and wind. Then, we develop a reward system that encapsulates the goals of the environment. Finally, we train off-the-shelf RL models from Stable Baselines3 and purejaxrl to show learning on the environment. \section{Methodology} % \begin{tikzpicture}[scale=1.5] % % Define slope angle % \def\angle{30} % % % Draw ground/slope % \draw[thick] (-3,0) -- (3,0); % \draw[thick] (-2,0) -- (2.5,2); % % % Draw angle arc and label % \draw (-,0) arc (0:\angle:0.5); % \node at (0.4,0.3) {$\theta$}; % % % Draw simplified car (rectangle) % \begin{scope}[rotate=\angle,shift={(0,1)}] % \draw[thick] (-0.8,-0.4) rectangle (0.8,0.4); % % Add wheels % \fill[black] (-0.6,-0.4) circle (0.1); % \fill[black] (0.6,-0.4) circle (0.1); % \end{scope} % % % Draw forces % % Weight force % \draw[->,thick,red] (0,1) -- (0,0) node[midway,right] {$W$}; % % % Normal force % \draw[->,thick,blue] (0,1) -- ({-sin(\angle)},{cos(\angle)}) node[midway,above left] {$N$}; % % % Downslope component % \draw[->,thick,green] (0,1) -- ({cos(\angle)},{sin(\angle)}) node[midway,below right] {$W\sin\theta$}; % \end{tikzpicture} \begin{figure}[H] \begin{tikzpicture}[scale=1.5] % Define slope angle \def\angle{30} % Calculate some points for consistent geometry \def\slopeStart{-2} \def\slopeEnd{2} \def\slopeHeight{2.309} % tan(30°) * 2 % Draw ground (horizontal line) \draw[thick] (-3,0) -- (3,0); % Draw slope \draw[thick] (\slopeStart,0) -- (\slopeEnd,\slopeHeight); % Calculate car center position on slope \def\carX{0} % Center position along x-axis \def\carY{1.6} % tan(30°) * carX + appropriate offset % Draw car (rectangle) exactly on slope \begin{scope}[shift={(\carX,\carY)}, rotate=\angle] \draw[thick] (-0.6,-0.3) rectangle (0.6,0.3); % Add wheels aligned with slope \fill[black] (-0.45,-0.3) circle (0.08); \fill[black] (0.45,-0.3) circle (0.08); \draw[->,thick] (0,0) -- ++(-0.8, 0) node[left] {$F_{slope} + F_{drag} + F_{rolling}$}; \draw[->,thick] (0,0) -- ++(0.8, 0) node[right] {$F_{motor}$}; \node at (0,0) [circle,fill,inner sep=1.5pt]{}; \end{scope} % Draw forces from center of car % Center point of car for forces \coordinate (carCenter) at (\carX,\carY); \end{tikzpicture} \centering \caption{Free body diagram showing relevant forces on a 2-dimensional car} \label{fig:freebody} \end{figure} To model the vehicle dynamics, we simplify the system to a 2d plane. As seen in Figure~\ref{fig:freebody}, the forces on the car are due to intrinsic vehicle properties, current velocity, and environment conditions like slope and wind. If the velocity is held constant, we can assert that the sum of the forces on the car is zero: \begin{align} F_{drag} + F_{slope} + F_{rolling} + F_{motor} &= 0 \\ F_{drag} &= \frac{1}{2} \rho v^2 C_dA \\ F_{slope} &= mg\sin {\theta} \\ F_{rolling} &= mg\cos {\theta} C_{rr} \end{align} The $F_{motor}$ term is modulated by the driver. In our case, we give the agent a simpler control mechanism with a normalized velocity control instead of a force-based control. This is written as $v = \alpha v_{max}$ where $\alpha$ is the action taken by the agent in $\left[-1,1\right]$. From the velocity, and the forces acting on the car, we can determine the power of the car using a simple $K_t$ model: \begin{align} \tau &= \left(F_{drag} + F_{slope} + F_{rolling}\right) r \\ P_{motor} &= \tau v + R_{motor} \left(\frac{\tau}{K_t}\right)^2 \end{align} The torque of motor is the sum of the outstanding forces times the wheel radius. $K_t$ is a motor parameter, as is $R_{motor}$. Both can be extracted from physical motors to simulate them, but simple "rule-of-thumb" numbers were used during development. The power of the motor is given in watts. Based on the time-step of the simulation, we can determine the energy consumed in joules with $W \times s = J$. A time-step of 1 second was chosen to accelerate simulation. Lower values result in reduced integration errors over time at the cost of longer episodes. \subsection{Environment Generation} It is important that our agent learns not just the optimal policy for a fixed course, but an approximate optimal policy for any course. To this end we must be able to generate a wide variety of terrain and wind scenarios. Perlin noise is typically used in this context. We use a 1D Perlin noise to generate the slope of the terrain, and then integrate the slope to create the elevation profile. Currently the elevation profile is unused, but it can be important for drag force due to changes in air pressure. This was done because differentiated Perlin noise is not smooth, and is not an accurate representation of slope. The wind was generated with a 2D Perlin noise, where one axis was time, and the other was position. The noise was blurred in the time-axis to ease the changes in wind at any given point. An example of the environment can be seen in Figure~\ref{fig:env_vis}. \begin{figure}[H] \centering \includegraphics[width=\textwidth]{environment.pdf} \caption{Visualization of the generated environment} \label{fig:env_vis} \end{figure} \subsection{Performance Evaluation} To quantify agent performance, we must produce a single value reward. While multi-objective learning is an interesting subject\footnote{I especially wanted to do meta-rl using a neural net to compute a single reward from inputs} it is out of the scope of this project. Additionally, sparse rewards can significantly slow down learning. A poor reward function can prevent agents from approaching optimal policy. With these factors in mind, we use the following: \[ R = x/D_{goal} + (x > D_{goal}) * \left(100 - E - 10(t - T_{goal})\right) + (t > T_{goal}) * -500 \] To understand this, there are three major components: the continuous reward, which is rewarded at every step, and is the position of the car relative to the goal distance. The victory reward is a constant, minus the energy used and the early arrival penalty. This was added to help guide the agent towards arriving with as little time left as possible. Finally, there's a penalty for the time going above the goal time, as after that point the car is disqualified from the race. It took a few iterations to find a reward metric that promoted fast learning. Some of these issues were exacerbated by the initially low performance when using stable baselines. A crucial part of the improvement was the energy loss only being applied during wins. This allowed the model to quickly learn to go forward to finish, after which refinement of speed could take place\footnote{I looked into Q-initialization but couldn't figure out a way to implement it easily.}. \subsection{State and Observation Spaces} The complete state of the simulator is the position, velocity, and energy of the car, as well as the entire environment. These parameters are sufficient for a deterministic snapshot of the simulator. However, one goal of the project was to enable partial-observation of the system. To this end, we separate the observation space into a small snippet of the upcoming wind and slope. This also simplifies the agent calculations since the view of the environment is relative to its current position. The size of the view can be controlled as a parameter. \section{Experiments and Results} An implementation of the aforementioned simulator was developed with Jax. Jax was chosen as it enables vectorization and optimization to improve performance. Additionally, Jax allows for gradients of any function to be computed, which is useful for certain classes of reinforcement learning. In our case, we didn't use this as there seemed to be very little available off the shelf. Initially Stable Baselines was used since it is one of the most popular implemntations of common RL algorithms. Stable Baselines3\cite{stable-baselines3} is written in PyTorch\cite{Ansel_PyTorch_2_Faster_2024}, and uses the Gym\cite{gymnasium} format for environments. A basic Gym wrapper was created to connect SB3 to our environment. PPO was chosen as the RL algorithm as it is very simple, while still being effective \cite{proximalpolicyoptimization} The performance and convergence was very bad. This made it difficult to diagnose as the model would need potentially millions of steps before it would learn anything interesting. The primary performance loss was in the Jax/Numpy/PyTorch conversion, as this requires a CPU roundtrip. To combat this I found a Jax-based implementation of PPO called \texttt{purejaxrl}. This library is written in the style of CleanRL but instead uses pure Jax and an environment library called \texttt{gymnax}\cite{gymnax2022github}. The primary advantage of writing everything in Jax is that both the RL agent and the environment can be offloaded to the GPU. Additionally, the infrastructure provided by \texttt{gymnax} allows for environments to be vectorized. The speedup from using this library cannot be understated. The SB3 PPO implementation ran at around 150 actions per second. After rewriting some of the code to make it work with \texttt{purejaxrl}, the effective action rate\footnote{I ran 2048 environments in parallel} was nearly$238000$ actions per second\footnote{It's likely that performance with SB3 could have been improved, but I was struggling to figure out exactly how.}. The episode returns after 50 million timesteps with a PPO model can be seen in Figure~\ref{fig:returns}. Each update step is performed after collecting minibatches of rollouts based on the current policy. We can see a clean ascent at the start of training, this is the agent learning to drive forward. After a certain point, the returns become noisy. This is likely due to energy scoring being random based on the terrain. A solution to this, which wasn't pursued due to lack of time, would be to compute the "nominal energy" use based on travelling at $v_{avg}$. Energy consumption that was above the nominal use would be penalized, and below would be heavily rewarded. Despite this, performance continued to improve, which is a good sign for the agent being able to learn the underlying dynamics. \begin{figure}[H] \centering \includegraphics[width=0.8\textwidth]{PPO_results.pdf} \caption{Episodic Returns during PPO training} \label{fig:returns} \end{figure} Initially I thought that this was actually pretty impressive, but I looked at an individual level and it seemed to just drive forward too fast. Reworking the reward function might cause this to converge better. I wish I had a graph, but I keep running out of memory when I try to capture a rollout. \section{Discussion} While the PPO performance was decent, it still had a significant amount of improvement on the table. Tuning the reward function would probably help it find a solution better. One strategy that would help significantly is to pre-tune the model to output the average speed by default, so the model doesn't have to learn that at the beginning. This is called Q-initialization and is a common trick for problem spaces where an initial estimate exists and is easy to define. Perhaps the most important takeaway from this work is the power of end-to-end Jax RL. \texttt{purejaxrl} is CleanRL levels of code clarity, with everything for an agent being contained in one file, but surpassing Stable Baselines3 significantly in terms of performance. One drawback is that the ecosystem is very new, so there was very little to reference when I was developing my simulator. Often there would be an opaque error message that would yield no results on search engines, and would require digging into the Jax source code to diagnose. Typically this was some misunderstanding about the inner works of Jax. Future work on this project would involve trying out other agents, and comparing different reward functions. Adjusting the actor-critic network would also be an interesting avenue, especially since a CNN will likely work well with wind and cloud information, which have both a spatial and temporal axis\footnote{You can probably tell that the quality dropped off near the end - bunch of life things got in the way, so this didn't go as well as I'd hoped. Learned a lot though.}. \section{Conclusion} We outline the design of a physics based model of solar car races. We implement this model and create a simulation environment for use with popular RL algorithm packages. We demonstrate the performance and learning ability of these algorithms on our model. Further work includes more accurate modelling, improved reward functions, and hyperparameter tuning. \bibliography{references} \bibliographystyle{plain} \end{document}