Learning simulations of the real world

How do we learn a simulation?

Unlike Machine Learning models, which typically come with standard training algorithms (like Backpropagation for Neural Networks), simulations often need us to explicitly choose and design procedures for learning their Parameters from real-world data or optimising their outputs.

In order to do this, we must first have some Objective which either characterises how close simulation Trajectories are to replicating the data we have or define the quantity we want to optimise.

There are a number of techniques we can use to specify what the Objective should be, depending on the purpose.

Learning simulations from data

If we want to learn the Parameters which correspond to simulation Trajectories fitting real-world data trends more closely, it is natural to use an Objective based on the Probabilities of State Partition Histories that we computed in the previous post.

We start by streaming time-series data into our simulation by specifying it as a State Partition.

We can then use a method to estimate the Probabilities of State Values within the data, e.g., the Probabilistic Sample Weighting we discussed in the previous post.

So we have a way to calculate these ‘Data Probabilities’ for any Possible State Values the data can take in Time.

By then evaluating these Data Probabilities at the points which coincide with simulation Trajectories, we have an Objective which quantifies how close the simulation is to the data.

Example: Online simulation parameter estimation

The Data Probabilities of simulation Trajectories can also be interpreted as Probabilities of simulation Parameters; often accompanied with some simulation noise to account for differences between Trajectories even with the same Parameters.

We can create an algorithm which uses this sequence of Probabilities to estimate the Probabilities of simulation Parameters in a very similar way to Probabilistic Sample Weighting (see the last post for details on the latter).

We might call this algorithm ‘Online Simulation Parameter Estimation’; where ‘Online’ here means that the simulation is being adaptively learned to the data iteratively in time, as opposed to a whole Batch all at once.

Observed Data vs Estimated Trajectory

Parameter Posterior Estimate

There is also an implementation of this Online Simulation Parameter Estimation algorithm within the stochadex simulation engine.

Learning optimal simulations

If we want to learn Parameters which correspond to optimal simulation Trajectories, we first need to specify what ‘optimal’ means.

We do this by defining an Objective whose maximum/minimum possible value will be achieved when our goal is met.

For instance, we may define some logic in a State Partition Iteration of the simulation which replicates taking ‘Actions’ in the real world. This logic can depend on the simulation Parameters so that the latter encodes the behaviour quantitatively.

Given this setup, a very common goal of interest is then in finding the best Actions to take; which is analogous to optimising the Parameters of the Action-taking State Partition Iteration. We will refer to these Parameters as ‘Policy Parameters’.

But what should be use as an Objective?

The ‘Discounted Future Reward’ is a quantity we can specify that a simulation Trajectory will have accumulated into the future, accounting for increasing distance into the future by ‘Discounting’ it gradually with a weighting.

We are using this concept of Discounted Future Reward in the same way that it is used in Reinforcement Learning.

The idea is that, as you go further into the future, the importance of the Reward you have accumulated by then is increasingly irrelevant to Actions you might take at the present moment.

Example: Optimising with evolutionary strategies

The Evolutionary Strategies algorithm can be applied to search future simulation Trajectories to find the best set of Policy Parameters needed to achieve some Discounted Future Reward.

This algorithm relies on sorting the sampled simulation Trajectories according to their Discounted Future Rewards and then using the top fraction of these to update the best known Policy Parameters (and the Variance around them) after each Timestep.

Sampled Simulation Trajectories

Blue lines are the elite fraction; grey are the rest.

Best Discounted Future Reward