Architectures for current and future hardware
Classical hardware
When we talk about ‘classical’ hardware here, we just mean standard CPUs.
On CPUs, simulation architectures may be constructed out of several ingredients. Loosely speaking, these are: Memory, Threads, Channels between Threads, Processes and Inter-Process Communication (IPC).
All of these ingredients have their own tradeoffs in performance. But they are all useful in constructing the right simulation architectures to satisfy the right use cases.
In all of the previous posts so far, the main simulation architectures we have been considering are defined as Stepwise; simulation architectures which evaluate the Next State Values for the system at each point in Time, in turn.
Stepwise simulation architectures on CPUs are typically more performant when using Memory, Threads and Channels between Threads in the right combinations.
In contrast, Processes, and IPC in particular, are typically more useful when we consider scaling computations in parallel across multiple non-interacting simulation Trajectories (which don’t need much IPC). This is because IPC comes with more performance limitations.
Batch simulation architectures evaluate multiple successive sequences of Next State Values for the system over a wider interval in Time all as one computational block.
Despite their appearance, Batch simulation architectures cannot fundamentally evaluate the Next State Values at different Timesteps in a truly parallel fashion. Simulations must still preserve the causal relationships between these Next State Values as they progress in Time.
To ensure this causality, some form of Iteration can be performed; like the Stepwise architecture implies by evaluating it recursively.
However, it is sometimes sufficient to simply encode the causal/temporal dependencies between State Values along the Simulation Timeline as part of a Batch prediction; which is how some Machine Learning models are used to predict time series data.
Example: Stepwise vs batch
Specialised classical hardware
From the perspective of standard CPUs, Batch simulation architectures are often designed to evaluate segments of the Simulation Timeline using specialised classical hardware.
When we talk about ‘specialised classical’ hardware here, we mean GPUs, TPUs, IPUs and other specialised processors based on classical computing principles (as opposed to quantum processors).
This architecture can be used to reduce the overall processing time taken to complete a Simulation Run relative to a Stepwise equivalent, but there are tradeoffs which mean this isn’t always efficient.
GPUs, TPUs, IPUs, etc. all have their limitations. For example, GPUs and TPUs are highly optimised for dense arithmetic operations but struggle with branching control flow. IPUs offer more flexibility for irregular compute patterns and sparse operations, though they still prioritise throughput over the complex sequential logic that CPUs handle well.
So there are basically certain types of simulation algorithm that can be written that GPUs, TPUs, IPUs, etc. are not well-suited to reducing the overall processing time for.
In addition, this specialised hardware typically requires data transfer to/from CPU Memory (at the very least for initialisation and final results), which also takes processing time.
So, when deciding on the number of Timesteps a Batch simulation architecture should use for the best performance, software engineers must take into account:
- the available Memory of their specialised hardware
- the Memory requirements for their simulation State Partition Histories
- the overall number of Timesteps they need to perform
- the implications this has on the number of I/O operations needed to interact with CPU Memory
- and the implications this has on reducing the overall processing time, given the specialised hardware.
Example: Batch size tradeoffs
Quantum hardware
Note that the concepts in this section are the most likely to change with future advancements in Quantum Computing.
Quantum hardware seems to naturally fit the Batch simulation architecture in the same way that specialised classical hardware does.
In order to utilise this hardware within a given Batch evaluation, one would need to:
- prepare the initial state of Qubits
- encode the State Partition Histories from CPU Memory into these Qubits
- run quantum gates which encode the logic for multiple Timesteps of the simulation
- entangle the Next State Values at each Timestep with Ancilla Qubits or rely on the Qubits for State Partition Histories themselves
- and measure these Next State Values in order to write the data to CPU Memory.
Note also that the No-Cloning Theorem means we cannot simply copy the Qubits which have run the quantum gates; the circuit must run separately for each simulation Trajectory.
Therefore, you only get a Quantum Advantage if you can store more than one Timestep worth of simulation Next State Values in Qubit Memory.
Otherwise, if you only effectively have one instantaneous Timestep of Qubit Memory to use, the processing time will likely be dominated by I/O writing to and from the Qubits during the simulation. This is also known as the Quantum I/O Bottleneck.