Abstract |
As mobile devices and data-centers expand to cover global needs for services
and personal computing, power consumption of systems and devices has become
the most prevalent concern for hardware designers and software developers. ARM
processors already dominate the mobile world and are taking leaps into the server
market due to their inherent energy efficiency. In this work we study the energy
characteristics of modern ARM processors at the instruction level.
To characterize the energy consumption of ARM processors we measure the
energy consumption of special purpose benchmarks. Our measurements are made
using actual voltage/current sensors provided by the Odroid-XU+E development
board which contains an ARM big.LITTLE processor consisting of two clusters of
four Cortex-A7 and four Cortex-A15 cores.
Our characterization benchmarks are designed specifically to stress specific
units of the datapath. With two different benchmarks for each instruction type,
we study both the latency and the energy of instructions as well as the maximum
throughput of the processor for that instruction.
Our findings for Cortex-A7 cores show that integer instructions cost from 50 to
80 pJ each, float/double instructions from 80 pJ to 350 pJ each, and more complex
instructions like divisions cost from 150 pJ to 1200 pJ per instruction. Load and
store instructions cost 150 pJ to 200 pJ each when hitting in the L1 cache whereas
the cost increases up to 270 pJ when accessing the L2 cache. On the Cortex-A15,
instructions cost three to five times more than on Cortex-A7 for the same clock
frequency, even when the two cores show the same throughput for an instruction.
For benchmarks that fit mostly in the L1 cache, we observed that at a same
clock frequency, their execution time is 20% to four times faster on Cortex-A15,
while energy to completion is increased by 2 to 4 times, relative to Cortex-A7.
When comparing Cortex-A7 at the lowest frequency of 500 MHz to Cortex-A15
at the highest frequency of 1.5 GHz, we see that the execution time is 4 to 10
times faster on Cortex-A15, while energy to completion is increased by 5 to 9 times
relative to Cortex-A7
Through these measurements, we developed a thorough characterization of the
ARM instruction set with energy and latency metrics for every instruction type.
We validated the correctness of our characterization by developing an instruction
level energy model and testing it on a variety of real programs. Our evaluation
shows average mispredictions of 8.5% for Cortex-A7 and 14% for Cortex-A15.
Furthermore, we utilize our characterization and energy model to quantify the
energy characteristics of heterogeneous multiprocessing, like ARM big.LITTLE,
and show how this can help optimal workload placement in such systems. We
highlight the different factors that contribute to the energy expenditure of such
systems and show how these differ from one processor to the other.
|