Optimizing application performance for the Intel Atom architecture

The quality of tools support has a direct impact on the effectiveness of optimization efforts. Understanding how to use these tools to optimize performance is critical for Intel Atom processor migration, an important consideration for Atom-based small form factor designs.

Good software design seeks a balance between simplicity and efficiency. Application performance is one aspect of software design. A typical application development cycle consists of four phases: design, implementation, debugging, and tuning, which involves single processor core optimization, multicore processor optimization, and power optimization. The development cycle is iterative and concludes when performance and stability requirements are met.

The following discussion will explore the processes involved in the different phases of tuning and describe the software tools that can aid in these efforts.

Single processor core tuning

Single processor core tuning focuses on improving the behavior of the application executing on one physical Intel Atom processor core. This tuning step isolates the behavior of the application from more complicated interactions with other threads or processes on the system.

The foundation of performance tuning is built on complementary assertions of the Pareto principle and Amdahl’s law. As applied to software performance optimization, the Pareto principle, or the 80/20 rule, states that 80 percent of the time spent in an application is in 20 percent of the code. Amdahl’s law provides guidance on the limits of optimization. For example, if optimization can only be applied to 75 percent of the application, the maximum theoretical speedup is 4 times.

Single processor core tuning comprises multiple steps, including gaining an understanding of the application, tuning based on general performance analysis, and analyzing and tuning specific to the Intel Atom processor.

Multicore processor tuning

Tuning multithreaded applications on the Intel Atom processor requires ensuring good performance when the application is executing on logical processor cores available via Intel Hyper-Threading Technology as well as multiple physical processor cores. General multithreading issues such as lock contention and workload balance that affect performance regardless of the architecture must be addressed. When executing under Intel Hyper-Threading Technology, the processor core’s shared resources present a performance concern. Tuning for multi-core processors adds another level of complication, as the possible thread interactions and cache behavior can be even more complicated.

Converting a serial application to take advantage of multithreading requires an approach that uses the generic development cycle consisting of five phases:

  1. Analysis: Develop a benchmark that represents typical system usage and includes concurrent execution of processes and threads. Use a system performance profiler to identify performance hot spots in the critical path. Determine if the identified computations can be executed independently. If so, proceed to the next phase; otherwise, look for other opportunities with independent computations.
  2. Design: Determine changes required for a threading paradigm by characterizing the application-threading model (data- or task-level parallelization).
  3. Implementation: Convert the design into code based on the selected threading model.
  4. Debug: Use runtime debugging and thread analysis tools.
  5. Tune: Tune for concurrent execution on multiple processor cores executing without Hyper-Threading.

Power tuning

Tuning that is focused on power utilization is a relatively new addition to the optimization process for Intel architecture processors. The goal of this phase is to reduce the power utilized by the application when executing on the embedded system. One of the key methods of power tuning is to help the processor enter and stay in one of its idle states.

The goal of power tuning is twofold:

  • Minimize time in active state.
  • Maximize time in inactive state.

It might seem like these goals are redundant; however, in practice, both are required. Power is expended in transitioning into and out of idle modes. A processor that is repeatedly waking up and then going back to sleep might consume more power than a processor that has longer periods in an active state. Techniques to meet this goal follow one of two tuning strategies:

  • Race to idle: The tasks are executed as quickly as possible to enable the system to idle.
  • Idle mode optimization: Iteratively add software components executing on the system and analyze power state transitions to ensure these components are as nondisruptive to power utilization as possible.

Power and performance analysis tools overview

Software tools for performance and power optimization aid in analysis and tuning efforts. Performance tools that target single processor core performance provide insight into how the application is behaving at the microarchitecture level. Multicore performance tools provide insight into how the application is executing in the context of Intel Hyper-Threading Technology and multicore processing. Performance tools focused on power optimization provide insight into application behavior that affects power utilization.

Single-core performance tools

Tools for analyzing single processor core performance show how an application is behaving as it executes on one processor core. These tools typically fall into one of the following categories:

  • System profilers: Provide a summary of execution times across processes on the system.
  • Application profilers: Provide a summary of execution times at the function level of the application.
  • Microarchitecture profilers: Provide a summary of processor events across applications and functions executing on the system.

Table 1 describes several tools used in single-core performance analysis. These tools are not all equal. Some of the tools provide functionality that is a superset of others. For example, sysprof can provide a call graph profile across all applications executing on a system; GNU gprof cannot. However, gprof is available across a wide range of operating systems, whereas sysprof is a Linux tool.

Table 1: Single-core performance tools provide insight into how an application is behaving as it executes on one processor core.
(Click graphic to zoom by 1.5x)

Multicore performance tools

Unique tools for analyzing performance related to multicore processors are still somewhat few in number. System profilers can provide information on processes executing on a system; however, inter-actions in terms of messaging and coordination between processes are not visible. Tools that offer visibility into this coordination typically must be cognizant of the particular API in use.

The Intel Thread Profiler identifies thread-related performance issues and can analyze OpenMP, POSIX, and Windows multithreaded applications. The tool employs critical path analysis for recording events, including spawning new threads, joining terminated threads, holding synchronization objects, waiting for synchronization objects to be released, and waiting for external events. An execution flow – the execution through an application by a thread – is created, and each of the aforementioned events can split or terminate the flow. The critical path is defined as the longest flow through the execution from the start of the application until it terminates.

Power performance tools

Two types of tools assess power performance:

  • Probe-based profiling: Measure the actual power used by the device, employing an external device to measure and record the power utilized by the system as a specific application executes.
  • Power-state profiling: Rely on software interfaces into the platform’s power states, which, instead of providing a measure of power utilization, offer the number of times transitions occur between the platform power states. The process of idle-mode optimization works by enabling increasing application functionality and inspecting the recorded power data at every stage.

Small form factors depend on designs that provide power and performance efficiency for battery-powered and fanless systems. The optimization process consists of multiple phases as outlined in this article, and employing the right software tools and programming techniques at each stage of the process is essential to extract optimal power and performance efficiency for these devices.

This article is based on material excerpted from the book Break Away with Intel Atom Processors: A Guide to Architecture Migration by Lori Matassa and Max Domeika. For more information about this book, visit www.intel.com/intelpress/sum_ms2a.htm.

Lori Matassa is a senior staff platform software architect in Intel’s Embedded and Communications Division. She has more than 25 years of experience as an embedded software engineer developing software for platforms including mainframe and midrange computer system peripherals. At Intel, she has contributed to driver hardening standards for Carrier Grade Linux and led the software enablement of multicore adoption and architecture migration for embedded and communication applications. Lori holds a BS in Information Technology.

Max Domeika is an embedded software technologist in the Developer Products Division at Intel. During the past 14 years, Max has held several positions at Intel in compiler development, including project lead for the C++ front end and developer on the optimizer and IA-32 code generator. Max earned a BS in Computer Science from the University of Puget Sound, an MS in Computer Science from Clemson University, and an MS in Management in Science & Technology from Oregon Graduate Institute.

Intel Corporation 408-765-8080 www.intel.com