High Performance Windows Timers


Introduction

This page discusses Windows timing services with emphasis on it's use in very short events used in high performance software, where events may take < 5 milliseconds or < 1 millisecond. Windows provides many different timing services whose capabilities are backed by different clocks. A clock is a monotonic counter updated at a specified frequency from a source clock. Clock capabilities and interpretation can vary significantly.

[edit]Clock Attributes

The capabilities of the clock used by a timer In particular, the primary capabilities of the clocks used that vary between timing services are accuracy, precision, resolution, range, drift, alignment, interpretation, and availability.

[edit]Resolution, Accuracy, and Precision

The resolution of a timer is the minimum increment across which it can measure timer. For example, a great many time-related APIs have millisecond resolution. Also, note that while a clock may have millisecond resolution, this does not describe it's precision or accuracy. A clock is precise if it's measures are consistent and reproducible. A clock's accuracy is a description of how close it's measures are to the true value they claim. For example, say a millisecond clock is checked by a much better clock. The millisecond clock is asked to measure a 30 ms interval three times, and does so in 31 ms, 31 ms, and 31 ms. It's precision is quite good, because it always measures 31 actual ms when asked to measure 30 ms. However, it's accuracy is off by one millisecond. 


While a given clock may have good resolution, it may not be very accurate or precise, depending on how it is used. On Windows, this is particularly visible when timing very short events, in the millisecond or sub millisecond range. Longer timings, in the 50 ms and up range, can now be accurately reported by most timers, except for the real time clock. Many Windows APIs appear to be millisecond accurate because they take times in milliseconds, but can be off by 2 - 50 ms easily. However, in the vast majority of applications, this is irrelevant. 
The frequency of a clock is the frequency of the constant-interval signal that generates it. This is generally at least as high as it's resolution and is often 2x or 3x that of the stated resolution.

[edit]Range

The range of a timer refers to the width of the value used to track the timer. For example, a 16 Khz clock counted with a WORD has a range of (about) 1 second, since the maximum value of a WORD is 65536. A clock with this short of a range is not very useful in practice. Typically, most clocks are either DWORDs or QWORDs, affording a considerable range.

[edit]Drift

Clock drift refers to a clocks drift from it's rated resolution or frequency. For example, the clock periods may slightly decrease over time, leading the user to believe the clock has "sped up," gaining a second. Many clock signals in modern electronics are derived from quartz crystals due to their reliability. One of the primary cause of variance in quartz clocks is heat. Some quartz clocks are designed with temperature correction since the drift can be predicted from the crystal's change in temperature. Other factors include input power variance and mechanical load.

[edit]Alignment

A clock is said to be aligned if it has been synchronized with another clock. For example, the real-time clock is aligned with calendar time when a PC is built. Consequently, the real time clock can provide a meaningful interpretation of it's time. This permits the value 0x01cd4f23e1014960 to be interpreted as 2012 June 20th at 20:32:58.102 UTC. Nearly all Windows machines regularly realign their clocks to either a local domain controller or an internet time server.

[edit]Interpretation

For calendar time, sometimes referred to as real time, the interpretation of a time point is also relative. One observer in the Eastern may call a given timepoint 4:32PM, whereas another in the Western timezone may call it 1:32PM, and still another in a different timezone with the same offset as the US EST that does not observe daylight savings time may call it 3:32PM. Consequently, in general it is a best practice to store dates and times in UTC. However, if they will be accessed by other machines or will be stored on a central server, the timezone of the reporter should also be included.

[edit]Availability

The clock source used by a timer may not always be incrementing (ticking). Most clocks are always incrementing once the machine has powered up; however, at a low level, many clocks are suspended during sleep mode or hibernation. The only clock on a PC that can be guaranteed to always be available is the real-time clock, as it is battery backed. However, it's resolution is only 1 second.


[edit]Hardware Timers

Every Windows PC includes many different hardware timers. These include:

  • Real Time Clock (RTC)

The real time clock has two unique capabilities: it is always available, having a battery backup, and is aligned to calendar time when the system is created. This means that time from the RTC has a useful interpretation as calendar time. It's resolution it limited to 1 second, however.

  • Programmable Interval Timer (PIC)

The PIC is the oldest timing hardware service in the PC. It's clock runs at 1.193182MHz and uses a 16-bit counter. It has three timers, but two are prewired to DRAM refresh and the PC speaker for sound generation. The third timer can be used for general purposes. It was once used to play digitized sounds on PC speakers before sound cards were invented.

  • High Precision Event Timer (HPET)

The HPET is sometimes referred to as the multimedia timer. The HPET was developed jointly by Intel and Microsoft in 2004 to resolve timing issues that had plagued PCs for many years, particularly synchronized audio and video during multimedia playback. The Intel HPET spec is available here.  As of 2012 June, it has not been revised since publication.  This timer has good resolution and drift of only 0.05%, or 1.8 seconds per hour, quite sufficient for multimedia purposes. It's drift is 0.2% for periods of < 100 us, however. The first version of Windows that uses this timer is Windows Vista and is considered required for Windows Vista (part of the Windows hardware logo requirement). Windows XP SP3 introduced an emulated HPET at the kernel level, but did not use the actual HPET.  The specification calls for a minimum speed of 10 Mhz, but in practice it is always implemented at 14.31818 Mhz, or 4x the ACPI clock.

One way to determine if a computer has an HPET or not in Windows Vista or above is to check the Device Manager under System devices.  If present, it will be listed as "High precision event timer."  It has been part of Intel chipsets since the ICH6 revision, possibly earlier.

  • Advanced Programmable Interrupt Controller (APIC - Unavailable)

This timer is used to support hardware interrupt services, permitting a hardware device on the PCI bus to interrupt the CPU. Each CPU has it's own APIC and clocks according to the frequency of the CPU it is connected to. These chips connect to an IOAPIC which coordinates directing interrupts to specific processors and expands the range of available interrupts past the old limit of 15. For more information, see the following OSR writeup on the APIC, or the Intel spec for the chip here.  Until recently, the APIC timer was unused. There is no user-mode service that makes use of it, nor is there a special kernel API for it. Starting in Windows 7, the APIC timer is used internally when profiling is enabled, and may always be claimed by Windows 8 for internal purposes. See statements from Microsoft here and here.  For more information, see this link.

  • ACPI Clock (PM Timer - likely unused)

The development of Advanced Power Management (APM) services required the ability to maintain time when the system entered sleep or suspend modes. To support this, yet another clock, sometimes called the PMCLOCK, was required by ACPI specification. This clock runs at a fixed frequency of 3.579545MHz with a 24-bit or 32-bit counter, depending on hardware, resulting in a range of approximately 4 seconds - 19 minutes.  It is capable of generating interrupts on counter wrap to assist the OS in time keeping.  It is important to note that the term "ACPI Clock" is ambiguous:  the ACPI specification clearly describes the functionality as a service provided by the ACPI services, called the OSPM, differentiating from the actual source used for the clock.  Consequently, the PM timer hardware does not necessarily have to be used for the ACPI timer.  This is important, because in practice many PM timers were buggy, leading the ACPI spec to preferentially use the HPET as the ACPI clock source if an HPET is present and the CPU cycle counter is not used for this (see section 5.2.9, description for the USE_PLATFORM_CLOCK flag, currently p. 123).  Given this, the commonly-seen HPET frequency of 14.31818 Mhz can be explained:  it meets both the HPET spec requirement, only a minimum of 10 Mhz, and is exactly 4x the ACPI's specified frequency, permitting the OSPM to implement both clocks with one piece of hardware, as recommended by the specification.

  • CPU Cycle Counter (RDTSC)

The highest resolution clock available on any current PC is the CPU cycle counter, returned by the ReaD TimeStamp Counter instruction. The RDTSC clock is built into the CPU and is theoretically incremented with each clock tick. However, the semantics do not always hold. For example, some processors vary their frequency to reduce power consumption, and the TSC is reset after suspension and resumption. When multiple processors are used, the TSCs are not synchronized. To use the TSC reliably, you must prevent power-down and set processor affinity for the thread to prevent it from being rescheduled onto a different processor. 


For one of the best writeups on the subject of hardware timers, see VMWare's article Timekeeping in Virtual Machines.


[edit]Software Timing Services for Short Events and High Performance

There are several different software timing services. The word "timing" is ambiguous and may refer to one of two similar capabilities:

  1. Tracking the passage of time (timing)
  2. Notification that a specified amount of time has passed (notification)

It is important to note that these are separate capabilities. Timer facilities generally support either both or only time tracking.

The best Windows timer API for millisecond-accurate timing and notification continues to be (as of 2012) the multimedia timer, provided by the time* APIs running on Windows Vista or later with logo-certified hardware (HPET). Ignore the (2012!) MSDN documentation stating that timeSetEvent is an obsolete API. Their replacement, timer queues, is less accurate for these very short intervals. Also, you must continue to call timeBeginPeriod at application startup or timer queues are even less accurate, though still better than SetTimer. Windows XP's best possible timer resolution is considered to be 0.9766 ms.

The replacement was created due to a problem discovered with the PulseEvent API used by timer*. If PulseEvent is called when a thread is not current waiting on an object, it will not release any threads. This is by design. However, if a driver dispatched an asynchronous procedure call to the waiting thread, effectively hijacking it for a short period of time, the waiting thread is no longer considered waiting and so will not be woken up. When the APC(s) completes, the thread will remain waiting, although logically it should have been woken up. For more information on time*'s use of PulseEvent, see this link, or go straight to this link for details on the PulseEvent issue. For alternatives to PulseEvent, see condition variables, though note they are a user-mode only construct and are not available across processes.

Also, note that the accuracy of time* will increase with thread priority. A good pattern when using time* therefore is:

  1. Call timeBeginPeriod(1) at program startup to activate multimedia-class timer services. This activates per-millisecond interrupts which otherwise have nontrivial impact on performance, but if you need it, you need it. Without it, accuracy is generally limited to 5 ms.
  2. Call timeEndPeriod at program shutdown.
  3. Elevate thread priority via SetThreadPriority and SetPriorityClass. Note that thread priorities are clamped unless the class is increased, and to use extreme caution with any priorities over 15.
  4. If using the TSC mechanism, fix thread to the current CPU via SetThreadAffinityMask.
  5. Register a WM_QUERYENDSESSION handler to prevent session shutdown.
  6. Call SetThreadExecutionState to prevent suspends, or use WM_POWERBROADCAST on Windows XP/Server 2003 and older.

For sub-millisecond waits, the best solution is to spin using the system performance counter as the clock. This API was not generally reliable until many years ago when Microsoft reimplemented it to no longer rely on the TSC exclusively. It now vets the system timing services and performs all needed corrections to provide a reliable clock, but the accuracy may be only 3 Mhz. In particular, the performance counter value is now valid across threads, between processors, and across suspensions and wake-ups. The clock is now provided by the APC clock on multiprocessor systems and the TSC on uniprocessor systems. The observed frequencies are therefore generally 3 Mhz or several gigahertz, respectively. Because it is an abstract counter clock only, waiting for events of a required duration requires a looping until the counter hits the target time.

For a comparison of the accuracy and retrieval times of Windows timer services, see this link.  This link includes sample codes and results illustrating the differences, relative to the performance counter.

[edit]Counter Retrieval Performance

Determining what time it is can be a very slow operation, depending on the API. For example, it can take 0.0009 ms to get the current date and time using NET's DateTime.Now API. However, it takes only 0.00008ms on the same machine to get the performance counter, an order of magnitude faster. Retrieval of the timeGetTime timer is as fast as QueryPerformanceCounter();

[edit]QueryPerformanceCounter

The highest speed (resolution) timing-only API in Windows has always been QueryPerformanceCounter/QueryPerformanceFrequency, available since Windows 2000 (NT 5.0). Because of it's use of 64-bit counters and 32-bit frequency values, the API has not needed to be changed since it was introduced. Historically, this API was intended for debugging and diagnostics and little attention was paid to it's accuracy, but as it's use grew the accuracy issues became an increasing problem. It was traditionally implemented using the CPU TimeStamp Counter with no alteration. This counter is accurate, but was designed to count the number of elapsed CPU cycles, not to count time. It became used to track time periods because it is by far the highest resolution timer in the system. It's issues include:

  1. Different threads on different CPUs get conflicting values because the TSC's are not guaranteed to be synchronized between processors. The CPU TSC's are synchronized in symmetric multiprocessing systems, but not in asymmetric multiprocessing systems. This can result in TSC values jumping backwards. The most visible example of this is negative ping times, described here.
  2. Clock frequency can change as the processor speeds up or slows down in response to demand in order to reduce power requirements. Consequently, the clock tick interval is not constant under all conditions, but it is constant if these power management technologies are disabled, generally in the BIOS. This was commonly seen with the introduction of AMD's Cool-n-Quiet technology.
  3. The TimeStampCounter instruction can be disabled in the BIOS. This is no longer done in practice, but the capability remains.
  4. Out-of-order execution can cause RDTSC to be executed before or after it's location in the code stream.

These issues do not reflect bugs in the CPU TSC, but rather known limitations and restrictions on usage as it was intended as a CPU cycle counter. For example, many game developers worked around issue #1 by calling SetThreadAffinityMask to fix their rendering thread to a particular CPU, and worked around #2 by re-calling QueryPerformanceCounter frequently. This issue has been addressed by both Intel/AMD and Microsoft. Issue #4 can be resolved by calling a serializing instruction to prevent out-of-order dispatch of preceeding instructions, like cpuid. The CPU manufacturers AMD and Intel both changed the clock used by RDTSC so it is a fixed frequency regardless of the CPU speed or power state. Microsoft addressed this in Windows Vista by changing the implementation to vet hardware, workaround known bugs, and provide guaranteed accuracy over frequency. Consequently, on most multiprocessor systems today (2012), the performance counter actually uses the ACPI clock instead of the TSC, providing microsecond resolution.  Note that due to the ACPI spec, the actual hardware now used for this is the HPET, not the PM clock.  The most important consequence of this fix is that the API is considerably slower than it once was, but it is still so fast that this is normally only a concern for kernel developers. For example, kernel event tracing now adds 2% to CPU utilization solely due to timestamping when processing 20,000 events/second.

A secondary consequence of these fixes is that there is no longer a user-mode API to obtain the CPU TSC clock. This was also addressed by adding QueryThreadCycleTime, QueryProcessCycleTime, QueryIdleProcessorCycleTime. For more information, see this good writeup of the QueryPerformanceCounter saga, just one of many.


C# Notes

This section discusses timer concerns when used in C#.  The NET environment introduces a significant confound to very short waits in the form of it's garbage collection service.  This service can pre-empty a NET thread at any time to perform garbage collection.  While its performance is quite good, it is designed to freeze an executing thread at will will to perform garbage collection.  Consequently, there is no way to guarantee NET code will perform at any specific time.  This is not an issue for general use, as the GC's impacts are generally unnoticeable, but for any application where milliseconds count, managed code should not be used.

Using time* APIs from C#

To use the time# APIs from C#, you must maintain a reference to the delegate passed to timeSetEvent to prevent it from being cleaned up when the caller would otherwise assume it is called. This is actually a NET/native code interop issue that is not specific to the time* APIs. Note that the native stub for the delegate is always locked in memory and will compensate for relocation of the managed code. Also note that the argument should be pinned in memory.


C# Garbage Collector Performance Notes

The C# garbage collector is implemented as a soft real-time system, meaning (simplified) that 95% of the time, mallocs and frees takes the expected time.  Unlike the C RTL heap or COM's usage counting, NET memory allocation operates on essentially a mark-and-sweep basis:  mark memory as in use or not in use by objects and periodically sweep through to kill off dead objects, releasing their memory and relocating allocations to eliminate dead space.  The garbage collector separates memory allocations into three "generations":

* Generation 0 holds the newest and normally short-term allocations

* Generation 1 holds allocations leftover since the last run

* Generation 2 holds the longest-running allocations

As items last longer, they are migrated from generation 0 -> 1 -> 2.  The garbage collector does not need to process all three generations during garbage collection.  It can perform a fast collection by processing only generations 0 and 1 - this can often run in just a few milliseconds.  The significant pauses associated with GC applications only occur when collecting through to generation 2.

For more details on profiling execution of the NET garbage collector, see http://msdn.microsoft.com/en-us/library/ee851764.aspx.


Notable Solutions

1. Google dealt with this problem during the development of Chrome.  Their solution was the same as many others, to call timeBeginPeriod and use millisecond-resolution timers.  However, due to poor JavaScript code, they found it necessary to limit Chrome's timers to 4ms resolution.  See Google's timer work in Chrome.

2. A demonstration of a microsecond-resolution notifiable timer solution has been posted for at http://www.windowstimestamp.com/ as of 2012 June.  This implementation appears to use a service running at real-time priority.  The source has not been posted as of 2012 June.


[edit]Other References

QueryPerformanceCounter performs poorly on XP and older

Intel's original CPU TSC Counter guidance for use in game timing.

System clock could run fast on Windows XP and older after calling timeBeginPeriod, fixed in Vista.
Good academic writeup of issues leading to development of HPET, circa 2007.
Old but good graphical depiction of time to call the Windows user-mode timing services.
Using time* from C#.
Microsoft's 2005 Game Timing on Multicore Processors notes about QueryPerformanceCounter.
Good overview of audio/video synchronization issues with multimedia playback that prompted development of the HPET.

Comments