High Performance Unix Logging (Part 1)

High Performance Unix Logging (Part 1)

Writing efficient logging for performance critical and real time Unix applications.

Most applications have some mechanism for logging, which is way of saving information about the progression of the program. Basic applications just write text to stdout, using printf() or some other equivalent. Larger applications have more sophisticated mechanisms, which usually involve writing information to a file(s). For example, the Apache webserver has a log file of all the clients that connect and a log file for errors.

The problem with logging is that it isn’t free. High verbosity is helpful for debug, but it takes system resources to process log messages and store them, and it can really add up. First, the human readable text for the log message has to be formed in memory. Next, it is usually passed to the operating system, via a system call. This causes a context switch and the executing thread to block while the operating system kernel does something with it (such as writing the data to a file, or passing it to the shell process to be rendered as text). When there are multiple threads logging to the same place, they must use locking in order to take turns, to prevent from overwriting each other’s messages. This all takes system resources (processing power, memory, disk i/o) away from the application doing real work.

For most applications, particularly those that interact with a user, this is an acceptable tradeoff. Computers are fast, humans are slow, and if a program pauses for a couple of milliseconds here and there to handle some logging, the user won’t perceive it at all. However, there are some cases where this doesn’t cut it:

  • When logging is extremely verbose
  • When applications need all the resources they can get
  • When applications are sensitive to timing, such as real-time applications

For the rest of this post, I’m going to walk through designing a simple non-traditional logging system in C to handle these kinds of cases.

Designing to minimize intrusiveness

The key for this logging system is procrastination. The processor should spend as little time as possible working on logging while in a critical section, and save the hard stuff at a more opportune time. A big source of savings is avoiding system calls (e.g. writing to a file or printing to the screen), since those require pausing the current thread while the Linux Kernel runs. I will also avoid any unnecessary string processing and locking.

Each log message will be stored in a small C struct. In order to use a familiar printf() like interface, the struct will hold a format string and some arguments. In addition, it will also store the current processor cycle count, to act as a timestamp for each log message (many processor architectures have a way to get this without making a system call).

typedef struct
{
    const char * fmt_str; // Holds the printf format string
    unsigned long long cycle_count; // Holds CPU counter
    uint64_t args[2]; // Holds the arguments for printf()
} logEntry_t;

To store multiple log messages, they will use a circular buffer. Once the buffer fills up, it will overwrite the oldest message. The size just needs to be large enough to have sufficient history to be useful when/if something goes horribly wrong. I also recommend to keep the size as a power of two, to keep the math simple for the processor.

#define NUM_LOG_ENTRIES 1024
typedef struct 
{
    unsigned next_index;
    logEntry_t entries [NUM_LOG_ENTRIES];
} logInfo_t;

If there are multiple threads in the system, each thread will have its own circular buffer. This keeps us from having to do any locking to share resources. However, this requires each thread to keep track of which buffer belongs to it.

logInfo_t g_logs[MAX_NUM_THREADS];

Next, comes a function to write new messages to the log buffer.

void log_printf(logInfo_t * log_ptr, const char * fmt, uint64_t arg0, uint64_t arg1)
{
    unsigned cur_index = log_ptr->next_index;

    // Write the log entry
    log_ptr->entries[cur_index].fmt_str = fmt;
    log_ptr->entries[cur_index].cycle_count = __builtin_readcyclecounter();
    log_ptr->entries[cur_index].args[0] = arg0;
    log_ptr->entries[cur_index].args[1] = arg1;

    // Update the next index
    // Use a store-release to ensure the data gets written out before updating the index
    // Only necessary if another thread will be looking at this thread's logs
    __atomic_store_n(&log_ptr->next_index, (cur_index+1) % NUM_LOG_ENTRIES, __ATOMIC_RELEASE);
}

Finally, there needs to be a function to extract the logs. To read the ring buffer, it must start on the entry one past the next_index, and works its way around the circular buffer, processing messages from oldest to newest. The entry that next_index points to should be skipped, since that may be an invalid state. I have provided a simple function which will print all the messages to stdout.

void dump_logs(logInfo_t * log_ptr)
{
    unsigned end_index = log_ptr->next_index;
    unsigned index = (end_index + 1) % NUM_LOG_ENTRIES;

    // Write logs out from oldest to newest
    while (index != end_index)
    {
        const char * fmt = log_ptr->entries[index].fmt_str;
        if (fmt)
        {
            printf("0x%016llx: ", log_ptr->entries[index].cycle_count);
            printf(fmt, log_ptr->entries[index].args[0], log_ptr->entries[index].args[1]);
        }
        index = (index + 1) % NUM_LOG_ENTRIES;
    }
}

Checking the assembly

At this point, I usually like to check what I wrote by looking at the generated assembly. It helps give me an idea how the code will run on a processor, and serves as a nice sanity check. I spent an hour or trying different variations of my logging function, to get it to behave the way I wanted it.

Here it is in aarch64 assembly (compiled with LLVM 4.0 and -O3):

0000000000000000 <log_printf>:
   0:   b9400008        ldr     w8, [x0]            // Read next_index into w8
   4:   d53b9d0a        mrs     x10, pmccntr_el0    // Read the cycle count into x10
   8:   8b081409        add     x9, x0, x8, lsl #5  // Compute entry address in x9
   c:   11000508        add     w8, w8, #0x1        // Calculate new next_index value
  10:   12002508        and     w8, w8, #0x3ff      // Handle circular buffer wraparound
  14:   a9018d22        stp     x2, x3, [x9,#24]    // Write fmt and cycle_count to entry
  18:   a900a921        stp     x1, x10, [x9,#8]    // Write arg0 and arg1 to entry
  1c:   889ffc08        stlr    w8, [x0]            // Write new next_index value to memory
  20:   d65f03c0        ret                         // Return from the function

I tend to find aarch64 easier to read than x86_64, but I’m probably biased. If you aren’t familiar with aarch64, 64-bit registers start with x and 32-bit registers start with w. Function arguments are passed in the lower number registers:

  • x0 register holds the pointer to the logInfo_t struct
  • x1 register holds the pointer to the format string
  • x2 and x3 registers hold the printf arguments (arg0 and arg1 respectively)

Putting it all together

I wrote a proof-of-concept program to demonstrate the code. It starts some number of POSIX threads, which each do a billion 128-bit integer multiplication calculations, logging the value at each step. Once all the threads finish, the main thread prints out the logs for each thread. Since the logging circular buffer only holds 1024 entries, and one of those entries is reserved, this means that only the 1023 most recent messages are printed.

Below is the code if you want to run it yourself

// logging.h

#ifndef LOGGING_H
#define LOGGING_H

#include <stdint.h>

#ifndef NUM_LOG_ENTRIES
#define NUM_LOG_ENTRIES 1024
#endif

typedef struct
{
    const char * fmt_str; // Holds the printf format string
    uint64_t cycle_count; // Holds CPU counter
    uint64_t args[2]; // Holds the arguments for printf()
} logEntry_t;

typedef struct
{
    unsigned next_index;
    logEntry_t entries [NUM_LOG_ENTRIES];
} logInfo_t;

static inline void log_printf(logInfo_t * log_ptr, const char * fmt, uint64_t arg0, uint64_t arg1)
{
    unsigned cur_index = log_ptr->next_index;

    // Write the log entry
    log_ptr->entries[cur_index].fmt_str = fmt;
    log_ptr->entries[cur_index].cycle_count = __builtin_readcyclecounter();
    log_ptr->entries[cur_index].args[0] = arg0;
    log_ptr->entries[cur_index].args[1] = arg1;

    // Update the next index
    // Use a store-release to ensure the data gets written out before updating the index
    // Only necessary if another thread will be looking at this thread's logs
    __atomic_store_n(&log_ptr->next_index, (cur_index+1) % NUM_LOG_ENTRIES, __ATOMIC_RELEASE);
}

#endif // LOGGING_H
// logging_test.c

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <inttypes.h>

#include "logging.h"

#ifndef MAX_NUM_THREADS
#define MAX_NUM_THREADS 2
#endif

logInfo_t g_logs[MAX_NUM_THREADS];

void dump_logs(logInfo_t * log_ptr)
{
    unsigned end_index = log_ptr->next_index;
    unsigned index = (end_index + 1) % NUM_LOG_ENTRIES;

    // Write logs out from oldest to newest
    while (index != end_index)
    {
        const char * fmt = log_ptr->entries[index].fmt_str;
        if (fmt)
        {
            printf("0x%016llx: ", log_ptr->entries[index].cycle_count);
            printf(fmt, log_ptr->entries[index].args[0], log_ptr->entries[index].args[1]);
        }
        index = (index + 1) % NUM_LOG_ENTRIES;
    }
}

void * thread_work(void * ptr)
{
    logInfo_t * log_ptr = ptr;

    unsigned __int128 value = (uintptr_t)ptr;

    for (unsigned i = 0; i < 1000000000; i++)
    {
        value = value * (value + 1); // Random calculations
        log_printf(log_ptr, "Result = 0x%016"PRIx64"%016"PRIx64"\n", (uint64_t)(value>>64), (uint64_t)value);
    }
    return NULL;
}

int main(void)
{
    // Start all the threads
    pthread_t thread_ids[MAX_NUM_THREADS];
    for (unsigned i = 0; i < MAX_NUM_THREADS; i++)
    {
        if (pthread_create(&thread_ids[i], NULL, &thread_work, &g_logs[i]))
        {
            printf("Failed to start thread %u\n", i);
            exit(-1);
        }
        printf("Started thread %u\n", i);
    }

    // Wait for the threads to finish
    for (unsigned i = 0; i < MAX_NUM_THREADS; i++)
    {
        if (pthread_join(thread_ids[i], NULL))
        {
            printf("Failed to join thread %u\n", i);
            exit(-1);
        }
        printf("Thread %u finished\n", i);
    }

    // Dump the logs
    unsigned long long start = __builtin_readcyclecounter();
    for (unsigned i = 0; i < MAX_NUM_THREADS; i++)
    {
        printf("Logs for thread %u:\n", i);
        dump_logs(&g_logs[i]);
        printf("\n");
    }
    printf("Took 0x%016llx cycles to dump\n", __builtin_readcyclecounter() - start);
    return 0;
}

To compile, I used LLVM 4.0 with the following flags:

$ clang-4.0 -O3 -Wall -Werror -Wextra -lpthread logging_test.c -o logging_test

Results

Here is an excerpt from the output of the program, which I ran on Ubuntu 16.04 with an old AMD Athlon 64 X2 4400+ processor (took about 6.5 seconds):

Started thread 0
Started thread 1
Thread 0 finished
Thread 1 finished
Logs for thread 0:
0x0001a5659a5281e0: Result = 0xa8de4407e966db5e20bd8c2a892e5270
0x0001a5659a5281ef: Result = 0x3c418e765f8fb5bbd0fdb58823fa4370
0x0001a5659a5281fe: Result = 0x616ae38cd3f230e13eb01fa08c7e1470
0x0001a5659a52820d: Result = 0x303ee77f99b2a3ba0c56f0132c5fc570
0x0001a5659a52821c: Result = 0x1a2ea6347438e9f863b14e65cdc55670
0x0001a5659a52822b: Result = 0x0cc7b39fe8d7a094ce5967535354c770
0x0001a5659a52823a: Result = 0x42f954a7560ac42d16ceef5970341870
0x0001a5659a528249: Result = 0xd083ccef298882582616ad3560094970
...
Logs for thread 1:
0x0001a5659bec082f: Result = 0x0e2a3c03f7d6d797e592a1245ac54cf8
0x0001a5659bec083e: Result = 0x87b397d8efcb783d32ef891247997d38
0x0001a5659bec084d: Result = 0xfe38031678a3dc5f4f21946e41c93978
0x0001a5659bec085c: Result = 0x353a88c07ca99655745984e87d1fe1b8
0x0001a5659bec086b: Result = 0x5080a8479555936b0e0088ce9eb4d5f8
0x0001a5659bec087a: Result = 0x59ec72f3976337861ec41bca564b7638
0x0001a5659bec0889: Result = 0xf6f51abbd7912b887ba145d671b32278
0x0001a5659bec0898: Result = 0x23c6239e6b748e660123383d9a273ab8
...
Took 0x00000000010482c6 cycles to dump

If we compare the left most numbers from the log, which are the cycle counts (RDTSC instruction on x86) at the time the message was logged, we see that each pass through the loop requires 15 cycles. My processor runs at 2.3 GHz, which means that 15 cycles is about 6.5 ns.

I also measured the amount of time it took to dump all the log messages at the end, which was 0x10482c6 cycles (about 17 million cycles, or 7.4 ms at 2.3 GHz). In this time, it called printf() 4098 times (2 threads * (1023 * 2 + 3)), which means it takes an average of 4166 cycles per printf(), or 1811 ns. I wouldn’t call this slow, and this is acceptable for many applications, but it is still ~280 times slower than our new logging system.

Next Steps

There are many ways this simple logging system can be improved, which I’ll mostly leave as an exercise to the reader. One area I want to delve in deeper is alternative ways to extracting the logs from memory, so that they are human readable. What if you could check the logs whenever you needed them, instead of at the end of the program?

In Part 2, we are going to look at extracting log messages using a debugger.