C/C++ Oddities You Need to Know

Memory placement, virtual memory and padding

Memory, one of the hotspots of oddities in C like languages and are prone to pitfalls. Requiring detailed knowledge about what is it and what you are doing in case you dont want to shoot yourself at foot.

Before we begin, here's an example of what we are talking about:


int arr[5] = {1, 3, 5, 7, 9};
int x = 65;

for(int i = 0; i < 6; i++) // loop upto 6!
{
    printf("%d ", arr[i]);
}
// Output: 1 3 5 7 9 65
        
Wait, what?
Shouldn't this be a Segmentation fault after i = 4?
Well, may be or may not be, and mostly no.
This is something I had a hard time wrapping my head around being a newbie. Well what happens here is, C puts the array arr and x right after one another in the stack. This makes the location arr[5] valid as now x sort of resides in this location right after the array and is also the same datatype of the array. These sort of errors are hard to track and thus makes it extremely important to maintain lengths of arrays as a programmer, it's the programmers responsibility anyway.

Now, certain good programming practises can be implemented to get around these issues:


// 1: Use ARR_SIZE macros
// Note: These only work for stack allocated arrays like int arr[5], or int arr[] = {...}
#define ARR_SIZE(array) \
    (sizeof(arr) / sizeof(arr[0]))
// Example: for(int i = 0; i < ARR_SIZE(arr); i++)


// 2: Store the lengths when defining in constants or macros
const int arrlen = 5;
int arr[arrlen] = {...};

// OR

#define ARRLEN 5
int arr[ARRLEN] = {...};
#undef ARRLEN // if necessary
        

Array operators

Reversed indexing


int arr[5] = {1, 3, 5, 7, 9};
for(int i = 0; i < 5; i++)
{
    printf("%d ", i[arr]); // :0
}
// Output:
// 1 3 5 7 9
        
Okay, now it seems even array indexing operators are not safe against typos.
Even worse - This is a feature not a bug!
What's happening here requires a deep yet simple understanding of how C/C++ array indexing work. When you write arr[i] C actually reads it as *(arr + i), basically C converts your indexing to pointer arithmetic and dereference, and it does so with no checking involved. So, now i[arr] simply means *(i + arr) and it's the same as *(arr + i) addition is associative, so they are infact the same! Even more, you can write stuff like 0[arr] too with no problem whatsoever! However cool it looks its actually a bad practise and should be avoided at all costs.

This is just the beginning...

Reversed re-indexing

It is also possible to use this same construct to get the value at an index and then use it to index in the array again
Or in simple words this: i[arr][arr] Here is a code example to describe this further in detail:

int arr[] = {1, 3, 5, 7, 9};
int i = 1;
printf("%d\n", i[arr][arr]); // Outputs: 7
// Why? Let's break it down
// i[arr] = arr[i] = 3
// 3[arr] = arr[3] = 7

Reversed indexing in 2D arrays

This gets even more weird in case of 2D arrays. Here we can too use the reversed indexing i[arr] to access the inner array and then access the rest as i[arr][j] where i and j are the indexes of the 2D array. The concept is similar here, we again use the same logic as reversed indexing to retrieve the inner row array from the 2D array and then index the particular element from the row array. Here's an example to demonstrate this:

int arr[2][3] = {{1, 2, 3}, {4, 5, 6}};
int i = 1;
int j = 2;
printf("%d\n", i[arr][j]); // Outputs: 6
// Why? Let's break it down
// i[arr] = arr[i] = {4, 5, 6}
// {4, 5, 6}[j] = 6

Nested reversed indexing

We can even nest these sort of accesses to make it worse. For example consider a 2D array arr and then access the inner array using reversed indexing, then access the element of the inner array with an enclosing reversed indexing. For example:

int arr[2][3] = {{1, 2, 3}, {4, 5, 6}};
int i = 1;
int j = 2;
printf("%d\n", j[i[arr]]); // Outputs: 6
// Why? Let's break it down
// i[arr] = arr[i] = {4, 5, 6}
// j[{4, 5, 6}] = 6

Array indexing on "literal" strings

Characters in a string can alreay be indexed using similar indexing operators like str[i] But did you know that you can do that on a literal too? Not sure why this works or exists, but it works...
Example:

char *str = "Hello, World!";
// Makes sense and used to work
printf("%c\n", str[7]); // Outputs: W
// But this works too!
printf("%c\n", "Hello, World!"[7]); // Outputs: W

Memory padding


struct Complex
{
    int x;
    char y;
    long long z;
};
        
Here's a question for you -
What will be the offsets for the members in this structure?
Well, here we go,
int is 4 bytes long, char is 1 byte long and long long is 8 bytes long and by simple math, we can say
Right? Well unfortunately no.
Again - It's a feature, not a bug
C and friends are build for speed and optimization, and will do whatever necessary to match up to this criteria. When calculating offsets you should remember that your CPU is only capable of reading in multiples of a word and this limitation makes it a very bad choice to store stuff one after the another in memory just like that. Here's an image describing the underlying issue of storing contiguously:
Packed struct issue image
Packed struct issue
As you can observe, reading from memory index 5 is impossible as our computer is capable to read only 4 bytes at a time (it's a 32 bit CPU), this causes the compiler to add additional shift operations to juggle the memory and get the value. So the compiler instead adds some padding to make the values start in multiples of 4(for our 32 bit system). Making the offsets look like:
Padded struct example
Actual structure mapping
Making reads instantaneous for the CPU.
We can however force the compiler to use packed memory at the cost of slower reads.
Here's how to do it in GNU GCC C98 compiler:

struct Complex
{
    int x;
    char y;
    long long z;
} __attribute__((packed));
        
The __attribute__((packed)) is a compiler hint to pack memory contiguously for structures
Also reordering the memory from biggest to smallest actually helps from losing the intermediate memory as padding. Here's how you can do it:

struct Complex
{
    long long z;    // 8 bytes largest
    int x;          // 4 bytes, next largest
    char y;         // 1 byte smallest
};
// now the offsets are:
// z = 0
// x = 8
// y = 12
// all of them are contiguous
        

More oddities coming soon...