Fuzzing of self-referencing structures

In most sources devoted to file format fuzzing a file structure is described as a set of fields, where a field is a triplet of an offset, format and value. Fields can be logically organized in larger structure elements. Fields or subsets of fields can refer to each other, e.g. a field value can contain an offset or more sophisticated characteristic of another structure element. Another common variant is when presence of a field or a set of fields depends from another structure element existence or absence.

To my pleasant surprise the qcow2 format has self-referenced structures. There is a two-level structure that contains information about all allocated clusters in the image including ones occupied by the structure itself.

A small informational digression about this structure.

The low level — refcount block — contains information about some amount of adjacent clusters. This amount depends on a cluster size and size of one refcount entry, but it's at least several hundreds of clusters. Let's call it Nblock. Refcount blocks are ordered, i.e. the first refcount block describes first Nblock clusters, second one — next bunch of clusters and so on. If none of Nblock clusters for this refcount block is allocated, the block is not created. The size of each refcount block is exactly one cluster.

The high level — refcount table — contains references to all existing refcount blocks. These references are ordered, i.e. the first refcount table entry relates to the first refcount block, the second one — to the second refcount block, etc. If a refcount block is not created, the corresponding refcount table entry will be empty. So the size of the refcount table is defined by the maximum index of allocated refcount blocks and can take several clusters.

Let's get back to the fuzzer.

The goal is to represent the refcount table and blocks as a set of fields. If image and cluster sizes vary from generation to generation, it's impossible to predefine amount and positions of refcount blocks as well as size and position of the refcount table. So the generation process has to be iterative.

The main problem here is that an addition of a refcount block may cause a growth of the refcount table to the next cluster, that in its turn may require an additional refcount block and so on.

The most simple solution here is creation of a self-describing refcount block in the end of the image. In other words, Nblock clusters are appended to the image and all refcount blocks, including the refcount block describing these last clusters, and the refcount table are located in the last Nblock of clusters.

But this solution has several drawbacks. It doesn't work in some extreme cases, e.g. a small cluster size and a large image size may require more than Nblock refcount blocks. Test variability is very important property and it's not good to fix location of the refcount metadata for all generated images.

Let's improve the solution.

Allocation of refcount blocks:

  1. Calculate the number of refcount blocks required for all non-empty clusters.
  2. Allocate clusters for refcount blocks in any empty space of the image, e.g. take a random sample from indices of empty clusters. If there is no empty space left, append necessary clusters to the end of the image.
  3. Check if these newly allocated clusters require additional refcount blocks. If they do, go to the step #2, else exit.

As far as each step requires less allocations than the previous one, the algorithm converges at most when the empty space in between allocated clusters ends and all new clusters start being appended.

Allocation of the refcount tables:

  1. Calculate the refcount table size (Stable) after allocation of all refcount blocks.
  2. Find a sequence of adjacent empty clusters with the length equal to Stable + a tolerance on the refcount table growth. In my case, one cluster was an optimal tolerance.
  3. Allocate first Stable clusters of the sequence defined on the step #2.
  4. If newly allocated clusters require new refcount blocks, then use the algorithm for the refcount block allocation pointed above, but on every iteration check whether the refcount table should be extended to the next cluster. If it should, add this cluster to the list of newly allocated ones.

Now as all clusters are allocated, all refcount structures can be easily generated.

Useful links: