Resmack: Part 6: Stateful & Dynamic Grammars

January 6, 2021 10-minute read

grammar • fuzzing • resmack

Note: This post is part of a series on developing resmack. See the resmack tag for all posts in the series.

← Part 5 Grammar Mutation and Recursion

Static Grammars

To me, the most common type of grammar-based fuzzers use static grammars. Static grammars can be incredibly fast at generating structured data. See Brandon Falk’s (@gamozolabs) FZero fuzzer and the F1 Fuzzer from the Building Fast Fuzzers paper. However, as fast as static grammars can be, they do have their limitiations.

Static Grammar Limitations

Let’s imagine that you want to fuzz a programming language. Ignoring looking for bugs in the lexer/parser, you’d need to:

Generate 100% valid code
- Syntax errors cause early exits
Be able to keep track of [the assumed]^‡ state of the program
- variables (additional, run-time added options for existing rules)
- identifiers (functions, …)
- scoping

_{^‡We don’t actually know if the generated data is valid, but
feedback-driven fuzzing with grammars should weed out invalid data}

For #1, grammars seem to be an obvious solution. What about #2? Can this be done generically, without coding towards a specific paradigm/language?

This has been a goal of mine for a long time. In the past I have always created a basic grammar data-generator, and then wrapped it with language-specific constructs to keep track of scoping, in-scope identifiers, etc.

Limitations Example

Often when generating javascript to test browser APIs, references to valid instances of objects are only obtainable after several layers of promises and callbacks.

For example, when using the CacheStorage API, a promise needs to be used to gain access to a Cache object:

1
2
3


window.caches.open("named-cache").then(function(cache_obj) {
    cache_obj.add("https://narly.me");
});

A few steps need to be taken before we’re able to call the add function on the cache object:

Get a reference to a CacheStorage object
Call open on the CacheStorage object and call the then function of the returned promise with a callback function as a parameter
Use the argument in the callback function as a reference to a Cache object
Call add on the Cache object

This makes sense for us as humans, but this isn’t how grammars construct data.

Resmack would first decide - “I’m going to generate a random statement”. The statement the grammar might choose would be to call the add function on a Cache object. Then it works backwards to resolve everything that it needs in order to generate a valid Cache.add() statement.

In resmack, you might try to implement it like this:

rules
  ->AddRule("CacheStorage", V("window.caches"))
  ->AddRule("CachePre", AND(
    REF("CacheStorage"), V(".open('named-cache').then(function(cache_obj) {\n"))
  ))
  ->AddRule("CachePost", V("});")
  ->AddRule("Cache", V("cache_obj"))
  ->AddRule("Cache.add", AND(REF("Cache"), V(".add('https://narly.me')")));

With the above grammar, you’d have to know in advance that CachePre and CachePost need to be built before and after the Cache.add rule in order for cache_obj to be valid:

Having to hard code pre/post dependencies, as well as remembering scoping information, does not scale well. A more robust method is needed to let:

Pre/post dependencies to be declared
Capture and store built results as a new value for an existing rule
Scoping of dynamic rule values

Resmack Solution

To accomplish this in resmack, several constructs needed to be created to make it possible to flexibly track scoping and variable/identifier values throughout the data generation process. Additionally, the grammar fuzzer’s internals needed to have particular characteristics to support the required constructs:

Construct: PRE & POST
- Structure: PreOutput, Output, Post-Build
Construct: STORE & SET
- Structure: Dynamic Rule Values
Construct: SCOPE_PUSH & SCOPE_POP
- Structure: Scoped Rule Sets

Each of the constructs mentioned above are in addition to “normal” constructs often found in grammar data-generators/fuzzers:


`AND`	`OR`	`OPT`	`REF`
`INT`	`STR`	`RAW`/`V`

Construct: `PRE` & `POST`

The PRE construct allows pre-dependencies to be built prior to building the actual output of the rule.

The POST construct allows post-dependencies to be built dynamically after the normal output of the rule is built. This is done lazily so that the effects of building the normal output will be able to influence the building of any post dependencies.

Using PRE and POST, we can modify the original resmack grammar for CacheStorage like so:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


rules
  ->AddRule("CacheStorage", V("window.caches"))
  ->AddRule("Cache", AND(
    PRE(AND(
      REF("CacheStorage"), V(".open('named-cache').then(function(cache_obj) {")
    )),
    V("cache_obj"),
    POST(V("});"))
  ))
  ->AddRule("Cache.add", AND(REF("Cache"), V(".add('https://narly.me')")));

Building the Cache.add rule produces:

window.caches.open('named-cache').then(function(cache_obj) {cache_obj.add('https://narly.me')});

which works great! For simple cases anyways.

For more complex cases it only raises more questions. What if:

You want to be able to run multiple statements inside of the promise callback?
You don’t want to hard code the parameter name to cache_obj?

Structure: PreOutput, Output and Post-Build

In order to support the PRE and POST constructs, resmack needed to have separate output streams for both pre-output, normal output, and a dynamic queue of post-items to build after all pre and normal actions have been performed. Having the two output streams separately defined along with a dynamic queue of post item references allows arbitrary nesting of PRE/POST items.

For example, what if a rule is referenced inside of a pre construct, PRE(REF("refd-rule")), but the refd-rule value itself has its own PRE and POST constructs? A solution is to simply shift the pre-output stream to be the normal output stream and create a new temporary pre-output stream.

Additionally, post-build items need to be tracked and built lazily after merging the pre-output and output streams.

The animation below demonstrates how this works:

Construct: `STORE` & `SET`

Up to this point the grammar is still static - no new rules are being added, no state is being persisted.

The STORE construct allows generated data to be stored as a value for a specific rule while still outputting the data to the output stream.

The SET construct allows generated data to become the only value option for a specific rule while still writing data to the output stream.

_{NOTE: I did end up adding two variants: ISTORE and ISET that do not write
data to the output stream and only modify the rule values themselves}

Structure: Dynamic Rule Values

In order to support STORE and SET, the internal structures that hold and define the rules and their values must be:

dynamic, not static (implies heap-allocated data structures)
allowed to have generation-time-only rule values^‡

_{^‡This has complications due to the preprocessing that
resmack does. See the next sub-section about pruning}

I’ve mentioned a few times online that due to my personal use cases with resmack, that I can’t do certain optimizations:

The stats definitely change based on the target complexity and the complexity of the grammar. There are some features in my fuzzer that prohibit it from being as fast as it could be, but I need them & add other optimizations for the cases where I don't need those features
— James Johnson (@d0c_s4vage) December 16, 2020

Dynamic rule values absolutely prohibit easy paths towards having a screaming-fast grammar fuzzer.

Using STORE, we can rewrite the grammar to have a dynamic name for the Cache rule instead of being hard coded to cache_obj:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


rules
  ->AddRule("CacheStorage", V("window.caches"))
  ->AddRule("Cache", AND(
    PRE(AND(
      REF("CacheStorage"),
        V(".open('named-cache').then(function("),
        STORE("Cache", STR(5, 6)),
        V(") {")
    )),
    REF("Cache"),
    POST(V("});"))
  ))
  ->AddRule("Cache.add", AND(REF("Cache"), V(".add('https://narly.me')")));

Building the Cache.add rule produces:

window.caches.open('named-cache').then(function(NmwTX) {NmwTX.add('https://narly.me')});

However, since the STORE("Cache", STR(5, 6)) construct simply adds a new rule value to the already-existing Cache rule, and we’re doing REF("Cache") twice (lines 10 & 13), you may also see repeated, nested promises.

The prettified, generated data and animation below demonstrate this:

window.caches.open('named-cache').then(function(VKEUr) {
    window.caches.open('named-cache').then(function(prDvV) {
        window.caches.open('named-cache').then(function(vjUoB) {
            prDvV.add('https://narly.me')
        });
    });
});

Pre-Processing & Pruning

Allowing rules to be defined only at generation-time complicates things.

First, a quick overview of the preprocessing that resmack does on rules:

Prior to generating data, resmack preprocesses all defined rules to determine:

Reference depth of all rules^‡
Reachability of all references to other rules

_{^‡The values with the shortest reference depths for each OR
are used to gracefully exit generation once a maximum recursion level has
been reached. See the
Controlling Recursion Depth in Grammars
and the Grammar Mutation & Recursion posts}

Using the reachability and reference depth, resmack automatically prunes rules and rule values when:

The reference depth is infinite (cyclic rule references with no exit)

Prune_________ ______No Prune
Rules are referenced that are unresolvable

Prune_________ ______No Prune

The preprocessing runs in a loop until no new adjustments are made to the loaded rules. With this in mind, the third pruning scenario should make sense:

The rule has no values

Loop 1___ Loop 2___ Loop 3

Rules whose values are only defined during generation will have no rule values and would normally be pruned by #3. To work around this, rules are marked as keep=true if the only values for the rule are created during generation.

Construct: `SCOPE_PUSH` & `SCOPE_POP`

First, some terminology that I will be using. There are probably better terms for these, but at least we’ll be on the same page:

Term	Definition
grammar	The total set of rules that define the data
rule	A grammar rule, such as the `Cache` rule that we have been working with
rule value	One of the possible values of a rule
rule set	A distinct set of rules and values

The SCOPE_PUSH and SCOPE_POP constructs can be used to create new (scoped) rule sets which are linked to a parent rule set. Adding new rule values within a scoped rule set only adds the value to the current, scoped rule set, not the parent rule set.

Using SCOPE_PUSH and SCOPE_POP we can redefine our CacheStorage example to use scoped variables within each function callback:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


rules
  ->AddRule("CacheStorage", V("window.caches"))
  ->AddRule("Cache", AND(
    PRE(AND(
      REF("CacheStorage"),
        V(".open('named-cache').then(function("),
        SCOPE_PUSH,
        STORE("Cache", STR(5, 6)),
        V(") {")
    )),
    REF("Cache"),
    POST(AND(SCOPE_POP, V("});")))
  ))
  ->AddRule("Cache.add", AND(REF("Cache"), V(".add('https://narly.me')")));

The output is the same as before, but more correct!

What if we now wanted to perform multiple cache-related statements within a callback? The modified grammar below adds

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


rules
  ->AddRule("CacheStorage", V("window.caches"))
  ->AddRule("Cache", AND(
    PRE(AND(
      REF("CacheStorage"),
        V(".open('named-cache').then(function("),
        SCOPE_PUSH,
        STORE("Cache", STR(5, 6)),
        V(") {")
    )),
    REF("Cache"),
    POST(AND(SCOPE_POP, V("});")))
  ))
  ->AddRule("Cache.add", AND(REF("Cache"), V(".add('https://narly.me');")));
  ->AddRule("cache-statments", AND(
      REF("Cache.add"), OPT(REF("cache-statements"))
  ));

Below are two pretty-printed results from building the cache-statements rule:

window.caches.open("named-cache").then(function (bnksk) {
  window.caches.open("named-cache").then(function (FhSbl) {
    window.caches.open("named-cache").then(function (ySrIP) {
      bnksk.add("https://narly.me");
      FhSbl.add("https://narly.me");
    });
  });
});

window.caches.open("named-cache").then(function (bsEvk) {
  window.caches.open("named-cache").then(function (OIdxu) {
    window.caches.open("named-cache").then(function (uzqfh) {
      window.caches.open("named-cache").then(function (Okuy) {
        window.caches.open("named-cache").then(function (CuFcp) {
          window.caches.open("named-cache").then(function (rshoi) {
            uzqfh.add("https://narly.me");
            CuFcp.add("https://narly.me");
            uzqfh.add("https://narly.me");
          });
        });
      });
    });
  });
});

Structure: Scoped Rule Sets

The internal structure of scoped rule sets in resmack is boring - linked lists of rule sets are used.

The more interesting (and harder to think about) aspect of this is how rule values are chosen when a given rule set has multiple parents. It’s not the way you might initially think!

For example, suppose I have the two rule sets:

rule_set: 0
    parent: NULL
    rules:
        rule1 ::= `Value1`
        rule2 ::= `Value2`
        rule3 ::= `Value3`

rule_set: 1
    parent: 0
    rules:
        rule1 ::= <empty>
        rule2 ::= `NEW VALUE`
        rule3 ::= <empty>

If rule set 1 is the current rule set, and I build rule2, resmack does the following to decide which rule value to build:

runs up the chain of parent rule sets
collects all rules that have non-empty rule2 values
randomly chooses from one of the non-empty rule2 values

This seems counter-intuitive because in so many ways, this grammar-based data generator (resmack) is starting work like an interpreter for a programming language. If this were a normal interpreter, having an item with the same identifier in a scope stack would mean that only the most recent/nearest-scoped identifier is actually used. All others would be shadowed by the “in-scope” identifier.

In grammars, this is different. Rules in grammars do not define identifiers directly, but types of data or classifications of values. In the example rule sets above, NEW VALUE and Value2 are both instances of the rule2 type.

I occasionally forgot about this relationship while working on resmack. It is definitely confusing at times.

The Flushing Problem

To be continued