Resmack: Part 6: Stateful & Dynamic Grammars
Note: This post is part of a series on developing resmack. See the resmack tag for all posts in the series.
← Part 5 Grammar Mutation and Recursion
Static Grammars
To me, the most common type of grammar-based fuzzers use static grammars. Static grammars can be incredibly fast at generating structured data. See Brandon Falk’s (@gamozolabs) FZero fuzzer and the F1 Fuzzer from the Building Fast Fuzzers paper. However, as fast as static grammars can be, they do have their limitiations.
Static Grammar Limitations
Let’s imagine that you want to fuzz a programming language. Ignoring looking for bugs in the lexer/parser, you’d need to:
- Generate 100% valid code
- Syntax errors cause early exits
- Be able to keep track of [the assumed]‡ state of the program
- variables (additional, run-time added options for existing rules)
- identifiers (functions, …)
- scoping
‡We don’t actually know if the generated data is valid, but feedback-driven fuzzing with grammars should weed out invalid data
For #1, grammars seem to be an obvious solution. What about #2? Can this be done generically, without coding towards a specific paradigm/language?
This has been a goal of mine for a long time. In the past I have always created a basic grammar data-generator, and then wrapped it with language-specific constructs to keep track of scoping, in-scope identifiers, etc.
Limitations Example
Often when generating javascript to test browser APIs, references to valid instances of objects are only obtainable after several layers of promises and callbacks.
For example, when using the CacheStorage API, a promise needs to be used to gain access to a Cache object:
|
|
A few steps need to be taken before we’re able to call the add
function on the
cache object:
- Get a reference to a
CacheStorage
object - Call
open
on theCacheStorage
object and call thethen
function of the returned promise with a callback function as a parameter - Use the argument in the callback function as a reference to a
Cache
object - Call
add
on theCache
object
This makes sense for us as humans, but this isn’t how grammars construct data.
Resmack would first decide - “I’m going to generate a random statement”. The
statement the grammar might choose would be to call the add
function on
a Cache
object. Then it works backwards to resolve everything that it needs
in order to generate a valid Cache.add()
statement.
In resmack, you might try to implement it like this:
rules
->AddRule("CacheStorage", V("window.caches"))
->AddRule("CachePre", AND(
REF("CacheStorage"), V(".open('named-cache').then(function(cache_obj) {\n"))
))
->AddRule("CachePost", V("});")
->AddRule("Cache", V("cache_obj"))
->AddRule("Cache.add", AND(REF("Cache"), V(".add('https://narly.me')")));
With the above grammar, you’d have to know in advance that CachePre
and
CachePost
need to be built before and after the Cache.add
rule in order for
cache_obj
to be valid:
Having to hard code pre/post dependencies, as well as remembering scoping information, does not scale well. A more robust method is needed to let:
- Pre/post dependencies to be declared
- Capture and store built results as a new value for an existing rule
- Scoping of dynamic rule values
Resmack Solution
To accomplish this in resmack, several constructs needed to be created to make it possible to flexibly track scoping and variable/identifier values throughout the data generation process. Additionally, the grammar fuzzer’s internals needed to have particular characteristics to support the required constructs:
Each of the constructs mentioned above are in addition to “normal” constructs often found in grammar data-generators/fuzzers:
AND |
OR |
OPT |
REF |
INT |
STR |
RAW /V |
Construct: PRE
& POST
The PRE
construct allows pre-dependencies to be built prior to building
the actual output of the rule.
The POST
construct allows post-dependencies to be built dynamically after
the normal output of the rule is built. This is done lazily so that the effects
of building the normal output will be able to influence the building of any
post dependencies.
Using PRE
and POST
, we can modify the original resmack grammar for
CacheStorage
like so:
|
|
Building the Cache.add
rule produces:
window.caches.open('named-cache').then(function(cache_obj) {cache_obj.add('https://narly.me')});
which works great! For simple cases anyways.
For more complex cases it only raises more questions. What if:
- You want to be able to run multiple statements inside of the promise callback?
- You don’t want to hard code the parameter name to
cache_obj
?
Structure: PreOutput, Output and Post-Build
In order to support the PRE
and POST
constructs, resmack needed to have
separate output streams for both pre-output, normal output, and a dynamic queue
of post-items to build after all pre and normal actions have been performed.
Having the two output streams separately defined along with a dynamic queue
of post item references allows arbitrary nesting of PRE
/POST
items.
For example, what if a rule is referenced inside of a pre construct,
PRE(REF("refd-rule"))
, but the refd-rule
value itself has its own
PRE
and POST
constructs? A solution is to simply shift the pre-output stream
to be the normal output stream and create a new temporary pre-output stream.
Additionally, post-build items need to be tracked and built lazily after merging the pre-output and output streams.
The animation below demonstrates how this works:
Construct: STORE
& SET
Up to this point the grammar is still static - no new rules are being added, no state is being persisted.
The STORE
construct allows generated data to be stored as a value for a
specific rule while still outputting the data to the output stream.
The SET
construct allows generated data to become the only value option
for a specific rule while still writing data to the output stream.
NOTE: I did end up adding two variants: ISTORE
and ISET
that do not write
data to the output stream and only modify the rule values themselves
Structure: Dynamic Rule Values
In order to support STORE
and SET
, the internal structures that hold and
define the rules and their values must be:
- dynamic, not static (implies heap-allocated data structures)
- allowed to have generation-time-only rule values‡
‡This has complications due to the preprocessing that resmack does. See the next sub-section about pruning
I’ve mentioned a few times online that due to my personal use cases with resmack, that I can’t do certain optimizations:
The stats definitely change based on the target complexity and the complexity of the grammar. There are some features in my fuzzer that prohibit it from being as fast as it could be, but I need them & add other optimizations for the cases where I don't need those features
— James Johnson (@d0c_s4vage) December 16, 2020
Dynamic rule values absolutely prohibit easy paths towards having a screaming-fast grammar fuzzer.
Using STORE
, we can rewrite the grammar to have a dynamic name for the
Cache
rule instead of being hard coded to cache_obj
:
|
|
Building the Cache.add
rule produces:
window.caches.open('named-cache').then(function(NmwTX) {NmwTX.add('https://narly.me')});
However, since the STORE("Cache", STR(5, 6))
construct simply adds a new rule
value to the already-existing Cache
rule, and we’re doing REF("Cache")
twice (lines 10 & 13), you may also see repeated, nested promises.
The prettified, generated data and animation below demonstrate this:
window.caches.open('named-cache').then(function(VKEUr) {
window.caches.open('named-cache').then(function(prDvV) {
window.caches.open('named-cache').then(function(vjUoB) {
prDvV.add('https://narly.me')
});
});
});
Pre-Processing & Pruning
Allowing rules to be defined only at generation-time complicates things.
First, a quick overview of the preprocessing that resmack does on rules:
Prior to generating data, resmack preprocesses all defined rules to determine:
- Reference depth of all rules‡
- Reachability of all references to other rules
‡The values with the shortest reference depths for each OR
are used to gracefully exit generation once a maximum recursion level has
been reached. See the
Controlling Recursion Depth in Grammars
and the Grammar Mutation & Recursion posts
Using the reachability and reference depth, resmack automatically prunes rules and rule values when:
-
The reference depth is infinite (cyclic rule references with no exit)
Prune_________
______No Prune
-
Rules are referenced that are unresolvable
Prune_________
______No Prune
The preprocessing runs in a loop until no new adjustments are made to the loaded rules. With this in mind, the third pruning scenario should make sense:
Rules whose values are only defined during generation will have no rule values
and would normally be pruned by #3. To work around this, rules are
marked as keep=true
if the only values for the rule are created during
generation.
Construct: SCOPE_PUSH
& SCOPE_POP
First, some terminology that I will be using. There are probably better terms for these, but at least we’ll be on the same page:
Term | Definition |
---|---|
grammar | The total set of rules that define the data |
rule | A grammar rule, such as the Cache rule that we have been working with |
rule value | One of the possible values of a rule |
rule set | A distinct set of rules and values |
The SCOPE_PUSH
and SCOPE_POP
constructs can be used to create new (scoped)
rule sets which are linked to a parent rule set. Adding new rule values
within a scoped rule set only adds the value to the current, scoped rule set,
not the parent rule set.
Using SCOPE_PUSH
and SCOPE_POP
we can redefine our CacheStorage
example
to use scoped variables within each function callback:
|
|
The output is the same as before, but more correct!
What if we now wanted to perform multiple cache-related statements within a callback? The modified grammar below adds
|
|
Below are two pretty-printed results from building the cache-statements
rule:
window.caches.open("named-cache").then(function (bnksk) {
window.caches.open("named-cache").then(function (FhSbl) {
window.caches.open("named-cache").then(function (ySrIP) {
bnksk.add("https://narly.me");
FhSbl.add("https://narly.me");
});
});
});
window.caches.open("named-cache").then(function (bsEvk) {
window.caches.open("named-cache").then(function (OIdxu) {
window.caches.open("named-cache").then(function (uzqfh) {
window.caches.open("named-cache").then(function (Okuy) {
window.caches.open("named-cache").then(function (CuFcp) {
window.caches.open("named-cache").then(function (rshoi) {
uzqfh.add("https://narly.me");
CuFcp.add("https://narly.me");
uzqfh.add("https://narly.me");
});
});
});
});
});
});
Structure: Scoped Rule Sets
The internal structure of scoped rule sets in resmack is boring - linked lists of rule sets are used.
The more interesting (and harder to think about) aspect of this is how rule values are chosen when a given rule set has multiple parents. It’s not the way you might initially think!
For example, suppose I have the two rule sets:
rule_set: 0
parent: NULL
rules:
rule1 ::= `Value1`
rule2 ::= `Value2`
rule3 ::= `Value3`
rule_set: 1
parent: 0
rules:
rule1 ::= <empty>
rule2 ::= `NEW VALUE`
rule3 ::= <empty>
If rule set 1
is the current rule set, and I build rule2
, resmack does the
following to decide which rule value to build:
- runs up the chain of parent rule sets
- collects all rules that have non-empty
rule2
values - randomly chooses from one of the non-empty
rule2
values
This seems counter-intuitive because in so many ways, this grammar-based data generator (resmack) is starting work like an interpreter for a programming language. If this were a normal interpreter, having an item with the same identifier in a scope stack would mean that only the most recent/nearest-scoped identifier is actually used. All others would be shadowed by the “in-scope” identifier.
In grammars, this is different. Rules in grammars do not define identifiers
directly, but types of data or classifications of values. In the example rule
sets above, NEW VALUE
and Value2
are both instances of the rule2
type.
I occasionally forgot about this relationship while working on resmack. It is definitely confusing at times.
The Flushing Problem
To be continued