Skip to content
  • Connor Abbott's avatar
    nir/search: Add automaton-based pre-searching · 7ce86e69
    Connor Abbott authored
    
    
    nir_opt_algebraic is currently one of the most expensive NIR passes,
    because of the many different patterns we've added over the years. Even
    though patterns are already sorted by opcode, there are still way too
    many patterns for common opcodes like bcsel and fadd, which means that
    many patterns are tried but only a few actually match. One way to fix
    this is to add a pre-pass over the code that scans it using an automaton
    constructed beforehand, similar to the automatons produced by lex and
    yacc for parsing source code. This automaton has to walk the SSA graph
    and recognize possible pattern matches.
    
    It turns out that the theory to do this is quite mature already, having
    been developed for instruction selection as well as other non-compiler
    things. I followed the presentation in the dissertation cited in the
    code, "Tree algorithms: Two Taxonomies and a Toolkit," trying to keep
    the naming similar. To create the automaton, we have to perform
    something like the classical NFA to DFA subset construction used by lex,
    but it turns out that actually computing the transition table for all
    possible states would be way too expensive, with the dissertation
    reporting times of almost half an hour for an example of size similar to
    nir_opt_algebraic. Instead, we adopt one of the "filter" approaches
    explained in the dissertation, which trade much faster table generation
    and table size for a few more table lookups per instruction at runtime.
    I chose the filter which resulted the fastest table generation time,
    with medium table size. Right now, the table generation takes around .5
    seconds, despite being implemented in pure Python, which I think is good
    enough. Based on the numbers in the dissertation, the other choice might
    make table compilation time 25x slower to get 4x smaller table size, but
    I don't think that's worth it. As of now, we get the following binary
    size before and after this patch:
    
        text   data	    bss	     dec	   hex	filename
    11979455 464720	 730864	13175039	c908ff	before i965_dri.so
       text	   data	    bss	    dec	           hex	filename
    12037835 616244	 791792	13445871	cd2aef	after i965_dri.so
    
    There are a number of places where I've simplified the automaton by
    getting rid of details in the LHS patterns rather than complicate things
    to deal with them. For example, right now the automaton doesn't
    distinguish between constants with different values. This means that it
    isn't as precise as it could be, but the decrease in compile time is
    still worth it -- these are the compilation time numbers for a shader-db
    run with my (admittedly old) database on Intel skylake:
    
    Difference at 95.0% confidence
    	-42.3485 +/- 1.375
    	-7.20383% +/- 0.229926%
    	(Student's t, pooled s = 1.69843)
    
    We can always experiment with making it more precise later.
    
    Reviewed-by: default avatarJason Ekstrand <jason@jlekstrand.net>
    7ce86e69