Paste: tentative talk outline

- dynamic language perf challenges
  - integer overflow checks
  - floating point boxing
  - generic collection protocols
  - higher-order functions, closures
  - high allocation rates, lots of temporary objects
    - gen GC helps to a degree, but ideally you'll elide the allocation
      altogether and put the slots in registers
  - branching and indirection!
  - interactive development
    - changes to class hierarchy
    - adding and removing slots in classes
    - function definitions changing at runtime

- Factor-specific challenges
  - implement most of the library in Factor itself
  - library code is very generic with lots of dispatch and allocation

- solution
  - classical multi-stage compiler
  - 30 optimization passes
  - I'll talk about a handful of the interesting ones

- reconciling interactive development and performance
  - syntax tree for code in all loaded source files is retained in memory
  - all decisions made are recorded in a deoptimization database
  - when something is redefined, the relevant definitions are recompiled
    from their in-memory syntax tree

- data types and representations
  - tagged pointers: 3 bits tag + 29 or 61 bit payload
    payload is either a pointer to an object in the heap
  - several types of classes
    - built-in classes have special representations in the heap.
      - arrays
      - byte arrays
      - fixnums
      - boxed floats
      - boxed C pointers
    - tuple classes: user-defined classes with dynamically-typed slots
      - slots are pre-declared
      - slots are packed inside instances
      - reshaping is still possible!
    - struct classes: user-defined classes with raw data slots
      - slots contain C values with C types
      - interoperability with C
      - used for high-performance code
    - 'abstract' classes with no direct instances: mixins, predicates

- SSA form
  - single static assignment form
  - def-use chains

- words
  - single definition

- generic words
  - single dispatch mostly
  - limited form of multiple dispatch for math operators
  - from the implementor's stand-point, almost the same as message-passing OO
  - think of message name/selector as a generic word

  - the slot access primitives don't bounds or type check
  - type checking is encoded with generic dispatch this way
  - you never call the slot access primitives directly; they're private

- tuple slot access
  - generic accessor words for reading and writing slot values
  - individual methods call 'unsafe' primitives

- generic arithmetic
  - similar idea as slot accessors
  - individual methods call monomorphic math primitives
  - note: fixnum+ takes two fixnums, but can output an integer
  - fixnum-bitand takes two fixnums and outputs a fixnum

- method inlining and class propagation:
  - if the dispatched method is known statically, inline it, and
    record the fact that the method was inlined in the deopt
    database. if the method is redefined later, or the class
    heirarchy changes in a way that could affect the inlining
    decision, recompile all words that inlined that method.
  - works well for tuple slot accessors, all float arithmetic,
    and non-overflowing fixnum arithmetic
  - problem: doesn't really help for integer math

- solution: integer interval propagation:
  - interval arithmetic
    [a,b] + [c,d] = [a+c,b+d]
    try some examples in a listener

- next: tuple slot propagation
  - complex numbers
  - we can have a complex number with integer components, float components
  - solution: propagate slot value info for immutable tuples

- a closer look at the propagation pass
  - mapping from SSA values to 'value info' objects
  - a 'value info' defines a set of possible runtime values
  - a value info can be a constant, an interval, a class, a class with slots, etc
  - for every node that takes inputs and produces outputs, use a 'value info function'
    to approximate the results
  - for an arbitrary subroutine call, just say the outputs are objects
  - phi node: take the union of the input value infos

- subtle issues
  - nice idea: intersect input types
    - however, doesn't work with dead code elimination
  - flow-sensitive value info
    - same value could have different value info in two different places
  - cycles in the def-use graph
    - do we start by assuming all values have null value-info, or universal value-info?
    - former is more accurate
    - what about infinite regress with loop counters?
      - in general, 'scalar evolution'

- modular arithmetic

- escape analysis
  - unboxes read only tuples only
  - this is where immutability wins

- remaining overheads

- look at float+ for instance
  - reads two values from the stack
  - unboxes them
  - performs an addition
  - boxes the result
  - stores boxed result on stack
- if we have several consecutive calls to such primitives, most of the time
  is spent in overhead and not useful computation
- a few other primitives are insufficiently fine-grained
  - memory allocation: if we split off the GC check, we can combine GC checks,
    saving a conditional branch in some cases

- this is where high level optimizations end and low level optimizations begin
  - low level IR is a control flow graph of basic blocks
  - low level IR is built from high level IR
  - certain primitives are expanded out into fine-grained instructions
  - other primitives just become ##calls
  - examples:
    - float+
    - fixnum+
    - allocation

- low level optimizations
  - after low level IR is built, there is another round of optimization
  - most important optimizations are:
    - value numbering
    - alias analysis
    - representation selection
    - copy coalescing

- value numbering
  - computes an equivalence relation over SSA values, by assigning a value number
    to each value
  - if two values have the same value number, then they will always have
    identical values at runtime
  - we can replace the instruction defining the second value with a copy
  
- value numbering
  - value graph
  - algebraic simplifications and constant folding for integer math
  - simplifies conditional branches

- alias analysis
  - mostly just optimizes away initialization in allocations
  - when a new object is created it has to be zeroed out
  - if its immediately stored to after, we can remove the initial store

- representation selection
  - when IR is constructed, no boxing/unboxing inserted at all
  - instead it inserts minimum amount of boxing
  - for each SSA value, compute possible representations
  - compute cost of keeping it in each representation
  - pick representation with minimal cost
  - at every usage where the representation is different, insert a boxing or unboxing

- copy coalescing
  - when phi nodes are eliminated, many copies are inserted
  - also two-operand pass inserts copies
  - we want to get rid of copies
  - if two values are related by a copy and don't interfere, merge them
  - normally, interference graph has to be built
  - with SSA form, there's a better algorithm

- register allocation
  - linear scan algorithm based on Christian Wimmer's HotSpot Client
    register allocator

- limitations
  - no type feedback or tracing; instead, 'hints'
  - a few optimizations missing
    - global value numbering
    - loop invariant code motion
    - array bounds check elimination
    - graph coloring register allocation

- features factor has that jvm doesn't
  - simd
  - struct arrays
  - integer optimizations
  - better code reloading