Thursday, June 16, 2011

Version 0 SortCommutativeRegisterUses Optimization Running Time / Speedups

To help create a starting point for my work I started with profiling the execution times of each optimization in Simple.java[1] as seen below,



and I started mutlithreading SortCommutativeRegisterUses as seen below.




Optimizations (in Simple.java)

  1. // Compute defList, useList, useCount fields for each register.
        DefUse.computeDU(ir);
  2.    // Recompute isSSA flags
        DefUse.recomputeSSA(ir);
  3. / Simple copy propagation.
        // This pass incrementally updates the register list.
        copyPropagation(ir);
  4.  // Simple type propagation.
        // This pass uses the register list, but doesn't modify it.
       if (typeProp) {
         typePropagation(ir);
        }
  5. // Perform simple bounds-check and arraylength elimination.
        // This pass incrementally updates the register list
        if (foldChecks) {
          arrayPropagation(ir);
        }
  6. // Simple dead code elimination.
        // This pass incrementally updates the register list
        eliminateDeadInstructions(ir);
  7. // constant folding
        // This pass usually doesn't modify the DU, but
        // if it does it will recompute it.
        foldConstants(ir);
        // Simple local expression folding respecting DU
        if (ir.options.LOCAL_EXPRESSION_FOLDING && ExpressionFolding.performLocal(ir)) {
          // constant folding again
          foldConstants(ir);
        }
  8. // Try to remove conditional branches with constant operands
        // If it actually constant folds a branch,
        // this pass will recompute the DU
        if (foldBranches) {
          simplifyConstantBranches(ir);
        }
  9. // Should we sort commutative use operands
        if (sortRegisters) {
          sortCommutativeRegisterUses(ir);
        }


In this code we multithread SortCommuativeRegisterUses optimization by forking a thread per Instruction clearly the granularity is to fine and we need to optimize the algorithm.

./rvm -X:aos:initial_compiler=opt -classpath dacapo-2006-10-MR2.jar Harness fop
Running Time



SpeedUps




Tuesday, June 14, 2011

ComputeDU original vs Mutlithreaded

To help create a starting point for my work I started with profiling the execution times of each optimization in Simple.java[1]







I started multi threading  DefUse.computeDU


Optimizations (in Simple.java)

  1. // Compute defList, useList, useCount fields for each register.
        DefUse.computeDU(ir);
  2.    // Recompute isSSA flags
        DefUse.recomputeSSA(ir);
  3. / Simple copy propagation.
        // This pass incrementally updates the register list.
        copyPropagation(ir);
  4.  // Simple type propagation.
        // This pass uses the register list, but doesn't modify it.
       if (typeProp) {
         typePropagation(ir);
        }
  5. // Perform simple bounds-check and arraylength elimination.
        // This pass incrementally updates the register list
        if (foldChecks) {
          arrayPropagation(ir);
        }
  6. // Simple dead code elimination.
        // This pass incrementally updates the register list
        eliminateDeadInstructions(ir);
  7. // constant folding
        // This pass usually doesn't modify the DU, but
        // if it does it will recompute it.
        foldConstants(ir);
        // Simple local expression folding respecting DU
        if (ir.options.LOCAL_EXPRESSION_FOLDING && ExpressionFolding.performLocal(ir)) {
          // constant folding again
          foldConstants(ir);
        }
  8. // Try to remove conditional branches with constant operands
        // If it actually constant folds a branch,
        // this pass will recompute the DU
        if (foldBranches) {
          simplifyConstantBranches(ir);
        }
  9. // Should we sort commutative use operands
        if (sortRegisters) {
          sortCommutativeRegisterUses(ir);
        }





Our ComputeDU Optimization Mutlithread's on a Register Level Unfortunately that Granularity is way to fine and we see overhead from runnable objects/ threads

Although we are faced with a few cases where we actually benefit (kinda) see 3/4 transition

Original Execution Time:     108715 nano Seconds Instructs: 821 Method: deleteTree
Execution Time:     556187 nano Seconds Instructs: 821 Method: deleteTree 2 Threads
Execution Time: 24728977 nano Seconds Instructs: 821 Method: deleteTree 3 Threads
Execution Time: 14231161 nano Seconds Instructs: 821 Method: deleteTree 4 Threads

for I in {1..3}; do ./rvm -X:aos:initial_compiler=opt -classpath dacapo-2006-10-MR2.jar Harness fop; done

Graph Axes will be fixed and dimensions will be lined up