Difference between revisions of "NewRegressionFramework"

From gem5
Jump to: navigation, search
Line 27: Line 27:
 
** How do you identify the last working revision? (from Ali)
 
** How do you identify the last working revision? (from Ali)
 
** Maybe need a bug-tracking system so we could record facts like "changeset Y fixes a bug introduced in changeset X" then we could automatically exclude changesets between X and Y, but we don't have that. (from stever)
 
** Maybe need a bug-tracking system so we could record facts like "changeset Y fixes a bug introduced in changeset X" then we could automatically exclude changesets between X and Y, but we don't have that. (from stever)
 +
* Better definitions of success criteria.
 +
** E.g. Stats were changed, but output is all still correct vs simply passed and failed. (Passed, stats diffs, failed)
 +
** For example you could say that the terminal output changing is fail, or the stdout and spec binary outputs changing are failed, but a 1% difference in stats is a stats difference, which needs to be addresses
 +
** I envision this as providing reasonable certainty that if you create a change you know will modify the stats, you have a quick verification that nothing broke horribly before updating the stats.
  
 
== Implementation ideas ==
 
== Implementation ideas ==

Revision as of 18:47, 4 April 2011

We'd like to revamp the regression tests by moving to a new framework. This page is intended to host a discussion of features and design for the new framework.

Desirable features

  • Ability to add regressions via EXTRAS
    • For example, move eio tests into eio module so we don't try to run them when it's not compiled in
  • Ability to not run regressions for which binaries or other inputs aren't available
    • With maybe some nice semi-automated way of downloading binaries when they're publicly available
  • Better categorization of tests, and ability to run tests by category, e.g.:
    • by CPU model
    • by ISA
    • by Ruby protocol
    • by length
  • More directed tests that cover specific functionality and complete faster. Running spec benchmarks is important but spends a lot of time doing the same thing over and over. Those should only be a component of our testing, not almost all of it like it is now. This is a desirable feature of our testing strategy, not necessarily something that impacts the regression framework.
  • Better checkpoint testing
    • some of this doesn't really depend on the regression framework, just needs new tests
    • e.g., integrating util/checkpoint-tester.py
  • Support for random testing (e.g., for background testing processes)
    • Random latencies?
    • Random testing a la memory testers but with different seeds, longer intervals
  • Decouple from SCons somewhat
    • Avoid having scons dependency bugs force unnecessary re-running of tests, particularly for update-refs
  • Easy support for running separate tests where only the input parameters differ
    • For example, several protocols utilize different state transitions depending on configuration flags. It would be great if we could test these without having to create new directories and tests.
    • Similarly, we could/should test topologies this way as well.
  • Automated way to use nightly regressions as a basis for updating "m5-stable"
    • How do you identify the last working revision? (from Ali)
    • Maybe need a bug-tracking system so we could record facts like "changeset Y fixes a bug introduced in changeset X" then we could automatically exclude changesets between X and Y, but we don't have that. (from stever)
  • Better definitions of success criteria.
    • E.g. Stats were changed, but output is all still correct vs simply passed and failed. (Passed, stats diffs, failed)
    • For example you could say that the terminal output changing is fail, or the stdout and spec binary outputs changing are failed, but a 1% difference in stats is a stats difference, which needs to be addresses
    • I envision this as providing reasonable certainty that if you create a change you know will modify the stats, you have a quick verification that nothing broke horribly before updating the stats.

Implementation ideas

Just ideas... no definitive decisions have been made yet.

  • Use Python's unittest module, or something that extends it such as nose
  • Use SCons to manage dependencies between binaries/test inputs and test results, but in a different SCons invocation (i.e., in its own SConstruct/SConscript)