Posts Tagged ‘reproducibility’

Pattern for Versioning Generated Objects

May 5th, 2009 by dave

After building your software, do you check-in your generated binary  files? How about the output from test runs? If your software runs on multiple platforms or your test runs take hours/days to execute, you may want to consider storing the output — especially if binary reproducibility is critical.

Example. Consider shipping an application to a customer who 2 years later reports a defect. Can you reproduce their build “today”? Surely you have the exact versions of source files. But are you using the exact build file? Probably. How about the original version of the compiler? Maybe. But probably not. Don’t forget that your compilers get upgraded too — their optimization algorithms or bugfixes can change the binary execution format of your application. Thus, compiling source from 2 years ago may result in an equally functioning application at the user-level, but at the byte-level, things may have changed dramatically — and at a level where runtime defects (performance/memory) rear their ugly heads.

Myth #1: Committing generated  files results in longer checkout times. No developer wants to checkout source code and wait for or be inundated with megabytes of .o, .class, .jar, .war files that they are either never used or are going to be rebuilt anyway.  The AccuRev Truth: Use include/exclude rules on streams and workspaces to control which streams have access to generated objects and who will receive them during checkout.

Myth #2: Committing binary files slows down your CM system. Traditional SCM systems combine both meta data and content resulting in slower performance over time as the number of files increase (think labeling).  The AccuRev Truth: AccuRev stores meta-data separate from file contents and uses indexes to lookup and retrieve contents.  For example, transactions are labeled not files.  Using a card catalog (index lookup) to find your books is always faster than walking the isles (linear scan).

Myth #3: Storing generated artifacts will bloat the repository. Back in the day of wild-west coding, there was little rhyme or reason for where files were saved in the source tree.  The build system would simply compile the files it found, save the generated output right next to the source file, and as long as everything linked & compiled — it worked.  But in todays complex world of multi-layer software architectures, tiered deployments, mixed technologies, and sophisticated build tools, following a convention is almost a necessity (think  ruby on rails, maven, etc). The AccuRev Truth: Organizing the top-level source tree and configuring your build tool can make it very easy to carve out source vs. binary vs. tests vs. scripts, etc.  Using include/exclude rules, end-users can decide at the stream or workspace level what parts of the file tree need to be visible.

The Pattern. In this pattern for versioning generated artifacts, I’ll show how streams can be used to store generated files only in the appropriate stage of development and prevent unwanted exposure to developers.  Two options are present that can also be used in combination.

Option #1: sub-configurations

Option #1: sub-configurations

Option #1: Store and track generated artifacts as sub-configurations isolated from the mainline.   From a baseline snapshot such as a test build or release candidate, create a new child stream to store the generated artifacts.  Then create a second snapshot that represents both source code and generated artifacts. For a single “configuration” you now have two snapshots – one for source only and a second for source + binary.  Furthermore, you can diff these two snapshots to know exactly how the binary configuration is different from the source configuration.  You might also consider storing compiler files, debugging output, test output,the compilers themselves (!), etc.

Option #2: include/exclude rules

Option #2: include/exclude rules

Option #2: Store and track generated artifacts directly in mainline but exclude them from downstream access using stream-level exclude rules.   The top-most streams that need access to both source and binary will include the majority or entire filesystem footprint in their configurations.   The first stream that does not need access to generated objects will likely be the candidate to set an exclude rule on the folder(s) that contain those files.  The exclude rule is inherited to all children and grandchildren.

When using exclude rules, it is easiest to set a single rule on a top-level ‘./build’ or ‘./generated’ folder rather than creating a rule for each sub-folder in a large source tree.  Traditionally, make based build systems would generate the compiled files in-line with the source code.  Lately, ant based build systems would package all generated artifacts in a separate sub-tree off the root.  Regardless of your build tool, it’s best to have all generated artifacts in their own tree – it makes it easier to exclude as well as safer to clean!

In practice I see both patterns in use and both have equal merit depending simply on the situation at hand.  Option #1 is commonly used when generated artifacts are not to be included in the official release.  For example, transient or secondary artifacts such as test cases, debugging output, reports, etc.  These files are not promoted up to the release stream.  Option #2 is usually used when the generated artifacts are expected to be included in the official release snapshot.  Thus, they are promoted up through the test/build/release streams.  The build system for these types of ‘uber’ configurations may have multiple release targets creating different levels of release packages such as ‘minimal’, ‘app’ , ‘app-with-tests’ and ‘full’.  That is to say, the CM system may have all possible files but you can choose what actually gets deployed.  Ultimately, storing everything in the CM system may likely be the right choice for audit and reproducibility.

/Happy Coding/

How Accessible is Your Source Code?

December 19th, 2007 by jtalbott

One of the things that frightens anyone who develops code the most, or more accurately the people who are responsible for releasing or deploying that code, is the possibility that you might not have access to it within your software configuration management tool or version control system when you need it. Computers are fickle things; unless you are prepared to spend fantastic amounts of money for a truly redundant hardware/software infrastructure, there is always the chance that your system could be inaccessible when you need it most. And this doesn’t even begin to address potential network outages.

AccuRev, because of the ease of backup and restore for one thing, and the always accessible locally resident Workspaces for another, is a great insurance policy against data loss and brings developer down-time to zero. But let’s say that one of those dreaded network outages or hard disk failures occurs exactly at the time when you need to pull the latest source over to your build system for a critical release. Or imagine any kind of similar roadblock you’ve run up against in your own environment. This is yet another area where AccuRev, with just a tiny bit of prep and foresight, can save you hours, if not days, of aggravation by ensuring that you always have access to your latest “buildable” code.

How does this happen? Well, consider first the AccuRev stream paradigm where – unlike branches – there is a well defined progression of code, from the high instability of a developer’s Workspace to the “always buildable” top-level release stream. You simply have to choose the stream that you would consider as your ideal location to retrieve “offline” code from. In the following StreamBrowser, I’ve chosen the top-level “Claims_Client” stream.

offline_code

Create an AccuRev reference tree based on your chosen stream. Define the location on disk where that reference tree should populate the code to. Then, use a trigger (we provide examples with exactly this functionality) to Update that reference tree anytime a Promotion happens into this stream. The end result is that you have a specific file-system location that is *guaranteed* to have the latest and greatest code from any given stream that you want. Also, this approach is optimized because it only has to transfer changes, not the entire source tree, and there is zero manual intervention required! So in the case where the network or SCM server goes down, no problem for you; your code is happily hanging out right where you need it to be.

Some organizations are much more tolerant of this kind of scenario than others. If so, can you think of other times or examples where this kind of functionality would be helpful to you?