Archive for January, 2009

Risk v Confidence

Friday, January 30th, 2009

I’ve had a really exciting couple of weeks working with one of our System Test teams to define a better way of measuring test progress and product quality. For too long I’ve been fed up with the traditional test tracking metrics where we measure passes and fails or effort remaining. Historically, these measures seem to be used just because they are simple to gather. The assumption being that all you have to do is define what test cases need to be run, then track them until they all pass. The two major flaws in this are, firstly, that it’s a big assumption that the original test plan contains everything it needs to, and secondly, it is rare for any test plan to execute smoothly and at some stage in the project the project manager realises that the pass and fails aren’t telling them anything and start asking questions like “Just tell me what works and what doesn’t”. Invariably this is either impossible to determine or requires a lot of effort from the test team. At which point the simple solution is ‘Test team, work harder!’

I’ve failed miserably so far at trying to convince project teams that they should be looking at the outstanding risk in a project, rather than test case results. But I think I have finally realised why. People don’t like talking about risk. It sounds like something bad and most project teams don’t want to be associated with something bad.

The breakthrough we had this week came when my colleague Russell Finn came up with the idea of measuring the ‘confidence’ we have in the product or system rather than the outstanding risk. Now you could argue that confidence is just the inverse or risk in this case, but I think it has a much more positive spin on it.

We had been challenged by our lead engineer, Brian Cope, to redefine how we represented our status and with the help of system test leaders, Eileen Dreyer and Chris Osbourn we set about rethinking everything we do in terms of status reporting.

What we decided to show was effectively two columns of data. One showing areas of the product that we had high confidence in and one showing the backlog of areas we currently have low confidence in (or if you like the risky areas). Now, from a very simplistic view, we can answer the question ‘What works and what doesn’t?’ or at least have a good stab at it.

The next step was to work out a way of quantifying the ‘confidence’. Fortunately, this was relatively simple as we piggybacked on a piece of work that Russell had already done, where he had defined a ‘taxonomy’ for the system under test. This taxonomy split the system into its important parts, from a capability view point. With this taxonomy we were able to prioritise and apply relative weightings for each area using ‘Planning Poker’ (http://www.planningpoker.com/). A quick Friday afternoon game involving Jon Isaac, Russell, Brian and I and we had a pretty good view of the system with each area given a number of ‘story points’. (we have since done a sanity check with other members of our department and so far our estimates are holding up).

We could then chart the confidence in the system using a couple of pictures. The first showing the confidence in different areas (and their relative weighting), the second, showing the overall system. We decided to add a third ‘state’ to show areas of risk that we would be mitigating in the current iteration.

N.B The data shown here is for a fictitious system, but imagine that it is a system that is highly valued for it ability to recover from failures and outages and has a high expectation on performance.

Once we have this picture we can view automated test cases as tools that help us build our confidence in the system. Other tools included ‘manual testing’, ad-hoc testing, code reviews, code coverage metrics and ‘tester gut feel’. These other tools are not used in traditional tracking and can be a valuable source of information for determining the quality of the product. If these things feel a bit hokey now then spend a second or two thinking about what a traditional test status showing 54% pass actually means.

Riding on the back of another piece of work, where all the existing test cases had been ‘tagged’ to show which areas of the taxonomy they exercised. We held a review with the test team to weight each test case by area. Note that a test case can cover more than one area and would be weighted independently for each area. (For instance a test case might be very highly rated in the recovery area, but do a small amount of connectivity, this would mean that the test case would be weighted in both areas appropriately).

The following charts show the quality of the system during the early iterations. The highly weighted (and therefore most important areas) are being mitigated first (in true agile fashion) and we can see that a portion of recovery is now showing high confidence, a portion is being mitigated in this iteration and the rest is still outstanding in the backlog. Clearly the system is not suitable for shipping at this point.

As the iterations proceed we can see the backlog reduce and the confidence rise.

Finally we reach the last iteration and a decision must be made on whether we can ship or not. It looks like we have a small amount of risk in recovery, performance load and stress and a high confidence in everything else.

So, do we ship it or not?
The decision is still a tough one, but I’m sure that this sort of information will be far more useful than the traditional method where at this point we would be claiming 98% attempted and 94% successful!

I think this is a radical new way of thinking about product quality and will make a huge difference in how we do business.

I’d appreciate any thoughts and ideas on how this could be improved.

Sync’ing feeling…

Tuesday, January 27th, 2009

I seem to have amassed quite a lot of personal data; digital photographs, music, manuals and course-work all adds up to a few GB. Obviously, being fond of computing, I have implemented some mind bogglingly complex – albeit sporadic – backup strategies that I don’t really understand anymore. If a key piece of hardware were to fail I’d loose it all. As good an excuse as any then for a gadget impulse buy: a Network Attached Storage (NAS) device would save the day, so I bought one last weekend.

The problems started shortly after I switched it on.  I planned to use the device for backup purposes and to support sharing of data between systems, but I had not really thought this through. The device has a single ‘large’ disk, so no complex redundancy or performance configurations to worry about, how hard can it be?

My first plan was simple – mount the drive and start copying data. The drawbacks emerged pretty quickly;  many of the data structures are ‘live’ and subject to change.  Relying on manual inspection to identify updates is not really going to scale. Thankfully, the device manufacturer was one step ahead and had kindly bundled some software. Sadly, my PC, the software and I could not get along, the PC eventually got so upset that I sent it to the naughty step to reflect on its behaviour. OK, I need way of pairing file structures and establishing a relation so that changes would be synchronized from the live ‘master’ to backup ‘slave’ structures. At this point I remembered Jon mentioning SyncToy, which seems to address my needs. Any other tools or strategies worth reviewing?

Dangerous coding errors revealed

Friday, January 23rd, 2009

I was sent this link this week by fellow Test Architect, Alasdair Paton.

Dangerous coding errors revealed

25 of the most dangerous bugs in software as defined by the US National Security Agency (NSA). The question Alasdair posed was how many of these had I found? I also wondered how many we actually go looking for?
I could see about 6 or 7 that we see regularly and actively look for.

Thoughts?

Does Hero Testing make us Testing Heros?

Thursday, January 22nd, 2009

The plan has been approved and the team is fully committed, great stuff! Hang on a minute, what if… How about… hmm, I need another test! While I’m at it, it would be good to tidy up a few loose ends and address a few concerns. I’ll fetch my cape, it’s time for some Hero Testing; it can only make things better.

Jon first introduced me to the term ‘Hero Testing’ and to some of the challenges this ‘I can do it quick & cheap’ approach may bring. For example:

  • Employ different process
  • Compromise test repeatability (manual v automated test execution)
  • Lead to conflicting priorities
  • Undermine the value of ‘core’ testing already in plan

If this stuff is important, shouldn’t it be evaluated, prioritized, owned and managed by the team from *their* backlog? Maybe this is what a true Testing Hero would do. Shame, I thought I made the cape look good.

Risk in other industries

Wednesday, January 21st, 2009

There was a very interesting programme on BBC2 on how risk is managed (or rather mismanaged) in the financial markets. The City Uncovered with Evan Davis is well worth watching – even if just to understand how human nature sacrifices risk in the pursuit of performance.

It’s well summed up Evan Davis’ closing statement: “If you think you’ve got risk licked – you haven’t”.

What I learned today

Tuesday, January 20th, 2009

Just some of the things:

  1. As a Test Architect I get asked to review test plans and provide advice/recommendations. Sometimes these can be on topics that I know very little about. Today I grabbed a team member (who was previously a customer) and took him for a tea. Thirty minutes later I walked away knowing that little bit more about the product I work on and discovered additional areas that deserve testing attention. That was a great coffee “break”.
  2. I’ve been asked to provide a code sample for another team. We have buckets of this stuff around, however it was all written with automation in mind. I’ve discovered this does not lend itself to being opened up as easy to read samples. I wonder if there’s a way to meet in the middle and provide useful (automated) samples?

Did anyone else learn anything interesting today?

Independent testing

Sunday, January 18th, 2009

The main reason that I can think of for putting testers and developers in different teams, is the desire for independent testing.  By this, I mean someone who can think about the software critically, and not be “polluted” by knowledge of the internal details.  This is a very valuable thing, and means that you can end up with a tester who is using the software in ways that the original developer never even thought of.  A side effect is that the independent tester is more likely to be measured by the number and quality of defects that they find, and can therefore be appropriately critical of the software under test.  This can be good.

I’m not convinced, however, that *all* testers need to be completely independent.  While independence is valuable, it can create a rift in a delivery team of developers versus testers – it can become a battle, with software and defect reports being “thrown over the wall”.  I think that there are levels of independence.  It may be good for some testers to be involved in high level design discussions, but not low-level ones, or actual coding.  There are times when having a tester with knowledge of the code can be advantageous – maybe that person could even spend the rest of their time actually developing the code.  There doesn’t have to be a one-size-fits-all solution.

In short: independent testing is a good thing, but I think there are advantages to identifying when independence is required, and when breaking down barriers between developers and testers is more useful.

Isn’t it all just risk?

Wednesday, January 14th, 2009

Over the last few days, Ben and I (this is a joint post) have been trying to reach agreement on our understanding of risk. Ultimately we want to identify some new and effective methods to articulate the risks we may identify. The discussions were held at lunchtime, and OK, things got a bit silly. Nevertheless, we think there is a lesson to be learned somewhere in the example below. As such, please do add a comment if you can find one… We’ve also submitted this as an idea to the BBC’s Genius programme.

Objective: Introduce consolidated risk gauge to simplify the (human : machine) interface. The merits of a such a device are illustrated in the motor-car example below.

Modern cars have a bewildering array of dials and warning lights on a dashboard – but are they really necessary?

Consider just one of these dials: the speedometer. Does a driver really care what his absolute speed, based on centuries old units and the period of the Earth’s orbit around the sun, is? The answer is no: a driver simply wishes to know he will get to his destination without incident: be it crashing, getting a speeding ticket or missing his appointment.

It is therefore proposed that the speedometer is replaced by a risk dial – which interprets prevailing driving conditions, speed limits and navigation plans – e.g., using existing GPS technology – to calculate a risk metric. For example, if the driver exceeds a speed limit on a road, the risk gauge will go up, as speeding tickets are more likely. The driver can then elect to change his driving style to reduce this risk.

This proposal can be extended further. Consider the fuel gauge. The driver does not care about how much is in the tank per se – he simply wishes to understand the risk of running out of fuel on his journey. This dial could therefore be replaced too, by one measuring this risk. Note that the speedometer and fuel gauges have been simplified to share a common unit – one of risk – and hence can share the same gauge! By extension of the same argument, all dials and warning lights can be incorporated into one single dial of “consolidated risk” – thus addressing the complexity of modern car dashboards.

Missed a car service? Risk increases. Parked in a dodgy area? Risk increases. Such a metric would help encourage drivers to minimise risk, and even find alternate transport methods. This conveniently brings the proposal onto its zenith: consider how much more pleasant and minimalist a Jumbo Jet’s cockpit would be if there was just a single risk gauge.

‘Consumability’ testing

Monday, January 12th, 2009

I finally got ‘Mac’ed up at the weekend and bought my first iMac. A thing of beauty! I’ve seen a lot of them recently, but have never set one up so thought it was the ideal opportunity to do some real  ‘out of the box’ ‘Consumability’ testing.

First up, the whole lot comes in one box that’s easy to carry out of the shop. I got home and opend it up and decided to get as far as I could without reading the manuals. First out of the box comes the wireless keyboad, then the power cable, then the machine itself. Lastly a long white box with some books the mouse and some sort of remote control.

So the machine goes onto the table, the power lead can only go one place and there’s only one power on button that I pressed. A few seconds later a screen comes up telling me to sort the mouse out – simple enough.
Then the next screen tells me to get the keyboard ready. Easy – except I put the batteries in the wrong way and had to refer to the booklet to check (actually on closer inspection the keyboard had a little diagram on it telling me which way they went – so lets put it down to a user error)

Once the keyboard was on everything went smoothly – it asked me to run a few tests to check it was connected properly, then off it went setting up the machine.

So my out of the box experience and consumability testing have scored pretty high. Next step the ’11 year old test’. Could my daughter sync her phone up with the iMac using the bluetooth connection? Of course she could, it did take her a couple of minutes to work out which folder she need to connect to, but otherwise easy peazy!

Now onto some load testing……

Software with a sense of humour?

Friday, January 9th, 2009

I bought my wife a Nintendo DS for Christmas along with Dr. Kawashima’s Brain Training software. Since then, we have been competing to get our ‘brain ages’ as low as possible. Last night I was on a roll; I achieved a new player record in the Stroop test and personal bests for the word memory and connect maze tests. Having completed the three exercises, the software announced it was calculating my new brain age. For those unfamiliar with the software, a ‘brain age’ score of 20 is the best possible score.

After an impressive effort on my part, imagine my disappointment when I was informed my new ‘brain age’ was calculated to be 78 – more than double what it was before I started. What an outrage! I could not understand what I had done wrong, maybe I pressed some buttons while the system was calculating which had somehow affected the result. Either way, I was sure I was looking at a defect, and one I would not be able to recreate easily. While my wife saw the funny side, I was still trying to make sense of the situation. I clicked ‘next’ to get back to the system’s main menu page – at this point the system appears to persist the result and also presents some helpful tips, apparently I may have been tired when I did the tests which may explain my low score. Grrr!. Final step; I clicked OK to acknowledge completion of the test, but hang on a minute – there are some extra panels and I’m informed that there may have been a problem with the original calculation, my real score is in fact 23. At this point the animated character on the top screen (who I presume to be Dr Kawashima) is laughing as he tells me he had made a joke! As a tester, this is the first time software has made fun of me intentionally, I wonder if it will be the last? :)