Friday, July 28, 2006

Pumping Data

[I seem to be on an error handling kick right now.]

Several of the books that I have been reading over the last few months have made a big point about allowing the interactive user to save their data at any time, regardless of whether it is valid or not.  I agree with this approach in many instances and especially when the entry of the data takes a long time. 

The counter argument is that the database has constraints that the data has to be abide by or it cannot be stored.  Not valid means not stored. 

That only holds true if you think of the data representing a single stage.  In reality data exists in many stages.  Take the data that is involved in a work flow situation.  Each step in the work flow may add to or extend the data in the “item” being worked on.  It is very likely that for a given work step at least some of the new data is stored in tables different from the tables used to store the “item” prior to the execution of that work step.  The designers of the work flow built a database schema that recognized the sequential accretion of data.  The database design does not freak out in the presence of incomplete data. 

I would like to argue that the entry of a complex form (such as a loan or college application) where the user has to assemble and enter a substantial amount of data should be thought of as a “work flow”.  That would imply that the database schema be designed to accomodate incomplete data.  From there it is just a short hop, skip and jump to designing it to accomodate invalid data.

While we could define a large number of steps, let’s stick to three to illustrate the point.

  • Unstructured: we allow the user to enter all sorts of data.  Each data field could be of the wrong data type, the wrong length, missing, inconsistent with other data fields, etc.  We might have several fields to hold unstructured notes.  We might store this in the database as an XML string, or other tagged format.  We would simply use the tagging structure to keep track of what each of the input strings was (more or less) intended to represent.
  • Structured: we extract recognizable data in the unstructured data into named fields.  Each data field would be of the correct data type but it may not be valid; for example, we have a number to represent an income level but it might be out of what we deem to be a reasonable range.  Associated with each data field would be one or more data fields to hold what I call “meta error data”.  In the simplest design, this meta field would hold one or more erorr messages that described the status of the operational data field: missing, out of range, not one of the valid values, and so on. 
  • Final: this is the final data that is valid in every way.  Data is put here only if it passes all of the validation tests.  The meta error data in the Structured version of the input data would have no messages.

The application would be built to accept and validate data.  The data access layer would be designed to store data wherever it could.  Valid data would be stored in the Final staging area, the structured staging area, and the unstructured staging area; this would be data that has made the complete journey from initial data entry to nicely refined data that can be processed.  Structured but invalid data would update the structured and unstructured staging areas; this is data that we think that we recognize but that has not made the cut yet.  Anything else would end up in the unstructured staging area.  If the user had to stop and then later resume the data entry process, the application would (for each data field) look in the final staging area, then the structrured staging area, and then the unstructured staging area, using the first value that it found in that search. 

Got to go catch a plane.  More on this later.

 

Oops!

After I did my last post on the 0.01% solution, I went out for a walk.  I got to thinking about how applications respond to errors in input. One of the key points made by one of the major gurus of quality, Dr. W. Edwards Deming, was that it was more logical to ascribe the mistakes in processing to the process itself rather than to the people performing the process.  Rational management should design the process to reduce the number of possible errors.  I think the same thing is true about applications. 

I order a lot of things on line.  Each of these applications will ultimately as part of their checkout process ask for credit card information.  One of my pet peeves is that there is a lot of variability in how these applications will accept the credit card number.  Some applications require you to enter the spaces that appear on the credit card.  Other applications will not accept spaces at all.  I think that that is incredibly stupid.  The application ought to accept the credit card number with and without the spaces and internally converte to the format that it needs (I assume that’s without the spaces).

Another of my pet peeves is how applications handle addresses.  The application typically provides multiple input boxes to hold the one or two address lines for the street address, perhaps an apartment number, city, state, zip code, and so on.  If someone provides me with an address in an email and I want to get a map for that address, some of the mapping applications I use require me to parse the address to figure out all of the various piece parts.  What I really want to have is a text box I can paste the entire address into and ask the application to parse out the various piece parts for me.  The application would still have the various fields with the various parts of the address.  The parsing function would fill in those fields for me.  I then could edit those fields as needed and submit the address.  [This idea is not original with me.  I’m not sure where I read this, it might have been one of Alan Cooper’s books.  In any case I think it’s a good idea and I want to add my voice to promote it.]

Along the same line, I think applications should capture statistics about the errors that users make when they enter data.  A large number of a particular kind of error for a particular input field is not an indication that you have a lot of stupid users.  It is an indication that you had designed the application stupidly.  A good application makes it hard for users to make mistakes doing the typical processing.  Now I understand that there is a risk that you might make the application so “cuddly” that it becomes almost impossible to get useful work done.  However, a robust and vibrant application is aware of its environment and overtime adapts itself to productively serve the needs of its community.  If the application needs a credit card number and there are a lot of errors generated because people enter those credit card numbers with spaces that is an indication that you should accept spaces.

The 0.01% Solution

In an earlier post The Tale of the Bathtub Drain I talked about the difficulty of extracting the edge or fringe conditions.  The input consists of case alpha 99.99% of the time and case beta 0.01% of the time.  The problem is that each of these cases is probably to take the same amount of effort to document, design, implement, test, and so on.  It doesn’t seem quite fair.  In fact the effort to handle case beta is probably going to be even more difficult because it is obscure.  There probably are not very many people who fully understand case beta and it will take you longer to get the full and complete picture.  In fact the “powers that be” might not even be aware that there is this alternative processing path and might well regard the effort to document and handle it as a distraction from the real work at hand.

In my mind, that is a very strong motivation for getting a partial solution in place as quickly as possible.  Having the system reject valid data provides the concrete motivation to understand and document how to process that data.  Developing this motivation early in the project lifecycle as opposed to at the very end is critical to developing a quality application.

I’ve been giving a lot of thought to how applications ought to be designed to handle these kinds of situations.  It seems to me that applications oftentimes work by applying a set of rules to the input to control what kind of processing is applied to each of the inputs.  Oftentimes an application takes a binary approach to its input: either the input is “good” or it is “bad”.  The good input is processed and sent on its way; the bad input is rejected. 

I once worked on application that accepted data from a wide variety of sources.  This data would be dumped into a staging area which was just a set of tables on the database that had very relaxed requirements for format and content.  Periodically a set of automated processes would examine this data in the staging area and pull the data that passed the validation criteria into the mainstream of the processing.  The data that didn’t pass the validation requirements would stay in the staging area.  From time to time human beings would bring up an application that would allow the “adult leadership” to review the data in the staging area and take appropriate action.  Most the time they updated the input to correct the problems; the next time the automated processing ran, the data would be picked up and sent along its way.  Another significant part of the “appropriate action” was to formulate new rules for processing or to alter existing processing rules.

A very long time ago I took Latin in high school.  One of the works that I read in Latin was by Julius Caesar.  It began by saying “All Gaul is divided into three parts.”  I think applications should divide their input into three parts: data that is clearly broken because values are missing or mangled; data that is complete and the application knows how to process; and data that appears to be complete but the application does not know how to process explicitly.

Let’s do an example. Suppose that alpha and beta are products that we want to compute prices for.  When an order comes in for alpha, the application can perform the pricing automatically.  When an order come in for beta, we mark the order for the attention of the “adult leadership”.  Every so often the “adult leadership” accesses the suspended data and performs the pricing computation by hand.  In an agile project with multiple releases of the application, each successive release would be able to handle more of the input automatically.  The goal would be to reduce the “pile” of exceptional input to as close to zero as was economically feasible.  It may well be the case that the exceptional pile never becomes completely empty.  Some special cases might be so special and so infrequent that they do not warrant automated processing.

I think there is another interesting aspect of this approach.  The data processed by the system represents a rich source of requirements.  That is, we know what kind of inputs exist and we also know how the “adult leadership” handled those special imports.  Rather than asking them how they would handle each kind of input, we can simply look in the database to see how they actually did it.

 

Drain Watch: Day 2

The Bathtub Drain is down. 

Stay tuned for comprehensive geopolitical, financial, and astrological analysis of this unfolding drama.

It is Friday and I get to go home this afternoon.

 

 

Thursday, July 27, 2006

The Tale of the Bathroom Drain

I am on the road right now, living in one of those extended-stay hotels. Every day they come in while I am at the client site and cleanup the place, give me new towels, and the like.  Every morning, I take a shower.  How is this related to deigning out load, you ask.  I am getting to that; be patient.  The problem is that there must be different cleaning crews.  Some of them leave the drain in the bathtub up (so that water can drain) and some of them leave the drain in the bathtub down (so that the water cannot drain).  Now if I were half as clever as I thought that I was, I would check this before I started the water and got into the shower.  The truth is that I am not that clever and most times do not figure out that the drain is down until the water starts covering my ankles.  In my defence, the drain is pretty slow even when it is open.

The problem here is variability.  Variability screws up my morning routine (not a lot but we all have to take our lessons of life from where we can).  Variability also causes a lot of the grief in building software.  You gather requirements and the requirements say in a clear way that X is the case.  Down the line you find out that it is X 99.99% of the time but every once in a while it is Y.  There it is, that old devil variability.  Especially, unexpected variability.  There you are, strutting your stuff, hitting your milestones, making your numbers up and down the line and BANG! some client vice president comes in to burst your bubble by asking why his ankles are wet.  Bummer!

Ah well, into every life a little water must rise.

 

Building the perfect beast

Going along with the theme of this blog, designing out loud, what I want to do in this entry is to look at how I could “future proof” my design to handle the impending client specific product pricing.

One of the ways I could do this is to add a new parameter to the “compute total price” method supported by the Product class and pass in the reference to the current Customer instance.  I really don’t like this one very much at all.  One, I’m adding a parameter to several hundred different calls to this method that initially that will do nothing.  That really does violate the notion of YANGTNI.  Two, it also makes the product class dependent upon the customer class.  I may well want to use that Product class in an application that has no knowledge of clients and customers.  Three, it also means that the Product class has to have some knowledge about the peculiarities of different clients.  All in all, not a very good idea.

Another variation on this idea is to create a “parameter class” for the method that computes the total value.  The parameter class simply would hold references to all of the relevant bits and pieces needed for the “compute total price” method to perform its function.  This approach suffers from the same ills as the above approach.  The Product class knows about the Customer class and is dependent upon it.  Repackaging the dependency doesn’t make any more acceptable.

Another approach is to use the Strategy pattern.  We would create an interface (or an abstract class) to define the methods to compute the price of a product.  We would then create a realization of that interface that would handle the generic approach that were currently using.  Until the new client arrangement shows up, we would have only one implementation of the interface.  We would modify the Product class hold a reference to an object implemented the interface and to redirect the call to compute the total price to a method on the interface.  This approach is a little better deal than the first approach.  If are familiar with the notion that Eric Evans puts out in his Domain Driven Design book, the Product class and the pricing interface definition would be in one Aggregate.  We would probably put all of the details of pricing and invoicing into a different Aggregate.  (For those of you not familiar with Evans’s book, think of an aggregate as the same thing as an assembly.  There’s a lot more going on here than that but for the moment this is enough to help you get through this process.)  With this approach we have the capability to inject a different strategy for computing the price of a product.  Note that so far we’ve only repackaged the existing logic into a single strategy environment.  All that we have done is to add a few design elements to introduce a layer of indirection.  I would claim that we have not violated the spirit of YANGTNI (although we may have bent and bruised it a bit).

Still another approach is to create a separate class to perform all of the pricing.  We could argue that the Product class should only know about its base price and should not know about how various other factors such as the identity of the client, the region of the sale, or the phase of the moon affect how we compute the final price for the order.  What we would do then is to create a separate class to handle pricing.  Initially, this Pricing class would have to have knowledge of what product was involved and would have provided our familiar compute total price method.  Everywhere where we were going to call the compute total price method on the Product instance, we would now call that same method but on a Pricing instance.  I’m liking this one much better than the strategy approach.  The pricing class has a single responsibility: figure out how much to charge the customer under the current set of circumstances.

In the initial implementation of the Pricing class might only take into account the nature of the Product and the quantity of the order.  If and when we get new marching orders about pricing regarding a particular client, we can deal with that then.  I almost certainly will want to use some kind of factory method to build/acquire the Pricing object each time.  Again another level of indirection but this is relatively cheap to do it and again only bends and bruises the YANGTNI principle.

 

Wednesday, July 26, 2006

YANGTNI

In the philosophy of agile programming there is this principle called "you are not going to need it" or YANGTNI.  The notion here is that you shouldn't add design or logic to your application in anticipation of possible future requirements.  This is a principle that resonates with me quite strongly because I have a tendency to complicate the design in anticipation of possible things to come.  An associated notion is that when the requirements change, one can refactor the code to accommodate the new or altered requirement.  The problem that I have with this concept is that it tends to be articulated in a binary or absolute matter: either you need it or you don't.

In my world, things are never so black or white.  Let's give an example.  Suppose that I'm building an application that involves pricing of product that is sold to clients.  Suppose I have a requirement that stated something like this:

The price for a quantity of product on a single order is determined using the following table:
Quantity Range     Price per unit within the range
1 thru 99 inclusive     100% of base price
100 thru 999 inclusive     80% of base price
1000 and above     60% of base price

In other words, the more units that you order at a given time, the cheaper each of those units become.  Given this requirement, I would probably add a method “ComputeTotalPrice” to a product class (with the quantity as a parameter) to perform this computation.  Now let's complicate the situation a bit.  Suppose that we have a potential client coming down the road and the negotiations with this client so far would suggest that they want to change the way that we create that table, a special arrangement just for them.  We might get a decision tomorrow or next week or at the very least by the end of the decade.

As I understand it, The YANGTNI approach would suggest that you would do nothing about this potential change, until there is a firm hard requirement upon which to base an implementation.  When you have that revised requirement, you can refactor and address the requirement.

As I said this principle resonates with me but other parts of me say, wait a minute, there's got to be more to the story.  Taking things to an extreme, that would suggest that I don't need to lock my car or my house except on the day that someone's going to break in and steal something.  I do not have to wear my seat belts when I drive except on the day that I have an accident.  The reality is that I really need sit down and do some analysis.  I need to ask:
  • What is the likelihood that this particular change is going to happen?

  • When it does happen, what is the impact going to be on the design and implementation that I currently have in place?

  • Is there something that I could do today that would reduce the potential future impact?

  • What is the cost of making that change today?

  • What is the trade-off of making the change today versus waiting until some future time?

Let me take my example a little further.  If I only compute the price by calling the method described above in one location in my code, the answer to my “what is the impact when the requirement changes” question is, "not too much".  At this point the smartest thing to do is to discontinue my analysis.  When the change happens, we can respond to it easily and quickly.  Suppose, on the other hand, that my current implantation calls this computational method in hundreds of different locations in my code.  If I know that the logic of this method and in particular the parameters that it will require are stable, this is not a bad thing to do.  The unfortunate outcome of taking this approach is that the impact of a change in the price computation that involves additional parameters, be they the identity of the customer, the region of the country, or the phase of the moon, is increased dramatically.

It just seems to me that you have to think about these things.  I think that you have to ask the question, what existing requirements have a potential for change and is there something that I could do in terms of the structure of my design that would not require a lot of effort but would position the design and implementation to more readily accommodate such changes.  

I'll come back in future postings to ways that I think you could structure your design to accommodate such potential changes.          

Tuesday, July 25, 2006

Parallel Processing

I have been creating my unit tests for my restructured web application.  I am making very good progress and I am happy with the results so far.  However, I am doing something that I have never done before and it seems to be quite useful to me.  I am developing the NUnit tests and, in parallel, I am updating a requirements document.  My users will never understand the NUnit tests but they do have a reasonable chance of understanding my requirements document.  Note that I am writing a very sparse requirements document; it is just a bunch of declarative statements which define the functional requirements.

The process has been to write a handful of unit tests and get them working.  Then go over to the requirements document and update that.  More than once the update of the document has prompted me to write one or more unit tests.  The back and forth has been interesting.  It has given me two different looks at the same problem space.  

This approach would seem to ensure that everything in the functional requirements is testable: I write the test immediately and if I cannot figure out how to write a test, I modify the requirement until I can write a test.  

Just thought that you might like to know.

Some Thoughts on TDD

I'm posting this entry using the Blogger toolbar for Microsoft Word.  If this works, it will make my life a lot easier.  I spent a fair amount of my time in Microsoft Word.

I am a big believer in automated unit testing.  I work in the Microsoft .NET world and I use NUnit all the time.  I have been playing with the concept of TDD, Test Driven Design or Development.  

Like any new concept, TDD plays strange games in my head.  If I understand the concept properly, I should let the design emerge from the tests.  I suppose that that would work if I had not spent a fair amount of time before I sat down to code thinking about the structure for the problem in question.  It is just very odd feeling to pretend that you don't know that you're going to use a particular design pattern, for example, and write tests and code that do not take into account that final destination.  I suppose that, if you take this to its logical extreme, that some other design might pop out in the process and that design might be better.  That is certainly one of the comments that Kent Beck made in his book on the topic.

Anyway, I have a particular project I'm working on.  This is code written by someone else that I inherited and I'm trying to understand what's going on.  The style of the code is quite a bit different from that which I prefer.  It is a Web application with a lot of the business logic in the web pages or in the code behind for the web pages.  I prefer my web pages to be as thin as possible with all of the "meat and potatoes" logic in a separate set of classes (or even separate assemblies).  That way I can create a set of unit tests and run them in a fairly short time to make sure that I've understood was going on.

What I have done with this project is to create a separate assembly with separate classes and I am using TDD to try to capture the essence of the business logic.  I like to have written requirements.  On this project there are almost no written requirements.  For this one application, I decided that I would try to understand what was going on by extracting a set of requirements.  I then started writing unit tests from these requirements and “filling in the blanks” by either writing my own new code or cutting and pasting code from the existing application.  [Right now, I can hear some purists out there who are going "Oh my God, I don't believe he actually said that" but that's the way it is.  I just started this process yesterday and I made some good progress.  I'll let you know how it turns out.

My Very First Post in This Blog

It is said that the first is the hardest. My experience is that the fifth or sixth is much harder. The trick is to establish a pattern of behavior that can be sustained.

This blog is about software design, mostly object oriented software design. I find it useful as I am designing some software to try to capture my thinking about the problem. Having to write something down about the alternatives clarifies the mind. What I want to do in this blog is to talk "out loud" about what I am trying to do and what influences affect the design.

I read a lot about design (a by product of too much time on the road) and I try to apply the lessons to the work I do. Sometimes the current consulting project does not provide an opportunity. To address that, I have what I call a "sanity project" that I use as a test bed for trying out some of these ideas. Once upon a time, it was a real project but it has evolved into a sandbox for learning new ways of doing things by doing things in a new way.

My goal is to write something here everyday to establish that behavioral pattern. As with any blog, the material here is going to be short, somewhat unfocused, and erratic (but in a good way).

Ok, the first one is done. I want to take a look at the Word toolbar that allows me to do this from Word.