Friday, July 28, 2006

Pumping Data

[I seem to be on an error handling kick right now.]

Several of the books that I have been reading over the last few months have made a big point about allowing the interactive user to save their data at any time, regardless of whether it is valid or not.  I agree with this approach in many instances and especially when the entry of the data takes a long time. 

The counter argument is that the database has constraints that the data has to be abide by or it cannot be stored.  Not valid means not stored. 

That only holds true if you think of the data representing a single stage.  In reality data exists in many stages.  Take the data that is involved in a work flow situation.  Each step in the work flow may add to or extend the data in the “item” being worked on.  It is very likely that for a given work step at least some of the new data is stored in tables different from the tables used to store the “item” prior to the execution of that work step.  The designers of the work flow built a database schema that recognized the sequential accretion of data.  The database design does not freak out in the presence of incomplete data. 

I would like to argue that the entry of a complex form (such as a loan or college application) where the user has to assemble and enter a substantial amount of data should be thought of as a “work flow”.  That would imply that the database schema be designed to accomodate incomplete data.  From there it is just a short hop, skip and jump to designing it to accomodate invalid data.

While we could define a large number of steps, let’s stick to three to illustrate the point.

  • Unstructured: we allow the user to enter all sorts of data.  Each data field could be of the wrong data type, the wrong length, missing, inconsistent with other data fields, etc.  We might have several fields to hold unstructured notes.  We might store this in the database as an XML string, or other tagged format.  We would simply use the tagging structure to keep track of what each of the input strings was (more or less) intended to represent.
  • Structured: we extract recognizable data in the unstructured data into named fields.  Each data field would be of the correct data type but it may not be valid; for example, we have a number to represent an income level but it might be out of what we deem to be a reasonable range.  Associated with each data field would be one or more data fields to hold what I call “meta error data”.  In the simplest design, this meta field would hold one or more erorr messages that described the status of the operational data field: missing, out of range, not one of the valid values, and so on. 
  • Final: this is the final data that is valid in every way.  Data is put here only if it passes all of the validation tests.  The meta error data in the Structured version of the input data would have no messages.

The application would be built to accept and validate data.  The data access layer would be designed to store data wherever it could.  Valid data would be stored in the Final staging area, the structured staging area, and the unstructured staging area; this would be data that has made the complete journey from initial data entry to nicely refined data that can be processed.  Structured but invalid data would update the structured and unstructured staging areas; this is data that we think that we recognize but that has not made the cut yet.  Anything else would end up in the unstructured staging area.  If the user had to stop and then later resume the data entry process, the application would (for each data field) look in the final staging area, then the structrured staging area, and then the unstructured staging area, using the first value that it found in that search. 

Got to go catch a plane.  More on this later.

 

0 Comments:

Post a Comment

Links to this post:

Create a Link

<< Home