Saturday 31 March 2012

Data handling - the right tool for the job


Microsoft Excel must be the most widely used analytic tool available today but, it’s not the only tool in my tool-bag and it should not be the only one in yours either.  It’s perfect for building small models, has broad embedded functionality, good charting capabilities, pivot-tables (a superb tool worthy of its own post), comes with a programming language so you can extend its capability even further and almost everybody has a copy already.  It’s awesome and I use it every day.

But...the data we analyze is getting much bigger and a lot more complex.  Even with newer versions of Excel allowing over 1 million records in a sheet, what can you do usefully in Excel with 1 million records?  Certainly, you don’t want to print it, it’s near impossible to chart or model against, bottom line you are using the wrong tool for the job.  To do the job well you need to find, learn and use the right tool.  Don’t believe me?  Try chopping down a tree with a hammer!

It may be tempting to aggregate your data to reduce data size and continue using Excel.  Beware!  When you aggregate (Sum, Average, etc.) you lose detail you might need.  As an example, let's look at retailer point-of-sale data: the record of every purchase in every store:  
  • Aggregate individual shopper transactions into daily summaries (by product and store) and you can no longer see product associations in the basket or product substitution  from basket to basket;  
  • Aggregate daily to weekly data and you will be blind to monthly sales patterns (which can be very extreme).  
  • Aggregate store to national data and you will be unable to see regional patterns or relate sales profiles to demographics by store.  
If you throw away the detail you need (just to fit in a smaller space) then to quote a friend of mine "everything just looks vanilla".

At this point, most business analysts will open up Microsoft Access.  This can handle more data (up to 2 Gigabytes) and the complexity of multiple tables better than excel does it in multiple worksheets.  It also uses the same programming language so there is less to learn.   It’s a very good tool when used appropriately and I have a lot of experience with it, even designing custom data architectures and programming extensions to drive faster performance from it.

It seems easy to go beyond 2 Gigabytes these days but even before you reach 2GB you really should move to a more industrial-strength database (SQL Server, Oracle, DB2, MySQL…) just to handle the data volume effectively.  As data heads towards Terabytes and the data complexity increases you will need much more knowledge about database design and programming to do it well.

The analysis tools are not nearly so easy to apply in these environments as in Excel, requiring access to more high powered and analytical, data mining and optimization tools (SAS, SPSS, ILOG) or, for simpler analytics and my preference, a capable programmer that can embed the analytics directly into your database applications so they are usable on demand.

I have recently been working on a 3 Terabyte database, around 1500 times larger than the biggest Access database you can build and substantially larger than most SQL Server databases.  At this point, we are really talking ‘Big Data’.  In the analytical world you’ll see this phrase ‘Big data’ a lot right now.  If, for example, you’re collecting high frequency sensor data for weather modeling, or collecting data on for a web search engine, 3TB may not seem that big, but by current CPG standards it certainly is and importantly, its big enough that it becomes challenging to use regular, even industrial-strength, database packages. 

In this case (for Point of Sale data analysis) we use a novel and application-specific, data-architecture and associated query engine developed by Shiloh Technologies (www.shilohtech.com) on a SQL Server database.  The Shiloh application, enhanced now by a variety of analytic tools handles the huge data volumes for us, runs analysis, filters, summarizes and presents the data back to the user in a volume that a mere human can manage (certainly less than 100,000 records, 10,000 would be better).  And guess where we put that data?  Back into Excel of course – it’s a truly phenomenal tool!



Tuesday 20 March 2012

Data Cleansing: boring, painful, tedious and very, very important

I've been working recently on a category management project and I'm reminded of just how essential clean, well-organized data is.  We are working to group stores into 'clusters' of similar stores; later we will see what geographic and demographic data best helps us to predict cluster membership and optimize product assortment by cluster.

As a first pass, and under a severe time crunch, we took the data available, ran it through the model and while it processed, I was unhappy with the predictive power we found.  Of course, this approach was ridiculously  optimistic: so, back to look at the product characteristics we were using.
While the data were cleaner than expected they still suffered from a range of problems visible to someone who does not know the products that well:
  • missing, invalid and inconsistent values
  • inconsistency across related products (flavor variations with different weights and pricing). 
  • product characteristics that should really be split into multiple characteristics (because the options are not mutually exclusive)
It's taken a few iterations and a few days to get this cleaned up and to embed some of the new product characteristics in the system but its worth every minute.  Even simple analysis  now shows more meaningful results.  Predictive models will benefit even more.

Clustering is a relatively simple statistical process:   once set up, I can teach someone with limited predictive modeling skills to re-run models with sensible defaults and to interpret the outputs.  Cleaning the data and presenting it correctly to the modeling tools (so you get useful answers) takes more skill.


So, if you are knee-deep in a modeling project and have not paused to check your data quality, perhaps now is the time…



Sunday 18 March 2012

What is ‘analysis’ and why do most ‘analysts’ not do it?

In the Consumer Product Goods (CPG) world, there are a lot of analysts: supply chain analysts; sales analysts; shelf analysts; category analysts; transportation analysts... you get the idea.  For many people the 'analyst' role is their first step onto the managerial ladder.  It is their job to learn the business,  'crunch' numbers and one day, with hard work and a little luck, get promoted to a non-analytic, managerial role.



Knowing this career path, their managers recruit for skills and attitudes that reflect managerial and leadership potential (results focus, initiative, collaborative team working and initiative).  Rarely do they look for strong analytical skills (math, statistics, logical reasoning, systems engineering, management science and operations research) and as most organizations provide limited career paths for people with such skills, perhaps it is best not to recruit them into a dead-end role.

For many analysts, much of the work is focused around report generation which is manual and repetitive in nature ( run/copy/paste/format) and leaves little time for investigative, analytical work to comment on the reports, let alone investigate why the report says what it does.  The analyst stays stuck in this rut because they do not possess the skills and perhaps the time to find a way out.

It should not come as a surprise then that while analysts can be very bright, very energetic and looking forward to a great career in your company, they may not be very strong at 'analysis': framing the business issue to be solved; extracting the maximum useful insight from available data; and making recommendations as to next steps.

There are certainly other ways of thinking about what 'analysis' is, but I think this serves for now.  Of course it also gives you an idea why some people strong in analytic skills don't always do good analysis either :-)  In my experience, many lack the interest in the business issue they are working on, being enamored with the analytic process rather than the end-result.

Finding the combination of skill-sets you need to do this work well is tough and unless you have a very large organization, maintaining a critical mass of these skills is likely to be a problem too - it may be necessary to buy it in.