Showing posts with label POS-Analytics. Show all posts
Showing posts with label POS-Analytics. Show all posts

Monday, 29 September 2014

Next Generation Point of Sale Analytics

Over the last few months I've been exploring the features I want to see in a next generation platform for point of sale analytics: It's simpler, faster and cheaper, supports rapid blending of new data sources and is powered up with real analytic capability. Looking back there are a lot of posts on this topic so here is a quick summary with links back to the detail.


Note
  • I have no immediate plans to build such a system for sale but I do use systems with many of these features for ad-hoc analytics as they are flexible yet relatively easy to set up and tear-down without incurring substantial overheads. Consider this series more of a manifesto/buyers-guide.
  • I do see changes in the marketplace suggesting that a number of DSR vendors are at least considering a move in this direction. As to which one will get there first, I think it will be whoever feels least weighed down by their existing architecture.
Database technology has moved on dramatically over the last few years. For this scale of data, analytic solutions should be columnar, parallel and (possibly) in memory. This enables speed, scalability and a simple data structure that makes it easy to hook up whatever analytic or BI tools you wish.
If the only data you have in the system is pos sales for a single retailer, you can build a reporting system ("what sold well last week") but you will struggle to understand why sales change. Bringing in other data sources: multi-retailer, demographics, weather information, promotional calendars, competitor activity, socio-economic trends, Google trends, social media, etc. allow for much more insighful analtyics. It's not easy to do this though, particularly if your source database is locked down so that it takes a software engineer to add tables
The term "Analytics" in general use covers a lot of activities most of which involve little more than reporting. In some instances you can slice and dice your way through a dataset to find insight, reporting is not without value but it's not analytics. Not even close.
Can you buy good analytics? Yes, but there are also a number of pseudo-analytic solutions in the market that have little to no analytic power - caveat emptor!
To get to real, deep insights you need real analytic tools. Depending on the taxonomy you are used to, we are talking about predictive and prescriptive analytics,machine learning, statistics, optimization or data science. Most of these tools are not new but they are not generally found in standard BI offerings and even when they are (e.g. reporting level R integration) you may struggle to apply the analytic tools at scale.
Finally, whether you build your own analytic tools or buy them in to run on your platform, clever math is not enough. If a user cannot comprehend the tool or it's suggestions due to poor user interface design and /or bad visualization choices it's worth precisely ... squat.

Monday, 25 August 2014

Next Generation DSRs - Bring the Analytics to the data


Under old world analytics, you move data from the DSR to your analytic server, build models, then write results (sometimes models too) back out for integration into the DSR.
Now, consider this:
  • DSR datasets are often enormous. (2 years of data for a DSR I worked with recently input to a model was approx. 270 GB)
  • Analytic tools are small. (The R base software, all 150 packages I have installed and the development environment is 625 MB)
  • Analytic models are tiny. (Expressing a 10 component regression model in SQL, just 288 bytes and most of that is down to variable names)
Let's try that visually.
The input data is huge, everything needed to run R (my analytics tool of choice) is barely a blip on the scale and the resulting model can't be seen on this scale at all. And today we move the DSR data to the analytic server to run the analytics.... anyone else having an issue with this ?
Where the data is small enough that we can pull what we need via query over an ODBC connection and hold it in memory to run the analytics, perhaps you can live with the network overhead.
Similarly, if the DSR and analytic servers are co-located with a big fat data pipe connecting them, it doesn't matter so much. It's not same machine I'm after necessarily, but same rack would be nice.
What happens though, when the data is too big and the connection too slow (think wide area network) to be feasible? Now we need to build database structures on the analytic server, load the data (taking a copy), and if we are to re-run the analytics routinely, keep it in sync with the source on an ongoing basis. This is a lot of (non-analytic) maintenance work before we can even get started on the analytics.
So why do we do this?
"The analytic server is a high power, high memory machine great for analytics!" That's true but chances are your database servers have the same thing.
There are also valid concerns around how an analytic tool connecting directly to a database may impact other users. I do have a little sympathy for this, certainly much more than I used to, but think on this: a DSR is not a mission critical system. The failure of a mission-critical systems stops your business. If the DSR stops (and the chances are very good that you will have no issue at all), your reports are a bit late. Relax !
I have a suspicion that some of this is related to licensing. If you pay a small fortune for your analytic tools and they are priced per server, per CPU or per core, I can see why you would not want to go installing that software everywhere you might want to use it. Cheaper perhaps to bring the data to the software. Working with free open-source tools, it's not been an issue for me to install co-located or even on the same machine as needed.
Recently a number of database and BI vendors have moved to integrate analytic tools (often R, sometimes SAS) into their offerings trying to deliver real in-database analytics. I do think this is a great direction to move in though I have some concerns about the level of integration currently available. see my post on Analytic Power ! for more details.
Even if you can't execute true in-database analytics (which should be a Next Generation feature) there are still things you should be able to do to bring the analytics to the data.
First let's make a distinction between model-building (the act of creating new models from data) and model-scoring (running existing models against new data to make new predictions). All predictive analytic models I can think of can have this same split. (Descriptive and Prescriptive analytics do not)
Model-building is an intensive task, this is where all the heavy lifting happens in analytic work so processing and memory needs can be substantial though this varies widely depending on the analytic method and to some extent the implementation. If you have installed analytic tools directly on your database servers this may be enough to cause something of a slow-down. OK - try to co-locate instead. If you absolutely must replicate data to an analytic server on the other side of the world and try to keep your data in sync, I pity you.
Model-scoring is fast. A model is just a set of simple calculations. Deciding exactly what simple calculations you needed was the job of model-building but now you have done that, scoring new data against that model is quick.
This is what the result of a simple regression model looks like (in SQL):
[Variable_1] *-49.8916 + [Variable_2] *-24.2773 + [Variable_3] *-48.1305 + [Variable_4] -253.7238 + [Variable_5] *-20.7173 + [Variable_6] *17.722 + [Variable_7] *12.9865 + [Variable_8] *-17.4036 + [Variable_9] *2.2738 + [Variable_10] *-7.9186 + 6.668 AS Prediction
If you think it looks complex, look again, it's just a set of input variables multiplied by specific weights (as found by model-building) and then added together. This is easy work for the database. More complicated models will have more complex expressions, you may see logs, exponents, trig., perhaps an if..then..else statement. Nothing the database will find difficult to execute if it's expressed in the right language.
Unless models change with every input of new data (and so need re-building) there is no excuse not to score the model directly against the data. How you execute the model scoring is a different question and you have some options:
  • you may load the model, new data and score directly in your analytic tools. This is using a sledgehammer to crack a nut, but it's easy to do if a little heavyweight/slow.
  • for simpler models converting the model into SQL is not that difficult (though you do need to know SQL pretty well and have permission to build it into the database as a view, stored procedure or user defined function. This is probably the most difficult but fastest to execute.
  • try converting the model to PMML (predictive model markup language) and use a server based tool designed to execute PMML against your database. (Many analytic tools have an option to export models as PMML.) A PMML enabled DSR would be a great enhancement for the Next Generation.
Bring your analytics to the data , spend more time doing analytics and less data time wrangling.

Tuesday, 12 August 2014

Next Generation DSRs - Analytic Freedom !


Current Demand Signal Repositories don't play well with others. Their data is locked away behind layers of security and you can only access it through the shackles of their chosen front-end for reporting. There is no good way to get that rich dataset into other tools: you have to copy it into a new database and new data structures. (In some cases you may have to do this twice, once to rearrange the data from the DSR into a format you can understand, then again to match the data structure needs of the downstream tool.)
For small-scale models (do we do those anymore?) that sip data from the original repository you can do this through the reporting engine and live with the pain, for large scale modeling it's really not an option.
I want freedom. Freedom to analyze with whatever tools I need: The freedom to report in Business Objects, visualize in Tableau, analyze in R and run existing applications (order-forecasting, master-data-checking, clustering, assortment optimization, etc.) directly against this data. (I'm not endorsing any of these tools and you can replace the named software above with anything you deem relevant - that's kind of the point).
Much of this freedom comes from a simplified data model, enabled by new database technologies (massively parallel processing, scale-out, in-memory and columnar). See more details at data handling.
It also needs a security model that is handled by the database NOT the reporting layer or as soon as you get to the underlying data you can see lot's of interesting things you shouldn't :-)
I suppose I could live with a little less freedom if a DSR offered all the tools I need but I don't think that's realistic. Not all DSR reporting layers are equal, data visualization is hit and miss, and as I posted in An Analytic name is not enough while there are some good DSR based analytic applications you will find many use pseudo-analytics and some have no analytic basis at all.
Do you think, perhaps, that the Next Generation DSR will provide the best reporting, visualization and analytic tools available? Sorry, I don't think so. DSRs cover a dizzying array of analytic need and developing robust, flexible analytic applications, even assuming easy access to the data, is an expensive proposition for any DSR vendor to do alone. I anticipate a few strong analytic "flag-ship" tools will emerge alongside more me-too/check-the-box applications packed with pseudo-analytics.
So, what can the Next Generation DSR do to help?
  • make it (much) easier to get at the data in large quantities,
  • make it (much) easier to bring analytics to bear on that data. (Perhaps with an integrated analytic toolset)
  • open up the system to whatever analytic tools work best for you
  • make it easy for other software vendors to provide add-in analytics on the DSR data/analytics platform.
Think about that last point for a moment, no DSR vendor is big enough to provide state of the art analytic applications in all areas, but make it easy enough to integrate with and it could enable specialist analytics vendors to offer their tools as add-ins to the platform. (This could be good news for the analytics vendor too, it removes the need for them to install and maintain their own DSR just to enable the analytics)
Let's look at an example.
Today if you want assortment-optimization capability, you can
  • wait for your DSR vendor to develop it and hope they use real analytics; or
  • search for another solution and work to interface the (very large) quantities of data you need between the applications;
  • write your own (always fun, but you had better know what you are doing) and you will still need to interface the data.
  • decide not to bother
All but the last one of these are slow - I'm guessing 12 months plus.
In the NextGen world, if you want to new analytic capability, you could still write your own, it's easy to hook up the analytic engine, or, just go to the DSR's analytic market-place and shop for it.

Monday, 4 August 2014

Next Generation DSRs - An Analytic name is not enough

You need not always build your analytic tools, sometimes you should buy in. If the chosen application does what you need that often makes good economic sense... as long as you know what you are buying.

Let's be clear, an Analytic name does NOT mean there are any real Analytics under the hood.

For many managers, Analytics is akin to magic. They do not know how an analytics application works in a meaningful way and have no real interest in knowing. At the same time, there is no business standard for what makes up "forecasting", "inventory optimization", "cluster analysis", "pricing analysis", "shopper analytics", "like products" or even (my favorite) "optimization".  Don't buy a lemon!


In the worst examples, there is nothing under the hood at all. One promotion-analytic tool I came across recently proudly proclaimed that you (the user) could calculate the baseline and lift for each promotion however you saw fit and then just enter the result into their system. They presented this as a positive feature, but calculating a meaningful baseline and lift is the difficult part!!

I've seen similar approaches for:
  • off-shelf alerting tools that ask you how long of a period of zero sales is abnormal (so they can report exceptions)
  • supply chain systems that need you to enter safety-stocks or re-order-points (so they can figure out when to order).  
  • assortment optimization tools that want you to input product substitution rates.
Hmmm, is a car without an engine still a car?
Many applications use pseudo-analytics. After all, how hard can it be? "cluster analysis" , that's finding groups of things right? I reckon I can figure that out, no stats required. Yeah, right, of course you can... FYI - meaningful, useful clusters may be a little more difficult. It's not that cluster analysis is particularly hard, but neither is it something you can knock together without the right tools or any statistical understanding.
Sadly, I have seen real world examples of pseudo-analytics too in pricing analytics, off-shelf alerting, demographic analyses, inventory optimization and forecasting.
The right tool for the right job. There are many good analytic applications available, but you can still make it useless if it does not suit the task you have in mind. Using a time-bucket oriented optimization program to schedule production runs with sequencing comes to mind. OK, relatively few people are going to understand that one and it's not a DSR application, but it is real, the software vendor did not come out shouting that there would be a problem and 2 years down the line that project was abandoned.

Are DSRs worse than other applications?

I think this kind of feature-optimism, is a general issue in buying any analytic app but my perception is that it is a bigger problem in the DSR space.  Perhaps because the DSR is trying to offer so much analytic functionality to so many functional areas?  Is a DSR really going to handle forecasting, pricing-analytics, cluster-analysis, weather-sensitivity-modeling, promotional analytics, inventory optimization, assortment selection and demographic analysis (note - not a complete list), all as packaged software, for $50K a year?   Not unless they can scale that investment across a huge user-base.  Some will be good, others not so much - be warned.   

Spotting a lemon

An expert in the field (with analytic and domain knowledge) can spot a lemon from quite a distance. If you do not possess one you would be wise to invest in some consulting to bolster your purchasing team. For those applications that pass the sniff-test, the proof of any analytic system is in it's performance. Define rational performance criteria, test, validate, pilot and never, ever, ever rely on a software vendor ticking the box in your RFP.

Wednesday, 30 July 2014

Next Generation DSRs - Analytic power !

To handle real Analytics (see my recent post Reporting is NOT Analytics) you need real Analytic power. BI tools are based on the language they use to interrogate the database (typically SQL) and with no library of analytic tools - it's not nearly enough.



We use SQL (Structured Query Language) to query relational databases like SQLServer, Oracle, MySQL and Access. SQL is a great tool for handling large quantities of data, joining tables, filtering results and aggregating data. However, SQL's math library is only sufficient for accounting (sum, product, division, count) and while I do know it can do a few more things, it's not enough to be useful for Analytics. Even getting it to calculate a simple correlation-coefficient is a big challenge. Want to build a simple regression model? That's just not going to happen in base SQL, we need something designed for the task.
R, SAS, SPSS, Statistica, and a good number of others, are the real deal and the difference between any of them and what you can do in SQL (or Excel) is vast! With these tools it's no longer a question of "can you build a regression model?" now it's "which particular flavor of regression do you need?". What! There's more than one? Oh yeah!
I'm not getting into which analytic tool is the best. I use R, and that's what I'll talk to, but I have good friends, analytic-powerhouses who insist on using SAS or SPSS. These tools have different strengths and weaknesses and within the analytic community a lot of time, blog posts and misinformation go into arguing the relative merits of one vs. another. My take is that for most business-analytic purposes any of them will get the job done. The one you choose should be driven most heavily by your ability to get the analytic tool working against your data.
The problem is that these analytic tools do not generally reside in the same space as your database or BI tool, so you spend a lot of time interfacing data between systems. It's slow, sometimes very slow, and requires replication in your resources.
In recent years many database and BI tools have started offering integration with statistical tools (Oracle, SAP Hana, Tableau, Spotfire, MicroStrategy). The ideal here is in-database analytics where we run the complex stats in-tandem, indeed in the same memory space as the database. That is very attractive but I would look very carefully at the depth of integration offered before getting too excited. In some cases, I think, vendors have done just enough to tick the box without making it truly useful. As examples:
  • One vendor limits the transfer of data between database and R to simple table structures. Now, imagine running a regression model. What goes into the regression is very likely a simple table - check! What comes out is anything but: it's a complex object combining multiple tables of different dimensionality and named values (like r-sq). We need this data to determine the validity of the model and make future predictions. Force me to return just one table structure and I must throw most of the information and capability away. Before anyone asks, no, this is not unique to regression models.
  • Another vendor has integrated R into the reporting layer. This is relatively functional as long as the data you want to work with can be generated in a report. If you need very large amounts of input data you may well exceed reporting limits. If you want to build a separate model for each product in your database, you may have to run the report separately for each one.
  • Standard R was not originally designed for parallel execution (though you can get around this with a little coding help). Current processors (CPUs) even on low-level laptops are multi-cored. Servers routinely run more cores per CPU, more CPUs per server and we want to scale-out across multiple servers. A BI offering that only offers single core R execution is wasting your resources and time.
Bottom line, to do real Analytics, you need real Analytic tools. But even the best tools must be able to get at the data to be useful. Choose carefully,

Monday, 28 July 2014

Next Generation DSRs - Reporting is NOT analytics

I've written a number of posts now on the next generation of Demand Signal Repositories. DSRs are the specialized database and reporting tools primarily used by CPGs for retail Point of Sale data.
So far, I've looked at the challenges (and big opportunities) around handling the large quantities of data involved: better database technologies, scale-out platforms, true multi-retailer environments, effective data blending and dramatic simplification of data structures.
Taken as a whole this get's the necessary data into one place where it is relatively simple to overlay it with the BI or analytic tools of your choice and still get good performance. This is the starting point.
Now, we can get to the fun stuff, Analytics. Let's start by addressing a widespread misunderstanding

Reporting is NOT analytics

I've blogged on this before, actually one of my very first blog posts, but it bears repeating and extending from the original
Reporting is about "what happened"; Analytics is concerned with "why?""what if?"and "what's best?".
You need reports. Hopefully they are well constructed, with appropriate metrics, good visualization and exception highlighting. Perhaps they are also interactive so you can drill-down, pivot and filter. These are useful tools for exploratory "what happened" work, but, almost exclusively, reports leave it up to the reader to construct the "why".
Great reporting can pull together facts that you think are related for visual inspection (e.g. weekly temperature and ice-cream sales by region). Perhaps you can see a pattern, sort of, but reports will not quantify or test the validity of the pattern that's up to you, the reader, to guess at.
Even great reports can't help you much with more complex relationships. In reality, ice-cream sales are also dependent on rainfall, pricing, promotions, competitor activity etc. Who knew? Well we all did of course, but there is no reasonable way to visualize this in a standard report. Want to predict sales next week given weather, price and promo data for all products in all regions? Your going to need some good analytics.
You need Analytics too. In some cases, basic, high-school, math is all you need. In most, it doesn't even get you close to the 80% solution beloved of business managers. "Winging it" in Excel, Access, PowerPivot etc. can give you very bad answers that are seriously dangerous to your success and/or employment.
Want to understand and predict the impact to sales of promotions, pricing or weather events? You need Analytics for that.
Wan't to know where you can safely reduce inventory in your supply chain while increasing service level? You need Analytics.
Wan't to alert when sales of your product are abnormally low? Analytics!
Want to know how rationalizing products across retailers would impact your supply chain? Yep, Analytics.
Want to know which shopper demographics are most predictive of sales velocity? I think you get it...
If your business question is something other than "what happened" you need Analytics.

Monday, 7 July 2014

Next Generation DSRs - data blending (part 2)

My most recent post on Demand Signal Repositories bemoaned their general lack of ability to rapidly ingest new and interesting data sources (e.g.: promotions, Twitter feeds, Sentiment analysis, Google trends, Shipment history, master data, geographic features, proximity to competitor stores, demographic profiles, economic time series, exchange rates etc.).
As a result, analysts spend far too much time collecting/copying data into ad-hoc data marts to enable useful modeling work. At the extreme, you can move Terabytes of low-level data from a perfectly good database into another one (probably on a lower powered machine) so as to manually merge it with a few hundred records of new data you need for analysis. This is slow (actually very slow), error prone and leaves very little time to do valued added work.
Based on questions from blog readers via email, I think that I failed to call out how big the gap is between where we are now and where we should be. Let me spell it out:
If I go to my (or your) IS department now and ask "how long would it take to integrate XXX data into the DSR so it is loaded, cleaned, gap-filled, matched to appropriate dimensions and ready for some interesting analytic work." I would expect to hear back "between 6 and 12 months" and that's assuming they have both some developer availability and the necessary access to add/modify data stuctures - some DSRs are locked down tight. If I went to the DSR vendor, it may be a little faster, depending on just how tightly the data structure is tied into their product release schedule. But here's the thing - I want to do this, myself, in real-time and certainly in less than a day.
Tools such as Alteryx are designed to do data blending. Alteryx in particular, seems to do especially well handling geo/demographical data, some of which comes with it as standard. They also have a number of pre-defined macros to help you get at standard data source like Google Trends and Twitter. If I understand it correctly, it does this by loading all data sources into memory. Perhaps it constructs it's own data repository on the fly, but, certainly, it does not touch the source database's data structure at all.
This would work well for relatively small quantities of data. Let's say you are examining annual sales for a product group by store - you aggregate that to a few thousand records of POS data in the DSR, load it into Alteryx, geocode the locations, match up the geo/demographic data you are interested in and you are ready to run some math. I doubt that would take more than a couple of hours. There is also some analytic power in the platform and at least some level of R integration if you wish to extend it further. For ad-hoc small (sub 10 million record?) data analytics this looks really good.
What if you want to do your modeling at a much lower level of detail though? Do you have the capacity to match across billions of records outside the DSR? Perhaps, but it's going to cost you and why move it all into another database on another expensive server when you've already paid for those in your DSR? What if you to run analytics repeatedly, do you really want to do geocoding and ad-hoc matching every time you want to use census data in an analysis? Chances are the stores haven't moved :-) and the most frequently updated census data, I think, isn't updated any more often than annually.
Better to do it once, load it into new data structures in the DSR and enable it for ongoing reporting/analytics or... did you want to force even basic reporting through the data blending platform because that's the only place you can match up multiple data sources ? I didn't think so.
If would definitely look at something like Alteryx for ad-hoc work. If you can also use it to source, transform, handle dimensional matching, deal with missing data etc. and load the results back into your DSR (where you just defined new data structures to receive it), I think you might have something.

Monday, 30 June 2014

Next Generation DSRs - data blending

Over the last few months I've written a series of posts on Demand Signal Repositories.  These are the specialized database and reporting tools primarily used by CPGs for reporting against retail Point of Sale data.  
There are a number of good tools in the market-place and you can derive substantial value from them today but the competitive landscape is changing...fast. Existing tools found a market because they are capable of sourcing, loading and reporting against vast amounts of data quickly.  To do so they have employed a variety of complicated architectures that are now largely obsolete with recent advances in technology that can make solutions: faster, cheaper and more flexible.
Cheaper alone may be a win in the market today, but if all we do with this new power is report on "what I sold last week" more quickly and at a lower price-point I think we are missing the point.  
The promise of a DSR has always been to explain what happened but much more importantly why and existing tools struggle with this:
  • they do not hold a rich enough repository of data to test out hypotheses.
  • their primary analytic tools are report-writers and pivot-tables (by which I mean that they really don't have any)
We'll come to analytics in a later post, but for now let's think data because without that there isn't very much to analyze.
Imagine that I've spent a few hundred thousand acquiring point of sale data into my own DSR and now I want to really figure out what it is that drives my sales.  
How about weather.  Ignore for the moment whether or not a future forecast is useful, but how about using weather data to explain some of the strange sales in history so that I don't trend them forwards into the coming year?  I can get very detailed weather data from a number of sources, but can I, a system user, get that data into my DSR to start reporting against it and better yet, modeling?  Probably not
How about SNAP, the US government 's benefit program that funds grocery purchases for roughly 1 in 6  US households?  SNAP can drive huge spikes in demand for key products and I can easily go to usda.gov and find out exactly when SNAP dollars are dropped into the marketplace by day of the month and by state.  With a little time on Google I can even see when this schedule has changed in the past few years.  Can I, a system user, get this data into the DSR for reporting/modeling?  Nope.
The same is true for many additional data sources you wish to work with (Promotional  records, Twitter feeds, Sentiment analysis, Google trends, Shipment history, master data, geographic features, proximity to competitor stores, demographic profiles, economic time series, exchange rates etc.).  
These are all relatively easy to source datasets but if the DSR vendor has not set it up as part of the standard product, you are out of luck: the technical sophistication necessary to source, load and , especially, match key fields data is beyond what a super-user, and in many cases, a system administrator can handle.   Can it be done?  Maybe, depending on your system, skill-level and security-access, but it's going to cost you in time and money.

Matching data in particular can be a real bear - it will be rare that you are matching products at the same level of granularity (item, location, date) and with the exact same key fields.  Far more common to be matching weekly or monthly data to daily,  state or county data to zip-codes and product groups to shoppable items.  And do it without losing any data, sensibly handling missing data and flagging suspect data for manual follow-up.
So if you really want to do some analysis against e.g. SNAP what must you do?  Download a small ocean of detailed POS data so you can (carefully) join it to your few hundred records of SNAP release data in a custom database or analytic app, build the models and then (because you can't write the results back out to the DSR) build a custom reporting engine against these results.  This makes no sense to me.  
The solution is something called data-blending which tries to reduce the pain of integrating multiple data sources to a level that you could contemplate it in near real-time.  While I have not yet seen a solution I would call perfect the contrast with the standard, locked-down, DSR scenario is impressive.  
Much of what I have seen so far happens at the individual's level: where you are doing the match in-memory and without impacting the underlying database or fellow users in any way.  In many cases, particularly for exploratory work, this is preferable, but it's far from an ideal solution if you need to process against the detail of the entire database or have multiple needs for the same data.
The future, I think, will include such ad-hoc capability, but I suspect it also includes a more flexible data model that let's an administrator rapidly integrate new data sources into the standard offering.

Monday, 2 June 2014

Averages work ! (At least for ensemble methods)

After an early start, I was sitting at breakfast downtown enjoying a burrito and an excellent book on "ensemble methods".  (Yes, I do that sometimes... don't judge)

  1. 1.

Product Details








For those who have built a few predictive models: regression , neural-nets, decision trees,...  I think this is an excellent read, outlining an approach that can deliver big improvements on hard to predict problems.  The introduction provides a very good overview:
Ensemble methods have been called the most influential development in Data Mining and Machine Learning in the past decade. They combine multiple models into one usually more accurate than the best of its components. Ensembles can provide a critical boost to industrial challenges...
Ensemble models use teams of models.  Each model uses a different modeling approach or different samples of the available data or emphasizes different features of your data-set and each is built to be as good as it can be.  Then we combine ("average") the prediction results and,  typically,  get a better prediction than any of the component team members.

When I was first learning predictive modeling as an under-graduate the emphasis was on finding the best model from a group of potential candidates.  Embracing ensemble methods, initially, just felt wrong, but the proof is in the performance.

It sounds easy, but, clearly, this is more complex than building a single model and if you can get a good-enough result using simple approaches you should.  You'll know when it's worth trying something more high powered.

With thanks to my friend Matt for this simplification, this may be one of the few contexts where we can say "Averages work!!"  

As a reminder that working with averages (or aggregations of any kind) is generally dangerous to your insight, take another look at this post on why you should be using daily point-of-sale data.

Or, consider this...