Wednesday, 30 July 2014

Next Generation DSRs - Analytic power !

To handle real Analytics (see my recent post Reporting is NOT Analytics) you need real Analytic power. BI tools are based on the language they use to interrogate the database (typically SQL) and with no library of analytic tools - it's not nearly enough.



We use SQL (Structured Query Language) to query relational databases like SQLServer, Oracle, MySQL and Access. SQL is a great tool for handling large quantities of data, joining tables, filtering results and aggregating data. However, SQL's math library is only sufficient for accounting (sum, product, division, count) and while I do know it can do a few more things, it's not enough to be useful for Analytics. Even getting it to calculate a simple correlation-coefficient is a big challenge. Want to build a simple regression model? That's just not going to happen in base SQL, we need something designed for the task.
R, SAS, SPSS, Statistica, and a good number of others, are the real deal and the difference between any of them and what you can do in SQL (or Excel) is vast! With these tools it's no longer a question of "can you build a regression model?" now it's "which particular flavor of regression do you need?". What! There's more than one? Oh yeah!
I'm not getting into which analytic tool is the best. I use R, and that's what I'll talk to, but I have good friends, analytic-powerhouses who insist on using SAS or SPSS. These tools have different strengths and weaknesses and within the analytic community a lot of time, blog posts and misinformation go into arguing the relative merits of one vs. another. My take is that for most business-analytic purposes any of them will get the job done. The one you choose should be driven most heavily by your ability to get the analytic tool working against your data.
The problem is that these analytic tools do not generally reside in the same space as your database or BI tool, so you spend a lot of time interfacing data between systems. It's slow, sometimes very slow, and requires replication in your resources.
In recent years many database and BI tools have started offering integration with statistical tools (Oracle, SAP Hana, Tableau, Spotfire, MicroStrategy). The ideal here is in-database analytics where we run the complex stats in-tandem, indeed in the same memory space as the database. That is very attractive but I would look very carefully at the depth of integration offered before getting too excited. In some cases, I think, vendors have done just enough to tick the box without making it truly useful. As examples:
  • One vendor limits the transfer of data between database and R to simple table structures. Now, imagine running a regression model. What goes into the regression is very likely a simple table - check! What comes out is anything but: it's a complex object combining multiple tables of different dimensionality and named values (like r-sq). We need this data to determine the validity of the model and make future predictions. Force me to return just one table structure and I must throw most of the information and capability away. Before anyone asks, no, this is not unique to regression models.
  • Another vendor has integrated R into the reporting layer. This is relatively functional as long as the data you want to work with can be generated in a report. If you need very large amounts of input data you may well exceed reporting limits. If you want to build a separate model for each product in your database, you may have to run the report separately for each one.
  • Standard R was not originally designed for parallel execution (though you can get around this with a little coding help). Current processors (CPUs) even on low-level laptops are multi-cored. Servers routinely run more cores per CPU, more CPUs per server and we want to scale-out across multiple servers. A BI offering that only offers single core R execution is wasting your resources and time.
Bottom line, to do real Analytics, you need real Analytic tools. But even the best tools must be able to get at the data to be useful. Choose carefully,

No comments:

Post a Comment