Archive for the ‘Business analytics’ Category

25
Aug
By Mayank Bawa in Analytics, Blogroll, Business analytics, MapReduce on August 25, 2008
   

I’m unbelievably excited about our new In-Database MapReduce feature!

Google has used MapReduce and GFS on page rank analysis, but the sky is really the limit for anyone to build powerful analytic apps. Curt Monash has posted an excellent compendium of applications that are successfully leveraging the MapReduce paradigm today.

A few examples of SQL/MapReduce functions that we’ve collaborated with our customers on so far:

1. Path Sequencing: SQL/MR functions can be used for developing regular expression matching of complex path sequences (eg. time series financial analysis or clickstream behavioral recommendations). It can also be extended to discover Golden Paths to reveal interesting behavioural patterns useful for segmentation, issue resolution, and risk optimization.

2. Graph Analysis: many interesting graph problems like BFS (breadth first search), SSSP (single source shortest path), APSP (all-pairs shortest path), and page rank that depend on graph traversal.

3. Machine Learning: several statistical algorithms like linear regression, clustering, collaborative filtering, naive bayes, support vector machine, and neural networks can be used to solve hard problems like pattern recognition, recommendations/market basket analysis, and classification/segmentation.

4. Data Transformations and Preparation: Large-scale transformations can be parameterized as SQL/MR functions for data cleansing and standardization, unleashing the true potential for Extract-Load-Transform pipelines and making large-scale data model normalization feasible. Push down also enables rapid discovery and data pre-processing to create analytical data sets used for advanced analytics such as SAS and SPSS.

These are just a few simple examples Aster has developed for our customers and partners via Aster’s In-Database MapReduce to help them with rich analysis and transformations of large data.

I’d like to finish with a simple code snippet example of a simple, yet powerful SQL/MR function we’ve developed called “Sessionization”

Our Internet customers have conveyed that defining a user session can’t be easily done (if at all) using standard SQL. One possibility is to use cookies but users frequently remove them or they expire.

Aster In-Database MapReduce

Aster developed a simple “Sessionization”Â? SQL/MR function via our standard Java API library to easily parameterize the discovery of a user session. A session would be defined by a timeout value (eg. in seconds). If the elapsed time between consecutive click events is greater than the timeout, this would signal a new session has begun for that user.

From a user perspective, the input is user clicks (eg. timestamp, userid). The output is to associate each click to a unique session identifier based on the Java procedure noted above. Here’s the simple syntax:

SELECT timestamp, userid, sessionid
FROM sessionize("timestamp", 600) ON clickstream
SEQUENCE BY timestamp
PARTITION BY userid;

Indeed, it is that simple.

So simple, that we have reduced a complex multi-hour Extract-Load-Transform task into a toy example. That is the power of In-Database MapReduce!



19
Aug
   

I am curious if anyone out there is attending the TDWI World Conference in San Diego this week? If so and you would like to meet up with me, please do drop me a line or comment below as I will be in attendance. I’m of course very excited to be making the trip to sunny San Diego and hope to catch a glimpse of Ron Burgundy and the channel 4 news team! :-)

But of course it’s not all fun and games, as I’ll participate in one of TDWI’s famous Tool Talk evening sessions discussing data warehouse appliances. This should make for some great dialogue between me and other database appliance players, especially given the recent attention our industry has seen. I think Aster has a really different approach to analyzing big data and look forward to discussing exactly why.

For those interested in the talk, here are the details..come on by and let’s chat!
What:TDWI Tool Talk Session on data warehouse appliances
When: Wednesday, August 20, 2008 @ 6:00p.m.
Where: Manchester Grand Hyatt, San Diego, CA



17
Aug
By Tasso Argyros in Analytics, Blogroll, Business analytics, Statements on August 17, 2008
   

When Polo lets you use your mobile phone to buy a pair of pants, you know there’s something interesting going on.

The trend is inevitable: purchasing becomes easier and more frictionless. You could buy something at the store or from your home. But now you can buy stuff while you jog in the park, while you bike (it’s not illegal yet), or even while you’re reading a distressing email on your iPhone (shopping therapy at its best.)

As purchasing gets easier and pervasive, we’ll tend to buy things in smaller quantities and more often. Which means more consumer behavior data will be available for analysis by advertisers and retailers to better target promotions to the right people at the right time.

In this new age, where interaction of buyers with shops and brands is much more frequent and intimate, enterprises who use their data to understand their customers will have a huge advantage over their competition. That’s one of the reasons why at Aster we’re so excited building the tools for tomorrow’s winners.



05
Aug
By Mayank Bawa in Analytics, Blogroll, Business analytics, Business intelligence, Database on August 5, 2008
   

Today we are pleased to welcome Pentaho as a partner to Aster Data Systems. What this means is that our customers can now use Pentaho open-source BI products for reporting and analysis on top of Aster nCluster.

We have been working with Pentaho for some time on testing the integration between their BI products and our analytic database. We’ve been impressed with Pentaho’s technical team and the capabilities of the product they’ve built together with the open source community. Pentaho recently announced a new iPhone application which is darn cool!

I guess, by induction, Aster results can be seen on the iPhone too. :-)



25
Jul
   

Stuart announced yesterday that Microsoft has agreed to acquire DATAllegro. It is pretty clear Stuart and his team have worked hard for this day: it is heartening to see that hard work gets rewarded sooner or later. Congratulations, DATAllegro!

Microsoft is clearly acquiring DATAllegro for its technology. Indeed, Stuart says that DATAllegro will start porting away from Ingres to SQL Server once the acquisition completes. Microsoft’s plan is to provide a separate offering from its traditional SQL Server Clustering.

In effect, this event provides a second admission from a traditional database vendor that OLTP databases are not up to the task for large-scale analytics. The first admission was in 1990s when Sybase (ironically, originator of SQL Server code base) offered Sybase IQ as a separate product from its OLTP offering.

The market already knew this fact: the key point here is that Microsoft is waking up to the realization.

A corollary is that it must have been really difficult for Microsoft SQL Server division to scale SQL Server for larger scale deployments. Clearly, Microsoft is an engineering shop and the effort of integrating alien technology into their SQL Server code-base must have been carefully evaluated for a build-vs-buy decision. The buy decision is a tacit admission that it is incredibly hard to scale their SQL Server offering with its roots in traditional OLTP database.

We can expect Oracle, IBM, and HP to have similar problems in scaling their 1980s code-base for the needs of data-scale and query-workloads of today’s data warehousing systems. Will the market wait for Oracle, IBM, and HP’s efforts to scale to come to fruition? Or will Oracle, IBM, and HP soon acquire companies to improve their own scalability?

It is interesting to note that DATAllegro will be moving to an all-Microsoft platform. The acquisition could also be read as a defensive move by Microsoft. All of the large-scale data warehouse offerings today are based on Unix variants (Unix/Linux/Solaris), thus leading to the uncomfortable situation at some all-Microsoft shops who chose to run Unix-based data warehouse offerings because SQL Server would not scale. Microsoft needed an offering that could preserve their enterprise-wide customers on Microsoft platforms.

Finally, there is a difference in philosophy between Microsoft’s and DATAllegro’s product offerings. Microsoft SQLServer has sought to cater to the lower end of the BI spectrum; DATAllegro has actively courted the higher end. Correspondingly, DATAllegro uses powerful servers, fast storage, and expensive interconnect to deliver a solution. Microsoft SQL Server has sought to deliver a solution at a much lower cost. We can only wait and watch: will the algorithms of one philosophy work well in the infrastructure of the other?

At Aster Data Systems, we believe that the market dynamics will not change as a result of this acquisition: companies will want the best solutions to derive the most value from data. In the last decade, Internet changed the world and old-market behemoths could not translate their might into the new market. In this decade, Data will produce a similar disruption.



14
Jul
How to Answer Analytic Questions
By George Candea in Analytics, Business analytics on July 14, 2008
   

In a recent interview with Wired magazine, IBM’s Wattenberg mentioned an interesting yardstick for data analytics: compare the data you give to a human to the sum total of the words that human will hear in a lifetime, which is less than 1 TB of text. Incidentally, this 1 TB number is how big Gordon Bell thinks a lifetime of recording daily minutiae would be… Bell now has MyLifeBits, the most extensive personal archive, in which he records all his e-mails, photographs, phone calls, Web pages visited, IM conversations, desktop activity (like which apps he ran and when), health records, books in his library, labels of the bottles of wine he enjoyed, etc. His collection grows at about 1 GB / month, amounting to ~1 TB for a lifetime… and that’s what the human brain is built for.

Wattenberg comes with an interesting perspective: human language is a form of compression (�Twelve words from Voltaire can hold a lifetime of experience�). This is because of the strong contextual information carried by each phrase. MyLifeBits does not reflect the life experiences; it provides the bits from which those life experiences are built, through connections and interpretations.

Herein lies the challenge of data analytics: how to “compress� vast amounts of data into a small volume of information that the human brain can absorb, process, and act upon. How to leverage context in delivering answers, recommendations, and insights. The Web brought data out in the open, search engines allowed us to ask questions of this data, analytics engines are now starting to allow precise and deep questions to be asked of otherwise overwhelming amounts of data. We, as an industry, are just entering the Neolithic of information history.

We need breakthroughs in visualization and, in particular, in the way we leverage the context of previous answers. Researchers at University College London are looking into how the hippocampus encodes spatial and episodic memories; they are going as far as analyzing fMRI (functional MRI) scans of the brain to extract the memories stored in that brain. In computerized data analytics, we are faced with a relatively simpler task: record all past answers and then leverage this context in order to more effectively communicate new results. Understand how the current answer relates to the previous one, and deliver an interpretation of the delta. That’s where we would like to be, sooner rather than later.



11
Jun
By Mayank Bawa in Analytics, Blogroll, Business analytics, Interactive marketing on June 11, 2008
   

I had the opportunity to work closely with Anand Rajaraman while at Stanford University and now at our company. Anand teaches the Data Mining class at Stanford as well, and recently he did a very instructive post on the observation that efficient algorithms on more data usually beat complex algorithms on small data. He followed it up with an elaboration post. Google also seems to believe in a similar philosophy.

I want to build upon that observation here. If you haven’t read the posts, do read them first. It is well-worth the time!

I propose that there are 2 forces in action that help simple algorithms on big data beat complex algorithms on small data:

  1. The freedom of big data allows us to bring in related datasets that provide contextual richness.
  2. Simple algorithms allow us to identify small nuances by leveraging contextual richness in the data.

Let me expand my proposal using Internet Advertising Networks as an example.

Advertising networks essentially make a guess about a user’s intent and present an advertisement (creative) to the consumer. If the user is indeed interested, the user clicks through the creative to learn more.

Advertising networks are used today on a CPC model (Cost-Per-Click). There are stronger variants of CPL (Cost-Per-Lead) or CPA (Cost-Per-Acquisition) but these variants are as applicable to the discussion as the simpler CPC model. There is a simpler variant of CPM (Cost-Per-Impression) but an advertiser ends up effectively computing CPC by keeping track of click-through rates for money spent via the CPM model. The CPC model dictates that Advertising Networks do not make money unless the user clicks on a creative.

Today, the best advertising networks have a click through rate of less than 1%. In other words, advertising networks correctly interpret a user’s intentions 1% of the time, 99% of the time they are ineffective!I find this statistic immensely liberating. Here is a statistic that shows that even if we are correct 1% of the time, the rewards are significant. ☺Why is the click-through rate so low? I think it is because human behavior is difficult to predict. Even sophisticated algorithms (that are computationally practical only on small datasets) do a bad job of predicting human behavior.It is much more powerful to think of efficient algorithms that execute across larger, diverse datasets to exploit the richness inherent in the context to enable a higher click-through rate.I’ve observed people in the field sample behavioral data to reduce their operating dataset. I submit that a sample of 1% will lose the nuances and the context that can cause an uplift and growth in revenue.For example, a Content Media site may have 2% of their users who come in to read Sports stay on to read Finance articles. A sampling of 1% is certain to reduce this 2% population trait to a statistically insignificant portion in the sample. Should we or should we not derive this insight to identify and engage the 2% by serving them better content?Similarly, an Internet Retailer may have 2% of their users who come in to buy flat-panel TV have also bought video games recently. Should we or should we not act on this insight to identify and engage the 2% by offering them better deals on games? Given that games are a high-margin product, the net effect on revenue via cross-sell could be higher than 2% in dollars.We often want to develop an algorithm that is provably correct under all circumstances. In a bid to satisfy this urge, we restrict our datasets to find a statistically significant model that is a good predictor. I associate that with a purist way of algorithm development that was drilled into us at school.Anand’s observation is a call for practitioners to think simple, use context and come up with rules that segment and win locally. It will be faster to develop, test and win on simple heuristics than waiting for a perfect “Aha!” that explains all things human.



20
May
By Mayank Bawa in Analytics, Blogroll, Business analytics, Business intelligence on May 20, 2008
   

I’ve remarked in an earlier post that the usage of data is changing and new applications are on the horizon. Over the past few years, we’ve observed or invented quite a few interesting design patterns for business processes that use data.

There are no books or tutorials for these new applications, and they are certainly not being taught in the classrooms of today. So I figured I’d share some of these design patterns on our blog.

Let me start with a design pattern that we internally call “The Automated Feedback Loop”. I didn’t invent it but I’ve seen it being applied successfully at search engines during my research days at Stanford University. I certainly think there is a lot of power that remains to be leveraged from this design principle in other verticals and applications.

Consider a search engine. Users ask keyword queries. The search engine ranks documents that match the queries and provides 10 results to the user. The user clicks one of these results, perhaps comes back and clicks another result, and then does not come back.

How do search engines improve themselves? One key way is by recording the number of times users clicked or ignored a result page. They also record the speed with which a user returned from that page to continue his exploration. The quicker the user returned, the less relevant the page was for user’s query. The relevancy of a page now becomes a factor in the ranking function itself for future queries.

The Automated Feedback LoopSo here is an interesting feedback loop. We offered options (search results) to the user, and the user provided us feedback (came back or not) on how good one option was compared to the others. We then used this knowledge to adapt and improve future options. The more the user engages, the more everyone wins!

This same pattern could hold true in a lot of consumer-facing applications that provide consumers with options.

Advertising networks, direct marketing companies, and social networking sites are taking consumer feedback into account. However, this feedback loop in most companies today is manual and not automated. Usually the optimization (adapting to user response) is done by domain experts who read historical reports from their warehouses, build an intuition of user needs and then apply their intuition to build a model that runs everything from marketing campaigns to supply chain processes.

Such a manual feedback loop has two significant drawbacks:

1. The process is expensive: it takes a lot of time, trial and error for humans to become experts, and as a result the experts are hard to find and worth their weight in gold.

2. The process is ineffective: humans can only think about handful of parameters and they optimize for the most popular products or processes (e.g., “Top 5 products or Top 10 destinations”). Everything outside this area of comfort is left under-optimized.

Such a narrow focus on optimization is severely limiting. The incorporation of Top 10 trends into future behavior is akin to a search engine saying that it will optimize for only the top 10 searches of the quarter. I am sure Google would definitely be a less valuable company then, and the world a less engaging place.

I strongly believe that there are rich dividends to be reaped if we can automate the feedback process in more consumer-facing areas. What about hotel selection, airline travel, and e-mail marketing campaigns? E-tailers, news (content providers), insurance, banks and media sites are all offering the consumer a choice for his time and money. Why not instill an automated feedback loop in all consumer-facing processes to improve consumer experience? The world will be a better place for both the consumer and the provider!