Today Aster took a significant step and made it easier for developers building fraud detection, financial risk management, telco network optimization, customer targeting and personalization, and other advanced, interactive analytic applications.
Along with the release of Aster Data nCluster 4.5, we added a new Solution Partner level for systems integrators and developers.
Why is this relevant?
Recession or no-recession, IT executives are constantly challenged. They are asked to execute strategies based on better analytics and information to improve effectiveness of business processes (customer loyalty, inventory management, revenue optimization, ..), while staying on top of technology-based disruptions and managing (shrinking or flat) IT budgets.
IT organizations have taken on the challenge by building analytics-based offeringsleveraging existing data management skills and increasingly taking advantage of MapReduce, a disruptive technology introduced by Google and now being rapidly adopted by mainstream enterprise IT shops in Finance, Telco, LifeSciences, Govt. and other verticals.
As MapReduce and big data analytics goes mainstream, our customers and ecosystem partners have asked us to make it easier for their teams to leverage MapReduce across enterprise application lifecycles, while harvesting existing IT skills in SQL, Java and other programming languages. The Aster development team that brought us the SQL/MapReduce innovation, has now delivered the market’s first integrated visual development environment for developing, deploying and managing MapReduce and SQL-based analytic applications.
Enterprise MapReduce developers and system integrators can now leverage the integrated Aster platform and deliver compelling business results in record time (read how ComScore delivers 360 degree view of digital world to enterprise customers, Full Tilt Poker gains the upper hand tackling online fraud using Aster).
We are also teaming up with leaders in our ecosystem like MicroStrategy to deliver an end-to-end analytics solution to our customers that includes SQL/MapReduce enabled reporting and rich visualization. Aster is proud to be driving innovation in the Analytics and BI market and was recently honored at MicroStrategy’s annual customer conference.
I am delighted with the rapid adoption of Aster Data’s platform by our partners and the strong continued interest from enterprise developers and system integrators in building big data applications using Aster. New partners are endorsing our vision and technical innovation as the future of advanced analytics for large data volumes.
Sign up today to be an Aster solution partner and join the revolution to deliver compelling information and analytics-driven solutions.
I’m very excited about the upcoming Big Data Summit in New York City on Thursday evening (October 1st). Sponsored by Aster Data, Microstrategy, and Informatica, we have an incredible speaker lineup including LinkedIn, comScore, and Colin White from BI Research. Check out the Facebook page for the event here.
To kick off the festivities, we’re holding a live webinar earlier in the day at 9 a.m. US Pacific time. Colin White and myself will be discussing Hadoop and data warehousing - how they’re similar, how they’re different, and what they can be used for (both separately and together). In fact, we’ll be making an important announcement of a new product offering that you won’t want to miss. If you haven’t already, I urge you to register by clicking here.
Mark it on your calendar - something “big” is coming on October 1st.
When we announced Aster nCluster’s In-Database MapReduce feature last year, many people were intrigued by the new analytics they would be able to do in their database. However, In-Database MapReduce is new and often loaded with a lot of technical discussion on how it’s different from PL/SQL or UDF’s, whether it’s suitable for business aanalysts or developers, and more.What people really want to know is how businesses can take advantage of MapReduce.
I’ve referred to how our customers use In-Database MapReduce (and nPath) for click-stream analytics . In our “MapReduce for Data Warehousing and Analytics” webinar last week, Anand Rajaraman covered several other example applications. Rajaraman is CEO and Founder of Kosmix and Consulting Assistant Professor in the Computer Science Department at Stanford University (full disclosure: Anand is also on the Aster board of directors). After spending some time discussing graphing, i.e. finding the shortest path between items, Rajaraman discusses applications in finance, behavioral analytics, text, and statistical analysis that can be easily completed with In-Database MapReduce but are difficult or impossible with SQL alone.
As Rajaraman says, “We need to think beyond conventional relational databases. We need to move on to MapReduce. And the best way of doing that is to combine MapReduce with SQL.”
Our goal at Aster is to build a product that will answer your analytical questions sooner. Sooner doesn’t just mean faster database performance - it means faster answers from the moment you conceive of the question to the moment you get the answer. This means allowing analysts and end-users to easily ask the questions on their mind.
Aster nCluster, our massively-parallel database, has supported SQL from birth. SQL is great in many respects: it allows people of various levels of technical proficiency to ask lots of interesting questions in a relatively straightforward way. SQL’s easy to learn but powerful enough to ask the right questions.
But, we’ve realized that in many situations SQL just doesn’t cut it. If you want to sessionize your web clicks or find interesting user paths, run a custom fraud classifier, or tokenize and stem words across documents, you’re out of luck. Enter SQL/MR, one part of our vision of what a 21st-century database system should look like.
Let’s say your data is in nCluster. If your analytic question can be answered using SQL, you don’t have to worry about writing Java or Python. But, as soon as something more complicated comes up, you can write a SQL/MR function against our simple API, upload it into the cluster, and have it start operating on your data by invoking it from SQL. How is this related to MapReduce? It turns out that these functions are sufficient to express a full MapReduce dataflow. How are SQL/MR functions different than the UDFs of yore? It’s all about scale, usability, reusability; all three contributing to you getting your answer sooner.
Scalability
SQL/MR functions play in a massively-parallel sandbox, one with terabytes and terabytes of data, so they’re designed to be readily parallelized. Yes, they just accept a table as input and produce a table as output, but they do so in a distributed way at huge scale. They can take as input either rows (think “map”) or well-defined partitions (think “reduce”), which allows nCluster to move data and/or computation around to make sure that the right data is on the right node at the right time. SQL/MR functions are table functions breaking out of the single node straight-jacket. This means you can analyze lots of data fast.
Usability We want to make sure that developers using our SQL/MR framework spend their time thinking about the analytics, not dealing with infrastructure issues. We have a straight-foward API (think: you get a stream of rows and give us back a stream of rows) and a debugging interface that lets you monitor execution of your function across our cluster. Want to write and run a function? One command installs the function, and a single SQL statements invokes it. The data you provide the function is defined in SQL, and the output can be sliced and dices with more SQL - no digging into Java if you want to change a projection, provide the function a different slice of data, or add a sort onto the output. All this allows a developer to get a working function - sooner - and an analyst to tweak the question more readily.
Reusability We’ve gone to great lengths to make sure that a SQL/MR function, once written, can be leveraged far and wide. As mentioned before, SQL/MR functions are invoked from SQL, which means that they can be used by users who don’t know anything about Java. They also accept “argument clauses” - custom parameters which integrate nicely with SQL. Our functions are polymorphic, which means their output is dynamically determined by their input. This means that they can be used in a variety of contexts. And, it means that any number of people can write a function which you can easily reuse over your data. A function, once written, can be reused all over the place, allowing users to ask their questions faster (since someone’s probably asked a similar question in the past).
In fact, we’ve leveraged the SQL/MR framework to build a function that ships with nCluster: nPath. But this is just the first step, and
the sky’s the limit. SQL/MR could enable functions for market basket analysis, k-means clustering, support vector machines, natural language processing, among others.
How soon will your questions be answered? I’d love to hear of any ideas you have for analytic functions you’re struggling to write in SQL which you think could be a good fit for SQL/MapReduce
Cloud computing is a fascinating concept. It offers greenfield opportunities (or more appropriately, blue sky frontiers) for businesses to affordably scale their infrastructure needs without plunking down a huge hardware investment (and the space/power/cooling costs associated with managing your own hosted environment). This removes the risks of mis-provisioning by enabling on-demand scaling according to your data growth needs. Especially in these economic times, the benefits of Cloud computing are very attractive.
But let’s face it - there’s also a lot of hype, and it’s hard to separate truth from fiction. For example, what qualities would you say are key to data warehousing in the cloud?
Here’s a checklist of things I think are important:
[1] Time-To-Scalability. The whole point of clouds is to offer easy access to virtualized resources. A cloud warehouse needs to quickly scale-out and scale-in to adapt to changing needs. It can’t take days to scale…it has to happen on-demand in minutes (<1 hour).
[2] Manageability. You go with clouds because you not only want to save on hardware, but also on the operational people costs of maintaining that infrastructure. A cloud warehouse needs to offer one-click scaling, easy install/upgrade, and self-managed resiliency.
[3] Ecosystem. While clouds offer *you* huge TCO savings, you can’t compromise service levels for your customers - especially if you run your business on the cloud. BI/ETL/monitoring tools, Backup & Recovery, and ultra-fast data loading can’t be overlooked for “frontline” mission-critical warehousing on the cloud.
[4] Analytics. Lots of valuable data is generated via the cloud and there are opportunities to subscribe to new data feed services. It’s insufficient for a cloud warehouse to just do basic SQL reporting. Rather, it must offer the ability to do deep analytics very quickly.
[5] Choice. A truly best-in-class cloud warehouse won’t lock you in to a single cloud vendor. Rather, it will offer portability by enabling you to choose the best cloud for you to run your business on.
Finally, here are a couple ideas on the future of cloud warehousing. What if you could link multiple cloud warehouses together and do interesting queries across clouds? And what about the opportunities for game-changing new analytics - with so many emerging data subscription services, wouldn’t this offer ripe opportunities for mash-up analytics (eg. using Aster SQL/MapReduce).
What do you think are the standards for “best-in-class” cloud warehousing?
It may not sound sexy, but analyzing a sequence of events over time is non-trivial in standard SQL. When Aster started providing customers ways to express time-series analysis more elegantly and have it return result sets in 90 seconds as opposed to 90 minutes, we started getting comments like, “That’s beautiful … I would make out with my screen if I could!”.
Crazy, I know.
Time-series, or sequential pattern analysis, is useful in a number of industries and applications such as:
- Price changes over time in financial market data
- Path analysis within click-stream data to define anything from top referring sites to the “golden” paths customers navigate before purchasing
- Patterns which detect deviant activity such as spamming or insurance claims fraud
- Sessionization (mapping each event in a clickstream to a human user session)
More specifically, one customer wanted to improve the effectiveness of their advertising and drive more visitors to their site. They asked us to help determine the top paths people take to get to their site and top paths people take after leaving the site. Knowing the “before” path gave them insight into what sites to place advertisements on to drive traffic to their site. Additionally, pathing helps the customer understand behavior/preferences of users who visit their site (e.g., if espn.com is a top site that is in the path of the 5 pages leading up to the customer site, they know that many visitors like sports).
However, discovering relationships between rows of data is difficult to express in SQL, which must invoke multiple self-joins of the data. These joins dramatically expand the amount of data involved in the query and slow down query performance - not to mention complexity in developing and parsing these expressions.
What’s the solution? There are a few, but Aster’s approach has been to develop extensions to SQL which is executed in-database in a single-pass over the data in a massively-parallel fashion utilizing nPath, which is a SQL-MapReduce (SQL/MR) function used to perform regular expression pattern matching over a sequence of rows. It allows users to:
- Specify any pattern in an ordered collection - a sequence - of rows with symbols;
- Specify additional conditions on the rows matching these symbols; and
- Extract useful information from these row sequences.
I’ll share the syntax of nPath here to give you more context of how the query operates:
SELECT…
FROM nPath (
ON {table|query}
…various parameters…
)
WHERE…
GROUP BY…
HAVING…
ORDER BY…
etc.
nPath performs pattern matching and computes aggregates. The results of this computation are the output rows of nPath. The rows from nPath can subsequently be used like any other table in a SQL query: rows from nPath may be filtered with the WHERE clause, joined to rows from other tables, grouped with the GROUP BY clause, and so on.
The result? Incredibly powerful insight into a series of events which indicates a pattern or segment can be expressed in SQL, run in parallel on massive clusters of compute-power in an extremely efficient manner via a single pass over the data, and made accessible to business analysts through traditional BI tools.
What do you think? What other methods are people using to tackle these types of problem?
I recently attended a panel discussion in New York on media fragmentation consisting of media agency execs including:
- Bant Breen (Interpublic - Initiative – President, Worldwide Digital Communications),
- John Donahue (Omnicom Media Group - Director of BI Analytics and Integration),
- Ed Montes (Havas Digital - Executive Vice President),
- Tim Hanlon (Publicis - Executive Vice President/Ventures for Denuo)
The discussion was kicked off of by Brian Pitz, Principle of Equity Research for Bank of America. Brian set the stage for a spirited discussion regarding the continuing fragmentation of online media along with research on the issues posed by this. The panel discussion touched upon many issues including fear placement around unknown user-generated content, agency lack of skill set to address this medium and lack of standards. However, what surprised me most was the unanimous consensus in opinion that there is more value further out on “The Tail” of the online publisher spectrum due to the targeted nature of the content. Yet the online media buying statistics conflict with this opinion (over 77% of online ad spending is still flowing to the top 10 sites).
When asked “why the contrast?” between their sentiment and the stats, the discussion revealed the level of uncertainty due to a lack of transparency into “The Tail”. Despite the 300+ ad networks that have emerged to address this very challenge, the value chain lacks the data to confidently invest the dollars. In addition, there was a rather cathartic moment when John Donahue professed that agencies should “Take Back Your Data From Those that Hold It Hostage”.
It is our belief that the opinions expressed by the panel serve as evidence of a shift towards a new era in media where evidential data will drive valuation across media rather than sampling-based ratings acting as the currency. No one will be immune from this:
- Agencies need it to confidentially invest their clients dollars and show demonstrable ROI of their services
- Ad networks need it to earn their constituencies’ share of marketing budgets
- Ad networks need it to defend the targeted value and the appropriateness of their collective content
- 3rd Party measurement firms (comScore, Nielsen Online, ValueClick) need it to maintain the value of their objective value
- Advertisers need it to support the logic budget allocation decisions
- BIG MEDIA needs it to defend their 77% stake
You might be thinking, “The need for data is no great epiphany”. However, I submit that the amount of data and the mere fact that all participants should have their own copy is a shift in thinking. Gone are the days where:
- The value chain is driven solely by 3rd Party’s and their audience samples
- Ad Servers/Ad Networks are the only keepers of the data
- Service Providers can offer data for a fee
I was at Defrag 2008 yesterday and it was a wonderful, refreshing experience. A diverse group of Web 2.0 veterans and newcomers came together to accelerate the “Aha!” moment in today’s online world. The conference was very well organized and there were interesting conversations on and off the stage.
The key observation was that individuals, groups and organizations are struggling to discover, assemble, organize, act on, and gather feedback from data. Data itself is growing and fragmenting at an exponential pace. We as individuals feel overwhelmed by the slew of data (messages, emails, news, posts) in the microcosm, and we as organizations feel overwhelmed in the macrocosm.
The very real danger is that an individual or organization’s feeling of being constantly overwhelmed could result in the reduction of their “Aha!” moments - our resources will be so focused on merely keeping pace with new information that we won’t have the time or energy to connect the dots.
The goal then is to find tools and best practices to enable the “Aha!” moments - to connect the dots even as information piles up on our fingertips.
My thought going into the conference was that we need to understand what causes these “Aha!” moments. If we understand the cause, we can accelerate the “Aha!” even at scale.
Earlier this year, Janet Rae-Dupree published an insightful piece in the International Herald Tribune on Reassessing the Aha! Moment. Her thesis is that creativity and innovation - “Aha! Moments” - do not come in flashes of pure brilliance. Rather, innovation is a slow process of accretion, building small insight upon interesting fact upon tried-and-true process.
Building on this thesis, I focused my talk on using frontline data warehousing as an infrastructure piece that allows organizations to collect, store, analyze and act on market events. The incremental fresh data loads in a frontline data warehouse add up over time to build a stable historical context. At the same time, applications can contrast fresh data with historical data to build the small contrasts gradually until the contrasts become meaningful to act upon.
I’d love to hear back from you on how massive data can accelerate, rather than impede, the “Aha!” moment.
In a down economy, marketing and advertising are some of the first budgets to get cut. However, recessions are also a great time to gain market share from your competitors. If you take a pragmatic, data-driven approach to your marketing, you can be sure you’re getting the most ROI from every penny spent. It is not a coincidence that in the last recession, Google and Advertising.com came out stronger since they provided channels that were driven by performance metrics.
That’s why I’m excited that Aster Data Systems will be at Net.Finance East in New York City next week. Given the backdrop of the global credit crisis, we will learn first-hand the implications of the events in the financial landscape. I am sure the marketing executives are thinking of ways to take advantage of a change in the financial landscape, whether it’s multi-variate testing, more granular customer segmentation, or simply lowering the data infrastructure costs associated with your data warehouse or Web analytics.
Look us up if you’re at Net.Finance East - we’d love to learn from you and vice-versa.
Aster announced the general availability of our nCluster 3.0 database, complete with new feature sets. We’re thrilled with the adoption we saw before GA of the product, and it’s always a pleasure to speak directly with someone who is using nCluster to enable their frontline decision-making.
Lenin Gali, director of business intelligence for the online sharing platform ShareThis, is one such friend. He recently sat down with us to discuss how Internet and social networking companies can successfully grow their business by rapidly analyzing and acting on their massive data.