Archive for the ‘Blogroll’ Category

26
Oct
By Steve Wooledge in Analytics, Blogroll, Frontline data warehouse on October 26, 2008
   

In a down economy, marketing and advertising are some of the first budgets to get cut. However, recessions are also a great time to gain market share from your competitors. If you take a pragmatic, data-driven approach to your marketing, you can be sure you’re getting the most ROI from every penny spent. It is not a coincidence that in the last recession, Google and Advertising.com came out stronger since they provided channels that were driven by performance metrics.

That’s why I’m excited that Aster Data Systems will be at Net.Finance East in New York City next week. Given the backdrop of the global credit crisis, we will learn first-hand the implications of the events in the financial landscape.  I am sure the marketing executives are thinking of ways to take advantage of a change in the financial landscape, whether it’s multi-variate testing, more granular customer segmentation, or simply lowering the data infrastructure costs associated with your data warehouse or Web analytics.

Look us up if you’re at Net.Finance East - we’d love to learn from you and vice-versa.



12
Oct
By George Candea in Blogroll, Frontline data warehouse on October 12, 2008
   

At long last, I get to smash some hardware! Ever since we started the recovery-oriented computing (ROC) project at Stanford and Berkeley in 2001, I’ve been dreaming of demo-ing ROC by taking a sledgehammer to a running computer and have the software continue running despite the damage. I never quite had the funds for it! :-)

The Aster nCluster analytic database embodies so much of ROC (microrebooting, undo, fault-injection-based testing, and so on), that fulfilling this “childhood dream” within the context of nCluster is a perfect match.  It’s hilarious to watch, but for me it was a great experience; check out Recovery-Oriented Computing for Databases (the actual demonstration) and DBA’s Gone Wild (just for fun). The only thing I’d do differently is get a bigger sledgehammer, because some of that hardware was really built to last (hats off to HP )!

Beyond all the fun involved, there is also a broader message. A lot of data warehouses are way too fragile, and too many people believe that investing in more solid hardware is the way to go. Frontline business applications have to be able to withstand a wide range of failures, and what I do in these videos is really just scratching the surface.

At Aster, “always on” availability is much more than a key marketing message - it’s a core database innovation founded in recovery-oriented computing, which minimizes both planned and unplanned downtime for our customers. Whether it’s an analytic application, or an analyst requiring 24/7 queries for their modeling, data mining, or business intelligence (BI) report - having a database they can depend on is critical.



06
Oct
By Steve Wooledge in Analytics, Blogroll, Frontline data warehouse on October 6, 2008
   

Aster announced the general availability of our nCluster 3.0 database, complete with new feature sets. We’re thrilled with the adoption we saw before GA of the product, and it’s always a pleasure to speak directly with someone who is using nCluster to enable their frontline decision-making.

Lenin Gali, Director of BI, ShareThis

ShareThis logo

Lenin Gali, director of business intelligence for the online sharing platform ShareThis, is one such friend. He recently sat down with us to discuss how Internet and social networking companies can successfully grow their business by rapidly analyzing and acting on their massive data.

You can read the full details of our conversation on the Aster Website.



06
Oct
Growing Your Business with Frontline Data Warehouses
By Mayank Bawa in Analytics, Blogroll, Frontline data warehouse on October 6, 2008
   

It is really remarkable how many companies today view data analytics as the cornerstone of their businesses.

acerno logoaCerno is an advertising network that uses powerful analytics to predict which advertisements to deliver to which person at what time. Their analytics are performed on completely anonymous consumer shopping data of 140M users obtained from an association of 450+ product manufacturers and multi-channel retailers. There is a strong appetite at aCerno to perform analytics that they have not done before because each 1% uplift in the click-through rates is a significant revenue stream for them and their customers.

Aggregate KnowledgeAggregate Knowledge powers a discovery network (The Pique Discoveryâ„¢ Network) that delivers recommendations of products and content based on what was previously purchased and viewed by an individual using the collective behavior of the crowds that had behaved similarly in the past. Again, each 1% increase of engagement is a significant revenue stream for them and their customers.

ShareThis logoShareThis provides a sharing network via a widget that makes it simple for people to share things they find online with their friends. In a short period of time since their launch, ShareThis has reached over 150M unique monthly users. The amazing insight is that ShareThis knows which content users actually engage with, and want to tell their friends about! And in its sheer genius, ShareThis gives away its service to publishers and consumers free; relying on delivering targeted advertising for its revenue: by delivering relevant ad messages while knowing the characteristics of that thing being shared. Again, the better their analytics, the better their revenue.

Which brings me to my point: data analytics is a direct contributor of revenue gains in these companies.

Traditionally, we think of data warehousing as a back-office task. The data warehouse can be loaded in separate load windows; loads can run late (the net effect is that business users will get their reports late); loads, backups, and scale-up can take data warehouses offline –which is OK since these tasks can be done on non-business hours (nights/weekends).

But these companies rely on data analytics for their revenue.

·    A separate exclusive load window implies that their service is not leveraging analytics during that window;
·    A late-running load implies that the service is getting stale data;
·    An offline warehouse implies that the service is missing fresh trends

Any such planned or unplanned outage results in lower revenues.

On the flip side, a faster load/query provides the service a competitive edge – a chance to do more with their data than anyone else in the market. A nimbler data model, a faster scale-out, or a more agile ETL process helps them implement their “Aha!” insights faster and gain revenue from a reduced time-to-market advantage.

These companies have moved data warehousing from the back-office to the frontlines of business: a competitive weapon to increase their revenues or to reduce their risks.

In response, the requirements of a data warehouse that supports these frontline applications go up a few notches: the warehouse has to be available for querying and loading 365x24x7; the warehouse has to be fast and nimble; the warehouse has to allow “Aha!” queries to be phrased.

We call these use cases “frontline data warehousing“. And today we released a new version of Aster nCluster that rises up those few notches to meet the demands of the frontline applications.



06
Oct
By Tasso Argyros in Analytics, Blogroll, Frontline data warehouse on October 6, 2008
   

Back in the days when Mayank, George and I were still students at Stanford, working hard to create Aster, we had a pretty clear vision of what we wanted to achieve: allow the world to do more analytics on more data. Aster has grown tremendously since these days, but that vision hasn’t changed. And one can see this very clearly in the new release of our software, Aster nCluster 3.0, which is all about doing more analytics with more data. Because 3.0 introduces so many and important features, we tried to categorize them in three big buckets: Always Parallel, Always On, and In-Database MapReduce.

Always Parallel has to do with the “Big Data� part of our vision. We want to build systems that can handle 10x – 100x more data than any other system today. But this is too much data for any single “commodity server� (that is, a server with reasonable cost) that one can buy. So we put a lot of R&D effort into parallelizing every single function of the system – not only querying, but also loading, data export, backup, and upgrades. Plus, we allow our users to choose how much they want to parallelize all these functions, without having to scale up the whole system.

Always On also stems from the need to handle “Big Data�, but in a different way. In order for someone to store and analyze anything from a terabyte to a petabyte, she needs to use a system with more than a single server. But then availability and management can become a huge problem. What if a server fails? How do I keep going, but also how do I recover from the failure (either by introducing the same server or a new, replacement, server) with no downtime? And how can I seamlessly expand the system, in order for me to realize the great promise of horizontal scaling, without taking the system down? And, finally, how do I backup all these oceans of data without disrupting my system’s operation? All these issues are handled in our new 3.0 release.

We introduced In-Database MapReduce in a previous post so I won’t spend too much time here. But I want to point out how this fits our overall vision. Having a database which is always parallel and always on allows you to handle Big Data with high performance, low cost, and high availability. But once you have all this data, you want to do more analytics - to extract more value and insights. In-Database MapReduce is meant to do exactly that – push the limits of what insights you can extract by providing the first-ever system that tightly integrates MapReduce (a powerful analytical paradigm) with a wide-spread standard like SQL.

These are the big features in nCluster 3.0, and in the majority of our marketing materials we stop here. But I also want to talk about the other great things we have in there; things more subtle or technical to mention in the headlines, but still very important. We’ve added table compression features that offer online, multi-level compression for cost-savings. With table compression, you can choose your compression ratio and algorithm and have different tables compressed differently. This paves the way for data life-cycle management that can compress data differently depending on its age.

We’ve also implemented richer workload management to offer quality of service for fine-grained mixed workload prioritization via priority and fair-share based resource queue.  You can even allocate resource weights based on transaction number or time (useful when both big and small jobs occur).

3.0 also has Network Aggregation (NIC “bondingâ€?) for performance and fault tolerance. This is a one-click configuration that automates network setup – usually a tedious error-prone sys admin task. And that’s not the end of it – we also are introducing an Upgrade Manager that automates upgrades from one version of nCluster to another, including what most frequently breaks upgrades: the operating system components. This is another building block of the low cost of ongoing administration that we’re so proud of achieving with nCluster. I could go on and on (new SQL enhancements, new data validation tools, heterogeneous hardware support, LDAP authentication, …), but since blog space is supposed to be limited, I’ll stop here. (Check out our new resource library if you want to dig deeper.)

Overall, I am delighted to see how our product has evolved towards the vision we laid out years back. I’m also thrilled that we’re building a solid ecosystem around Aster nCluster – we now support all the major BI platforms – and are establishing quite a network of systems integrators to help customers with implementation of their frontline data warehouses. In a knowledge-based economy full of uncertainty, opportunities, and threats, doing more analytics on more data will drive the competitiveness of successful corporations – and Aster nCluster 3.0 will help you deliver just that for your own company.



21
Sep
By Mayank Bawa in Blogroll, Statements on September 21, 2008
   

There has been a lot of turmoil this past week in Financial Services. Several good people had their projects stalled, or even lost their jobs, due to market forces beyond their control.

I’d like to call out to Quantitative Computer Scientists who have been affected. If you are good with data and know how to extract intelligence from it, we want you in our team!

We are hiring. You’ll have the chance to work with a number of our customers and help them do more with their data. You’ll bring a fresh perspective to the business processes at our customers; in turn, you’ll gain from learning about the business processes of various verticals. An invaluable education when you want to go back to Financial Services after the crisis has passed in a couple of years.

Drop us a note at careers [at] asterdata [dot] com. We’d love to hear from you!



15
Sep
By Tasso Argyros in Blogroll, Database on September 15, 2008
   

Dave Kellog’s blog reminded me that the Claremont DB Research report was recently released. The Claremont report is the result of two days of discussion among some of the world’s greatest academics in databases and aims to identify and promote the most promising research directions in databases.

As I was reading the report, I realized that Aster Data is at the forefront of some of the most exciting database research topics. In particular, the report mentions four areas (out of a total of six) where Aster has been driving innovation very aggressively.

1. Revisiting database engines. MPP is the answer to Big Data, among other things.

2. Declarative programming for emerging platforms. MapReduce is explicitly mentioned here, noting its potential in data management. This is a very important development given that certain database academics (that participated in the report) have repeatedly shown their depreciation and ignorance on the topic.

3. Interplay of structured and unstructured data. This is an important area where MapReduce can play a huge role.

4. Cloud data services. Database researchers realize the potential of the cloud, both as a data management and a research tool. With our precision scaling feature, we are a strong fit for internal Enterprise clouds.

The world of databases is changing fast and this is an opportunity for us to provide the most cutting-edge database technology to our customers.

We’ve also found a lot of benefit from our strong ties with academia, by nature of our background and advisors, and we intend to strengthen these even more.



10
Sep
By Tasso Argyros in Blogroll, MapReduce on September 10, 2008
   

I am very excited about the power that In-Database MapReduce puts in the hands of the larger BI community. I’ll be leading a Night School session on In-Database MapReduce at the TDWI World Conference in November in New Orleans.

Please join me if you are interested in learning more about the MapReduce framework and its applications. I will introduce MapReduce from the basic principles, and then help build up your intuition. If we have time, I will even address why MapReduce is not UDF re-discovered. :-)

If you are unable to attend, or eager to understand, here are some MapReduce resources you may find informative: Aster’s whitepaper on In-Database MapReduce; Google Labs’ MapReduce research paper; Curt Monash’s post on Known Applications of MapReduce.

A great open-source project that I’d like to commend and draw your attention to illustrate the power of MapReduce is Apache’s Mahout Project, which is building machine learning algorithms on the MapReduce framework (Classification, Clustering, Regression, Dimension reduction and Evolutionary Algorithms).

I am sure this is just a snippet of the MapReduce resources available. If you have some that you have found helpful, please share them in your comments. I will be happy to review and cover them in our TDWI Night School!



06
Sep
By Tasso Argyros in Blogroll, Database, MapReduce on September 6, 2008
   

In response to Aster’s In-Database MapReduce initiative, I’ve been asked the following question:

“How does Aster Data Systems compete with open source MapReduce implementations, such as Hadoop?�

My answer – we simply do not.

Hadoop and Google’s implementation of MapReduce are targeted to the development (coding) community. The primary interface of these systems is the command line; and the primary means of accessing data is through Java or Python code. There have been efforts to build higher-level interfaces on top of these systems, but they are usually limited, do not follow any existing standard, and are incompatible with the existing filesystem.

Such tools are ideal for environments that are dominated by engineers, such as academic institutions, research labs or technology companies like Google/Yahoo that have a strong culture of in-house development (often hundreds of thousands of lines of code) to solve technical problems.

Most enterprises are unlike the culture of Google/Yahoo and each “build vs. buy” decision is carefully considered. Good engineering talent is a precious resource that is directed towards adding business value, not in building infrastructure from the ground up. The Data Services groups are universally under-staffed and consist of people that understand and leverage databases. As such, there are corporate governance expectations from any data management tool that they use:

- it has to comply with applicable standards like ANSI-SQL,

- it needs to provide a set of tools that IT can use & manage, and

- it needs to be ecosystem-friendly (BI and data integration tools compatibility).

In such an environment, using Java or developer-centric command line as the primary interface will increase the burden on the data services group and their IT counter-parts.

I strongly believe, that while existing MapReduce tools are good for development organizations, they are totally inappropriate for a large majority of enterprise IT departments.

Our goal is not to build yet another tool for development groups, but rather to create a product that unleashes the power of MapReduce for the enterprise IT organization.

How can we achieve that?

First, we’ve developed Aster to be a super-fast, always-parallel database for large-scale data warehousing using SQL. Then we allow our customers and partners to extend SQL through a tightly integrated MapReduce functionality.

The person that develops our MapReduce functions, naturally, needs to be a developer; but the person that is using this functionality can be an analyst using a standard BI tool (e.g., Microstrategy, Business Objects, Pentaho) over ODBC or JDBC connections!

Invoking MapReduce functions in Aster looks almost identical to writing standard SQL code. This way, the powerful MapReduce extensions that are developed by a small set of developers (either within an IT organization or by Aster itself) can be used by people with SQL skills using their existing sets of tools.

Integrating MapReduce and SQL is not an easy job; we had to innovate on multiple levels to achieve that, e.g. by creating a new type of UDFs that are both parallel and polymorphic, to make MapReduce extensions almost indistinguishable from standard SQL.

In summary, we have enabled:

- The flexible, parallel power of MapReduce to enable deep analytical insights that are impossible to express in standard SQL

- Seamless integration with ANSI standard SQL and all the rich commands, types, functions, etc. that are inherent in this well-known language

- Full JDBC/ODBC support ensures interoperability between Aster In-Database MapReduce and 3rd party database ecosystem tools like BI, reporting, advanced analytics (e.g., data mining), ETL, monitoring, scheduling, GUI administration, etc.

- SQL/MR functions – powerful plug-in operators that any non-engineer can easily plug into standard ANSI SQL to exploit the power of MapReduce analytic applications

- Polymorphism – unlike static, unreliable UDFs, SQL/MR functions unleash the power of polymorphism (run-time/dynamic) for cost-efficient reusability.  Built-in sandboxing ensures fault tolerance to avoid system crashes commonly experienced with UDFs

To conclude, it is important to understand that Aster nCluster is not yet another MapReduce implementation nor does it compete with Hadoop for resources or audience.

Rather, Aster nCluster is the world’s most powerful database that breaks traditional SQL barriers allowing Data Services groups and IT organizations to extract more knowledge out of their data



27
Aug
How Aster In-Database MapReduce Takes UDF’s to the Next Level
By Tasso Argyros in Blogroll, Database, MapReduce on August 27, 2008
   

Building on Mayank’s post, let me dig deeper into a few of the most important differences between Aster’s In-Database MapReduce and User Defined Functions (UDFs):

Feature User Defined Functions Aster SQL/MR Functions What does it mean?
———— ———— ———— ————
Dynamic Polymorphism No. Requires changing the code of the function and static declarations Yes SQL/MR functions work just like SQL extensions - no need to change function code
———— ———— ———— ————
Parallelism Only in some cases and for few number of nodes Yes, across 100s of nodes Huge performance increases even for the most complex functions
———— ———— ———— ————
Availability Ensured No. In most cases UDFs run inside the database Always. Functions run outside the database Even if functions have bugs, the system remains resilient to failures
———— ———— ———— ————
Data Flow Control No. Requires changing the UDF code or writing complex SQL subselects Yes. “PARTITION BY” and “SEQUENCE BY” natively control the flow of data in and out of the SQL/MR functions Input/output of SQL/MR functions can be redistributed across the database cluster in different ways with no actual function code change

In this blog post we’ll focus on Polymorphism - what it is and why it’s so critically important for building real SQL extensions using MapReduce.

Polymorphism allows Aster SQL/MR functions to be coded once (by a person that understands a programming or scripting language) and then used many times through standard SQL by analysts. In this context, comparing Aster SQL/MR functions and UDFs is like comparing SQL with the C language. The former is flexible, declarative and dynamic, the latter requires customization and recompilation even for the slightest change in usage.

For instance, take a SQL/MR function that performs sessionization. Let us assume that we have a webclicks(userId int, timestampValue timestamp, URL varchar, referrerURL varchar); table that contains a record of clicks for each user on our website. The same function, with no additional declarations, can be used in all the following ways:


SELECT sessionId, userId, timestampValue
FROM Sessionize( 'timestamp', 60 ) ON webclicks;


SELECT  sessionId, userId, timestampValue
FROM Sessionize( 'timestamp', 60 ) ON
(SELECT userid, timestampValue FROM webclicks WHERE userid = 50 );

[Note how the number of input arguments changed (going from all columns of webclicks; to just two columns of webclicks) in the above clause but the same function can be used. This is not possible with a plain UDF without writing additional declarations and UDF code]


SELECT  sessionId, UID, TS
FROM Sessionize( 'ts', 60 ) ON
(SELECT userid as UID, timestampValue as TS FROM webclicks WHERE userid = 50 );

[Note how the names of the arguments changed but the Sessionize() function does the right thing]

In other words, Aster SQL/MR functions are real SQL extensions - once they’ve been implemented there is zero need to change their code or write additional declarations - there is perfect separation between implementation and usage. This is an extremely powerful concept since in many cases the people that implement UDFs (engineers) and the people that use them (analysts) have different skills. Requiring a lot of back-and-forth can simply kill the usefulness of UDFs - but not so with SQL/MR functions.

How can we do that? There’s no magic, just technology. Our SQL/MR functions are dynamically polymorphic. To do this, our SQL/MR implementation (the sessionize.o file) includes not only code but also logic to determine its output schema based on its input, which is invoked at every query. This means that there is no need for a static signature as is the case with UDFs!

In-Database MapReduce Flow

Polymorphism also makes it trivial to nest different functions arbitrarily. Consider a simple example with two functions, Sessionize() and FindBots(). FindBots() can filter the input from any users that seem to act as bots, e.g. users whom interactions are very frequent (who could click on 10 links per second? probably not a human). To use these two functions in combination, one would simply write:


SELECT  sessionId, UserId, ts, URL
FROM Sessionize( 'ts', 60 ) ON FindBots( 'userid', 'ts' ) ON webclicks;

Using UDFs instead of SQL/MR functions would mean that this statement would require multiple subselects and special UDF declarations to accommodate the different inputs that come out of the different stages of the query.

So what is it that we have created? SQL? Or MapReduce? It really doesn’t matter. We just combined the best of both worlds. And it’s unlike anything else the world has seen!