Archive for February, 2009

25
Feb
By Mayank Bawa in Blogroll on February 25, 2009
   

We announced today that we’ve raised another $5.1M as an addendum to our Series B.

ivp-logo-05×2.jpg

Institutional Venture Partners (IVP) led this addendum to bring our total Series B raise to $17.1M. IVP is very clearly the best later-stage VC firm in Silicon Valley, and it is great to have them on our side as we continue to grow our company. IVP has backed such enterprises as Netflix, WebEx, MySQL, Data Domain, Juniper Networks, and Akamai.

Steve Harrick, Partner at IVP, led this investment for IVP. Steve has a great understanding of the technology infrastructure companies poised to lead their markets. He has led IVP’s investments at WebEx and MySQL and was recently recognized by Forbes Magazine’s 2009 Midas List as one of the leading venture investors in the US.

We had met Steve during our rounds of Series B meetings in Q4 2008 and had really liked and respected each other. At that point, we didn’t have space for two venture firms - having decided to raise $12M; we decided to separate, promising to keep in touch for the future.

In Q1 2009, as the economy grew more uncertain, we re-visited our decision and realized that it would be more prudent to pro-actively build a bigger cash reserve to ensure that we did not stumble in our growth, even as the market deteriorates, and a recovery continues to inch away.

Steve was happy to step in and participate in Aster at the same terms as our Series B. We were delighted to have a person of Steve’s caliber participate whole-heartedly in our growth.

Finally, there has to be some weird co-incidence in the manner in which all of our Series A and Series B venture capital firm participants (Sequoia Capital, JAFCO Ventures, and now IVP) have been investors in database companies that have had successful returns (Netezza, Datallegro, MySQL).



11
Feb
By Steve Wooledge in Blogroll, nPath on February 11, 2009
   

As a follow-on to the introductory nPath post, I wanted to share a little more depth on the nPath SQL syntax and a more sophisticated example which can be applied in click-stream or Web analytics. I’ll try to keep it concise for my colleagues who don’t want the pretty marketing bow . ;-)

SEO and SEM are critical traffic drivers for just about any consumer-facing website. Third party analytics offerings such as Google Analytics or Omniture can provide great turn-key package of canned reports. However, certain deep analytics on sequential events are simply out of the reach of not only these outsourced analytics services, but also in-house Web analytics data warehouses implemented on traditional solutions such as Oracle or SQL Server.

For example, suppose we are interested in the optimization of our website flow in order to retain and engage visitors driven to us by SEO/SEM. We want to answer the question: for SEO/SEM-driven traffic that stay on our site only for 5 or less pageviews and then leave our site and never return in the same session, what are the top referring search queries and what are the top path of navigated pages on our site? In traditional data warehouse solutions, this problem would require a five-way self-join of granular weblog data, which is simply unfeasible for large sites such as Myspace.

With the Aster nPath SQL/MR function, this problem can be expressed in a straightforward query that is executed in a very efficient manner in just a single pass over the granular data. The query below returns the top combinations of referral query string (of the entry page of the visit to our site) and on-site navigation path of up to 5 pages before leaving the site:

SELECT entry_refquerystring, entry_page || “,” || onsite_pagepath as onsite_pagepath, count(*) as session_count FROM nPath(
ON ( select * from clicks where year = 2009 )
PARTITION BY customerid, sessionid
ORDER BY timestamp
PATTERN ( ‘Entry.Onsite+.OffSite+$’ )
SYMBOLS (
domain ilike “mysite.com” and refdomain ~* “yahoo.com|google.com|msn.com|live.com” as Entry,
domain ilike “mysite.com” as OnSite,
domain not ilike “mysite.com” as OffSite
)
MODE( NONOVERLAPPING )
RESULT(
first(page of Entry) as entry_page,
first(refquerystring of Entry) as entry_refquerystring,
accumulate(page of Onsite) as onsite_pagepath,
count(* of Onsite) as onsitecount_minus1
)
)
WHERE onsitecount_minus1 < 4
GROUP BY 1,2
ORDER BY 3 DESC
LIMIT 1000;



10
Feb
By Shawn Kung in Analytics, Blogroll, Cloud Computing on February 10, 2009
   

blue_sky11.jpg

Cloud computing is a fascinating concept.  It offers greenfield opportunities (or more appropriately, blue sky frontiers) for businesses to affordably scale their infrastructure needs without plunking down a huge hardware investment (and the space/power/cooling costs associated with managing your own hosted environment).  This removes the risks of mis-provisioning by enabling on-demand scaling according to your data growth needs.  Especially in these economic times, the benefits of Cloud computing are very attractive.

But let’s face it - there’s also a lot of hype, and it’s hard to separate truth from fiction.  For example, what qualities would you say are key to data warehousing in the cloud?

Here’s a checklist of things I think are important:

[1] Time-To-Scalability.  The whole point of clouds is to offer easy access to virtualized resources.  A cloud warehouse needs to quickly scale-out and scale-in to adapt to changing needs.  It can’t take days to scale…it has to happen on-demand in minutes (<1 hour).

[2] Manageability.  You go with clouds because you not only want to save on hardware, but also on the operational people costs of maintaining that infrastructure.  A cloud warehouse needs to offer one-click scaling, easy install/upgrade, and self-managed resiliency.

[3] Ecosystem.  While clouds offer *you* huge TCO savings, you can’t compromise service levels for your customers - especially if you run your business on the cloud.  BI/ETL/monitoring tools, Backup & Recovery, and ultra-fast data loading can’t be overlooked for “frontline” mission-critical warehousing on the cloud.

[4] Analytics.  Lots of valuable data is generated via the cloud and there are opportunities to subscribe to new data feed services.  It’s insufficient for a cloud warehouse to just do basic SQL reporting.  Rather, it must offer the ability to do deep analytics very quickly.

[5] Choice.  A truly best-in-class cloud warehouse won’t lock you in to a single cloud vendor.  Rather, it will offer portability by enabling you to choose the best cloud for you to run your business on.

Finally, here are a couple ideas on the future of cloud warehousing.  What if you could link multiple cloud warehouses together and do interesting queries across clouds?  And what about the opportunities for game-changing new analytics - with so many emerging data subscription services, wouldn’t this offer ripe opportunities for mash-up analytics (eg. using Aster SQL/MapReduce).

What do you think are the standards for “best-in-class” cloud warehousing?



10
Feb
By Steve Wooledge in Analytics, Blogroll, nPath on February 10, 2009
   

It may not sound sexy, but analyzing a sequence of events over time is non-trivial in standard SQL. When Aster started providing customers ways to express time-series analysis more elegantly and have it return result sets in 90 seconds as opposed to 90 minutes, we started getting comments like,  “That’s beautiful … I would make out with my screen if I could!”.

Crazy, I know.

Time-series, or sequential pattern analysis, is useful in a number of industries and applications such as:

- Price changes over time in financial market data

- Path analysis within click-stream data to define anything from top referring sites to the “golden” paths customers navigate before purchasing

- Patterns which detect deviant activity such as spamming or insurance claims fraud

- Sessionization (mapping each event in a clickstream to a human user session)

More specifically, one customer wanted to improve the effectiveness of their advertising and drive more visitors to their site. They asked us to help determine the top paths people take to get to their site and top paths people take after leaving the site. Knowing the “before” path gave them insight into what sites to place advertisements on to drive traffic to their site. Additionally, pathing helps the customer understand behavior/preferences of users who visit their site (e.g., if espn.com is a top site that is in the path of the 5 pages leading up to the customer site, they know that many visitors like sports).

However, discovering relationships between rows of data is difficult to express in SQL, which must invoke multiple self-joins of the data. These joins dramatically expand the amount of data involved in the query and slow down query performance - not to mention complexity in developing and parsing these expressions.

What’s the solution? There are a few, but Aster’s approach has been to develop extensions to SQL which is executed in-database in a single-pass over the data in a massively-parallel fashion utilizing nPath, which is a SQL-MapReduce (SQL/MR) function used to perform regular expression pattern matching over a sequence of rows. It allows users to:

- Specify any pattern in an ordered collection - a sequence - of rows with symbols;

- Specify additional conditions on the rows matching these symbols; and

- Extract useful information from these row sequences.

I’ll share the syntax of nPath here to give you more context of how the query operates:

SELECT…
FROM nPath (
ON {table|query}
…various parameters…
)
WHERE…
GROUP BY…
HAVING…
ORDER BY…
etc.

nPath performs pattern matching and computes aggregates. The results of this computation are the output rows of nPath. The rows from nPath can subsequently be used like any other table in a SQL query: rows from nPath may be filtered with the WHERE clause, joined to rows from other tables, grouped with the GROUP BY clause, and so on.

The result? Incredibly powerful insight into a series of events which indicates a pattern or segment can be expressed in SQL, run in parallel on massive clusters of compute-power in an extremely efficient manner via a single pass over the data, and made accessible to business analysts through traditional BI tools.

What do you think? What other methods are people using to tackle these types of problem?