Archive for the ‘Analytics tech’ Category

17
Oct
   

“Big data” has always been a favorite subject of discussion among the Aster Data team. We’ve been talking about big data at least since 2009, long before the term became burning-hot. The big data hype has confused many organization (and vendors) in the market about the best technology or method to solve their analytical business problems.

However, our vision hasn’t changed: from the time we founded the company in 2005 to today where we are part of the Teradata family. Teradata Aster continues to lead the market with technology innovations and reference architectures which provide clear guidance and deliver significant business value to our customers

Today, we are pushing the limits of analytical technology once more, by launching the Teradata Aster Big Analytics Appliance. The Big Analytics Appliance is a unique machine that can help enterprises see their business in high-definition. By harnessing all existing and new data types in the enterprise, we enable organizations to leverage our powerful SQL-MapReduce framework and business-ready analytics & apps which solve specifics business problems in marketing attribution, fraud detection, graph analysis, pattern analysis, and much more. It unleashes the creativity of bright analysts to go discover new insights to help their organizations grow revenue and create sustainable competitive advantage.

So what is the Big Analytics Appliance? It’s five things in one box:

  1. Aster + Apache Hadoop (100% open source via the Hortonworks HDP distribution), fully integrated in one box
  2. ANSI-standard SQL and next-generation MapReduce, fully integrated
  3. More than 50 ready-to-use MapReduce  apps, to deliver immediate business value
  4. Full ecosystem connectivity for both Aster and Hadoop; with BI, ETL and other existing IT systems
  5. The latest-generation, most efficient hardware platform, specifically optimized for Aster, Hadoop, and Big Analytics

Loyal to our Stanford roots, the appliance comes in Cardinal-red color!

Teradata Aster Big Analytics Appliance

The Big Analytics Appliance packs a long list of essential and unique technologies, including:

  • SQL-MapReduce®,  industry’s only true SQL/MapReduce integration
  • SQL-H™, industry’s only ANSI-standard SQL and Hadoop integration
  • Teradata Viewpoint, the most advanced database monitoring platform now extended to Aster and Hadoop
  • Teradata TVI a very sophisticated hardware support and failure prevention software, now ported to Hadoop as well as to Aster
  • Infiniband network interconnect – makes ultra-high-performance connectivity between Aster and Hadoop, as well as scalability, a non-issue
  • Small factor disk drives and dense enclosures – make this appliance one of the most dense and space-efficient big data platforms in the market

And, of course, everything in this appliance is packaged, integrated, pre-tested and supported by Teradata – the most trusted brand in data management and analytics.

I also want to take a moment to talk about our Unified Data Architecture vision for the enterprise. When most vendors out there talk about big data at a very high level without explaining where it fits and how it relates with traditional technologies like data warehousing, we decided to do the hard work of figuring out how different technologies complement each other and for what purpose. The result of that was the diagram below that showcases how Teradata, Aster & Hadoop can work together in tandem to provide a complete data solution for enterprise environments:

Teradata Unified Data Architecture

We also went one step further and now have a matrix that explains what technology (or technologies) are more appropriate for what use case – given a workload/use case and a specific type of data. The result of that exercise is below:

Processing as a Function of Schema Requirements by Data Type

When To Use Which Technology? The best approach by workload and data type

If you want to know more about our Unified Data Architecture vision, read the whitepaper we co-authored with Hortonworks, or feel free to contact us and we’ll be happy to discuss with you this concept and how it’d fit into your environment.

Through tightly integrating Aster and Hadoop, the new Big Analytics Appliance addresses a large part of the Unified Data Architecture; and via the Teradata-Aster and Teradata-Hadoop connectors, Teradata now has all the necessary pieces to help enterprises extract the maximum business value from all their data and execute on their Big Data vision. At Aster, just like at Teradata, we are committed to continuously provide the best innovations to help our customers have the power to make the best decision possible.

P.S. If you want to try out Aster without ordering a full Aster box, we now allow you to download an Aster virtual appliance! Go give it a try: http://www.asterdata.com/AsterExpress



12
Jun
   

Back in 2005, when we first founded Aster Data, our vision was to take some of the latest technology innovations – including MPP shared-nothing architectures; Linux-based commodity hardware; and novel analytical interfaces like Google’s MapReduce – and bring them to mainstream enterprises. This vision translated into a strategy focused not only on big data innovations, but also on delivering technologies that make big data viable for enterprise environments. SQL-MapReduce®, our industry-leading patented technology that combines standard SQL processing with a native MapReduce execution environment, is one example of how we make big data enterprise ready.

Today we have completed another major milestone on providing value to our customers by announcing a major innovation: Aster SQL-H™, a seamless way to execute SQL & SQL-MapReduce on Apache™ Hadoop™ data.

This is a significant step forward from what was state-of-the-art until yesterday. What was missing? A common DBMS-Hadoop connector operating at the physical layer. This means that getting data from Hadoop to a database required a Hadoop expert in the middle to do the data cleansing and the data type translation. If the data was not 100% clean (which is the case in most circumstances) a developer was needed to get it to a consistent, proper form. Besides wasting the valuable time of that expert, this process meant that business analysts couldn’t directly access and analyze data in Hadoop clusters. Other database connectors require duplicating the data into HDFS by using proprietary formats; a cumbersome and expensive approach by any measure.

SQL-H, an industry-first, solves all those problems.

First, we have integrated Aster’s metadata engine with Hadoop’s emerging metadata standard, HCatalog. This means that data stored in Hadoop using Pig, Hive & HBase can be “seen” in an Aster system as if they are just another Aster view. The business implication is that a business analyst using standard SQL or a BI tool can have full and seamless access to Hadoop data through the Aster’s standard ODBC/JDBC connector and Aster’s SQL engine. There is no need to have a human in the middle to translate the data or ensure its consistency; and no need to file tickets or call up experts to get the data the business needs. Everything happens transparently, seamlessly, and instantly. This is an industry first, since today all available Hadoop tools either do not provide standard SQL interfaces that are well optimized, do not provide native BI compatibility, or require manual data translation and movement from Hadoop to a third party system. None of these approaches are viable options for SQL & BI execution on Hadoop data, thus making it hard for enterprises to get value from Hadoop.

Secondly, SQL-H provides a high-performance, type-safe data connector, that can take a SQL or SQL-MapReduce query that involves Hadoop data, automatically select the minimum subset of data in Hadoop that is required for execution of the query, and run the query on the Aster system. The performance of running SQL and SQL-MapReduce analytics in Aster is significantly higher than Hadoop because (a) Aster can optimize data partitioning and distribution, thus reducing network transfers and overhead; (b) Aster’s engine can keep statistics about the data and use that to optimize execution of both SQL & MapReduce; (c) Aster’s SQL queries are cost-based-optimized which means that it can handle very complex SQL, including SQL produced by BI tools, very efficiently.

In addition, one can take advantage of SQL-H to apply the 50+ pre-build SQL-MapReduce apps that Teradata Aster provides on Hadoop data, thus doing big data analytics that are impossible to do in every other database without having to write a single line of Java MapReduce code! These apps include functions for path & pattern analysis, statistics, graph, text analysis, and more.

Teradata Aster is committed to groundbreaking product innovation as the key strategy in maintaining our #1 position in the big analytics market. SQL-H is another important step that we expect will make Hadoop and big data analytics much more palatable for enterprise environments, allowing business analysts, SQL power-users & BI tool users to analyze Hadoop data without having to learn about Hadoop interfaces and code.

If you want to find out more we’ll be talking about SQL-H at Hadoop Summit, on webcast taking place June 21st, at the upcoming Big Analytics 2012 events in Chicago & New York, and at the annual Teradata Partners event.



21
Feb
By Tasso Argyros in Analytic platform, Analytics, Analytics tech, Database, MapReduce on February 21, 2012
   

It has been about seven years since Aster Data was founded, four years since our industry-first Enteprise SQL-MapReduce implementation (first commercial MapReduce offering) and three years since our first Big Data Summit event (the first “Big Data” event in the industry as far as I know). During this whole time, we have witnessed our technology investments take off together with the Big Data market – just think how many people had never even heard the word MapReduce three years ago, and how many swear by it today!

As someone who was caught in the Big Data wave since 2005, I can tell you that the stage of the market has changed significantly during this time – and with it, the challenges that Enterprise customers face. A few years ago, customers were realizing the challenges that piles of new types of data were bringing – big volumes (terabytes to petabytes) and new, complex types (multi-structured data such as weblogs, text, customer interaction data); but at the same time, the opportunities that the new analytical interfaces, like MapReduce, were enabling. Fast forward to today and most enterprises are trying to put together their Big Data strategies and make sense of what the market has to offer – and as a result there is a lot of market noise and confusion: it is usually not clear what use cases apply to traditional technologies versus new; how to reconcile existing technologies with new investments; and what type of projects will they give them highest ROI versus a long and painful failure.

Teradata and Teradata Aster have a high interest in customers being successful with Big Data challenges and technologies, because we believe that the growth of the market will translate into growth for us. Given Teradata’s history in being the #1 strategic advisor to customers around data management and analytics, we only want to offer the best solutions to our customers. This includes our products –which are recognized by Gartner as leading technologies in Data Warehousing and Big Data analytics– but also our expertise helping customers how to use complementary solutions, like Hadoop, and making sure that the total solution works reliably and succeeds in tackling big business problems.

With this partnership, we are taking one more step towards this direction. So we are announcing three things:

1. Teradata and Hortonworks will work together to jointly solve big challenges for our customers. This is a win/win for customers and the industry.

2. Our intent to do joint R&D to make it easier for customers that use products from Teradata and Hadoop to utilize these products together. This is important because every enterprise will look to combine new technologies with existing investments, and there is plenty of opportunity to do better.

3. A set of reference architectures that combine Teradata and Hadoop products to accelerate the implementation of Big Data Big Data projects. We hope that this will be a starting point that will save enterprises time and money when they embark on Big Data projects.

We believe that all the above three points will translate into eliminating risks and unnecessary trial and error. We have enough collective experience to guide customers to avoid failed projects and traps. And by helping clear up some of the confusion in the big data market, we hope to accelerate its growth and the benefit to Enterprises that are looking to utilizing their data to become more competitive and efficient.



26
Aug
By Mayank Bawa in Analytics, Analytics tech, Blogroll, Database, MapReduce on August 26, 2008
   

Pardon the tongue-in-cheek analogy to Oldsmobile when describing user-defined functions (UDFs), but I want to draw out some distinctions between this new class of functions that In-Database MapReduce enables.

Not Your Granddaddy's Oldsmobile

While similar on the surface, in practice there are stark differences between Aster In-Database MapReduce and traditional UDF’s.

MapReduce is a framework that parallelizes procedural programs to offload traditional cluster programming. UDF’s are simple database functions and while there are some syntactic similarities, that’s where the similarity ends. Several major differences between In-Database MapReduce and traditional UDF’s include:

Performance: UDF’s have limited or no parallelization capabilities in traditional databases (even MPP ones).  Even where UDF’s are executed in parallel in an MPP database, they’re limited to accessing local node data, have byzantine memory management requirements, require multiple passes and costly materialization.  In constrast, In-Database MapReduce automatically executes SQL/MR functions in parallel across potentially hundreds or even thousands of server nodes in a cluster, all in a single-pass (pipelined) fashion.

Flexibility: UDF’s are not polymorphic. Some variation in input/output schema may be allowed by capabilities like function overloading or permissive data-type handling, but that tends to greatly increase the burden on the programmer to write compliant code.  In contrast, In-Database MapReduce MR/SQL functions are evaluated at run-time to offer dynamic type inference, an attribute of polymorphism that offers tremendous adaptive flexibility previously only found in mid-tier object oriented programming.

Manageability: UDF’s are generally not sandboxed in production deployments. Most UDF’s are executed in-process by the core database engine, which means bad UDF code can crash a database. SQL/MR functions execute in their own process for full fault isolation (bad SQL/MR code results in an aborted query, leaving other jobs uncompromised). A strong process management framework also ensures proper resource management for consistent performance and progress visibility.



19
Aug
   

I am curious if anyone out there is attending the TDWI World Conference in San Diego this week? If so and you would like to meet up with me, please do drop me a line or comment below as I will be in attendance. I’m of course very excited to be making the trip to sunny San Diego and hope to catch a glimpse of Ron Burgundy and the channel 4 news team! :-)

But of course it’s not all fun and games, as I’ll participate in one of TDWI’s famous Tool Talk evening sessions discussing data warehouse appliances. This should make for some great dialogue between me and other database appliance players, especially given the recent attention our industry has seen. I think Aster has a really different approach to analyzing big data and look forward to discussing exactly why.

For those interested in the talk, here are the details..come on by and let’s chat!
What:TDWI Tool Talk Session on data warehouse appliances
When: Wednesday, August 20, 2008 @ 6:00p.m.
Where: Manchester Grand Hyatt, San Diego, CA



25
Jul
   

Stuart announced yesterday that Microsoft has agreed to acquire DATAllegro. It is pretty clear Stuart and his team have worked hard for this day: it is heartening to see that hard work gets rewarded sooner or later. Congratulations, DATAllegro!

Microsoft is clearly acquiring DATAllegro for its technology. Indeed, Stuart says that DATAllegro will start porting away from Ingres to SQL Server once the acquisition completes. Microsoft’s plan is to provide a separate offering from its traditional SQL Server Clustering.

In effect, this event provides a second admission from a traditional database vendor that OLTP databases are not up to the task for large-scale analytics. The first admission was in 1990s when Sybase (ironically, originator of SQL Server code base) offered Sybase IQ as a separate product from its OLTP offering.

The market already knew this fact: the key point here is that Microsoft is waking up to the realization.

A corollary is that it must have been really difficult for Microsoft SQL Server division to scale SQL Server for larger scale deployments. Clearly, Microsoft is an engineering shop and the effort of integrating alien technology into their SQL Server code-base must have been carefully evaluated for a build-vs-buy decision. The buy decision is a tacit admission that it is incredibly hard to scale their SQL Server offering with its roots in traditional OLTP database.

We can expect Oracle, IBM, and HP to have similar problems in scaling their 1980s code-base for the needs of data-scale and query-workloads of today’s data warehousing systems. Will the market wait for Oracle, IBM, and HP’s efforts to scale to come to fruition? Or will Oracle, IBM, and HP soon acquire companies to improve their own scalability?

It is interesting to note that DATAllegro will be moving to an all-Microsoft platform. The acquisition could also be read as a defensive move by Microsoft. All of the large-scale data warehouse offerings today are based on Unix variants (Unix/Linux/Solaris), thus leading to the uncomfortable situation at some all-Microsoft shops who chose to run Unix-based data warehouse offerings because SQL Server would not scale. Microsoft needed an offering that could preserve their enterprise-wide customers on Microsoft platforms.

Finally, there is a difference in philosophy between Microsoft’s and DATAllegro’s product offerings. Microsoft SQLServer has sought to cater to the lower end of the BI spectrum; DATAllegro has actively courted the higher end. Correspondingly, DATAllegro uses powerful servers, fast storage, and expensive interconnect to deliver a solution. Microsoft SQL Server has sought to deliver a solution at a much lower cost. We can only wait and watch: will the algorithms of one philosophy work well in the infrastructure of the other?

At Aster Data Systems, we believe that the market dynamics will not change as a result of this acquisition: companies will want the best solutions to derive the most value from data. In the last decade, Internet changed the world and old-market behemoths could not translate their might into the new market. In this decade, Data will produce a similar disruption.



17
Jun
By Tasso Argyros in Analytics, Analytics tech, Blogroll, Database, Scalability on June 17, 2008
   

 

I’m delighted to be able to bring to a guest post to our blog this week. David Cheriton, one of Aster Data Systems’ angel investors, leads the Distributed Systems Group at Stanford University and has been known for making some smart investments. Below is what David has to say about the need to address the network interconnect in MPP systems – we hope this spurs some interesting conversation!

“A cluster of commodity computer nodes clearly offers a very cost-effective means of tackling demanding large-scale applications such as data mining over large data sets. However, most applications require substantial communication. For example, consider a query that requires a join between three tables that share no common key to partition on (non-parallelizable query), a frequent case in analytics. In conventional architectures, such operations need to move huge amounts of data among different nodes and depend on the interconnect to deliver adequate performance.

The cost and performance impact of the interconnect for the cluster to support this communication is often an unpleasant surprise, particularly without careful design of the cluster software. Yes, we are seeing the cost of 10G Ethernet coming down in cost, both in switches and NICs, and the IEEE is starting work on 100G Ethernet. However, the interconnect is, and will remain, an issue for several reasons.

First, in a parallelizable query, you need to get data from one node to several others. The bandwidth out of this one node is limited by its NIC bandwidth, Bn. In a uniformly configured cluster, each of the receiving nodes has the same NIC bandwidth Bn, so with K receivers, each is receiving at 1/K. However, the actual performance of the cluster can be limited by data hotspots, where the requirement for data from a given node far exceeds its NIC and/or memory bandwidth.

The inverse problem, often called the incast problem, arises when K nodes need to send data to a single node. Each can send at bandwidth Bn for a total bandwidth demand of K*Bn, but the target node can only receive at Bn or 1/K of the offered load. The result can be congestion, packet drop from overflowing packet queues, TCP timeouts and backoff, resulting in dramatically lower goodput than even Bn. Here, I say “dramatically” because the performance can collapse to 1/10 of expected or worse because of packet drop, timeout and retries that can occur at the TCP level. In systems with as little as 10 nodes, connected via a Gigabit Ethernet interconnect, performance can deteriorate to under 10 MB per second per node! For higher number of nodes, the problem becomes even worse.

Phanishayee et al have studied the incast problem. They show that TCP tuning does not help significantly. They observe that significantly larger switch buffering helps up to some scale, but that drives up the cost of the switches substantially. Besides some form of link-level flow control (which suffers from head-of-line blocking, is not generally available and usually does not work between switches), the other solution is just adding more NICs or faster NICs per node, to increase the send and receive bandwidth.

Moreover, with k NICs per node, an N node network now requires k*N ports, requiring a larger network to interconnect all the nodes in the cluster. Large fast networks are an engineering and operation challenge. The simplest switch is a single-chip shared memory switch. This type of switch is limited by the memory and memory bandwidth available for buffering. For instance, a 24-port 10 Gbps switch requires roughly 30 Gbytes/sec of memory bandwidth, forcing the use of on-chip memory or off-chip SRAM, in either case rather limited in size, aggravating TCP performance problems. This memory bandwidth demand tends to limit the size of shared memory switches.

The next step up is a crossbar switch. In effect, each line card is a shared memory switch, possibly splitting the send and receive sides, connected by a special interconnect, the crossbar. The cost per port increases because of the interconnect and the overall complexity of the system and the lower volume for large-scale switches. In particular each line card needs to solve the same congestion problems as above in sending through the interconnect to other line cards.

Scaling larger means building a multi-switch network. The conventional hierarchical multi-switch network introduces bottlenecks within the network, such as from the top-of-rack switch to the inter-rack switch, leading to packet loss inside the network. Various groups have proposed building Clos networks out of commodity GbE switches, but these require specialized routing support and complex configuration and a larger number of components, leading to more failures and complex failure behavior and extra cost.

Overall, you can regard the problem as being k nodes of a cluster needing to read from and write to the memory of the other nodes. The network is just an intermediary trying to handle this aggregate of read and write traffic across all the nodes in the cluster, thus requiring expensive high-speed buffering because these actions are asynchronous/streamed. Given this aggregate demand, faster processors and faster NICs just make the challenge greater.

In summary, MPP databases are more MPP than databases, in the sense that for complex distributed queries the network performance (major bottleneck in MPP systems) is much more challenging than disk I/O performance (major bottleneck in conventional database systems). Smart software that is able to minimize demands on the network and avoid hotspots and incast can significantly reduce the demand on the network and achieve far more cost-efficient scaling of the cluster, plus avoid dependence on complex (CLOS) or non-sweet spot networking technologies (i.e. non-Ethernet). It’s a great investment in software and processor cycles when the network is intrinsically a critical resource. In some sense, smart software in the nodes is the ultimate end-to-end solution, achieving good application performance by minimizing its dependence on the intermediary, the interconnect.”

- Prof. David Cheriton, Computer Science Dept., Stanford University

 



15
May
By Mayank Bawa in Analytics, Analytics tech, Blogroll, Statements on May 15, 2008
   

Have you ever discovered a wonderful little restaurant off the beaten path? You know the kind of place. It’s not part of some corporate conglomerate. They don’t advertise. The food is fresh and the service is perfect–it feels like your own private oasis. Keeping it to yourself would just be wrong (even if you selfishly don’t want the place to get too crowded).

We’re happy to see a similar anticipation and word-of-mouth about some new ideas Aster is bringing to the data analytics market. Seems that good news is just too hard to keep to yourself.

We’re serving up something unique that we’ve been preparing for several years now. We’re just as excited to be bringing you this fresh approach.



08
May
By Tasso Argyros in Analytics, Analytics tech on May 8, 2008
   

Over the last couple of years I’ve talked to scores of companies that face data analytics problems and ask this question. From these discussions it was pretty clear that no existing infrastructure can really solve the problem of driving deep insights from massive amounts of data for most enterprises. But why? And how do companies today try to cope with this issue?

I’ve seen three classes of “solutions” that companies attempt to implement in a desperate attempt to overcome their data analytics challenges. Let me try to describe what I’ve seen here.

“Solution” One. Vertical scale-up. If you are like most companies, database performance problems make your favorite hardware vendor sales rep lots of money every year! There is nothing new here. Ever since the 1960s, when the first data management systems came around, performance issues were solved by buying much more expensive hardware. So here’s the obvious problem with this approach: cost. And here’s the non-obvious one: there’s a limit in how much you can scale this way, which is actually pretty low. (Question: what is the maximum number of CPUs that you can buy in a high-end server? How does it compare to the average Google cluster?)

“Solution” Two. “Massively” parallel database clusters. Sometimes I’ve heard an argument that goes like this: “Why shouldn’t it be simple to build a farm of databases just like we have farms of app servers or web servers?” Driven by this seemingly innocent question, you may try (or have tried) to put together clusters of databases to do analytics, either on your own or using one of the MPP products that are in the marketplace. This will work fine for small datasets *or* very simple queries (e.g. computing a sum of values). But, as any student of distributed systems knows, there is a reason why web servers scale so nicely: they are stateless! That’s why they’re so easy to deploy and scale. On the other hand, databases do have state. In fact, they have lots of it, perhaps several Gigabytes per box. And, guess what, in analytics each query potentially needs access to all of it at once! So what works fine for very small numbers of nodes or small amount of data, doesn’t do anything for slightly more complex queries and larger systems – which is probably the issue you were trying to solve in the first place.

By the way, all the solutions that are in the marketplace today solve the wrong problems. For instance, some optimize disk I/O of the individual nodes and not overall system performance for complex queries, which is the real issue (e.g., “columnar” systems). Others allow for fast execution of really simple queries but do nothing to allow more complex ones to go really quickly (e.g., “MPP” databases). None of these products can provide a solution that is even relevant to the hardest problems these systems face.

“Solution” Three. Write custom code. Why not? Google and Yahoo have done it pretty successfully! The only problem is, this approach is even more expensive than approach #1! Google has built a great infrastructure, but what is the cost to retain and compensate the best minds in the world who can develop and maintain your analytics? (Hint: It’s more than free snacks and soda). I’ve frequently seen what starts as a simple, cheap solution for a single point problem evolve to a productivity nightmare, where each new data insight requires development time and specialized (thus expensive) skills. If you can afford that, that’s fine. But I’ll bet you do not want to spend your most precious resources reinventing the wheel every time you need to run a new query instead of doing what makes your company most successful.

The end result is that all of these approaches are pretty far from solving the real problem. Rather, the cost of becoming more competitive through data is currently huge – and it shouldn’t be! I believe that as soon as the right tools are built and made available, companies will immediately take advantage of them to be more competitive and successful. This is the upcoming data revolution that I see, and, frankly, it has been long overdue.