Rumors abound that Intel is “baking” the successor of the very successful Nehalem CPU architecture, codenamed Westmere. It comes with an impressive spec: 10 CPU cores (supporting 20 concurrent threads) packed in a single chip. You can soon expect to see 40 cores in middle range 4-socket servers – a number hard to imagine just five years ago.
We’re definitely talking about a different era. In the old days, you could barely fit a single core in a chip. (I still remember 15 years ago when I had to buy and install a separate math co-processor on my Mac LC to run Microsoft Excel and Mathematica.) And with the hardware, software has to change, too. In fact, modern software means software that can handle parallelism. This is what makes MapReduce such an essential and timely tool for big data applications. MapReduce’s purpose in life is to simplify data and processing parallelism for big data applications. It gives ample freedom to the programmer on how to do things locally; and takes over when data needs to be communicated across processes/cores/servers, thus evaporating a lot of the parallelism complexity.
Once someone designs their software and data to operate in a parallelized environment using MapReduce, gains will come on multiple levels. Not only will MapReduce help your analytical applications scale across a cluster of servers with terabytes of data, it will also exploit the billions of transistors and the 10s of CPU cores inside each server. The best part: the programmer doesn’t need to think about the difference.
As an example, consider this great paper out of Stanford discusses MapReduce implementations of popular Machine Learning algorithms. The Stanford researchers considered MapReduce as a way of “porting” these algorithms (traditionally implemented to run in a single CPU) to a multi-core architecture. But, of course, the same MapReduce implementations can be used to scale these algorithms across a distributed cluster as well.
Hardware has changed – MPP, shared-nothing, commodity servers, and, of course, multi-core. In this new world MapReduce is Software’s response for big data processing. Intel and Westmere have just found an unexpected friend.
Today Aster took a significant step and made it easier for developers building fraud detection, financial risk management, telco network optimization, customer targeting and personalization, and other advanced, interactive analytic applications.
Along with the release of Aster Data nCluster 4.5, we added a new Solution Partner level for systems integrators and developers.
Why is this relevant?
Recession or no-recession, IT executives are constantly challenged. They are asked to execute strategies based on better analytics and information to improve effectiveness of business processes (customer loyalty, inventory management, revenue optimization, ..), while staying on top of technology-based disruptions and managing (shrinking or flat) IT budgets.
IT organizations have taken on the challenge by building analytics-based offeringsleveraging existing data management skills and increasingly taking advantage of MapReduce, a disruptive technology introduced by Google and now being rapidly adopted by mainstream enterprise IT shops in Finance, Telco, LifeSciences, Govt. and other verticals.
As MapReduce and big data analytics goes mainstream, our customers and ecosystem partners have asked us to make it easier for their teams to leverage MapReduce across enterprise application lifecycles, while harvesting existing IT skills in SQL, Java and other programming languages. The Aster development team that brought us the SQL/MapReduce innovation, has now delivered the market’s first integrated visual development environment for developing, deploying and managing MapReduce and SQL-based analytic applications.
Enterprise MapReduce developers and system integrators can now leverage the integrated Aster platform and deliver compelling business results in record time (read how ComScore delivers 360 degree view of digital world to enterprise customers, Full Tilt Poker gains the upper hand tackling online fraud using Aster).
We are also teaming up with leaders in our ecosystem like MicroStrategy to deliver an end-to-end analytics solution to our customers that includes SQL/MapReduce enabled reporting and rich visualization. Aster is proud to be driving innovation in the Analytics and BI market and was recently honored at MicroStrategy’s annual customer conference.
I am delighted with the rapid adoption of Aster Data’s platform by our partners and the strong continued interest from enterprise developers and system integrators in building big data applications using Aster. New partners are endorsing our vision and technical innovation as the future of advanced analytics for large data volumes.
Sign up today to be an Aster solution partner and join the revolution to deliver compelling information and analytics-driven solutions.
When you hear the word “warehouse,” you normally think of an oversized building with high ceilings and a ton of storage space. In the data warehousing world, it’s all too easy to fill that space faster than expected. Even companies with predictable data growth trajectories don’t want to pay for storage space they won’t need for months or even years out. For either type of company, the ability to scale on-demand, and to the appropriate degree, is critical.
That’s why I’m so excited about a webinar we are hosting next week with James Kobielus, Senior Analyst for Forrester Research. In case you haven’t read it, James recently released his report “Massive But Agile: Best Practices for Scaling the Next-Generation Data Warehouse.” In the report, James thoroughly address several issues around scalability for which Aster is well-suited (parallelism, optimized storage, in-database analytics, etc.).
We’ll get into much more detail on these and other issues over the course of the webinar. If you haven’t had a chance yet, please register for the webinar to hear what James, a leader and visionary in the industry, has to say. And make sure to leave a comment below if there are any facets of data warehouse scalability that you would like us to cover.
Our goal at Aster is to build a product that will answer your analytical questions sooner. Sooner doesn’t just mean faster database performance - it means faster answers from the moment you conceive of the question to the moment you get the answer. This means allowing analysts and end-users to easily ask the questions on their mind.
Aster nCluster, our massively-parallel database, has supported SQL from birth. SQL is great in many respects: it allows people of various levels of technical proficiency to ask lots of interesting questions in a relatively straightforward way. SQL’s easy to learn but powerful enough to ask the right questions.
But, we’ve realized that in many situations SQL just doesn’t cut it. If you want to sessionize your web clicks or find interesting user paths, run a custom fraud classifier, or tokenize and stem words across documents, you’re out of luck. Enter SQL/MR, one part of our vision of what a 21st-century database system should look like.
Let’s say your data is in nCluster. If your analytic question can be answered using SQL, you don’t have to worry about writing Java or Python. But, as soon as something more complicated comes up, you can write a SQL/MR function against our simple API, upload it into the cluster, and have it start operating on your data by invoking it from SQL. How is this related to MapReduce? It turns out that these functions are sufficient to express a full MapReduce dataflow. How are SQL/MR functions different than the UDFs of yore? It’s all about scale, usability, reusability; all three contributing to you getting your answer sooner.
Scalability
SQL/MR functions play in a massively-parallel sandbox, one with terabytes and terabytes of data, so they’re designed to be readily parallelized. Yes, they just accept a table as input and produce a table as output, but they do so in a distributed way at huge scale. They can take as input either rows (think “map”) or well-defined partitions (think “reduce”), which allows nCluster to move data and/or computation around to make sure that the right data is on the right node at the right time. SQL/MR functions are table functions breaking out of the single node straight-jacket. This means you can analyze lots of data fast.
Usability We want to make sure that developers using our SQL/MR framework spend their time thinking about the analytics, not dealing with infrastructure issues. We have a straight-foward API (think: you get a stream of rows and give us back a stream of rows) and a debugging interface that lets you monitor execution of your function across our cluster. Want to write and run a function? One command installs the function, and a single SQL statements invokes it. The data you provide the function is defined in SQL, and the output can be sliced and dices with more SQL - no digging into Java if you want to change a projection, provide the function a different slice of data, or add a sort onto the output. All this allows a developer to get a working function - sooner - and an analyst to tweak the question more readily.
Reusability We’ve gone to great lengths to make sure that a SQL/MR function, once written, can be leveraged far and wide. As mentioned before, SQL/MR functions are invoked from SQL, which means that they can be used by users who don’t know anything about Java. They also accept “argument clauses” - custom parameters which integrate nicely with SQL. Our functions are polymorphic, which means their output is dynamically determined by their input. This means that they can be used in a variety of contexts. And, it means that any number of people can write a function which you can easily reuse over your data. A function, once written, can be reused all over the place, allowing users to ask their questions faster (since someone’s probably asked a similar question in the past).
In fact, we’ve leveraged the SQL/MR framework to build a function that ships with nCluster: nPath. But this is just the first step, and
the sky’s the limit. SQL/MR could enable functions for market basket analysis, k-means clustering, support vector machines, natural language processing, among others.
How soon will your questions be answered? I’d love to hear of any ideas you have for analytic functions you’re struggling to write in SQL which you think could be a good fit for SQL/MapReduce
MySpace decided to support one of its most important product launches of 2008 with an expansion of its Aster data warehouse. The data that would be collected would be used to provide information on trends in media and current interests on MySpace. The go-live date was October 2008.
MySpace planned for the data warehouse right from the inception of the project to ensure that reporting was considered a first-class citizen in the overall launch process, rather than a post-launch activity. The result was that the data warehouse was up and running to receive the usage streams, even during a private beta release period, giving the warehousing team the necessary time to prepare for the onslaught of data that would result after the public release.
In fact, there is a very interesting incident that happened on the day of the new MySpace product launch. At about 7am, one of the servers in the Aster nCluster data warehouse failed. The failure was detected by our support team - and no scrambling ensued. Aster nCluster detected and isolated the failure, continuing to run the service with n-1 nodes without a blip and minimal performance change! Later, after the initial tense moments were behind us, the MySpace operations team walked over and replaced the failed hardware. The Aster database administrator then pressed a single button to re-include the node back to the nCluster data warehouse - the database continued to hum away with zero downtime.
The power of “Always-On”!
We will be co-hosting a case study by MySpace on their use of Aster at the Gartner BI Summit next week in National Harbor, MD on March 11. If you’ll be at the event, please come by to hear what Hala has to say about their use of Aster to support their mission-critical operations at MySpace across multiple functions and departments.
Their Aster enterprise data warehouse supports frontline applications (e.g., MySpace TV, MySpace Video, etc.), as well as online marketing, sales, IT, finance, international, and legal. MySpace is also planning to incorporate data from Aster into a balanced scorecard for strategic alignment of the business around key performance indicators, as well as other future projects.
Some highlights from the video for folks who would rather read:
MySpace got up and running with Aster quickly
“We were able to bring that up online and actually start processing the data into it within a matter of weeks, and I think very few technologies give you the ability to do something like that.”
- Hala Al-Adwin, VP of Data Services at MySpace
Aster is mission-critical to MySpace
“With Aster, what we’ve been able to produce with commodity hardware has been a supercomputer-like infrastructure …the data that we collect and process is absolutely critical to the success of MySpace.”
– Bita Mathews, Data Warehouse Manager, MySpace
“Right now our key business performance metrics are all powered out of the Aster system. If somebody went and shut it down, none of that would be available. I think in a lot of ways, we were lacking that data before, and now that we’re used to having it, people are just hungry for more and more information. So if all that went away, I think it’s kinda like going back to an age where there was no light.”
– Hala Al-Adwan
MySpace’s data warehouse with Aster is extremely reliable
“Aster is always on and available. And this is very amazing thing about Aster, because it’s massive. There’s a lot of hardware underneath the system. When hardware fails, we can continue working, and although we know some engineers are fixing hardware, but that doesn’t stop us from continuing to run queries and producing our reports.”
– Anna Dorofiyenko, Data Architect, MySpace
Aster is the blueprint for successful data warehouse deployments going forward
“Integrating Aster and including them from the very beginning in the MySpace Music project … from beginning to end is what allowed that to be the most successful data warehouse implementation we’ve had to date, and I think we should definitely use it as a blueprint for any future implementations we do.”
– Christa Stelzmuller, Chief Data Architect, MySpace
Cloud computing is a fascinating concept. It offers greenfield opportunities (or more appropriately, blue sky frontiers) for businesses to affordably scale their infrastructure needs without plunking down a huge hardware investment (and the space/power/cooling costs associated with managing your own hosted environment). This removes the risks of mis-provisioning by enabling on-demand scaling according to your data growth needs. Especially in these economic times, the benefits of Cloud computing are very attractive.
But let’s face it - there’s also a lot of hype, and it’s hard to separate truth from fiction. For example, what qualities would you say are key to data warehousing in the cloud?
Here’s a checklist of things I think are important:
[1] Time-To-Scalability. The whole point of clouds is to offer easy access to virtualized resources. A cloud warehouse needs to quickly scale-out and scale-in to adapt to changing needs. It can’t take days to scale…it has to happen on-demand in minutes (<1 hour).
[2] Manageability. You go with clouds because you not only want to save on hardware, but also on the operational people costs of maintaining that infrastructure. A cloud warehouse needs to offer one-click scaling, easy install/upgrade, and self-managed resiliency.
[3] Ecosystem. While clouds offer *you* huge TCO savings, you can’t compromise service levels for your customers - especially if you run your business on the cloud. BI/ETL/monitoring tools, Backup & Recovery, and ultra-fast data loading can’t be overlooked for “frontline” mission-critical warehousing on the cloud.
[4] Analytics. Lots of valuable data is generated via the cloud and there are opportunities to subscribe to new data feed services. It’s insufficient for a cloud warehouse to just do basic SQL reporting. Rather, it must offer the ability to do deep analytics very quickly.
[5] Choice. A truly best-in-class cloud warehouse won’t lock you in to a single cloud vendor. Rather, it will offer portability by enabling you to choose the best cloud for you to run your business on.
Finally, here are a couple ideas on the future of cloud warehousing. What if you could link multiple cloud warehouses together and do interesting queries across clouds? And what about the opportunities for game-changing new analytics - with so many emerging data subscription services, wouldn’t this offer ripe opportunities for mash-up analytics (eg. using Aster SQL/MapReduce).
What do you think are the standards for “best-in-class” cloud warehousing?
Back in March 2005, I attended the AFCOM Data Center World Conference while working at NetApp. It was a great opportunity to learn about enterprise data center challenges and network with some very experienced folks. One thing that caught my attention was a recurring theme on growing power & cooling challenges in the data center.
Vendors, consultants, and end user case study sessions trumpeted dire warnings that the proliferation of powerful 1U blade servers would result in power demands outstripping supply (for example, a typical 42U rack consumed 7-10kW, while new-generation blade servers were said to exhibit peak rack heat loads of 15-25kW). In fact, estimates were that HVAC cooling (for heat emissions) were an equally significant power consumer (ie. for every watt you burn to power the hardware, you burn another watt to cool it down).
Not coincidentally, 2005 marked the year when many server, storage, and networking vendors came out with “green” messaging. The idea was to convey technologies that reduce power consumption and heat emissions, saving both money and the environment. While some had credible stories (eg. VMware), more often than not the result was me-too bland positioning or sheer hype (also known as “green washing”).
Luckily, Aster doesn’t suffer from this, as the architecture was designed for cost-efficiency (both people costs and facilities costs). Among many examples:
[1] Heterogeneous scaling: we use commodity hardware but the real innovation is making new servers work with pre-existing older ones. This saves power & cooling costs because rather than having to create a new cluster from scratch (which requires new Queen nodes, new Loader nodes, more networking equipment, etc), you can just plug in new-generation Worker nodes and scale-out on the existing infrastructure…
[2] Multi-layer scaling: A related concept is nCluster doesn’t require the same hardware for each “role” in the data warehousing lifecycle. This division-of-labor approach ensures cost-effective scaling and power efficiency. For example, Loader nodes are focused on ultra-fast partitioning and loading of data - since data doesn’t persist to disk, these servers contain minimal spinning disk drives to save power. On the opposite end, Backup nodes are focused on storing full/incremental backups for data protection - typically these nodes are “bottom-heavy” and contain lots of high-capacity SATA disks for power efficiency benefits (fewer servers, fewer disk drives, slower spinning 7.2K RPM drives).
[3] Optimized partitioning: one of our secret sauce algorithms ensures maximizing locality of joins via intelligent data placement. As a result, less data transfers over the network, which means IT orgs can stretch their existing network assets (without having to buy more networking gear and burn power).
[4] Compression: we love to compress things. Tables, cross-node transfers, backup & recovery, etc all leverage compression algorithms to get 4x - 12x compression ratios - this means fewer spinning disk drives to store data and lower power consumption.
…and others (too many to list in a short blog like this)
I’d love to continue the conversation with IT folks passionate about power consumption…what are your top challenges today and what trends do you see in power consumption for different applications in the data center?
With all the hype in the analytic database and database appliance market around “100x performance!” and “dramatic cost savings!”, we’ve stayed away from making bold, unsubstantiated claims of our own. Instead, we’ve chosen to talk about the ability for companies to “do more” with their data through advanced analytics powered by Aster’s In-Database MapReduce, etc.
However, Aster does provide some unique cost and time advantages. We’ve produced a technical brief and webcast which outline the four main areas where Aster can help you lower your data warehousing costs:
I recently wrote an article with Enterprise Systems Journal on how retailers can benefit from innovations in “always on” databases. This story is based on the real-life experiences of one of our customers during the last holiday shopping season. They saw a spike in traffic and quickly scaled their frontline data warehouse built with Aster Data Systems. This allowed them to maintain the service level agreements with the business for product cross-promotion without needing to wait to plan an upgrade in the off-season.
After next week, let us know - how did your favorite e-tailers fare on ‘Black Friday’ and ‘Cyber Monday’? Did their recommendation engines deliver, or did they leave cash in your wallet that got spent elsewhere?
Forget about total cost of ownership (TCO). In the Internet age, scaling up can be the biggest cost factor in your infrastructure. And overlooking the scaling challenge can be disastrous.
Take Bob, for instance. Bob is the fictional database manager at FastGrowth, an Internet company with a fast-growing user base. Bob is 36 and has done more than a dozen data warehouse deployments in his career. He’s confident in his position. Granted, this is his first Internet gig… But this shouldn’t matter a lot, right? Bob’s been there, done that, got the tee-shirt.
Bob’s newest gig is to implement a new data warehouse system to accommodate FastGrowth’s explosive data growth. He estimates that there will be 10TB of data in the next 6 months, 20TB in the next 12, and 40TB 18 months from now.
Bob needs to be very careful about cost (TCO); getting way overboard on his budget could cost him his reputation or even (gasp) his job. He thus asks vendors and friends how much hardware and software he needs to buy at each capacity level. He also makes conservative estimates about the number of people required to manage the system and its data at 10, 20, and 40 terabytes.
Fast-forward 18 months. Bob’s DW is in complete chaos; it can hardly manage half of the 40 TB target and it required twice the number of people and dollars … so far. Luckily for Bob, his boss (Suzy), has been doing Internet infrastructure projects for her whole career and knew exactly what mistake Bob made (and why he deserves a second chance.)
What went wrong? Bob did everything almost perfectly. His TCO estimates at each scale level were, in fact, correct. But what he did not account for was the effort of going from one scale level to the other in such a short time! Doubling the size of data every 6 months is 3x faster than Moore’s law. That’s like trying to build a new car that is 3x faster than a Ferrari. As a result, growing from 10TB to 20TB in six months may cost many times more than (in terms of people, time and dollars) running a 20TB system for 6 months.
In some way, this is no news. The Internet space is full of stories where scaling was either too expensive or too disruptive to be carried out properly. Twitter, with its massive success, has had to put huge effort to scale up its systems. And Friendster lost the opportunity to be a top social network partly because it was taking too long to scale up its infrastructure. Moreover, as new data sources become available, companies outside Internet are facing similar kind of challenges – scaling needs that are too hard to manage!
So how can we reason about this new dimension of infrastructure cost? What happens when data is growing constantly, and scaling up ends up being the most expensive part of most projects?
The answer, we believe, is that the well-known concept of TCO not good enough to capture scaling costs in this new era of fast data growth. Instead, we need to also start thinking about the Total Cost of Scaling – or TCS.
Why is TCS useful? TCS captures all costs – in terms of hardware, software and people – that are required to increase the capacity of the infrastructure. Depending on the application, capacity can mean anything such as amount of data (e.g. for data warehousing projects) or queries per second (for OLTP systems.) TCO together with TCS gives a true estimate of project costs for environments that have been blessed with a growing business.
Let’s see how TCS works in an example. Say that you need 100 servers to run your Web business at a particular point in time, and you have calculated the TCO for that. You can also calculate the TCO of having 250 servers running 12 months down the road, when your business has grown. But going from 100 severs to 250 – that’s where TCS comes in. The careful planner (e.g. Bob in his next project) will need to add all three numbers together – TCO at 100 servers, TCO at 250 servers and TCS for scaling from 100 to 250 – to get an accurate picture of the full cost.
At Aster, we have been thinking about TCS from day one exactly because we design our systems for environments of fast data growth. We have seen TCS dominating the cost of data projects. As a result, we have built a product that is designed from the ground-up to make scalability seamless and reduce the TCS of our deployments to a minimum. For example, one of our customers scaled up their Aster deployment from 45 to 90 servers with a click of a button. In contrast, traditional scaling approaches – manual, tedious and risky – bloat TCS and can jeopardize whole projects.
As fast data growth becomes the rule rather than the exception, we expect more people to start measuring TCS and seek ways to reduce it. As Francis Ford Coppola put it, “anything you build on a large scale or with intense passion invites chaos.” And while passion is hard to manage, there is something we can do about scale.