<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="wordpress/2.2.3" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>

<channel>
	<title>The Data Blog: Aster Data Blog</title>
	<link>http://www.asterdata.com/blog</link>
	<description>The convergence of Big Data, analytic applications, MPP data warehouses, and MapReduce</description>
	<pubDate>Tue, 24 Aug 2010 23:39:57 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.2.3</generator>
	<language>en</language>
			<item>
		<title>Joining the Big Data Revolution for Fun and Profit</title>
		<link>http://www.asterdata.com/blog/index.php/2010/08/10/joining-the-big-data-revolution-for-fun-and-profit/</link>
		<comments>http://www.asterdata.com/blog/index.php/2010/08/10/joining-the-big-data-revolution-for-fun-and-profit/#comments</comments>
		<pubDate>Tue, 10 Aug 2010 22:47:15 +0000</pubDate>
		<dc:creator>Tasso Argyros</dc:creator>
		
		<category><![CDATA[Analytics]]></category>

		<category><![CDATA[Database]]></category>

		<guid isPermaLink="false">http://www.asterdata.com/blog/index.php/2010/08/10/joining-the-big-data-revolution-for-fun-and-profit/</guid>
		<description><![CDATA[Coming out of Stanford to start Aster Data five years back, my co-founders and I had to answer a lot of questions. What kind of an engineering team do we want to build? Do we want people experienced in systems or databases? Do we want to hire people from Oracle or another established organization? When [...]<p><a href="http://sharethis.com/item?&#038;wp=2.2.3&#38;publisher=cf684f86-aafe-420a-8a13-0d8ab76b31c0&#38;title=Joining+the+Big+Data+Revolution+for+Fun+and+Profit&#38;url=http%3A%2F%2Fwww.asterdata.com%2Fblog%2Findex.php%2F2010%2F08%2F10%2Fjoining-the-big-data-revolution-for-fun-and-profit%2F">ShareThis</a></p>]]></description>
			<content:encoded><![CDATA[<p>Coming out of Stanford to start Aster Data five years back, my co-founders and I had to answer a lot of questions. What kind of an engineering team do we want to build? Do we want people experienced in systems or databases? Do we want to hire people from Oracle or another established organization? When you&#8217;re just starting a company, embarking on a journey that you know will have many turns, answers are not obvious.</p>
<p><img src="http://www.asterdata.com/blog/wp-content/uploads/2010/08/img_4179.jpg" align="left" width="138" height="201" hspace="10" />What we ended up doing very early on is bet on intelligent, smart and adaptable engineers, as opposed to experience or a long resume. It turned out that this was the right thing to do because, as a startup, we had react to market needs and change our focus at a blink of an eye. Having a team of people that were used to tackling never-seen-before problems made us super-agile as a product organization. As the company grew, we ended up having a mix of people that combined expertise in certain areas and core engineering talent. But the culture of the company was set in stone even though we didn&#8217;t realize it: even today our interview process expects talent, intelligence and flexibility to be there and strongly complement the experience our candidates may have.</p>
<p>There are three things that are great about being an engineer at Aster Data:</p>
<p><strong>Our technol</strong><strong>ogy stack is really tall.</strong> We have people working right above the Kernel on filesystems, workload management, I/O performance, etc. We have many challenging problems that involve very large scale distributed systems - and I&#8217;m talking about the whole nine yards, including performance, reliability, manageability, and data management at scale. We have people working on database algorithms from the I/O stack to the SQL planner to no-SQL planners. And we have a team of people working on data mining and statistical algorithms on distributed systems (this is our &#8220;quant&#8221; group since people there come with a background in physics as much as computer science). It&#8217;s really hard to get bored or stop learning here.</p>
<p><strong>We build</strong><strong> real enterprise software.</strong> There&#8217;s a difference between the software one would write in a company like Aster Data versus a company like Facebook. Both companies write software for big data analysis. However, a company like Facebook solves their problem (a very big problem, indeed) for themselves and each engineer gets to work on a small piece of the pie. At Aster Data we write software for enterprises and due to our relatively small size each engineer makes a world of a difference. We also ship software to third-party people and they expect our software to be out-of-the-box resilient, reliable and easy to manage/debug. This makes the problem more challenging but also gives us great leverage: once we get something right, not one, nor two, but potentially hundreds or thousands of companies can benefit from our products. The impact of the work of each engineer at Aster Data is truly significant.</p>
<p><strong><img src="http://www.asterdata.com/blog/wp-content/uploads/2010/08/aster1.jpg" align="left" width="266" height="308" hspace="10" />We&#8217;re working on (perhaps) the biggest IT revolution of the 21st century.</strong> Big Data. Analytics. Insights. Data Intelligence. Commodity hardware. Cloud/elastic data management. You name it. We have it. When we started Aster Data in 2005 we just wanted to help corporations analyze the mountains of data that they generate. We thought it was a critical problem for corporations if they wanted to remain competitive and profitable. But the size and importance of data grew beyond anyone&#8217;s expectations over the past few years. We can probably thank Google, Facebook and the other internet companies for demonstrating to the world what data analytics can do. Given the importance and impact of our work, there&#8217;s no ceiling on how successful we can become.</p>
<p><img src="http://www.asterdata.com/blog/wp-content/uploads/2010/08/img_4201.jpg" align="right" width="132" height="199" hspace="10" />You&#8217;ve probably guessed it by now, but the reason I&#8217;m telling you all this is to also tell you that <a href="http://www.asterdata.com/about/careers.php">we&#8217;re hiring</a>. If you think you have what it takes to join such an environment, I&#8217;d encourage you to apply. We get many applications daily so the best way to get an interview here is through a recommendation and referral. With tools like LinkedIn (who happens to be  a customer) it&#8217;s really easy to explore your network. My LinkedIn profile is <a href="http://www.linkedin.com/in/tasso" target="_blank">here</a>, so see if we have a professional or academic connection. You can also look at our <a href="http://www.asterdata.com/about/management.php">management team</a>, <a href="http://www.asterdata.com/about/board.php">board of directors</a>, <a href="http://www.asterdata.com/about/investors.php">investors</a> and <a href="http://www.asterdata.com/about/advisory.php">advisors</a> to see if there are any connections there. If there&#8217;s no common connection, feel free to email your resume to <a href="mailto:jobs@asterdata.com">jobs@asterdata.com</a>. However, to stand out I&#8217;d encourage you to tell us a couple of words about what excites you about Aster Data, large scale distributed systems, databases, analytics and/or startups that work to revolutionize an industry, and why you think you&#8217;ll be successful here. Finally, take a look at the <a href="http://www.asterdata.com/news/events.php">events</a> we either organize or participate in - it&#8217;s a great way to meet someone from our team and explain why you&#8217;re excited to join our quest to revolutionize data management and analytics.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.asterdata.com/blog/index.php/2010/08/10/joining-the-big-data-revolution-for-fun-and-profit/feed/</wfw:commentRss>
		</item>
		<item>
		<title>The Lifecycle and Importance of a Big Data Product</title>
		<link>http://www.asterdata.com/blog/index.php/2010/08/09/the-lifecycle-and-importance-of-a-big-data-product/</link>
		<comments>http://www.asterdata.com/blog/index.php/2010/08/09/the-lifecycle-and-importance-of-a-big-data-product/#comments</comments>
		<pubDate>Mon, 09 Aug 2010 18:18:37 +0000</pubDate>
		<dc:creator>Tasso Argyros</dc:creator>
		
		<category><![CDATA[Analytics]]></category>

		<guid isPermaLink="false">http://www.asterdata.com/blog/index.php/2010/08/09/the-lifecycle-and-importance-of-a-big-data-product/</guid>
		<description><![CDATA[Watching our customers use Aster Data to discover new insights and build new big data products is one of the most satisfying parts of my job. Having seen this process a few times, I found that it always has the same steps:
An idea or concept - Someone comes up with an idea of a hidden [...]<p><a href="http://sharethis.com/item?&#038;wp=2.2.3&#38;publisher=cf684f86-aafe-420a-8a13-0d8ab76b31c0&#38;title=The+Lifecycle+and+Importance+of+a+Big+Data+Product&#38;url=http%3A%2F%2Fwww.asterdata.com%2Fblog%2Findex.php%2F2010%2F08%2F09%2Fthe-lifecycle-and-importance-of-a-big-data-product%2F">ShareThis</a></p>]]></description>
			<content:encoded><![CDATA[<p>Watching our customers use Aster Data to discover new insights and build new big data products is one of the most satisfying parts of my job. Having seen this process a few times, I found that it always has the same steps:</p>
<p><strong>An idea or concept</strong> - Someone comes up with an idea of a hidden treasure that could be hidden in the data, e.g. a new customer segment that could be very profitable, a new pattern that reveals novel cases of fraud, or other event-triggered analysis.</p>
<p><strong>Dataset </strong>- An idea based on data that doesn&#8217;t exist is like a great recipe without the ingredients. Hopefully the company has already deployed one or more big data  repositories that have the necessary data in full detail (no summaries, sampling, etc). If that&#8217;s not the case, data has to be generated, captured and moved to a big data-analytics server, which is an MPP database with a fully integrated analytics engine, like Aster Data&#8217;s solution. It addresses both parts of the big data need - scalable data storage and data processing.</p>
<p><strong>Iterative Experimentation</strong> - This is the fun part. In contrast to traditional reporting, where the idea translates almost automatically to a query or report (e.g.: I want to know average sales per store for the past 2 years), a big data product idea (e.g.: I want to know what is my most profitable customer segment) requires building an intuition about the data before coming up with the right answer. This can only be achieved by a large number of analytical queries using either SQL or MapReduce, and it&#8217;s the step where the analyst or data scientist builds their intuition and understanding of the dataset and of the hidden gems buried there.</p>
<p><strong>Data Productization</strong> - Once iterative experimentation provides the data scientist with evidence of gold, the next step is to make the process repeatable so that its output can be systematically used by humans (e.g. marketing department) or systems (e.g. a credit card transaction clearing system that needs to identify fraudulent transactions). This requires not only a repeatable process but also data that&#8217;s certified to be of high quality and processing that can meet specific SLAs, always while using a hybrid of SQL and MapReduce for deep big data analysis</p>
<p>If you think about it, this process is similar to the process of coming up with a new product (software or otherwise). You start with an idea, you then get the first material and build a lot of prototypes. Iâ€™ve found that people who find an important and valuable data insight after a process of iterative experimentation feel the same satisfaction as an inventor who has just made a huge discovery. And the next natural step is to take that prototype, make it a repeatable manufacturing process and start using it in the real world.</p>
<p>In the &#8220;old&#8221; world of simple reporting, the process of creating insights was straightforward. Respectively the value of the outcome (reports) was much lower and easily replicable by everyone. Big Data Analytics, on the other hand, require a touch of innovation and creativity, which is exactly why it is hard to  replicate and why its results produce such important and sustainable advantages to businesses. I believe that Big Data Products are the next wave of corporate value creation and competitive differentiation.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.asterdata.com/blog/index.php/2010/08/09/the-lifecycle-and-importance-of-a-big-data-product/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Google&#8217;s Dremel - or, Can MapReduce Itself Handle Fast, Interactive Querying?</title>
		<link>http://www.asterdata.com/blog/index.php/2010/07/19/google%e2%80%99s-dremel-%e2%80%93-or-can-mapreduce-itself-handle-fast-interactive-querying/</link>
		<comments>http://www.asterdata.com/blog/index.php/2010/07/19/google%e2%80%99s-dremel-%e2%80%93-or-can-mapreduce-itself-handle-fast-interactive-querying/#comments</comments>
		<pubDate>Mon, 19 Jul 2010 21:52:18 +0000</pubDate>
		<dc:creator>Tasso Argyros</dc:creator>
		
		<category><![CDATA[MapReduce]]></category>

		<category><![CDATA[Business Intelligence]]></category>

		<category><![CDATA[Statements]]></category>

		<guid isPermaLink="false">http://www.asterdata.com/blog/index.php/2010/07/19/google%e2%80%99s-dremel-%e2%80%93-or-can-mapreduce-itself-handle-fast-interactive-querying/</guid>
		<description><![CDATA[Every year or so Google comes out with an interesting piece of infrastructure, always backed by claims that it&#8217;s being used by thousands of people on thousands of servers and processes petabytes or exabytes of web data. That alone makes Google papers interesting reading.  
This latest piece of research just came out on Google&#8217;s [...]<p><a href="http://sharethis.com/item?&#038;wp=2.2.3&#38;publisher=cf684f86-aafe-420a-8a13-0d8ab76b31c0&#38;title=Google%27s+Dremel+-+or%2C+Can+MapReduce+Itself+Handle+Fast%2C+Interactive+Querying%3F&#38;url=http%3A%2F%2Fwww.asterdata.com%2Fblog%2Findex.php%2F2010%2F07%2F19%2Fgoogle%25e2%2580%2599s-dremel-%25e2%2580%2593-or-can-mapreduce-itself-handle-fast-interactive-querying%2F">ShareThis</a></p>]]></description>
			<content:encoded><![CDATA[<p>Every year or so Google comes out with an interesting piece of infrastructure, always backed by claims that it&#8217;s being used by thousands of people on thousands of servers and processes petabytes or exabytes of web data. That alone makes Google papers interesting reading. <img src='http://www.asterdata.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>This latest piece of research just came out on <a href="http://www.google.com/buzz/goog.research.buzz/WsARqxc7d7R/Dremel-Interactive-Analysis-of-Web-Scale-Datasets" target="_blank">Google&#8217;s Research Buzz page</a>. It&#8217;s about a system called Dremel (note: <a href="http://www.dremel.com/en-us/Pages/default.aspx" target="_blank">Dremel</a> is a company building hardware tools which I happened to use a lot when I was building model R/C airplanes as a kid). Dremel is an interesting move by Google which provides a system for interactive analysis of data. It was created because it was thought that native MapReduce  has too much latency for  for  fast interactive querying/analysis. It uses data that sits on different storage systems like GFS or BigTable. Data is modeled in a columnar, semi-structured format and the query language is SQL-like with extensions to handle the non-relational data model.  I find this interesting  - below is my analysis of what Dremel is and the big conclusion.</p>
<p>Main characteristics of the system:</p>
<p><strong>Data &amp; Storage Model</strong><br />
- Data is stored in a semi-structured format. This is not XML, rather it uses <a href="http://code.google.com/apis/protocolbuffers/" target="_blank">Google&#8217;s Protocol Buffers</a>. Protocol Buffers (PB) allow developers to define schemas that are nested.<br />
- Every field is stored in its own file, i.e. every element of the Protocol Buffers schema is columnar-ized. <strong>Columnar modeling is especially important for Dremel</strong> for two specific reasons:<br />
-	Protocol Buffer data structures can be huge (&gt; 1000 fields).<br />
- Dremel does not offer any data modeling tools to help break these data structures down. E.g. there&#8217;s nothing in the paper that explains how you can take a Protocol Buffers data structure and break it down to 5 different tables.<br />
-	Data is stored in a way that makes it possible to recreate the orignial flat schema from the columnar representation. This however requires a full pass over the data - the paper doesn&#8217;t explain how point or indexed queries would be executed.<br />
-	There&#8217;s almost no information about how data gets in the right format, how is it stored, deleted, replicated, etc. My best guess is that when someone defines a Dremel table, data is copied from the underlying storage to the local storage of Dremel nodes (leaf nodes) and at the same time is replicated across the leaf nodes. Since data in Dremel cannot be updated (it seems to be a write-once-read-many model), design &amp; implementation of the replication subsystem should be significantly simplified.</p>
<p><strong>Interface</strong><br />
- <strong>Query interface is <em>SQL-like</em></strong> but with extensions to handle the semi-structured, nested nature of data. Input of queries is semi-structured, and output is semi-structured as well. One needs to get used to this since it&#8217;s significantly different from the relational model.<br />
- Tables can be defined from files, e.g. stored in GFS by means of a DEFINE TABLE command.<br />
- <strong>The data model and query language makes Dremel appropriate for developers</strong>; for Dremel to be used by analysts or database folks, a different/simpler data model and a good number of tools (for loading, changing the data model etc) would be needed.</p>
<p><strong>Query Execution</strong><br />
- <strong>Queries do NOT use MapReduce</strong>, unlike Hadoop query tools like Pig &amp; Hive.<br />
- Dremel provides optimizations for sequential data access, such as async I/O &amp; prefetching.<br />
- Dremel supports approximate results (e.g. return partial results after reading X% of data - this speeds up processing in systems with 100s of servers or more since you don&#8217;t have to wait for laggards).<br />
- Dremel can use replicas to speed up execution if a server becomes too slow. This is similar to the &#8220;backup copies&#8221; idea from the <a href="http://labs.google.com/papers/mapreduce.html" target="_blank">original Google MapReduce paper</a>.<br />
- <strong>There seems to be a tree-like model of executing queries</strong>, meaning that there are intermediate layers of servers between the leaf nodes and the top node (which receives the user query). This is useful for very large deployments (e.g. thousands of servers) since it provides some intermediate aggregation points that reduce the amount of data that needs to flow to any single node.</p>
<p><strong>Performance &amp; Scale</strong><br />
- <strong>Compared to Google&#8217;s native MapReduce implementation</strong><strong>, Dremel is two orders of magnitude faster</strong> in terms of query latency. As mentioned above, part of the reason is that the Protocol Buffers are usually very large and Dremel doesn&#8217;t have a way to break those down except for its columnar modeling. Another reason is the high startup cost of Google&#8217;s MapReduce implementation.<br />
- Following Google&#8217;s tradition, <strong>Dremel was shown to scale reasonably well to thousands of servers</strong> although this was demonstrated only over a single query that parallelizes nicely and from what I understand doesn&#8217;t reshuffle much data. To really understand scalability, it&#8217;d be interesting to see benchmarks with a more complex workload collection.<br />
- The paper mentions little to nothing about how data is partitioned across the cluster. Scalability of the system will probably be sensitive to partitioning strategies, so that seems like a significant omission IMO.</p>
<p><strong>So the big question: Can MapReduce itself handle fast, interactive querying?</strong><br />
- There&#8217;s a difference between the MapReduce paradigm, as an interface for writing parallel applications, and a MapReduce implementation (two examples are Google&#8217;s own MapReduce implementation, which is mentioned in the Dremel paper, and open-source Hadoop). MapReduce implementations have unique performance characteristics.<br />
- It is well known that Google&#8217;s MapReduce implementation &amp; Hadoop&#8217;s MapReduce implementation are optimized for batch processing and not fast, interactive analysis. Besides the Dremel paper, look at <a href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-136.html" target="_blank">this Berkeley paper</a> for some Hadoop numbers and an effort to improve the situation.<br />
- <strong>Native MapReduce execution is not fundamentally slow; however Google&#8217;s MapReduce and Hadoop happen to be oriented more towards batch processing</strong>. Dremel tries to overcome that by building a completely different system that speeds interactive querying. Interestingly, Aster Data&#8217;s SQL-MapReduce came about to address this in the first place and offers very fast interactive queries even though it uses MapReduce. So the idea that one needs to get rid of MapReduce to achieve fast interactivity is something I disagree with - we&#8217;ve shown this is not the case with SQL-MapReduce.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.asterdata.com/blog/index.php/2010/07/19/google%e2%80%99s-dremel-%e2%80%93-or-can-mapreduce-itself-handle-fast-interactive-querying/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Cloud Becomes HPC-friendly</title>
		<link>http://www.asterdata.com/blog/index.php/2010/07/13/cloud-becomes-hpc-friendly/</link>
		<comments>http://www.asterdata.com/blog/index.php/2010/07/13/cloud-becomes-hpc-friendly/#comments</comments>
		<pubDate>Tue, 13 Jul 2010 22:11:16 +0000</pubDate>
		<dc:creator>Tasso Argyros</dc:creator>
		
		<category><![CDATA[Cloud Computing]]></category>

		<guid isPermaLink="false">http://www.asterdata.com/blog/index.php/2010/07/13/cloud-becomes-hpc-friendly/</guid>
		<description><![CDATA[Amazon announced today the availability of special EC2 cloud clusters that are optimized for low-latency network operations. This is useful for applications in the so-called High-Performance Computing area, where servers need to request and exchange data very fast. Examples of HPC applications range from nuclear simulations in government labs to playing chess. 
I find this [...]<p><a href="http://sharethis.com/item?&#038;wp=2.2.3&#38;publisher=cf684f86-aafe-420a-8a13-0d8ab76b31c0&#38;title=Cloud+Becomes+HPC-friendly&#38;url=http%3A%2F%2Fwww.asterdata.com%2Fblog%2Findex.php%2F2010%2F07%2F13%2Fcloud-becomes-hpc-friendly%2F">ShareThis</a></p>]]></description>
			<content:encoded><![CDATA[<p>Amazon <a href="http://aws.typepad.com/aws/2010/07/the-new-amazon-ec2-instance-type-the-cluster-compute-instance.html?utm_source=feedburner&#038;utm_medium=feed&#038;utm_campaign=Feed:+AmazonWebServicesBlog+(Amazon+Web+Services+Blog)">announced</a> today the availability of special EC2 cloud clusters that are optimized for low-latency network operations. This is useful for applications in the so-called High-Performance Computing area, where servers need to request and exchange data very fast. Examples of HPC applications range from nuclear simulations in government labs to <a href="http://en.wikipedia.org/wiki/Deep_Blue_(chess_computer)">playing chess</a>. </p>
<p>I find this development interesting, not only because it makes scientific applications in the cloud a possibility, but also because it&#8217;s an indication of where cloud infrastructure is heading. </p>
<p>In the early days, Amazon EC2 was very simple: if you wanted 5 &#8220;instances&#8221; (that is, 5 virtual machines), that&#8217;s what you got. However, memory of the instances was low, as well as disk capacity. Over time, more and more configurations were added and now one can <a href="http://aws.amazon.com/ec2/instance-types/">choose an instance type</a> from a variety of disk &#038; memory characteristics with up to 15GB of memory and 2TBs of disks per instance. However, network was always a problem independently of the size of the instance. (According to rumors, EC2 would make things worse by distributing instances as far away from each other as possible in the datacenter to increase reliability - as a result, network latency would suffer.) Now, the network problem is being solved by means of these special &#8220;Cluster Compute Instances&#8221; that provide guaranteed, non-blocking access to a 10GbE network infrastructure.</p>
<p>Overall this course represents a departure from the super-simple black-box model that EC2 started from. Amazon - wisely - realizes that accommodating more applications requires transparency - and providing guarantees - for the underlying infrastructure. Guaranteeing network latency is just the beginning: Amazon has the opportunity add much more options and guarantees around I/O performance, quality of service, SSDs versus hard drives, fail-over behavior etc. The more options &#038; guarantees Amazon offers the closer we&#8217;ll get to the promise of the cloud - at least for resource-intensive IT applications.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.asterdata.com/blog/index.php/2010/07/13/cloud-becomes-hpc-friendly/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Storage vs Processing &#038; the EMC/Greenplum Acquisition</title>
		<link>http://www.asterdata.com/blog/index.php/2010/07/12/storage-vs-processing-the-emcgreenplum-acquisition/</link>
		<comments>http://www.asterdata.com/blog/index.php/2010/07/12/storage-vs-processing-the-emcgreenplum-acquisition/#comments</comments>
		<pubDate>Mon, 12 Jul 2010 18:35:08 +0000</pubDate>
		<dc:creator>Tasso Argyros</dc:creator>
		
		<category><![CDATA[Analytics]]></category>

		<category><![CDATA[Database]]></category>

		<guid isPermaLink="false">http://www.asterdata.com/blog/index.php/2010/07/12/storage-vs-processing-the-emcgreenplum-acquisition/</guid>
		<description><![CDATA[I have always enjoyed the subtle irony of someone trying to be impressive by saying &#8220;my data warehouse is X Terabytes&#8221; [muted: &#8220;and it&#8217;s bigger than yours&#8221;]! Why is this ironic? Because it describes a data warehouse, which is supposed to be all about data processing and analysis, using a storage metric. Having an obese [...]<p><a href="http://sharethis.com/item?&#038;wp=2.2.3&#38;publisher=cf684f86-aafe-420a-8a13-0d8ab76b31c0&#38;title=Storage+vs+Processing+%26+the+EMC%2FGreenplum+Acquisition&#38;url=http%3A%2F%2Fwww.asterdata.com%2Fblog%2Findex.php%2F2010%2F07%2F12%2Fstorage-vs-processing-the-emcgreenplum-acquisition%2F">ShareThis</a></p>]]></description>
			<content:encoded><![CDATA[<p>I have always enjoyed the subtle irony of someone trying to be impressive by saying &#8220;my data warehouse is X Terabytes&#8221; [muted: &#8220;and it&#8217;s bigger than yours&#8221;]! Why is this ironic? Because it describes a data warehouse, which is supposed to be all about data processing and analysis, using a storage metric. Having an obese 800 Terabytes system that may take hours or days to just do a single pass over the data is not impressive and definitely calls for some diet.</p>
<p>Surprisingly though, several vendors went down the path of making their data warehousing offerings fatter and fatter. Greenplum is a good example. Prior to Sun&#8217;s acquisition by Oracle, they were heavily pushing systems based on the <a href="http://www.oracle.com/us/products/servers-storage/servers/x86/031210.htm" target="_blank">Sun Thumper</a>, a 48-disk-heavy 4U box that can store up to 100TBs/box. I was quite familiar with that box as it partly came out of a startup called Kealia that my Stanford advisor, <a href="http://en.wikipedia.org/wiki/David_Cheriton" target="_blank">David Cheriton</a>, and Sun co-founder <a href="http://en.wikipedia.org/wiki/Andy_Bechtolsheim" target="_blank">Andy Bechtolsheim</a> had founded and then sold to Sun in 2004. I kept wondering, though, what a 50TB/CPU configuration has to do with data analytics.</p>
<p>After long deliberation I came to the conclusion that it has nothing to do with it. There were two reasons why people were interested in this configuration. First, there were some use cases that required &#8220;near-line storage&#8221;, a term that&#8217;s used to describe a data repository whose major purpose is to store data but also allows for basic &amp; infrequent data access. In that respect, Greenplum&#8217;s software on top of the Sun Thumpers represented a cheap storage solution that offered basic data access and was very useful for applications where processing or analytics was not the main focus.</p>
<p>The second reason for the interest, though, is a tendency to drive DW projects towards an absolute low per-TB price to reduce costs. Experienced folks will recognize that such an approach leads to disaster, because (as mentioned above) analytics is more than just Terabytes. Perfectly low per-TB price using fat storage looks great on glossy paper but in reality it&#8217;s no good because nobody&#8217;s analytical problems are that simple.</p>
<p>The point here is that analytics have more to do with processing rather than storage. It requires a fair number of balanced servers (thus good scalability &amp; fault tolerance), CPU cycles, networking bandwidth, smart &amp; efficient algorithms, fair amounts of memory to avoid thrashing etc. It&#8217;s also about how much processing can it be done by SQL, and how much of your analytics need to use next-generation interfaces like <a href="http://www.mapreduce.org" target="_blank">MapReduce</a> or pre-packaged in-database analytical engines. In the new decade in which we&#8217;re embarking, solving business problems like fraud, market segmentation &amp; targeting, financial optimization, etc., require much more than just cheap, overweight storage.</p>
<p>So going to the EMC/Greenplum news, I think such an acquisition makes sense, but in a specific way. It will lead to systems that live between storage and data warehousing, systems able to store data and also give the ability to retrieve it on an occasional basis or if the analysis required is trivial. But the problems Aster is excited about are those of advanced <a href="http://www.asterdata.com/product/advanced-analytics.php">in-database analytics</a> for rich, ad hoc querying, delivered through a full application environment inside a MPP database. It&#8217;s these problems that we see as opportunities to not only cut IT costs but also provide tremendous competitive advantages to our customers. And on that front, we promise to continue innovating and pushing the limits of technology as much as possible.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.asterdata.com/blog/index.php/2010/07/12/storage-vs-processing-the-emcgreenplum-acquisition/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Call for Database Industrial Papers in EDBT 2011</title>
		<link>http://www.asterdata.com/blog/index.php/2010/07/11/call-for-database-industrial-papers-in-edbt-2011/</link>
		<comments>http://www.asterdata.com/blog/index.php/2010/07/11/call-for-database-industrial-papers-in-edbt-2011/#comments</comments>
		<pubDate>Sun, 11 Jul 2010 17:39:55 +0000</pubDate>
		<dc:creator>Tasso Argyros</dc:creator>
		
		<category><![CDATA[Statements]]></category>

		<guid isPermaLink="false">http://www.asterdata.com/blog/index.php/2010/07/11/call-for-database-industrial-papers-in-edbt-2011/</guid>
		<description><![CDATA[Those of you that follow the academic conferences in the database space, are probably familiar with EDBT, the premier database conference in Europe. EDBT acts as a forum not only for European researchers, but also for commercial technologies and vendors that want to present their innovations in a European setting.
For 2011, EDBT is held in [...]<p><a href="http://sharethis.com/item?&#038;wp=2.2.3&#38;publisher=cf684f86-aafe-420a-8a13-0d8ab76b31c0&#38;title=Call+for+Database+Industrial+Papers+in+EDBT+2011&#38;url=http%3A%2F%2Fwww.asterdata.com%2Fblog%2Findex.php%2F2010%2F07%2F11%2Fcall-for-database-industrial-papers-in-edbt-2011%2F">ShareThis</a></p>]]></description>
			<content:encoded><![CDATA[<p>Those of you that follow the academic conferences in the database space, are probably familiar with <a href="http://edbticdt2011.it.uu.se/">EDBT</a>, the premier database conference in Europe. EDBT acts as a forum not only for European researchers, but also for commercial technologies and vendors that want to present their innovations in a European setting.</p>
<p>For 2011, EDBT is held in <a href="http://en.wikipedia.org/wiki/Uppsala">Uppsala, Sweden</a> in March. I&#8217;m on the Program Committee for the &#8220;industrial application&#8221; section and I&#8217;d like to encourage anyone with an interesting commercial technology and an interest in Europe to consider submitting a paper to the conference. Papers on applications and position papers on technology trends are equally welcome. The deadline for submission is September 8, 2010 and you can find more info on submitting <a href="http://edbticdt2011.it.uu.se/EDBT-CFP.html">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.asterdata.com/blog/index.php/2010/07/11/call-for-database-industrial-papers-in-edbt-2011/feed/</wfw:commentRss>
		</item>
		<item>
		<title>The Concept of Non-Relational Analytics</title>
		<link>http://www.asterdata.com/blog/index.php/2010/07/02/the-concept-of-non-relational-analytics/</link>
		<comments>http://www.asterdata.com/blog/index.php/2010/07/02/the-concept-of-non-relational-analytics/#comments</comments>
		<pubDate>Fri, 02 Jul 2010 17:15:35 +0000</pubDate>
		<dc:creator>Tasso Argyros</dc:creator>
		
		<category><![CDATA[Analytics]]></category>

		<category><![CDATA[MapReduce]]></category>

		<guid isPermaLink="false">http://www.asterdata.com/blog/index.php/2010/07/02/the-concept-of-non-relational-analytics/</guid>
		<description><![CDATA[There is a lot of talk these days about relational vs. non-relational data. But what about analytics? Does it make sense to talk about relational and non-relational analytics?
I think it does. Historically, a lot of data analysis in the enterprise has been done with pure SQL. SQL-based analysis is a type of &#8220;relational analysis,&#8221; which [...]<p><a href="http://sharethis.com/item?&#038;wp=2.2.3&#38;publisher=cf684f86-aafe-420a-8a13-0d8ab76b31c0&#38;title=The+Concept+of+Non-Relational+Analytics&#38;url=http%3A%2F%2Fwww.asterdata.com%2Fblog%2Findex.php%2F2010%2F07%2F02%2Fthe-concept-of-non-relational-analytics%2F">ShareThis</a></p>]]></description>
			<content:encoded><![CDATA[<p>There is a lot of talk these days about relational vs. non-relational data. But what about analytics? Does it make sense to talk about relational and non-relational <em>analytics</em>?</p>
<p>I think it does. Historically, a lot of data analysis in the enterprise has been done with pure SQL. SQL-based analysis is a type of &#8220;relational analysis,&#8221; which I define as analysis done via a set-based declarative language like SQL. Note how SQL treats every table as a set of values; SQL statements are relational set operations; and any intermediate SQL results, even within the same query, need to follow the relational model. All these are characteristics of a relational analysis language. Although recent SQLÂ standards define the language to be <a href="http://en.wikipedia.org/wiki/Turing_completeness" target="_blank">Turing Complete</a>, meaning you can implement any algorithm in SQL, in practice implementing any computation that departs from the simple model of sets, joins, groupings, and orderings is severely sub-optimal, in terms of performance or complexity.</p>
<p>On the other hand, an interface like MapReduce is clearly non-relational in terms of its algorithmic and computational capabilities. You have the full flexibility of a procedural programming language, like C or Java; MapReduce intermediate results can follow any form; and the logic of a MapReduce analytical application can implement almost arbitrary formations of code flow and data structures. In addition, any MapReduce computation can be automatically extended to a shared-nothing parallel system which implies ability to crunch big amounts of data. So MapReduce is one version of &#8220;non-relational&#8221; analysis.</p>
<p>So Aster Data&#8217;s <a href="http://www.asterdata.com/resources/mapreduce.php">SQL-MapReduce</a> becomes really interesting if you see it as a way of doing non-relational analytics on top of relational data. In Aster Data&#8217;s platform, you can store your data in a purely relational form. By doing that, you can use popular RDBMS mechanisms to achieve things like adherence to a data model, security, compliance, integration with ETL or BI tools etc. The similarities, however, stop there. Because you can then use SQL-MapReduce to do analytics that were never possible before in a relational RDBMS, because they are MapReduce-based and non-relational and they extend to TBs or PBs. And that includes a large number of analytical applications like fraud detection, network analysis, graph algorithms, data mining, etc.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.asterdata.com/blog/index.php/2010/07/02/the-concept-of-non-relational-analytics/feed/</wfw:commentRss>
		</item>
		<item>
		<title>In-Memory Data Processing</title>
		<link>http://www.asterdata.com/blog/index.php/2010/06/23/in-memory-data-processing/</link>
		<comments>http://www.asterdata.com/blog/index.php/2010/06/23/in-memory-data-processing/#comments</comments>
		<pubDate>Wed, 23 Jun 2010 23:40:07 +0000</pubDate>
		<dc:creator>Tasso Argyros</dc:creator>
		
		<category><![CDATA[Data-Application Server]]></category>

		<category><![CDATA[Analytics]]></category>

		<category><![CDATA[MapReduce]]></category>

		<category><![CDATA[Database]]></category>

		<guid isPermaLink="false">http://www.asterdata.com/blog/index.php/2010/06/23/in-memory-data-processing/</guid>
		<description><![CDATA[Recently, a journalist called to ask about in-memory data processing, a very interesting subject. I always thought that in-memory processing will be more and more important as memory prices keep falling drastically. In fact, these days you can get 128GB of memory into a single system for less than $5K plus the server cost, not [...]<p><a href="http://sharethis.com/item?&#038;wp=2.2.3&#38;publisher=cf684f86-aafe-420a-8a13-0d8ab76b31c0&#38;title=In-Memory+Data+Processing&#38;url=http%3A%2F%2Fwww.asterdata.com%2Fblog%2Findex.php%2F2010%2F06%2F23%2Fin-memory-data-processing%2F">ShareThis</a></p>]]></description>
			<content:encoded><![CDATA[<p>Recently, a journalist called to ask about in-memory data processing, a very interesting subject. I always thought that in-memory processing will be more and more important as memory prices keep falling drastically. In fact, these days you can get 128GB of memory into a single system for less than $5K plus the server cost, not to mention that DDR3 and multiple memory controllers are giving a huge performance boost. And if you run software that can handle shared-nothing parallelism (MPP), your memory cost increases linearly, and systems with TBs of memory are possible.</p>
<p>So what do you do with all that memory? There are two classes of use cases that are emerging today. First is the case where you need to increase concurrent access to data with reduced latency. Tools like <a href="http://memcached.org/">memcached</a> offer in-memory caching that, used properly, can vastly improve latency and concurrency for large-scale OLTP applications like websites. Also the nice thing with object caching is that it scales well in a distributed way and people have build <a href="http://www.scribd.com/doc/4069180/Caching-Performance-Lessons-from-Facebook">TB-level caches</a>. Memory-only OLTP databases have started to emerge, such as <a href="http://www.voltdb.com">VoltDB</a>. And memory is used implicitly as a very important caching layer in open-source key-value products like <a href="http://project-voldemort.com/">Voldemort</a>. We should only expect memory to play a more and more important role here.</p>
<p>The second way to use memory is to gain &#8220;processing flexibility&#8221; when doing analytics. The idea is to throw your data into memory (however much it fits, of course) without spending much time thinking how to do that or what queries you&#8217;ll need to run. Because memory is so fast, most simple queries will be executed at interactive times and also concurrency is handled well. European upstart <a href="http://www.qlikview.com/">QlikView</a> exploits this fact to offer a memory-only BI solution which provides simple and fast BI reporting. The downside is its applicability to only 10s of GBs of data as <a href="http://www.dbms2.com/2010/06/12/the-underlying-technology-of-qlikview/">Curt Monash notes</a>.</p>
<p>By exploiting an MPP shared-nothing architecture, Aster Data has production clusters with TBs of total memory. Our software takes advantage of memory in two ways: first, it uses caching aggressively to ensure the most relevant data stays in memory; and when data is in memory, processing is much faster and more flexible. Secondly, MapReduce is a great way to utilize memory as it provides full flexibility to the programmer to use memory-focused data structures for data processing. In addition, Aster Data&#8217;s SQL-MapReduce provides tools to the user to encourage the development of memory-only MapReduce applications.</p>
<p>However, one shouldn&#8217;t fall into the trap of thinking that all analytics will be in-memory anytime soon. While memory is down to $30/GB, disk manufacturers have been busy increasing platter density and dropping their price to less than $0.06/GB. Given that the amount of data in the world grows faster than Moore&#8217;s law and memory, there will always be more data to be stored and analyzed than what fits into any amount of memory that an enterprise can use. In fact, most big data applications will have data sets that do not fit into memory because, while tools like memcached worry only about the present (e.g. current Facebook users), analytics need to worry about the past, as well - and that means much more data. So a multi-layer architecture will be the only cost-effective way of analyzing large amounts of data for some time.</p>
<p>One shouldn&#8217;t be discussing memory without mentioning solid-state disk products (like Aster Data partner company <a href="http://www.fusionio.com/products/iodrive/">Fusion-io</a>). SSDs are likely to make the surprise here given that their per-GB price is falling faster than disks (being a solid-state product that follows Moore&#8217;s law does help). In the next few years we&#8217;ll witness SSDs in read-intensive applications providing similar advantages to memory while accommodating much larger data sizes.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.asterdata.com/blog/index.php/2010/06/23/in-memory-data-processing/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Should Intel love MapReduce?</title>
		<link>http://www.asterdata.com/blog/index.php/2010/06/22/should-intel-love-mapreduce/</link>
		<comments>http://www.asterdata.com/blog/index.php/2010/06/22/should-intel-love-mapreduce/#comments</comments>
		<pubDate>Wed, 23 Jun 2010 05:55:29 +0000</pubDate>
		<dc:creator>Tasso Argyros</dc:creator>
		
		<category><![CDATA[Data-Application Server]]></category>

		<category><![CDATA[MapReduce]]></category>

		<category><![CDATA[Scalability]]></category>

		<guid isPermaLink="false">http://www.asterdata.com/blog/index.php/2010/06/22/should-intel-love-mapreduce/</guid>
		<description><![CDATA[Rumors abound that Intel is &#8220;baking&#8221; the successor of the very successful Nehalem CPU architecture, codenamed Westmere. It comes with an impressive spec: 10 CPU cores (supporting 20 concurrent threads) packed in a single chip. You can soon expect to see 40 cores in middle range 4-socket servers - a number hard to imagine just [...]<p><a href="http://sharethis.com/item?&#038;wp=2.2.3&#38;publisher=cf684f86-aafe-420a-8a13-0d8ab76b31c0&#38;title=Should+Intel+love+MapReduce%3F&#38;url=http%3A%2F%2Fwww.asterdata.com%2Fblog%2Findex.php%2F2010%2F06%2F22%2Fshould-intel-love-mapreduce%2F">ShareThis</a></p>]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.pcworld.com/article/199412/intels_westmereex_chip_to_include_10_cores.html">Rumors abound</a> that Intel is &#8220;baking&#8221; the successor of the very successful Nehalem CPU architecture, codenamed Westmere. It comes with an impressive spec: 10 CPU cores (supporting 20 concurrent threads) packed in a single chip. You can soon expect to see <strong>40 cores in middle range 4-socket servers</strong> - a number hard to imagine just five years ago.</p>
<p>We&#8217;re definitely talking about a different era. In the old days, you could barely fit a single core in a chip. (I still remember 15 years ago when I had to buy and install a separate math co-processor on my <a href="http://en.wikipedia.org/wiki/Macintosh_LC">Mac LC</a> to run <a href="http://www.google.com/google-d-s/spreadsheets/">Microsoft Excel</a> and <a href="http://www.wolfram.com/products/mathematica/index.html">Mathematica</a>.) And with the hardware, software has to change, too. In fact, <strong>modern software means software that can handle parallelism</strong>. This is what makes <a href="http://mapreduce.org/">MapReduce </a>such an essential and timely tool for big data applications. MapReduce&#8217;s purpose in life is to simplify data and processing parallelism for big data applications. It gives ample freedom to the programmer on how to do things locally; and takes over when data needs to be communicated across processes/cores/servers, thus evaporating a lot of the parallelism complexity.</p>
<p>Once someone designs their software and data to operate in a parallelized environment using MapReduce, gains will come on multiple levels. Not only will MapReduce help your analytical applications scale across a cluster of servers with terabytes of data, it will also exploit the billions of transistors and the 10s of CPU cores <em>inside </em>each server. The best part: <strong>the programmer doesn&#8217;t need to think about the difference</strong>.</p>
<p>As an example, consider this <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.71.4156&amp;rep=rep1&amp;type=pdf">great paper </a>out of Stanford discusses MapReduce implementations of popular Machine Learning algorithms. The Stanford researchers considered MapReduce as a way of &#8220;porting&#8221; these algorithms (traditionally implemented to run in a single CPU) to a multi-core architecture. But, of course, the same MapReduce implementations can be used to scale these algorithms across a distributed cluster as well.</p>
<p>Hardware has changed - MPP, shared-nothing, commodity servers, and, of course, multi-core. In this new world MapReduce is software&#8217;s response for big data processing. Intel and Westmere have just found an unexpected friend.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.asterdata.com/blog/index.php/2010/06/22/should-intel-love-mapreduce/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Hadoop Summit Unconference - BigDataCamp</title>
		<link>http://www.asterdata.com/blog/index.php/2010/06/14/hadoop-summit-unconference-bigdatacamp/</link>
		<comments>http://www.asterdata.com/blog/index.php/2010/06/14/hadoop-summit-unconference-bigdatacamp/#comments</comments>
		<pubDate>Mon, 14 Jun 2010 23:32:59 +0000</pubDate>
		<dc:creator>Steve Wooledge</dc:creator>
		
		<category><![CDATA[Statements]]></category>

		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.asterdata.com/blog/index.php/2010/06/14/hadoop-summit-unconference-bigdatacamp/</guid>
		<description><![CDATA[As the market around big data heats up, it&#8217;s great to see the ecosystem for Hadoop, MapReduce, and massively parallel databases expanding. This includes events for education and networking around big data.
As such, Aster Data is co-sponsoring our first official &#8220;unconference&#8221; the night before the 2010 Hadoop Summit. It&#8217;s called BigDataCamp and will be June [...]<p><a href="http://sharethis.com/item?&#038;wp=2.2.3&#38;publisher=cf684f86-aafe-420a-8a13-0d8ab76b31c0&#38;title=Hadoop+Summit+Unconference+-+BigDataCamp&#38;url=http%3A%2F%2Fwww.asterdata.com%2Fblog%2Findex.php%2F2010%2F06%2F14%2Fhadoop-summit-unconference-bigdatacamp%2F">ShareThis</a></p>]]></description>
			<content:encoded><![CDATA[<p>As the market around big data heats up, it&#8217;s great to see the ecosystem for Hadoop, MapReduce, and massively parallel databases expanding. This includes events for education and networking around big data.</p>
<p>As such, Aster Data is co-sponsoring our first official &#8220;unconference&#8221; the night before the 2010 Hadoop Summit. It&#8217;s called <a href="http://www.bigdatacamp.org">BigDataCamp</a> and will be June 28th at the TechMart from 5:00-9:30PM (adjacent to the Hyatt where Hadoop Summit is taking place). Similar to our <a href="http://www.scaleunlimited.com/events/scale_camp">ScaleCamp event last year</a> where we heard from companies like LinkedIn and ShareThis and industry practitioners like Chris Wensel (author of <a href="http://www.cascading.org/">Cascading</a>), there will be a lineup of great talks, including hands-on workshops led by Amazon Web Services, Karmasphere, and more. In addition, we&#8217;re lucky to have <a href="http://www.linkedin.com/in/dnielsen">Dave Nielsen</a> as the moderator/organizer of the event as he&#8217;s chaired similar unconferences such as CloudCamp, and is an expert at facilitating content and discussions to best fit attendee interest.</p>
<p>It&#8217;s very fitting to have the more open/dynamic agenda style of an unconference given the audience will be more of the &#8220;analytic scientists&#8221; - a title which I&#8217;ve seen LinkedIn use when <a href="http://www.scaleunlimited.com/wp-content/uploads/2009/06/scalecamp-scaleunlimited-9june09_patil.pdf">describing the rise in job roles dedicated to tackling big data</a> in companies to tease out insights and develop data-driven products and applications. The analytic scientist-customers I speak with who use Aster Data together with Hadoop challenge the norms and move quickly - not unlike an unconference agenda. I expect a night of free thinking (and free drinks/food), big ideas, and a practical look at emerging technologies and techniques to tackle big data. Best of all, the networking portion is a great chance to meet folks to hear what they&#8217;re up to and exchange ideas.</p>
<p>Check out the agenda at <a href="http://www.bigdatacamp.org">www.bigdatacamp.org</a> and note that seats are limited and we expect to sell out, so please REGISTER NOW. Hope to see you there!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.asterdata.com/blog/index.php/2010/06/14/hadoop-summit-unconference-bigdatacamp/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
