As the Big Data phenomenon continues to gather momentum, more and more organizations are starting to recognize the unexploited value in the vast amounts of data they hold. According to IDC, the Big Data technology and services market will grow to about $17 billion by 2015, seven times the growth rate of the overall IT market.
Despite the strong potential commercial advantage for business, developing an effective strategy to cope with existing and previously unexplored information could prove tough for many enterprises.
In many ways, this is because the term ‘Big Data’ itself is somewhat misleading. One definition is in terms of terabytes and petabytes of information that common database software tools cannot capture, manage and process within an acceptable amount of time. In reality, data volume is just one aspect of the discussion and arguably the most straightforward issue that needs to be addressed.
As Gartner points out; ‘the complexity, variety and velocity with which it is delivered combine to amplify the problem substantially beyond the simple issues of volume implied by the popular term Big Data.’ For this reason, ‘big’ really depends on the starting point and the size of the organization.
With so much being written about Big Data these days, it can prove difficult for enterprises to implement strategies that deliver on the promise of Big Data Analytics. For example I have read many online articles equating "MapReduce" with "Hadoop" and "Hadoop" with "Big Data".
MapReduce is, of course, a programming model that enables complex processing logic expressed in Java and other programming languages to be parallelised efficiently, thus permitting their execution on "shared nothing", scale-out hardware architectures and Hadoop is one implementation of the MapReduce programming model. There are other implementations of the MapReduce model – and there are other approaches to parallel processing, which are a better fit with many classes of analytic problems. However we rarely see these alternatives discussed.
Another interesting assertion I read and sometimes I am confronted with by customers new to Hadoop is the positioning of Hadoop as an alternative to existing, SQL-based technologies that is likely to displace – or even entirely replace – these technologies. This can often lead to an interesting discussion, but you could summarize that Hadoop lacks important capabilities found in a mature and sophisticated data warehouse RDBMS, for example: query re-write and cost-based query optimization; mixed-workload management; security, availability and recoverability features; support for transactions; etc., etc., etc.
There is, of course, a whole ecosystem springing-up around Hadoop – including HBase, Hive, Mahout and ZooKeeper, to name just four – and some commentators argue that in time these technologies may extend Hadoop to the point where this ecosystem could provide an alternative to existing Data Warehouse DBMS technology. Possibly, but I would suggest that they have a long an arduous path to reach such a goal.
None of which is to say that Hadoop is not an extremely interesting and promising new technology – because clearly it is, and has role as enterprises embrace Big Data Analytics. There is evidence today, from leading e-business companies that Hadoop scales well - and has a unit-cost-of-storage that will increasingly make it possible for organizations to "remember everything", by enabling them to retain data whose value for analytics is as yet unproven.
Hadoop may become the processing infrastructure that enables us to process raw, multi-structured data and move it into a "Big Analytic" environment - like Teradata-Aster - that can more efficiently support high-performance, high concurrency manipulation of the data, whilst also providing for improved usability and manageability, so that we can bring this data to a wider audience. The final stage in this “Big Data value chain” will the see us move the insights derived from the processing of the raw multi-structured data in these "up stream" environments into the Data Warehouse, where it can most easily and most efficiently be combined with other data - and shared with the entire organization, so in order to maximize business value.
Teradata continues to invest in partnerships with leading Hadoop distributors Cloudera and Hortonworks - and to develop and enhance integration technology between these environments and the Teradata and Teradata-Aster platforms.
The fact that Big Data is discovery-oriented and its relative immaturity compared with traditional analytics, arguably means that it doesn’t sit well within the IT department because requirements can never be fully defined in advance. Neither should it logically fall to business analysts used to using traditional BI tools.
As a result, a new role has emerged for data scientists, who are not technologists but are also not afraid of leveraging technology. Rather than seeking an answer to a business question, this new professional is more concerned with what the question should be. The data scientist will look for new insights from data and will use it as a visualization tool not a reporting tool.
In future, many believe that having this type of individual on staff will also be key to generating maximum value from Big Data. In the meantime, the onus will invariably fall to the CIO to prepare and act for a changing Big Data landscape.
Customers can be assured that Teradata will continue to be their #1 strategic advisor for their data management and analytics. We continue to provide compelling and innovative solutions with Teradata Aster and Teradata IDW appliances. We will also work with best-in-class partners to provide choices in integrated solutions and reference architectures to help customers maintain competitive advantage with their data.