As the Big Data
phenomenon continues to gather momentum, more and more organizations are
starting to recognize the unexploited value in the vast amounts of data they
hold. According to IDC, the Big Data technology and services market will grow
to about $17 billion by 2015, seven times the growth rate of the overall IT
market.
Despite the strong
potential commercial advantage for business, developing an effective strategy
to cope with existing and previously unexplored information could prove tough
for many enterprises.
In many ways, this is
because the term ‘Big Data’ itself is somewhat misleading. One definition is in
terms of terabytes and petabytes of information that common database software
tools cannot capture, manage and process within an acceptable amount of time.
In reality, data volume is just one aspect of the discussion and arguably the
most straightforward issue that needs to be addressed.
As Gartner points out;
‘the complexity, variety and velocity with which it is delivered combine to
amplify the problem substantially beyond the simple issues of volume implied by
the popular term Big Data.’ For this reason, ‘big’ really depends on the
starting point and the size of the organization.
With so much being written about Big Data
these days, it can prove difficult for enterprises to implement strategies that
deliver on the promise of Big Data Analytics. For example I have read many
online articles equating "MapReduce" with "Hadoop" and
"Hadoop" with "Big Data".
MapReduce is, of course, a programming
model that enables complex processing logic expressed in Java and other
programming languages to be parallelised efficiently, thus permitting their
execution on "shared nothing", scale-out hardware architectures and
Hadoop is one implementation of the MapReduce programming
model. There are other implementations of the MapReduce model – and there
are other approaches to parallel processing, which are a better fit with many
classes of analytic problems. However we rarely see these alternatives
discussed.
Another interesting assertion I read and sometimes
I am confronted with by customers new to Hadoop is the positioning of Hadoop as
an alternative to existing, SQL-based technologies that is likely to displace –
or even entirely replace – these technologies. This can often lead to an
interesting discussion, but you could summarize that Hadoop lacks important
capabilities found in a mature and sophisticated data warehouse RDBMS, for
example: query re-write and cost-based query optimization; mixed-workload
management; security, availability and recoverability features; support for
transactions; etc., etc., etc.
There is, of course, a whole ecosystem
springing-up around Hadoop – including HBase, Hive, Mahout and ZooKeeper, to
name just four – and some commentators argue that in time these technologies
may extend Hadoop to the point where this ecosystem could provide
an alternative to existing Data Warehouse DBMS technology.
Possibly, but I
would suggest that they have a long an arduous path to reach such a goal.
None of which is to say that Hadoop is not
an extremely interesting and promising new technology – because clearly it is,
and has role as enterprises embrace Big Data Analytics. There is evidence
today, from leading e-business companies that Hadoop scales well - and has a
unit-cost-of-storage that will increasingly make it possible for organizations
to "remember everything", by enabling them to retain data whose value
for analytics is as yet unproven.
Hadoop may become the processing
infrastructure that enables us to process raw, multi-structured data and move
it into a "Big Analytic" environment - like Teradata-Aster - that can
more efficiently support high-performance, high concurrency manipulation of the
data, whilst also providing for improved usability and manageability, so that
we can bring this data to a wider audience. The final stage in this “Big
Data value chain” will the see us move the insights derived from the processing
of the raw multi-structured data in these "up stream" environments
into the Data Warehouse, where it can most easily and most efficiently be
combined with other data - and shared with the entire organization, so in order
to maximize business value.
Teradata continues to invest in
partnerships with leading Hadoop distributors Cloudera and Hortonworks - and to
develop and enhance integration technology between these environments and the
Teradata and Teradata-Aster platforms.
The fact that Big Data
is discovery-oriented and its relative immaturity compared with traditional
analytics, arguably means that it doesn’t sit well within the IT department
because requirements can never be fully defined in advance. Neither should it
logically fall to business analysts used to using traditional BI tools.
As a result, a new
role has emerged for data scientists, who are not technologists but are also
not afraid of leveraging technology. Rather than seeking an answer to a
business question, this new professional is more concerned with what the
question should be. The data scientist will look for new insights from data and
will use it as a visualization tool not a reporting tool.
In future, many
believe that having this type of individual on staff will also be key to
generating maximum value from Big Data. In the meantime, the onus will
invariably fall to the CIO to prepare and act for a changing Big Data
landscape.
Customers can be
assured that Teradata will continue to be their #1 strategic advisor for their
data management and analytics. We continue to provide compelling and innovative
solutions with Teradata Aster and Teradata IDW appliances. We will also work
with best-in-class partners to provide choices in integrated solutions and
reference architectures to help customers maintain competitive advantage with
their data.
What is certain is that
interesting times lay ahead, and that those enterprises that can successfully
execute a Big Data strategy will gain competitive advantage from the valuable
insights gained from Big Data Analytics.