How do we integrate Hadoop into R.

Big data

It is advisable and economically sensible to separate commercial and mass data processing and thus to trust the specialists in each discipline. For many companies, an ideal mix of the high-performance database SAP HANA and a solid Hadoop platform can open up completely new ways in the field of real-time analytics and at the same time save enormous costs. The announcement of the product "SAP HANA Vora" underpins this ideal constellation. This tool ensures an even deeper integration between the in-memory data platform SAP HANA and the big data component Hadoop.

The highlight: distributed processing of the data

A major advantage over other systems is that Hadoop does not rely on expensive proprietary hardware for storing and processing the data. The advantage of the distributed file system also extends to the distributed processing of the data and can be scaled almost endlessly using inexpensive standard servers: an ideal prerequisite for preparing for the steadily growing flood of data.

"Hadooponomics": Numbers speak for themselves

Hadoop is not just an option, it is essential for big data scenarios, according to the market research company Forrester Research. To emphasize the financial benefits of the open source software, Forrester analysts created the word "Hadooponomics". Indeed, the numbers speak for themselves. According to Forrester, the major Hadoop distributions cost between $ 2,000 and $ 3,000 per node per year. A HANA node, on the other hand, costs around $ 750,000 per year.

A well-known company in the UK compared conventional data storage with the estimated cost of using Hadoop. A terabyte in an Oracle database would cost around £ 35,000 a year. By contrast, the company calculated that storing the same amount of data in Hadoop would cost £ 1,120 per year. In view of this immense cost difference, it makes economic sense to process only the most valuable and most frequently used data in SAP HANA and to keep the remaining data in Hadoop.

Data offload lowers HANA costs

With the help of data offload, SAP HANA costs remain constant despite the increasing amount of data, while access to the outsourced data continues without having to reload it. The administrative costs for simply storing and processing content are also very low with Hadoop. In addition, Hadoop enables cost-effective analytical procedures to be applied to data.

Using Smart Data Access (SDA), such outsourced data can still be accessed from SAP HANA. This aspect will also be supported much more conveniently in the future. While the first versions of the "Hadoop Relocation Agent" are integrated in the "SAP HANA Data Lifecycle Manager Tool" (DLM) in the SPS10 from SAP HANA, it can be assumed that the programming times for manually created and planned procedures for data relocation to Hadoop soon to be a thing of the past. The recently announced product "SAP HANA Vora" supports analytics performance by enabling main memory-based query execution within the Apache Spark Framework and integrating new functions.

All information can be linked to corporate data

The huge amounts of data that arise from current topics such as the "Internet of Things" (IoT) or Industry 4.0, but also from classic Big Data topics (web, social media, mobile apps), can not only be 100 percent with the help of Hadoop find and index them, but they can be linked to corporate data using the linking mechanisms of SAP HANA technology via Smart Data Access (SDA) - for example, data from sensors, networks, machine data and unstructured information from texts, social media data, Mailboxes and SharePoints or also video / audio information. In addition, compliance requirements can also be met using suitable access mechanisms.

Securing previous investments

Companies that already work in Hadoop environments, where they have gained experience with a wide variety of data formats and with MapReduce, and have already set up data lakes, can connect these Hadoop environments directly to SAP HANA and thereby significantly increase the number of implementable use cases. Investments made so far will continue to be used. SAP HANA makes it possible to connect not only Hadoop, but also all common database formats for data warehouses using Smart Data Access and thus to get one step closer to the "logical data warehouse" or "distributed data warehouse" and to obtain a hybrid architecture.

Data virtualization: first integrate, then modernize

Existing data warehouses and Hadoop environments can be linked virtually in SAP HANA, thereby enabling the creation of a uniform access layer for applications. Regardless of which technologies are linked to SAP HANA via Smart Data Access, from the perspective of SAP HANA, all tables involved are viewed as separate virtual tables and can be addressed and connected with standard SQL. Smart Data Access thus offers data virtualization.

It is advisable to consider step by step, application area by application area, whether a modernization and a possible direct relocation to SAP HANA can make sense. It is also important to consider whether the performance gained represents a decisive advantage or whether it enables new business opportunities to be implemented. The advantage of a distributed data warehouse based on SAP HANA: the data is virtually integrated in SAP HANA and the applications already access it.

The virtual table, which uses Smart Data Access to access the underlying technology, now only needs to be converted into a physical table in SAP HANA. The application access remains the same. This makes it possible to gradually modernize important applications and, in the long term, to disconnect application parts that are no longer required.

Real-time evaluation enables new fields of application

A big data platform with a high processing speed for the distributed execution of analytical algorithms over large amounts of data in all structures makes it possible to create analytical applications in a data-integrated environment and make them even more valuable. Demanding "human information", for example video, audio, contextual meanings or multilingualism make a further addition to this framework useful, for example with HP Autonomy. Available connectors and project accelerators help to record, integrate and process a wide variety of internal and external data as quickly as possible and to ensure faster implementation of social media analytics or the analysis of other unstructured data. Especially in the field of analytics, speed is a differentiating factor. Linking SAP HANA with the in-memory-based Apache Spark framework is therefore all the more important.

Try it out

The fields of application for analytics are numerous and highly innovative. However, a lot must first be tested before it can be implemented in practice, as an example from the area of ​​machine data shows. A machine or device manufacturer has to install a lot of sensors in order to be able to support the best possible use. Since these devices are used by the end customer, the data often have to be transmitted to the manufacturer using built-in SIM cards via mobile radio. This results in costs for data transmission and it is important to optimize the amount of data to be transmitted.

  1. Manufacturers' IoT products and strategies
    Almost every large IT manufacturer is positioning itself in the future market of the Internet of Things (IoT). Sometimes the market access is understandable, sometimes smoke candles are thrown and existing products are redefined. We give an overview of the strategies of the most important players.
  2. Microsoft
    Like over 200 other companies, the software group was until recently a member of the AllSeen alliance initiated by Qualcomm and recently switched to the newly formed Open Connectivity Foundation. Their goal is to develop a single specification or at least a common set of protocols and projects for all types of IoT devices.
  3. Microsoft
    On the client side, Windows 10 IoT Core acts as a possible operating system for industrial devices. The example shows a robot kit.
  4. Microsoft
    Microsoft provides the Azure IoT suite as a cloud platform. This already contains some preconfigured solutions for common Internet of Things scenarios. The portfolio is expanded with the acquisition of the Italian IoT start-up Solair.
  5. Amazon
    With AWS Greengrass, the portfolio extends into the edge area. IoT devices can react to local events and act locally on the data they generate, while the cloud continues to be used for management, analysis and permanent storage.
  6. IBM
    In March 2015, Big Blue announced that it would invest around three billion dollars in the development of an IoT division over the next four years. It should be located within the IBM Analytics division. IBM wants to develop new products and services here. In the course of this, the "IBM IoT Cloud Open Platform for Industries" was announced, on which customers and partners can design and implement industry-specific IoT solutions.
  7. Intel
    Although Intel is already well equipped for the age of wearables and IoT with its single-processor computers "Galileo" and "Edison" in the field of end devices, the company wants more of the pie. "The Internet of Things is an end-to-end topic," said Doug Fisher, vice president and general manager of Intel's Software and Services Group, on the announcement of the IoT strategy six months ago. Its core component is therefore a gateway reference design that can collect, process and translate data from sensors and other networked IoT devices.
  8. Intel
    At the center of the chip manufacturer's IoT strategy is a new generation of the "Intel IoT Gateway". Based on the IoT platform, Intel offers a roadmap for integrated hardware and software solutions. It includes API management, software services, data analytics, cloud connectivity, intelligent gateways and a product line of scalable processors with Intel architecture. Another important part of the roadmap is IT security.
  9. SAP
    The SAP IoT platform "HANA Cloud Platform for IoT" is an IoT version of the HANA Cloud Platform that has been expanded to include software for connecting and managing devices as well as data integration and analysis. The edition is integrated with SAP's already presented IoT solutions "SAP Predictive Maintenance and Service", "SAP Connected Logistics" and "Connected Manufacturing".
  10. Hewlett-Packard
    At the end of February 2015, HP presented its "HP Internet of Things Platform". The company is targeting "communications service providers" who are to be enabled to create "smart device ecosystems" - that is, to manage large amounts of networked products and end devices in their networks and to analyze the resulting data.
  11. PTC
    With the takeover of ThingWorx, the American software provider PTC caught up with the group of the most promising Internet of Things providers at the beginning of last year. With "ThingWorx" the company offers a platform for the development and commissioning of IoT applications in companies.

But what is the most important data that has to be transmitted in order to create new opportunities in the area of ​​preventive maintenance? Together with the manufacturer, HP initially tested the data in an offline scenario with a data volume of several months. Data scientists from the HP Global Analytics division analyzed all the data and determined the most important parameters from it. This gives the machine manufacturer the option of adapting the data exchange from the vehicle and transmitting only the most important data.

In addition, the manufacturer can offer its customers improved service through the early detection of problems. Appropriate maintenance windows can be identified together with the customer, thereby avoiding costly downtimes for the end customer. The warranty costs for the manufacturer are also reduced if it is found that the machine or vehicle is being misused, for example through constant overloading.