Archive for May, 2011

Handling dependencies and configuration in Java + Hadoop projects efficiently

When working on Java Hadoop projects it is common to encounter the situation where you have to add a large amount of JAR dependencies which will be needed by all of the tasks that are executed in any node of the cluster (Mappers, Reducers…). Synchronizing our libraries with a Hadoop cluster is not as simple as synchronizing them in a standard Java project.

There are three possible solutions to this scenario worth mentioning. First, we will show the two that we advise against and the drawbacks they have. Finally, we will talk about the one we recommend. Read more…

Massive data processing with Hive: US flight history analysis

The analysis and extraction of large amounts of data, which is usually related to the relational databases realm, has always represented a big challenge. Hadoop, Hive and Cloud computing services come to the rescue, offering a low-cost effective solution for “Big Data” analysis.

In this post we will show an example of parallel data processing using Hive to analyze a 40GB database; each query will cost us only about 0.37$. We will use the “On-Time” database from “TranStats” (available in this link) which contains information about all the flights in the USA from 1988 to 2008 spread out over a total of 116 million registers in 240 CSV files. We will process these data with Hadoop using Hive, a high-level interface which allows us to execute Map-Reduce jobs based on a sequence of SQL-like commands. Thanks to this, we will perform a rapid, comprehensive analysis of these data in a simple, scalable and affordable way. At the end of the post we will specify the time and cost involved in executing this analysis. Read more…

Replacing your database with Hadoop

Hadoop represents a good alternative to achieve scalability in cases where data freshness is not a key issue, even making it possible to dispense with databases. In this post we will explain when this solution is advisable and we will discuss the main characteristics of its architecture. Read more…

Welcome to the Datasalt blog

Today we are celebrating the opening of the Datasalt blog and its Spanish version. Let’s start by talking a little bit about Datasalt and what we do.

Firstly, we need to discuss today’s context. Information is currently experiencing an explosion. New devices equipped with sensors (mobile phones, antennas, telescopes, cameras, etc.) are constantly being created, thus generating massive amounts of data. Moreover, companies strive to keep track of all the interaction / operations they have with their customers, monitoring their behavior as closely as possible. Clicks on the web as well as all the transactions and purchases performed by customers or users are registered. In short, companies have more data than ever and they are aware that they must use that information to become more competitive.

However, the usual techniques of data analysis (Data Mining), such as Enterprise Data Warehouse (EDW) and Business Intelligence (BI), are inadequate for analysis and extraction of such enormous amounts of data. First, they are inadequate because of their inability to process all of the data, making it necessary to do “sampling” in order to reduce the size of the problem. Second, these solutions are efficient in the processing of structured information, but not when the information is unstructured. And most of the data generated today is unstructured.

This has given rise to the concept of Big Data, which refers to data that, due to their size and nature, are beyond the scope of the usual techniques of data analysis (EDW and BI). Companies have a growing need to extract value from information to remain competitive. This is the reason why interest in Big Data is growing dramatically, as shown in the articles published in a special report entitled “Data, data everywhere” that The Economist recently devoted to this issue (the one entitled “A different game” is also interesting).

New techniques have emerged, capable of exploiting the Big Data and transforming them into value for companies. Among them we should highlight Hadoop, a platform for distributed data analysis, capable of processing large amounts of data through the use of a set of parallel machines. Other solutions worth noting are NoSQL distributed databases. These databases feature the ability to handle larger amounts of data in exchange for lowering the requirements with respect to any RDBMS (hence the name “no SQL”). Here we could highlight Cassandra, HBase and MongoDB. Solr provides a search engine that is also very useful. All these techniques make a perfect tandem with the new cloud computing platforms (Amazon, Rackspace, etc). If you want to find out more about this subject, you can check the Gartner report entitled “Hadoop and MapReduce: Big Data Analytics.”

And this is where Datasalt comes in, as a company that specializes in the field of Big Data by developing new products and providing services. We offer solutions for the extraction of value (Data Mining) from large data sets, such as records of interaction with customers, data logs generated by applications, the information captured from the web and social networks, data from mobile devices, etc. We also have systems of aggregation and search within large data sets.

We’ll be publishing news, methods and events related to Big Data and Datasalt in this blog. See you soon!