Archive for July, 2011

Hadoop + Avro

In Hadoop projects that have to handle large amounts of data, the serialization we use to process and store these data can be a key decision in optimizing the time and cost of our pipeline.

We can serialize our data in different ways:

Format Main advantage Main disadvantage
In plain text, using tab-separated fields, for example. Data is easily readable by humans. Only one nesting level – arrays are expressed as comma-separated element lists.
In JSON. Flexible, multi-level schema. Waste of storage space: each field has its associated name written with it.
In binary format: Hadoop Serialization (Writables). More efficient in both space and speed. Not flexible (not “backwards compatible”: changes in the format makes it impossible to read data that was written previous to the format change).
In binary format: Using Thrift, Avro or similar APIs. More efficient in both space and speed. Flexible. Which one to choose?

Read more…