In Hadoop projects that have to handle large amounts of data, the serialization we use to process and store these data can be a key decision in optimizing the time and cost of our pipeline.
We can serialize our data in different ways:
|Format||Main advantage||Main disadvantage|
|In plain text, using tab-separated fields, for example.||Data is easily readable by humans.||Only one nesting level – arrays are expressed as comma-separated element lists.|
|In JSON.||Flexible, multi-level schema.||Waste of storage space: each field has its associated name written with it.|
|In binary format: Hadoop Serialization (Writables).||More efficient in both space and speed.||Not flexible (not “backwards compatible”: changes in the format makes it impossible to read data that was written previous to the format change).|
|In binary format: Using Thrift, Avro or similar APIs.||More efficient in both space and speed. Flexible.||Which one to choose?|