In Hadoop projects that have to handle large amounts of data, the serialization we use to process and store these data can be a key decision in optimizing the time and cost of our pipeline.
We can serialize our data in different ways:
| Format | Main advantage | Main disadvantage |
| In plain text, using tab-separated fields, for example. | Data is easily readable by humans. | Only one nesting level – arrays are expressed as comma-separated element lists. |
| In JSON. | Flexible, multi-level schema. | Waste of storage space: each field has its associated name written with it. |
| In binary format: Hadoop Serialization (Writables). | More efficient in both space and speed. | Not flexible (not “backwards compatible”: changes in the format makes it impossible to read data that was written previous to the format change). |
| In binary format: Using Thrift, Avro or similar APIs. | More efficient in both space and speed. Flexible. | Which one to choose? |
