Skip to contentSkip to frontmatterSkip to Backmatter
Continuous Science Foundation

The Evolution of Scientific Data Formats

Learnings from ZIP to Zarr on New Standards for Scientific Publishing

Abstract

We motivate an Open Exchange Architecture (oxa) for scientific publishing, informed by the progression from ZIP archives to HDF5 stores to Zarr-based object collections. By abstracting articles, data, code, and review materials into addressable, typed components, we want an open exchange architecture that can preserve the semantic richness of JATS while matching the access patterns of modern cloud-native science. We describe some of the interoperability requirements needed to make scholarly communication as streamable, reproducible, and machine-actionable as contemporary scientific datasets.

In the world of scientific publishing, new standards are often met with a groan, followed by someone sharing that xkcd comic about “there being 14 competing standards… now 15.” It’s funny because it’s true — sometimes. But other times, it misses the point entirely. Sometimes, a new standard isn’t about adding noise — it’s about enabling something fundamentally new.

The evolution from ZIPHDF5 → Zarr tells that story.

The ZIP Era — Sharing Files

In 1989, Phil Katz introduced the ZIP format, a simple and open way to compress and bundle multiple files. ZIP solved a distribution problem: getting data from one machine to another efficiently. It standardized something broad — the ability to share complex folders across systems — and quickly became a cornerstone of digital exchange.

But ZIP had limits. It was built for files, not data. It couldn’t efficiently handle massive arrays or structured scientific content. It was a transport format, not a computational one.

Capability unlocked
file portability.
Ecosystem enabled
software distribution, document sharing.
Limitation
flat archives; no structure or metadata for science.

The HDF5 Era — Storing and Structuring Science

By the late 1990s, data volumes exploded. Scientific simulations, satellite observations, and sensor networks needed something more powerful. Enter HDF5 (1998), born at NCSA and later maintained by The HDF Group.

HDF5 introduced a hierarchical, self-describing data model — groups, datasets, attributes — built to manage large, structured, multidimensional arrays. It supported parallel I/O, chunking, and compression — critical in high-performance computing.

HDF5 didn’t replace ZIP; it transcended it. It addressed a new challenge: efficient access to scientific structure at scale. Entire disciplines built on it — from NASA’s Earth Observing System to climate modeling to particle physics.

Capability unlocked
hierarchical data, parallel I/O.
Ecosystem enabled
HPC and scientific data management.
Limitation
local file systems; monolithic files don’t scale to cloud object stores.

The Zarr Era — Cloud-Native, Chunked, and Distributed

Fast-forward to 2015. The world had moved to the cloud, and computation had gone distributed. Zarr, created by Alistair Miles, took inspiration from HDF5 but reimagined it for a new environment — object stores, HTTP access, parallel cloud computing, and modular data ecosystems.

Zarr stores each array chunk as a separate object (or file) and uses simple JSON metadata. That’s it. This design makes Zarr “cloud-native”: you can open, stream, and process pieces of a terabyte-scale dataset directly from S3 or GCS without downloading the whole thing.

It’s not “HDF5, but in folders” — it’s a rethinking of what a scientific data container means when storage, compute, and collaboration all happen across the network.

Capability unlocked
distributed, parallel, object-store access.
Ecosystem enabled
FAIR data, Pangeo, OME-Zarr, cloud-scale machine learning, and analysis pipelines.
Limitation
evolving specs and community governance — still in progress, but essential for the future.

What This Evolution Shows

Each format wasn’t just a “new standard”. Each reflected a transformation in infrastructure and possibility:

Table 1:A timeline of data formats and the new scientific capabilities each unlocked

EraFormatNew CapabilityWhy It Was Needed
1989ZIPShare files and foldersCross-platform portability
1998HDF5Structure and access large scientific arraysHPC & simulation workloads
2016ZarrCloud-native chunked storageDistributed and object-store compute

When technology shifts — from disks to clusters to clouds — our standards must evolve with it. ZIP could never have supported petabyte-scale simulation data. HDF5 has challenges to fully adapt to cloud-native workflows Rocklin, 2018, as well as embracing the simplicity of the new cloud buckets Abernathey, 2018. And that’s the lesson: new ecosystems often require new standards.

Figure 1:A 1.5TB zebrafish microscopy image using OME-Zarr. Under file-based standards or in print-based publishing, only a static PNG or small excerpt is directly viewable; the full-resolution microscopy data remain locked away, inaccessible for browsing or compute unless fully downloaded and extracted. In a cloud-enabled future, as we zoom in, the limitations of current publishing containers become clear: they lack the addressability and streaming needed for modern data-intensive science. These formats need to be accessible from the reading experience as well.


The Scientific Publishing Parallel — JATS and MECA

Around the same time HDF5 was taking root in scientific computing, publishing found its own “standard of record”: JATS (the Journal Article Tag Suite), formalized in 2012 and updated in 2015 and 2021. JATS offered an XML vocabulary for describing every part of a scientific article — section headings, references, tables, figures, metadata. It has become a common export types of publishers and repositories, built for interoperability, preservation, and cross-platform rendering.

To move these articles between systems, publishers introduced MECA — the Manuscript Exchange Common Approach — which packages an entire submission (the JATS XML, figures, supplementary materials) inside a ZIP archive, usually transferred via FTP.

JATS defines the structure. MECA defines the container. Together, they have become open infrastructure for scholarly publishing.

But that infrastructure still assumes a file-based world. One based around XML, FTP, and ZIP — tools that are seldom used in modern web-development workflows.

They’re reliable and (relatively) standardized[1]. But they’re not accessible to developers or to the component pieces of research. They don’t interoperate with the tools researchers actually use: notebooks, data repositories, cloud archives, and computational environments.

In a world of APIs, data portals, and reproducible workflows, the “currency” of publishing remains static and opaque.

What If Scientific Publishing Made the Zarr Leap?

Imagine applying the “Zarr Leap” to scholarly publishing.

Zarr didn’t reinvent the data or the data structures — it restructured access. It made scientific content addressable, parallel, and web-native.

What if publishing did the same?

Instead of wrapping an article and its supplements in a ZIP, what if we exposed the components of research — the notebook, data, code, figures, article, and review — as structured, typed, linked pieces that could be accessed directly?

Researchers, machines, and readers could:

It would be like replacing FTP with APIs, and replacing ZIP with a structured bucket of interconnected, queryable components.

We wouldn’t lose the semantic logic of JATS — we could wrap it in a more expressive, cloud-native container, just as Zarr wrapped the logic of HDF5 in a new access model. JATS was designed for paper; before a world of Jupyter Notebooks, linked protocols, datasets, and interactive articles — so there is also a lot we can and should do to improve the content that can be expressed.

A Thought to Leave With

Science has already made this leap once — from ZIP to HDF5 to Zarr — each time unlocking a new era of access, scale, and collaboration. Publishing hasn’t yet made its equivalent jump. We are still shipping our most valuable ideas around as ZIP files on FTP servers.

It’s time to imagine what publishing looks like when articles are as open, structured, and explorable as data itself. An Open Exchange Architecture, oxa.

Not a new standard for the sake of it — but a new standard for what’s possible in scientific communication.

License

Copyright © 2025 Cockett. This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license, which enables reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator.

Footnotes
  1. There are many challenges with JATS with many different flavours and variants between each publisher. MECA has relatively small adoption in the industry, and often when implemented doesn’t always follow the specification.

Abbreviations
API
Application Programming Interface
FTP
File Transfer Protocol
HDF
Hierarchical Data Format
HDF5
Hierarchical Data Format 5
I/O
Input/Output
JATS
Journal Article Tag Suite
JSON
JavaScript Object Notation
MECA
Manuscript Exchange Common Approach
NASA
National Aeronautics and Space Administration
NCSA
National Center for Supercomputing Applications
XML
Extensible Markup Language
ZIP
Zip Archive
References
  1. Rocklin, M. (2018). HDF in the Cloud: Challenges and Solutions for Scientific Data. https://matthewrocklin.com/blog/work/2018/02/06/hdf-in-the-cloud
  2. Abernathey, R. (2018). Step-by-Step Guide to Building a Big Data Portal. https://medium.com/pangeo/step-by-step-guide-to-building-a-big-data-portal-e262af1c2977