

Once the data is in a memory mappable format, opening it with Vaex is instant (0.052 seconds!), despite its size of over 100GB on disk:
VANS PYTHON RUNNER SIZING HOW TO
An example of how to do convert CSV data to HDF5 can be found in here. The first step is to convert the data into a memory mappable file format, such as Apache Arrow, Apache Parquet, or HDF5. The complete analysis can be viewed separately in this Jupyter notebook. The data can be downloaded from this website, and comes in CSV format. In this article we will use the New York City (NYC) Taxi dataset, which contains information on over 1 billion taxi trips conducted between 20 by the iconic Yellow Taxis. To illustrate this concepts, let us do a simple exploratory data analysis on a dataset that is far to large to fit into RAM of a typical laptop. All of this is wrapped in a familiar Pandas-like API, so anyone can get started right away. To do this, Vaex employs concepts such as memory mapping, efficient out-of-core algorithms and lazy evaluations. Vaex is an open-source DataFrame library which enables the visualisation, exploration, analysis and even machine learning on tabular datasets that are as large as your hard-drive. In this article I will show you a new approach: a faster, more secure, and just overall more convenient way to do data science using data of almost arbitrary size, as long as it can fit on the hard-drive of your laptop, desktop or server. Not to mention the costs, which although start low, tend to pile up as time goes on. In this case you still have to manage cloud data buckets, wait for data transfer from bucket to instance every time the instance starts, handle compliance issues that come with putting data on the cloud, and deal with all the inconvenience that come with working on a remote machine. For example, AWS offers instances with Terabytes of RAM. Alternatively, one can rent a single strong cloud instance with as much memory as required to work with the data in question. Imagine having to set up a cluster for a dataset that is just out of RAM reach, like in the 30–50 GB range. While this is a valid approach for some cases, it comes with the significant overhead of managing and maintaining a cluster. The next strategy is to use distributed computing.

The drawback here is obvious: one may miss key insights by not looking at the relevant portions, or even worse, misinterpret the story the data it telling by not looking at all of it. There are 3 strategies commonly employed when working with such datasets. Thus, they are already tricky to open and inspect, let alone to explore or analyse. They are small enough to fit into the hard-drive of your everyday laptop, but way to big to fit in RAM. Now, these kind of datasets are a bit… uncomfortable to use. Therefore it is becoming increasingly common for data scientists to face 50GB or even 500GB sized datasets. Many organizations are trying to gather and utilise as much data as possible to improve on how they run their business, increase revenue, or how they impact the world around them.
