Pandas, pyspark, Koalas - Which library to choose for your Data Science project?

This article compares three Python libraries in data analysis. Pandas suitable for limited volumes, PySpark manages large quantities thanks to parallelization, and Koala bears makes the transition from Pandas to Spark easier.

Aqsone Profile Picture
Aqson

If you've ever done data analysis on Python, you couldn't have missed the Pandas library!
Indeed, this library is one of the most used for data manipulation. It became a unavoidable due to its ease of use, its practicality and all the functionalities available for data manipulation.

The basic object of this library which are the DataFrames (tables, see below) make it a very intuitive tool because this object is very easy to understand when working with structured data. Pandas makes it very easy to load or write data from a csv, Excel or SQL database file. Additionally, Pandas contains a multitude of optimized functions allowing you to manipulate this tabular data. Indeed, critical parts of code are sometimes coded in Python or C to improve execution speed.

These various advantages make it the ideal choice for data processing on Python when working on a limited volume of data.


For more performance: PySpark

Using a framework like Spark typically comes after the proof of concept, or PoC, phase. As a reminder, a PoC is a very short-term project aimed at exploring a data science subject. The PoC is used to prove the feasibility and added value of the approach on a reduced scope (with a reasonable quantity of data). In this context, data scientists have every interest in using their favorite tools in order to move forward quickly.

When secondly, we want industrialize this PoC, that is to say applying it on a larger scale, we generally find ourselves confronted with performance problems. Indeed, it is possible that the volume of data explodes when we move to the full scope compared to the initial sample, which can create significant latencies in the execution of calculations or even the non-completion of some.

To remedy this, it will be necessary to use frameworks allowing parallelize calculations, hence the use of Spark with its python interface PySpark. One of the advantages of pySpark is also the distinction between “lazy operation” and actions. Indeed, Spark will only execute the code and therefore launch calculations when necessary.

Here is an example of code in PySpark in comparison with Pandas, note that the language differs a lot and that an adaptation is necessary to move from Pandas to PySpark:


Pandas:

import Pandas as pd df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4], 'col3': [5, 6]}) df['col4'] = df .col1 * df.col1

 

PySpark:

df = spark.read.option(“inferSchema”, “true”).option(“comment”, True).csv(“my_data.csv”) df = df.toDF('col1', 'col2', 'col3 ') df = df.withColumn('col4', df.col1*df.col1)


For a smooth transition: Koalas?

Koalas is a very recent library (end of 2019) which aims to write Spark programs with Pandas syntax. It makes it possible to unify the production of experimentation and industrialization code under the same tool, while benefiting from the flexibility of Pandas and the distributed performance of Spark.

It is therefore a intermediate solution completely relevant which will be particularly suitable for Data Scientists who already master Pandas and who wish to move towards larger volumes of data while remaining with familiar tools, and therefore without having to completely train themselves in a new language. Note, however, that not all Pandas features are yet available in the Koalas library.

For example, using the same code as previously on Koalas:

import databricks.Koalas as ks df = ks.DataFrame({'col1': [1, 2], 'col2': [3, 4], 'col3': [5, 6]}) df['col4'] = df.col1 * df.col1

We can clearly see the similarity with writing code in Pandas. Here, only the name of the library has changed but the code has remained exactly the same as with Pandas.



Should we abandon PySpark anyway? Not really.

Subtleties still exist between the two environments and PySpark remains the reference framework in Big Data of the Data Engineers community. The latter are often already familiar with the Spark language and will have very little interest in changing it for the benefit of Koalas. In addition Koalas is an augmentation of Spark's DataFrame API to be closer to Pandas, which means that the underlying language remains Spark. If, for example, there is a need for additional performance, it may be necessary to revert to Spark without overhead. In addition, Spark is designed to be easily integrated with a whole bunch of other tools given its popularity.

In the graph below (produced by Databricks), we can see that pySpark still has superior performance to Koalas, even if Koalas is already very efficient compared to Pandas. It is also interesting to note that for small datasets, Pandas is more efficient (due to the initialization and data transfer operations of distributed solutions) and therefore more suitable.


Koalas can also be used as an approach to learn Spark gradually, but in all cases it is necessary to practice in the Spark environment.




It is obviously possible to move from one language to another using certain functions like to_pandas() but it is recommended to use them as little as possible because these are very expensive operations given that this impacts the storage format of the given.



And at Aqsone in all this?

Our data scientists use (Py)Spark and Koalas to process calculations on large volumes of data. If you want to learn more about this technology and implement it in your big data projects, contact us!

A must see

Most popular articles

Do you have a transformation project? Let's talk about it !