Working with pandas (2024)

One of the most important features of xarray is the ability to convert to andfrom pandas objects to interact with the rest of the PyDataecosystem. For example, for plotting labeled data, we highly recommendusing the visualization built in to pandas itself or provided by the pandasaware libraries such as Seaborn.

Hierarchical and tidy data#

Tabular data is easiest to work with when it meets the criteria fortidy data:

  • Each column holds a different variable.

  • Each rows holds a different observation.

In this “tidy data” format, we can represent any Dataset andDataArray in terms of DataFrame andSeries, respectively (and vice-versa). The representationworks by flattening non-coordinates to 1D, and turning the tensor product ofcoordinate indexes into a pandas.MultiIndex.

Dataset and DataFrame#

To convert any dataset to a DataFrame in tidy form, use theDataset.to_dataframe() method:

In [1]: ds = xr.Dataset( ...:  {"foo": (("x", "y"), np.random.randn(2, 3))}, ...:  coords={ ...:  "x": [10, 20], ...:  "y": ["a", "b", "c"], ...:  "along_x": ("x", np.random.randn(2)), ...:  "scalar": 123, ...:  }, ...: ) ...: In [2]: dsOut[2]: <xarray.Dataset> Size: 100BDimensions: (x: 2, y: 3)Coordinates: * x (x) int64 16B 10 20 * y (y) <U1 12B 'a' 'b' 'c' along_x (x) float64 16B 0.1192 -1.044 scalar int64 8B 123Data variables: foo (x, y) float64 48B 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732In [3]: df = ds.to_dataframe()In [4]: dfOut[4]:  foo along_x scalarx y 10 a 0.469112 0.119209 123 b -0.282863 0.119209 123 c -1.509059 0.119209 12320 a -1.135632 -1.044236 123 b 1.212112 -1.044236 123 c -0.173215 -1.044236 123

We see that each variable and coordinate in the Dataset is now a column in theDataFrame, with the exception of indexes which are in the index.To convert the DataFrame to any other convenient representation,use DataFrame methods like reset_index(),stack() and unstack().

For datasets containing dask arrays where the data should be lazily loaded, see theDataset.to_dask_dataframe() method.

To create a Dataset from a DataFrame, use theDataset.from_dataframe() class method or the equivalentpandas.DataFrame.to_xarray() method:

In [5]: xr.Dataset.from_dataframe(df)Out[5]: <xarray.Dataset> Size: 184BDimensions: (x: 2, y: 3)Coordinates: * x (x) int64 16B 10 20 * y (y) object 24B 'a' 'b' 'c'Data variables: foo (x, y) float64 48B 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732 along_x (x, y) float64 48B 0.1192 0.1192 0.1192 -1.044 -1.044 -1.044 scalar (x, y) int64 48B 123 123 123 123 123 123

Notice that that dimensions of variables in the Dataset have nowexpanded after the round-trip conversion to a DataFrame. This is becauseevery object in a DataFrame must have the same indices, so we need tobroadcast the data of each array to the full size of the new MultiIndex.

Likewise, all the coordinates (other than indexes) ended up as variables,because pandas does not distinguish non-index coordinates.

DataArray and Series#

DataArray objects have a complementary representation in terms of aSeries. Using a Series preserves the Dataset toDataArray relationship, because DataFrames are dict-like containersof Series. The methods are very similar to those for working withDataFrames:

In [6]: s = ds["foo"].to_series()In [7]: sOut[7]: x y10 a 0.469112 b -0.282863 c -1.50905920 a -1.135632 b 1.212112 c -0.173215Name: foo, dtype: float64# or equivalently, with Series.to_xarray()In [8]: xr.DataArray.from_series(s)Out[8]: <xarray.DataArray 'foo' (x: 2, y: 3)> Size: 48Barray([[ 0.469, -0.283, -1.509], [-1.136, 1.212, -0.173]])Coordinates: * x (x) int64 16B 10 20 * y (y) object 24B 'a' 'b' 'c'

Both the from_series and from_dataframe methods use reindexing, so theywork even if not the hierarchical index is not a full tensor product:

In [9]: s[::2]Out[9]: x y10 a 0.469112 c -1.50905920 b 1.212112Name: foo, dtype: float64In [10]: s[::2].to_xarray()Out[10]: <xarray.DataArray 'foo' (x: 2, y: 3)> Size: 48Barray([[ 0.469, nan, -1.509], [ nan, 1.212, nan]])Coordinates: * x (x) int64 16B 10 20 * y (y) object 24B 'a' 'b' 'c'

Lossless and reversible conversion#

The previous Dataset example shows that the conversion is not reversible (lossy roundtrip) andthat the size of the Dataset increases.

Particularly after a roundtrip, the following deviations are noted:

  • a non-dimension Dataset coordinate is converted into variable

  • a non-dimension DataArray coordinate is not converted

  • dtype is not allways the same (e.g. “str” is converted to “object”)

  • attrs metadata is not conserved

To avoid these problems, the third-party ntv-pandas library offers lossless and reversible conversions betweenDataset/ DataArray and pandas DataFrame objects.

This solution is particularly interesting for converting any DataFrame into a Dataset (the converter find the multidimensional structure hidden by the tabular structure).

The ntv-pandas examples show how to improve the conversion for the previous Dataset example and for more complex examples.

Multi-dimensional data#

Tidy data is great, but it sometimes you want to preserve dimensions instead ofautomatically stacking them into a MultiIndex.

DataArray.to_pandas() is a shortcut that lets you convert aDataArray directly into a pandas object with the same dimensionality, ifavailable in pandas (i.e., a 1D array is converted to aSeries and 2D to DataFrame):

In [11]: arr = xr.DataArray( ....:  np.random.randn(2, 3), coords=[("x", [10, 20]), ("y", ["a", "b", "c"])] ....: ) ....: In [12]: df = arr.to_pandas()In [13]: dfOut[13]: y a b cx 10 -0.861849 -2.104569 -0.49492920 1.071804 0.721555 -0.706771

To perform the inverse operation of converting any pandas objects into a dataarray with the same shape, simply use the DataArrayconstructor:

In [14]: xr.DataArray(df)Out[14]: <xarray.DataArray (x: 2, y: 3)> Size: 48Barray([[-0.862, -2.105, -0.495], [ 1.072, 0.722, -0.707]])Coordinates: * x (x) int64 16B 10 20 * y (y) object 24B 'a' 'b' 'c'

Both the DataArray and Dataset constructors directly convert pandasobjects into xarray objects with the same shape. This means that theypreserve all use of multi-indexes:

In [15]: index = pd.MultiIndex.from_arrays( ....:  [["a", "a", "b"], [0, 1, 2]], names=["one", "two"] ....: ) ....: In [16]: df = pd.DataFrame({"x": 1, "y": 2}, index=index)In [17]: ds = xr.Dataset(df)In [18]: dsOut[18]: <xarray.Dataset> Size: 120BDimensions: (dim_0: 3)Coordinates: * dim_0 (dim_0) object 24B MultiIndex * one (dim_0) object 24B 'a' 'a' 'b' * two (dim_0) int64 24B 0 1 2Data variables: x (dim_0) int64 24B 1 1 1 y (dim_0) int64 24B 2 2 2

However, you will need to set dimension names explicitly, either with thedims argument on in the DataArray constructor or by callingrename on the new object.

Transitioning from pandas.Panel to xarray#

Panel, pandas’ data structure for 3D arrays, was always a second classdata structure compared to the Series and DataFrame. To allow pandasdevelopers to focus more on its core functionality built around theDataFrame, pandas removed Panel in favor of directing users who usemulti-dimensional arrays to xarray.

Xarray has most of Panel’s features, a more explicit API (particularly aroundindexing), and the ability to scale to >3 dimensions with the same interface.

As discussed in the data structures section of the docs, there are two primary data structures inxarray: DataArray and Dataset. You can imagine a DataArray as an-dimensional pandas Series (i.e. a single typed array), and a Datasetas the DataFrame equivalent (i.e. a dict of aligned DataArray objects).

So you can represent a Panel, in two ways:

  • As a 3-dimensional DataArray,

  • Or as a Dataset containing a number of 2-dimensional DataArray objects.

Let’s take a look:

In [19]: data = np.random.RandomState(0).rand(2, 3, 4)In [20]: items = list("ab")In [21]: major_axis = list("mno")In [22]: minor_axis = pd.date_range(start="2000", periods=4, name="date")

With old versions of pandas (prior to 0.25), this could stored in a Panel:

In [23]: pd.Panel(data, items, major_axis, minor_axis)Out[23]: <class 'pandas.core.panel.Panel'>Dimensions: 2 (items) x 3 (major_axis) x 4 (minor_axis)Items axis: a to bMajor_axis axis: m to oMinor_axis axis: 2000-01-01 00:00:00 to 2000-01-04 00:00:00

To put this data in a DataArray, write:

In [24]: array = xr.DataArray(data, [items, major_axis, minor_axis])In [25]: arrayOut[25]: <xarray.DataArray (dim_0: 2, dim_1: 3, date: 4)> Size: 192Barray([[[0.549, 0.715, 0.603, 0.545], [0.424, 0.646, 0.438, 0.892], [0.964, 0.383, 0.792, 0.529]], [[0.568, 0.926, 0.071, 0.087], [0.02 , 0.833, 0.778, 0.87 ], [0.979, 0.799, 0.461, 0.781]]])Coordinates: * dim_0 (dim_0) <U1 8B 'a' 'b' * dim_1 (dim_1) <U1 12B 'm' 'n' 'o' * date (date) datetime64[ns] 32B 2000-01-01 2000-01-02 ... 2000-01-04

As you can see, there are three dimensions (each is also a coordinate). Two ofthe axes of were unnamed, so have been assigned dim_0 and dim_1respectively, while the third retains its name date.

You can also easily convert this data into Dataset:

In [26]: array.to_dataset(dim="dim_0")Out[26]: <xarray.Dataset> Size: 236BDimensions: (dim_1: 3, date: 4)Coordinates: * dim_1 (dim_1) <U1 12B 'm' 'n' 'o' * date (date) datetime64[ns] 32B 2000-01-01 2000-01-02 ... 2000-01-04Data variables: a (dim_1, date) float64 96B 0.5488 0.7152 0.6028 ... 0.7917 0.5289 b (dim_1, date) float64 96B 0.568 0.9256 0.07104 ... 0.4615 0.7805

Here, there are two data variables, each representing a DataFrame on panel’sitems axis, and labeled as such. Each variable is a 2D array of therespective values along the items dimension.

While the xarray docs are relatively complete, a few items stand out for Panel users:

  • A DataArray’s data is stored as a numpy array, and so can only contain a singletype. As a result, a Panel that contains DataFrame objectswith multiple types will be converted to dtype=object. A Dataset ofmultiple DataArray objects each with its own dtype will allow originaltypes to be preserved.

  • Indexing is similar to pandas, but more explicit andleverages xarray’s naming of dimensions.

  • Because of those features, making much higher dimensional data is verypractical.

  • Variables in Dataset objects can use a subset of its dimensions. Forexample, you can have one dataset with Person x Score x Time, and another withPerson x Score.

  • You can use coordinates are used for both dimensions and for variables which_label_ the data variables, so you could have a coordinate Age, that labelledthe Person dimension of a Dataset of Person x Score x Time.

While xarray may take some getting used to, it’s worth it! If anything is unclear,please post an issue on GitHub orStackOverflow,and we’ll endeavor to respond to the specific case or improve the general docs.

Working with pandas (2024)

FAQs

How hard is pandas to learn? ›

Pandas is easy to use because it's intuitive and mimics Excel in some ways. It lets you manipulate data through simple commands. But it's also tricky because it has lots of functions and ways to do things. As you dive deeper, you'll find more complex operations that require understanding its nuances.

How long does it take to get good at pandas? ›

If you already know Python, you will need about two weeks to learn Pandas. Without a background in Python, you'll need one to two months to learn Pandas. This will give you time to understand the basics of Python before applying your knowledge to Python data science libraries such as Pandas.

How can I be proficient in pandas? ›

Becoming proficient in pandas requires knowing how to import and export data of different types, manipulating and reshaping data, pivoting and aggregating data, deriving simple insights from DataFrames, and more.

Is Pandas harder than SQL? ›

In Pandas, it is easy to get a quick sense of the data; in SQL it is much harder. Pandas offers quick ways to understand the data and metadata of a dataframe. We've already seen examples of this when we print a dataframe by simply using its variable name, or if we use the functions 'head/tail()' .

Should I learn Pandas or SQL first? ›

For complex database operations you should learn SQL whereas pandas will work on the datasets that are available like dataframes, csv, or file. If you want to combine SQL and python then you should know SQL as it will serve the data source for Pandas.

Should I put pandas on my resume? ›

In my resume for data science roles, I prioritize highlighting my proficiency in key programming languages and tools that showcase my ability to tackle complex data challenges. Python is my cornerstone, given its wide range of libraries like Pandas and NumPy essential for data analysis and machine learning.

What is the weakness of pandas? ›

Pandas can face performance challenges when dealing with large datasets. Operations may be slower compared to alternatives like Polars, particularly in scenarios where speed is critical. Memory Usage: Pandas DataFrames can be memory-intensive, especially for large datasets.

Is pandas easier than Excel? ›

Most of the tasks you can do with Pandas are easy to automate, reducing the amount of tedious and repetitive tasks you need to perform daily. This automating process includes repairing data holes and eliminating duplicates. Pandas is also faster than Excel, and you will notice when we need to deal with large data sets.

How do I start working with pandas? ›

Getting Started with Pandas

Step 1: Type 'cmd' in the search box and open it. Step 2: Locate the folder using the cd command where the python-pip file has been installed. For more reference, take a look at this article on installing pandas follows.

How does pandas get dummies work? ›

The get_dummies function works as follows: It takes a data frame, series, or list. Then, it converts each unique element present in the object to a column heading. The function iterates over the object that is passed and checks if the element at the particular index matches the column heading.

What is pandas best used for? ›

This means that Pandas is chiefly used for machine learning in the form of DataFrames.

Is Panda easy to learn? ›

Pandas is written in Python, so it's easy to understand and use. It also offers a range of built-in methods and functions, making it easier to access data quickly. It's faster than other libraries. Pandas is written in Cython, a language that compiles Python code and speeds up execution time.

Is Pandas easier than Excel? ›

Most of the tasks you can do with Pandas are easy to automate, reducing the amount of tedious and repetitive tasks you need to perform daily. This automating process includes repairing data holes and eliminating duplicates. Pandas is also faster than Excel, and you will notice when we need to deal with large data sets.

Are NumPy and Pandas hard? ›

How much time will it take to learn Pandas and Numpy? Not much time if you already have some python experience or any coding experience. Numpy is great to start with before Pandas because it shows you how python works under the hood (like working with different data types and structures).

How much time will it take to learn Pandas and NumPy? ›

For Data Analysis Numpy, Pandas, Seaborn, Bokeh, SciPy, and Matplotlib these libraries are good for data analysis. These libraries are helpful for those who want to become data analysts/ data scientists. Learning Numpy or Pandas will take around 1 week.

Top Articles
Latest Posts
Article information

Author: Rev. Porsche Oberbrunner

Last Updated:

Views: 5848

Rating: 4.2 / 5 (53 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Rev. Porsche Oberbrunner

Birthday: 1994-06-25

Address: Suite 153 582 Lubowitz Walks, Port Alfredoborough, IN 72879-2838

Phone: +128413562823324

Job: IT Strategist

Hobby: Video gaming, Basketball, Web surfing, Book restoration, Jogging, Shooting, Fishing

Introduction: My name is Rev. Porsche Oberbrunner, I am a zany, graceful, talented, witty, determined, shiny, enchanting person who loves writing and wants to share my knowledge and understanding with you.