From 7ab21c38ee77502fc58cebc2ba20080f5b05af7a Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Fri, 29 Aug 2025 04:37:18 +0000
Subject: [PATCH 01/36] Initial plan
From 4ebe07fa5e2a29c6d9491e75a83e21c2dfaa9b8f Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Fri, 29 Aug 2025 04:50:40 +0000
Subject: [PATCH 02/36] Create polars lecture based on pandas lecture and add
to TOC
Co-authored-by: mmcky <8263752+mmcky@users.noreply.github.com>
---
lectures/_toc.yml | 1 +
lectures/polars.md | 861 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 862 insertions(+)
create mode 100644 lectures/polars.md
diff --git a/lectures/_toc.yml b/lectures/_toc.yml
index 302a0a0b..b1121ed3 100644
--- a/lectures/_toc.yml
+++ b/lectures/_toc.yml
@@ -21,6 +21,7 @@ parts:
- file: matplotlib
- file: scipy
- file: pandas
+ - file: polars
- file: pandas_panel
- file: sympy
- caption: High Performance Computing
diff --git a/lectures/polars.md b/lectures/polars.md
new file mode 100644
index 00000000..68633ae1
--- /dev/null
+++ b/lectures/polars.md
@@ -0,0 +1,861 @@
+---
+jupytext:
+ text_representation:
+ extension: .md
+ format_name: myst
+ format_version: 0.13
+ jupytext_version: 1.16.7
+kernelspec:
+ display_name: Python 3 (ipykernel)
+ language: python
+ name: python3
+---
+
+(pl)=
+```{raw} jupyter
+
+```
+
+# {index}`Polars `
+
+```{index} single: Python; Polars
+```
+
+In addition to what's in Anaconda, this lecture will need the following libraries:
+
+```{code-cell} ipython3
+:tags: [hide-output]
+
+!pip install --upgrade polars
+!pip install --upgrade wbgapi
+!pip install --upgrade yfinance
+```
+
+## Overview
+
+[Polars](https://pola.rs/) is a lightning-fast data manipulation library for Python written in Rust.
+
+Polars has gained significant popularity in recent years due to its superior performance
+compared to traditional data analysis tools, making it an excellent choice for modern
+data science and machine learning workflows.
+
+Polars is designed with performance and memory efficiency in mind, leveraging:
+
+* Arrow's columnar memory format for fast data access
+* Lazy evaluation to optimize query execution
+* Parallel processing for enhanced performance
+* Expressive API similar to pandas but with better performance characteristics
+
+Just as [NumPy](https://numpy.org/) provides the basic array data type plus core array operations, polars
+
+1. defines fundamental structures for working with data and
+1. endows them with methods that facilitate operations such as
+ * reading in data
+ * adjusting indices
+ * working with dates and time series
+ * sorting, grouping, re-ordering and general data munging [^mung]
+ * dealing with missing values, etc., etc.
+
+More sophisticated statistical functionality is left to other packages, such
+as [statsmodels](https://www.statsmodels.org/) and [scikit-learn](https://scikit-learn.org/), which can work with polars DataFrames through their interoperability with pandas.
+
+This lecture will provide a basic introduction to polars.
+
+Throughout the lecture, we will assume that the following imports have taken
+place
+
+```{code-cell} ipython3
+import polars as pl
+import numpy as np
+import matplotlib.pyplot as plt
+import requests
+```
+
+Two important data types defined by polars are `Series` and `DataFrame`.
+
+You can think of a `Series` as a "column" of data, such as a collection of observations on a single variable.
+
+A `DataFrame` is a two-dimensional object for storing related columns of data.
+
+## Series
+
+```{index} single: Polars; Series
+```
+
+Let's start with Series.
+
+
+We begin by creating a series of four random observations
+
+```{code-cell} ipython3
+s = pl.Series(name='daily returns', values=np.random.randn(4))
+s
+```
+
+Here you can imagine the indices `0, 1, 2, 3` as indexing four listed
+companies, and the values being daily returns on their shares.
+
+Polars `Series` are built on top of Apache Arrow arrays and support many similar
+operations
+
+```{code-cell} ipython3
+s * 100
+```
+
+```{code-cell} ipython3
+s.abs()
+```
+
+But `Series` provide more than basic arrays.
+
+Not only do they have some additional (statistically oriented) methods
+
+```{code-cell} ipython3
+s.describe()
+```
+
+But they can also be used with custom indices
+
+```{code-cell} ipython3
+# Create a new series with custom index using a DataFrame
+df_temp = pl.DataFrame({
+ 'company': ['AMZN', 'AAPL', 'MSFT', 'GOOG'],
+ 'daily returns': s.to_list()
+})
+df_temp
+```
+
+To access specific values by company name, we can filter the DataFrame
+
+```{code-cell} ipython3
+# Get AMZN's return
+df_temp.filter(pl.col('company') == 'AMZN').select('daily returns').item()
+```
+
+```{code-cell} ipython3
+# Update AMZN's return to 0
+df_temp = df_temp.with_columns(
+ pl.when(pl.col('company') == 'AMZN')
+ .then(0)
+ .otherwise(pl.col('daily returns'))
+ .alias('daily returns')
+)
+df_temp
+```
+
+```{code-cell} ipython3
+# Check if AAPL is in the companies
+'AAPL' in df_temp.get_column('company').to_list()
+```
+
+## DataFrames
+
+```{index} single: Polars; DataFrames
+```
+
+While a `Series` is a single column of data, a `DataFrame` is several columns, one for each variable.
+
+In essence, a `DataFrame` in polars is analogous to a (highly optimized) Excel spreadsheet.
+
+Thus, it is a powerful tool for representing and analyzing data that are naturally organized into rows and columns, often with descriptive indexes for individual rows and individual columns.
+
+Let's look at an example that reads data from the CSV file `pandas/data/test_pwt.csv`, which is taken from the [Penn World Tables](https://www.rug.nl/ggdc/productivity/pwt/pwt-releases/pwt-7.0).
+
+The dataset contains the following indicators
+
+| Variable Name | Description |
+| :-: | :-: |
+| POP | Population (in thousands) |
+| XRAT | Exchange Rate to US Dollar |
+| tcgdp | Total PPP Converted GDP (in million international dollar) |
+| cc | Consumption Share of PPP Converted GDP Per Capita (%) |
+| cg | Government Consumption Share of PPP Converted GDP Per Capita (%) |
+
+
+We'll read this in from a URL using the `polars` function `read_csv`.
+
+```{code-cell} ipython3
+df = pl.read_csv('https://raw.githubusercontent.com/QuantEcon/lecture-python-programming/master/source/_static/lecture_specific/pandas/data/test_pwt.csv')
+type(df)
+```
+
+Here's the content of `test_pwt.csv`
+
+```{code-cell} ipython3
+df
+```
+
+### Select Data by Position
+
+In practice, one thing that we do all the time is to find, select and work with a subset of the data of our interests.
+
+We can select particular rows using array slicing notation
+
+```{code-cell} ipython3
+df[2:5]
+```
+
+To select columns, we can pass a list containing the names of the desired columns
+
+```{code-cell} ipython3
+df.select(['country', 'tcgdp'])
+```
+
+To select both rows and columns using integers, we can combine slicing with column selection
+
+```{code-cell} ipython3
+df[2:5].select(df.columns[0:4])
+```
+
+To select rows and columns using a mixture of integers and labels, we can use more complex selection
+
+```{code-cell} ipython3
+df[2:5].select(['country', 'tcgdp'])
+```
+
+### Select Data by Conditions
+
+Instead of indexing rows and columns using integers and names, we can also obtain a sub-dataframe of our interests that satisfies certain (potentially complicated) conditions.
+
+This section demonstrates various ways to do that.
+
+The most straightforward way is with the `filter` method.
+
+```{code-cell} ipython3
+df.filter(pl.col('POP') >= 20000)
+```
+
+To understand what is going on here, notice that `pl.col('POP') >= 20000` creates a boolean expression.
+
+```{code-cell} ipython3
+df.select(pl.col('POP') >= 20000)
+```
+
+In this case, `df.filter()` takes a boolean expression and only returns rows with the `True` values.
+
+Take one more example,
+
+```{code-cell} ipython3
+df.filter(
+ (pl.col('country').is_in(['Argentina', 'India', 'South Africa'])) &
+ (pl.col('POP') > 40000)
+)
+```
+
+We can also allow arithmetic operations between different columns.
+
+```{code-cell} ipython3
+df.filter((pl.col('cc') + pl.col('cg') >= 80) & (pl.col('POP') <= 20000))
+```
+
+For example, we can use the conditioning to select the country with the largest household consumption - gdp share `cc`.
+
+```{code-cell} ipython3
+df.filter(pl.col('cc') == pl.col('cc').max())
+```
+
+When we only want to look at certain columns of a selected sub-dataframe, we can combine filter with select.
+
+```{code-cell} ipython3
+df.filter((pl.col('cc') + pl.col('cg') >= 80) & (pl.col('POP') <= 20000)).select(['country', 'year', 'POP'])
+```
+
+**Application: Subsetting Dataframe**
+
+Real-world datasets can be [enormous](https://developers.google.com/machine-learning/crash-course/overfitting).
+
+It is sometimes desirable to work with a subset of data to enhance computational efficiency and reduce redundancy.
+
+Let's imagine that we're only interested in the population (`POP`) and total GDP (`tcgdp`).
+
+One way to strip the data frame `df` down to only these variables is to overwrite the dataframe using the selection method described above
+
+```{code-cell} ipython3
+df_subset = df.select(['country', 'POP', 'tcgdp'])
+df_subset
+```
+
+We can then save the smaller dataset for further analysis.
+
+```{code-block} python3
+:class: no-execute
+
+df_subset.write_csv('pwt_subset.csv')
+```
+
+### Apply and Map Operations
+
+Polars provides powerful methods for applying functions to data.
+
+Instead of pandas' `apply` method, polars uses expressions within `select`, `with_columns`, or `filter` methods.
+
+Here is an example using built-in functions
+
+```{code-cell} ipython3
+df.select([
+ pl.col(['year', 'POP', 'XRAT', 'tcgdp', 'cc', 'cg']).max().suffix('_max')
+])
+```
+
+This line of code applies the `max` function to all selected columns.
+
+For more complex operations, we can use `map_elements` (similar to pandas' apply):
+
+```{code-cell} ipython3
+# A trivial example using map_elements
+df.with_row_index().select([
+ pl.col('index'),
+ pl.col('country'),
+ pl.col('POP').map_elements(lambda x: x * 2, return_dtype=pl.Float64).alias('POP_doubled')
+])
+```
+
+We can use complex filtering conditions with boolean logic:
+
+```{code-cell} ipython3
+complex_condition = (
+ pl.when(pl.col('country').is_in(['Argentina', 'India', 'South Africa']))
+ .then(pl.col('POP') > 40000)
+ .otherwise(pl.col('POP') < 20000)
+)
+
+df.filter(complex_condition).select(['country', 'year', 'POP', 'XRAT', 'tcgdp'])
+```
+
+### Make Changes in DataFrames
+
+The ability to make changes in dataframes is important to generate a clean dataset for future analysis.
+
+**1.** We can use conditional logic to "keep" certain values and replace others
+
+```{code-cell} ipython3
+df.with_columns(
+ pl.when(pl.col('POP') >= 20000)
+ .then(pl.col('POP'))
+ .otherwise(None)
+ .alias('POP_filtered')
+).select(['country', 'POP', 'POP_filtered'])
+```
+
+**2.** We can modify specific values based on conditions
+
+```{code-cell} ipython3
+df_modified = df.with_columns(
+ pl.when(pl.col('cg') == pl.col('cg').max())
+ .then(None)
+ .otherwise(pl.col('cg'))
+ .alias('cg')
+)
+df_modified
+```
+
+**3.** We can use expressions to modify columns as a whole
+
+```{code-cell} ipython3
+df.with_columns([
+ pl.when(pl.col('POP') <= 10000).then(None).otherwise(pl.col('POP')).alias('POP'),
+ (pl.col('XRAT') / 10).alias('XRAT')
+])
+```
+
+**4.** We can use `map_elements` to modify all individual entries in specific columns.
+
+```{code-cell} ipython3
+# Round all decimal numbers to 2 decimal places in numeric columns
+df.with_columns([
+ pl.col(pl.Float64).round(2)
+])
+```
+
+**Application: Missing Value Imputation**
+
+Replacing missing values is an important step in data munging.
+
+Let's randomly insert some null values
+
+```{code-cell} ipython3
+# Create a copy with some null values
+df_with_nulls = df.clone()
+
+# Set some specific positions to null
+indices_to_null = [(0, 'XRAT'), (3, 'cc'), (5, 'tcgdp'), (6, 'POP')]
+
+for row_idx, col_name in indices_to_null:
+ df_with_nulls = df_with_nulls.with_columns(
+ pl.when(pl.int_range(pl.len()) == row_idx)
+ .then(None)
+ .otherwise(pl.col(col_name))
+ .alias(col_name)
+ )
+
+df_with_nulls
+```
+
+We can replace all missing values with 0
+
+```{code-cell} ipython3
+df_with_nulls.fill_null(0)
+```
+
+Polars also provides us with convenient methods to replace missing values.
+
+For example, we can use forward fill, backward fill, or interpolation
+
+```{code-cell} ipython3
+# Fill with column means for numeric columns
+df_filled = df_with_nulls.with_columns([
+ pl.col(pl.Float64, pl.Int64).fill_null(pl.col(pl.Float64, pl.Int64).mean())
+])
+df_filled
+```
+
+Missing value imputation is a big area in data science involving various machine learning techniques.
+
+There are also more [advanced tools](https://scikit-learn.org/stable/modules/impute.html) in python to impute missing values.
+
+### Standardization and Visualization
+
+Let's imagine that we're only interested in the population (`POP`) and total GDP (`tcgdp`).
+
+One way to strip the data frame `df` down to only these variables is to overwrite the dataframe using the selection method described above
+
+```{code-cell} ipython3
+df = df.select(['country', 'POP', 'tcgdp'])
+df
+```
+
+Here the index `0, 1,..., 7` is redundant because we can use the country names as an index.
+
+While polars doesn't have a traditional index like pandas, we can work with country names directly
+
+```{code-cell} ipython3
+df
+```
+
+Let's give the columns slightly better names
+
+```{code-cell} ipython3
+df = df.rename({'POP': 'population', 'tcgdp': 'total GDP'})
+df
+```
+
+The `population` variable is in thousands, let's revert to single units
+
+```{code-cell} ipython3
+df = df.with_columns((pl.col('population') * 1e3).alias('population'))
+df
+```
+
+Next, we're going to add a column showing real GDP per capita, multiplying by 1,000,000 as we go because total GDP is in millions
+
+```{code-cell} ipython3
+df = df.with_columns(
+ (pl.col('total GDP') * 1e6 / pl.col('population')).alias('GDP percap')
+)
+df
+```
+
+One of the nice things about polars `DataFrame` and `Series` objects is that they can be easily converted to pandas for visualization through Matplotlib.
+
+For example, we can easily generate a bar plot of GDP per capita
+
+```{code-cell} ipython3
+# Convert to pandas for plotting
+df_pandas = df.to_pandas().set_index('country')
+ax = df_pandas['GDP percap'].plot(kind='bar')
+ax.set_xlabel('country', fontsize=12)
+ax.set_ylabel('GDP per capita', fontsize=12)
+plt.show()
+```
+
+At the moment the data frame is ordered alphabetically on the countries---let's change it to GDP per capita
+
+```{code-cell} ipython3
+df = df.sort('GDP percap', descending=True)
+df
+```
+
+Plotting as before now yields
+
+```{code-cell} ipython3
+# Convert to pandas for plotting
+df_pandas = df.to_pandas().set_index('country')
+ax = df_pandas['GDP percap'].plot(kind='bar')
+ax.set_xlabel('country', fontsize=12)
+ax.set_ylabel('GDP per capita', fontsize=12)
+plt.show()
+```
+
+## On-Line Data Sources
+
+```{index} single: Data Sources
+```
+
+Python makes it straightforward to query online databases programmatically.
+
+An important database for economists is [FRED](https://fred.stlouisfed.org/) --- a vast collection of time series data maintained by the St. Louis Fed.
+
+For example, suppose that we are interested in the [unemployment rate](https://fred.stlouisfed.org/series/UNRATE).
+
+(To download the data as a csv, click on the top right `Download` and select the `CSV (data)` option).
+
+Alternatively, we can access the CSV file from within a Python program.
+
+This can be done with a variety of methods.
+
+We start with a relatively low-level method and then return to polars.
+
+### Accessing Data with {index}`requests `
+
+```{index} single: Python; requests
+```
+
+One option is to use [requests](https://requests.readthedocs.io/en/latest/), a standard Python library for requesting data over the Internet.
+
+To begin, try the following code on your computer
+
+```{code-cell} ipython3
+r = requests.get('https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1318&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=UNRATE&scale=left&cosd=1948-01-01&coed=2024-06-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2024-07-29&revision_date=2024-07-29&nd=1948-01-01')
+```
+
+If there's no error message, then the call has succeeded.
+
+If you do get an error, then there are two likely causes
+
+1. You are not connected to the Internet --- hopefully, this isn't the case.
+1. Your machine is accessing the Internet through a proxy server, and Python isn't aware of this.
+
+In the second case, you can either
+
+* switch to another machine
+* solve your proxy problem by reading [the documentation](https://requests.readthedocs.io/en/latest/)
+
+Assuming that all is working, you can now proceed to use the `source` object returned by the call `requests.get('https://research.stlouisfed.org/fred2/series/UNRATE/downloaddata/UNRATE.csv')`
+
+```{code-cell} ipython3
+url = 'https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1318&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=UNRATE&scale=left&cosd=1948-01-01&coed=2024-06-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2024-07-29&revision_date=2024-07-29&nd=1948-01-01'
+source = requests.get(url).content.decode().split("\n")
+source[0]
+```
+
+```{code-cell} ipython3
+source[1]
+```
+
+```{code-cell} ipython3
+source[2]
+```
+
+We could now write some additional code to parse this text and store it as an array.
+
+But this is unnecessary --- polars' `read_csv` function can handle the task for us.
+
+We use `try_parse_dates=True` so that polars recognizes our dates column, allowing for simple date filtering
+
+```{code-cell} ipython3
+data = pl.read_csv(url, try_parse_dates=True)
+```
+
+The data has been read into a polars DataFrame called `data` that we can now manipulate in the usual way
+
+```{code-cell} ipython3
+type(data)
+```
+
+```{code-cell} ipython3
+data.head() # A useful method to get a quick look at a data frame
+```
+
+```{code-cell} ipython3
+data.describe() # Your output might differ slightly
+```
+
+We can also plot the unemployment rate from 2006 to 2012 as follows
+
+```{code-cell} ipython3
+# Filter data for the specified date range and convert to pandas for plotting
+filtered_data = data.filter(
+ (pl.col('DATE') >= pl.date(2006, 1, 1)) &
+ (pl.col('DATE') <= pl.date(2012, 12, 31))
+).to_pandas().set_index('DATE')
+
+ax = filtered_data.plot(title='US Unemployment Rate', legend=False)
+ax.set_xlabel('year', fontsize=12)
+ax.set_ylabel('%', fontsize=12)
+plt.show()
+```
+
+Note that polars offers many other file type alternatives.
+
+Polars has [a wide variety](https://docs.pola.rs/user-guide/io/) of methods that we can use to read excel, json, parquet or plug straight into a database server.
+
+### Using {index}`wbgapi ` and {index}`yfinance ` to Access Data
+
+The [wbgapi](https://pypi.org/project/wbgapi/) python library can be used to fetch data from the many databases published by the World Bank.
+
+```{note}
+You can find some useful information about the [wbgapi](https://pypi.org/project/wbgapi/) package in this [world bank blog post](https://blogs.worldbank.org/en/opendata/introducing-wbgapi-new-python-package-accessing-world-bank-data), in addition to this [tutorial](https://github.com/tgherzog/wbgapi/blob/master/examples/wbgapi-quickstart.ipynb)
+```
+
+We will also use [yfinance](https://pypi.org/project/yfinance/) to fetch data from Yahoo finance
+in the exercises.
+
+For now let's work through one example of downloading and plotting data --- this
+time from the World Bank.
+
+The World Bank [collects and organizes data](https://data.worldbank.org/indicator) on a huge range of indicators.
+
+For example, [here's](https://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS) some data on government debt as a ratio to GDP.
+
+The next code example fetches the data for you and plots time series for the US and Australia
+
+```{code-cell} ipython3
+import wbgapi as wb
+wb.series.info('GC.DOD.TOTL.GD.ZS')
+```
+
+```{code-cell} ipython3
+govt_debt_pandas = wb.data.DataFrame('GC.DOD.TOTL.GD.ZS', economy=['USA','AUS'], time=range(2005,2016))
+govt_debt_pandas = govt_debt_pandas.T # move years from columns to rows for plotting
+
+# Convert to polars
+govt_debt = pl.from_pandas(govt_debt_pandas.reset_index())
+```
+
+```{code-cell} ipython3
+# For plotting, convert back to pandas format
+govt_debt.to_pandas().set_index('index').plot(xlabel='year', ylabel='Government debt (% of GDP)');
+```
+
+## Exercises
+
+```{exercise-start}
+:label: pl_ex1
+```
+
+With these imports:
+
+```{code-cell} ipython3
+import datetime as dt
+import yfinance as yf
+```
+
+Write a program to calculate the percentage price change over 2021 for the following shares:
+
+```{code-cell} ipython3
+ticker_list = {'INTC': 'Intel',
+ 'MSFT': 'Microsoft',
+ 'IBM': 'IBM',
+ 'BHP': 'BHP',
+ 'TM': 'Toyota',
+ 'AAPL': 'Apple',
+ 'AMZN': 'Amazon',
+ 'C': 'Citigroup',
+ 'QCOM': 'Qualcomm',
+ 'KO': 'Coca-Cola',
+ 'GOOG': 'Google'}
+```
+
+Here's the first part of the program
+
+```{code-cell} ipython3
+def read_data(ticker_list,
+ start=dt.datetime(2021, 1, 1),
+ end=dt.datetime(2021, 12, 31)):
+ """
+ This function reads in closing price data from Yahoo
+ for each tick in the ticker_list.
+ """
+
+ all_data = []
+
+ for tick in ticker_list:
+ stock = yf.Ticker(tick)
+ prices = stock.history(start=start, end=end)
+
+ # Convert to polars DataFrame
+ df = pl.from_pandas(prices.reset_index())
+ df = df.with_columns([
+ pl.col('Date').cast(pl.Date),
+ pl.lit(tick).alias('ticker')
+ ]).select(['Date', 'ticker', 'Close'])
+
+ all_data.append(df)
+
+ # Combine all data
+ ticker_df = pl.concat(all_data)
+
+ # Pivot to have tickers as columns
+ ticker_df = ticker_df.pivot(values='Close', index='Date', columns='ticker')
+
+ return ticker_df
+
+ticker = read_data(ticker_list)
+```
+
+Complete the program to plot the result as a bar graph like this one:
+
+```{image} /_static/lecture_specific/pandas/pandas_share_prices.png
+:scale: 80
+:align: center
+```
+
+```{exercise-end}
+```
+
+```{solution-start} pl_ex1
+:class: dropdown
+```
+
+There are a few ways to approach this problem using Polars to calculate
+the percentage change.
+
+First, you can extract the data and perform the calculation such as:
+
+```{code-cell} ipython3
+# Get first and last prices for each ticker
+first_prices = ticker[0] # First row
+last_prices = ticker[-1] # Last row
+
+# Convert to pandas for easier calculation
+first_pd = ticker.head(1).to_pandas().iloc[0]
+last_pd = ticker.tail(1).to_pandas().iloc[0]
+
+price_change = (last_pd - first_pd) / first_pd * 100
+price_change = price_change.dropna() # Remove Date column
+price_change
+```
+
+Alternatively you can use polars expressions to calculate percentage change:
+
+```{code-cell} ipython3
+# Calculate percentage change using polars
+change_df = ticker.select([
+ ((pl.col(col).last() - pl.col(col).first()) / pl.col(col).first() * 100).alias(f'{col}_pct_change')
+ for col in ticker.columns if col != 'Date'
+])
+
+# Convert to series for plotting
+price_change = change_df.to_pandas().iloc[0]
+price_change.index = [col.replace('_pct_change', '') for col in price_change.index]
+price_change
+```
+
+Then to plot the chart
+
+```{code-cell} ipython3
+price_change.sort_values(inplace=True)
+price_change.rename(index=ticker_list, inplace=True)
+```
+
+```{code-cell} ipython3
+fig, ax = plt.subplots(figsize=(10,8))
+ax.set_xlabel('stock', fontsize=12)
+ax.set_ylabel('percentage change in price', fontsize=12)
+price_change.plot(kind='bar', ax=ax)
+plt.show()
+```
+
+```{solution-end}
+```
+
+
+```{exercise-start}
+:label: pl_ex2
+```
+
+Using the method `read_data` introduced in {ref}`pl_ex1`, write a program to obtain year-on-year percentage change for the following indices:
+
+```{code-cell} ipython3
+indices_list = {'^GSPC': 'S&P 500',
+ '^IXIC': 'NASDAQ',
+ '^DJI': 'Dow Jones',
+ '^N225': 'Nikkei'}
+```
+
+Complete the program to show summary statistics and plot the result as a time series graph like this one:
+
+```{image} /_static/lecture_specific/pandas/pandas_indices_pctchange.png
+:scale: 80
+:align: center
+```
+
+```{exercise-end}
+```
+
+```{solution-start} pl_ex2
+:class: dropdown
+```
+
+Following the work you did in {ref}`pl_ex1`, you can query the data using `read_data` by updating the start and end dates accordingly.
+
+```{code-cell} ipython3
+indices_data = read_data(
+ indices_list,
+ start=dt.datetime(1971, 1, 1), #Common Start Date
+ end=dt.datetime(2021, 12, 31)
+)
+```
+
+Then, calculate the yearly returns using polars:
+
+```{code-cell} ipython3
+# Add year column and calculate yearly returns
+yearly_returns_list = []
+
+for index_col in indices_data.columns:
+ if index_col != 'Date':
+ yearly_data = (indices_data
+ .with_columns(pl.col('Date').dt.year().alias('year'))
+ .group_by('year')
+ .agg([
+ pl.col(index_col).first().alias('first_price'),
+ pl.col(index_col).last().alias('last_price')
+ ])
+ .with_columns(
+ ((pl.col('last_price') - pl.col('first_price')) / pl.col('first_price')).alias(indices_list[index_col])
+ )
+ .select(['year', indices_list[index_col]]))
+
+ yearly_returns_list.append(yearly_data)
+
+# Join all yearly returns
+yearly_returns = yearly_returns_list[0]
+for df in yearly_returns_list[1:]:
+ yearly_returns = yearly_returns.join(df, on='year', how='outer')
+
+yearly_returns
+```
+
+Next, you can obtain summary statistics by using the method `describe`.
+
+```{code-cell} ipython3
+yearly_returns.select(pl.exclude('year')).describe()
+```
+
+Then, to plot the chart
+
+```{code-cell} ipython3
+# Convert to pandas for plotting
+yearly_returns_pd = yearly_returns.to_pandas().set_index('year')
+
+fig, axes = plt.subplots(2, 2, figsize=(10, 8))
+
+for iter_, ax in enumerate(axes.flatten()): # Flatten 2-D array to 1-D array
+ if iter_ < len(yearly_returns_pd.columns):
+ index_name = yearly_returns_pd.columns[iter_] # Get index name per iteration
+ ax.plot(yearly_returns_pd[index_name]) # Plot pct change of yearly returns per index
+ ax.set_ylabel("percent change", fontsize = 12)
+ ax.set_title(index_name)
+
+plt.tight_layout()
+```
+
+```{solution-end}
+```
+
+[^mung]: Wikipedia defines munging as cleaning data from one raw form into a structured, purged one.
\ No newline at end of file
From 65e4ef8483abaedd9474de8b0accc4aee992cc7a Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Fri, 29 Aug 2025 05:49:04 +0000
Subject: [PATCH 03/36] Fix Polars .suffix() method to use .name.suffix() for
compatibility
Co-authored-by: mmcky <8263752+mmcky@users.noreply.github.com>
---
lectures/polars.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lectures/polars.md b/lectures/polars.md
index 68633ae1..91fc0cf0 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -297,7 +297,7 @@ Here is an example using built-in functions
```{code-cell} ipython3
df.select([
- pl.col(['year', 'POP', 'XRAT', 'tcgdp', 'cc', 'cg']).max().suffix('_max')
+ pl.col(['year', 'POP', 'XRAT', 'tcgdp', 'cc', 'cg']).max().name.suffix('_max')
])
```
From de01b789e493f79d5489206826b818fd4a059fb2 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Fri, 29 Aug 2025 06:18:28 +0000
Subject: [PATCH 04/36] Fix column name from 'DATE' to 'observation_date' in
FRED data filtering
Co-authored-by: mmcky <8263752+mmcky@users.noreply.github.com>
---
lectures/polars.md | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/lectures/polars.md b/lectures/polars.md
index 91fc0cf0..a9385883 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -579,9 +579,9 @@ We can also plot the unemployment rate from 2006 to 2012 as follows
```{code-cell} ipython3
# Filter data for the specified date range and convert to pandas for plotting
filtered_data = data.filter(
- (pl.col('DATE') >= pl.date(2006, 1, 1)) &
- (pl.col('DATE') <= pl.date(2012, 12, 31))
-).to_pandas().set_index('DATE')
+ (pl.col('observation_date') >= pl.date(2006, 1, 1)) &
+ (pl.col('observation_date') <= pl.date(2012, 12, 31))
+).to_pandas().set_index('observation_date')
ax = filtered_data.plot(title='US Unemployment Rate', legend=False)
ax.set_xlabel('year', fontsize=12)
From 490372f3a6999bb05eabd191bdb94ba80fa32417 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Fri, 29 Aug 2025 06:59:47 +0000
Subject: [PATCH 05/36] Fix TypeError in Polars exercise and update pivot API
usage
Co-authored-by: mmcky <8263752+mmcky@users.noreply.github.com>
---
lectures/polars.md | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/lectures/polars.md b/lectures/polars.md
index a9385883..e62ead0b 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -690,7 +690,7 @@ def read_data(ticker_list,
ticker_df = pl.concat(all_data)
# Pivot to have tickers as columns
- ticker_df = ticker_df.pivot(values='Close', index='Date', columns='ticker')
+ ticker_df = ticker_df.pivot(values='Close', index='Date', on='ticker')
return ticker_df
@@ -721,12 +721,12 @@ First, you can extract the data and perform the calculation such as:
first_prices = ticker[0] # First row
last_prices = ticker[-1] # Last row
-# Convert to pandas for easier calculation
-first_pd = ticker.head(1).to_pandas().iloc[0]
-last_pd = ticker.tail(1).to_pandas().iloc[0]
+# Convert to pandas for easier calculation, excluding Date column to avoid type errors
+numeric_cols = [col for col in ticker.columns if col != 'Date']
+first_pd = ticker.head(1).select(numeric_cols).to_pandas().iloc[0]
+last_pd = ticker.tail(1).select(numeric_cols).to_pandas().iloc[0]
price_change = (last_pd - first_pd) / first_pd * 100
-price_change = price_change.dropna() # Remove Date column
price_change
```
From ea139dfffb01bfae23668626b88c1635e816ddaa Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Sun, 31 Aug 2025 21:57:11 +0000
Subject: [PATCH 06/36] Fix DuplicateError in Polars exercise join logic by
using concat and pivot approach
Co-authored-by: mmcky <8263752+mmcky@users.noreply.github.com>
---
lectures/polars.md | 20 +++++++++++---------
1 file changed, 11 insertions(+), 9 deletions(-)
diff --git a/lectures/polars.md b/lectures/polars.md
index e62ead0b..208cf036 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -804,8 +804,8 @@ indices_data = read_data(
Then, calculate the yearly returns using polars:
```{code-cell} ipython3
-# Add year column and calculate yearly returns
-yearly_returns_list = []
+# Combine all yearly returns using concat and pivot approach
+all_yearly_data = []
for index_col in indices_data.columns:
if index_col != 'Date':
@@ -817,16 +817,18 @@ for index_col in indices_data.columns:
pl.col(index_col).last().alias('last_price')
])
.with_columns(
- ((pl.col('last_price') - pl.col('first_price')) / pl.col('first_price')).alias(indices_list[index_col])
+ ((pl.col('last_price') - pl.col('first_price') + 1e-10) / (pl.col('first_price') + 1e-10)).alias('return')
)
- .select(['year', indices_list[index_col]]))
+ .with_columns(pl.lit(indices_list[index_col]).alias('index_name'))
+ .select(['year', 'index_name', 'return']))
- yearly_returns_list.append(yearly_data)
+ all_yearly_data.append(yearly_data)
-# Join all yearly returns
-yearly_returns = yearly_returns_list[0]
-for df in yearly_returns_list[1:]:
- yearly_returns = yearly_returns.join(df, on='year', how='outer')
+# Concatenate all data
+combined_data = pl.concat(all_yearly_data)
+
+# Pivot to get indices as columns
+yearly_returns = combined_data.pivot(values='return', index='year', on='index_name')
yearly_returns
```
From b67e56b541c729282efe311b1a14bc58c161ccbe Mon Sep 17 00:00:00 2001
From: mmcky
Date: Mon, 1 Sep 2025 12:57:12 +1000
Subject: [PATCH 07/36] initial @mmcky review, in-work
---
lectures/pandas.md | 1 +
lectures/polars.md | 57 +++++++++++++++++++++++++++++++++-------------
2 files changed, 42 insertions(+), 16 deletions(-)
diff --git a/lectures/pandas.md b/lectures/pandas.md
index 3d2c809d..79c07d11 100644
--- a/lectures/pandas.md
+++ b/lectures/pandas.md
@@ -78,6 +78,7 @@ You can think of a `Series` as a "column" of data, such as a collection of obser
A `DataFrame` is a two-dimensional object for storing related columns of data.
+(pandas:series)=
## Series
```{index} single: Pandas; Series
diff --git a/lectures/polars.md b/lectures/polars.md
index 208cf036..66b67eaa 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -37,7 +37,7 @@ In addition to what's in Anaconda, this lecture will need the following librarie
## Overview
-[Polars](https://pola.rs/) is a lightning-fast data manipulation library for Python written in Rust.
+[Polars](https://pola.rs/) is a fast data manipulation library for Python written in Rust.
Polars has gained significant popularity in recent years due to its superior performance
compared to traditional data analysis tools, making it an excellent choice for modern
@@ -58,7 +58,7 @@ Just as [NumPy](https://numpy.org/) provides the basic array data type plus core
* adjusting indices
* working with dates and time series
* sorting, grouping, re-ordering and general data munging [^mung]
- * dealing with missing values, etc., etc.
+ * dealing with missing values, etc.
More sophisticated statistical functionality is left to other packages, such
as [statsmodels](https://www.statsmodels.org/) and [scikit-learn](https://scikit-learn.org/), which can work with polars DataFrames through their interoperability with pandas.
@@ -70,6 +70,7 @@ place
```{code-cell} ipython3
import polars as pl
+import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
@@ -96,11 +97,16 @@ s = pl.Series(name='daily returns', values=np.random.randn(4))
s
```
-Here you can imagine the indices `0, 1, 2, 3` as indexing four listed
-companies, and the values being daily returns on their shares.
+```{note}
+You may notice the above series has no indexes, unlike in [](pandas:series).
+
+This is because Polars' is column centric and accessing data is predominantly managed through filtering and boolean masks.
+
+Here is [an interesting blog post discussing this in more detail](https://medium.com/data-science/understand-polars-lack-of-indexes-526ea75e413)
+```
Polars `Series` are built on top of Apache Arrow arrays and support many similar
-operations
+operations to Pandas `Series`.
```{code-cell} ipython3
s * 100
@@ -112,16 +118,27 @@ s.abs()
But `Series` provide more than basic arrays.
-Not only do they have some additional (statistically oriented) methods
+For example they have some additional (statistically oriented) methods
```{code-cell} ipython3
s.describe()
```
-But they can also be used with custom indices
+However the Polars `series` cannot be used in the same as as a Pandas `series` when pairing data with indices.
+
+For example, using a Pandas `series` you can do the following:
+
+```{code-cell} ipython3
+s = pd.Series(np.random.randn(4), name='daily returns')
+s.index = ['AMZN', 'AAPL', 'MSFT', 'GOOG']
+s
+```
+
+However, in Polars you will need to use the `DataFrame` object to do the same task.
+
+Essentially any column in a Polars `DataFrame` can be used as an indices through the `filter` method.
```{code-cell} ipython3
-# Create a new series with custom index using a DataFrame
df_temp = pl.DataFrame({
'company': ['AMZN', 'AAPL', 'MSFT', 'GOOG'],
'daily returns': s.to_list()
@@ -149,7 +166,7 @@ df_temp
```{code-cell} ipython3
# Check if AAPL is in the companies
-'AAPL' in df_temp.get_column('company').to_list()
+'AAPL' in df_temp.get_column('company')
```
## DataFrames
@@ -161,7 +178,7 @@ While a `Series` is a single column of data, a `DataFrame` is several columns, o
In essence, a `DataFrame` in polars is analogous to a (highly optimized) Excel spreadsheet.
-Thus, it is a powerful tool for representing and analyzing data that are naturally organized into rows and columns, often with descriptive indexes for individual rows and individual columns.
+Thus, it is a powerful tool for representing and analyzing data that are naturally organized into rows and columns.
Let's look at an example that reads data from the CSV file `pandas/data/test_pwt.csv`, which is taken from the [Penn World Tables](https://www.rug.nl/ggdc/productivity/pwt/pwt-releases/pwt-7.0).
@@ -293,7 +310,7 @@ Polars provides powerful methods for applying functions to data.
Instead of pandas' `apply` method, polars uses expressions within `select`, `with_columns`, or `filter` methods.
-Here is an example using built-in functions
+Here is an example using built-in functions to find the `max` value for each column
```{code-cell} ipython3
df.select([
@@ -301,8 +318,6 @@ df.select([
])
```
-This line of code applies the `max` function to all selected columns.
-
For more complex operations, we can use `map_elements` (similar to pandas' apply):
```{code-cell} ipython3
@@ -314,6 +329,16 @@ df.with_row_index().select([
])
```
+However as you can see from the Warning issued by Polars there is often a better way to achieve this using the Polars API.
+
+```{code-cell} ipython3
+df.with_row_index().select([
+ pl.col('index'),
+ pl.col('country'),
+ (pl.col('POP') * 2).alias('POP_doubled')
+])
+```
+
We can use complex filtering conditions with boolean logic:
```{code-cell} ipython3
@@ -362,7 +387,7 @@ df.with_columns([
])
```
-**4.** We can use `map_elements` to modify all individual entries in specific columns.
+**4.** We can use in-built functions to modify all individual entries in specific columns.
```{code-cell} ipython3
# Round all decimal numbers to 2 decimal places in numeric columns
@@ -408,7 +433,7 @@ For example, we can use forward fill, backward fill, or interpolation
```{code-cell} ipython3
# Fill with column means for numeric columns
df_filled = df_with_nulls.with_columns([
- pl.col(pl.Float64, pl.Int64).fill_null(pl.col(pl.Float64, pl.Int64).mean())
+ pl.col(pl.Float64).fill_null(pl.col(pl.Float64).mean())
])
df_filled
```
@@ -860,4 +885,4 @@ plt.tight_layout()
```{solution-end}
```
-[^mung]: Wikipedia defines munging as cleaning data from one raw form into a structured, purged one.
\ No newline at end of file
+[^mung]: Wikipedia defines munging as cleaning data from one raw form into a structured, purged one.
From 746f18cc214b31b118f62fba1074a98784c3b88a Mon Sep 17 00:00:00 2001
From: mmcky
Date: Thu, 4 Sep 2025 10:24:14 +1000
Subject: [PATCH 08/36] updates to series section
---
lectures/polars.md | 24 ++++++++++++++++--------
1 file changed, 16 insertions(+), 8 deletions(-)
diff --git a/lectures/polars.md b/lectures/polars.md
index 66b67eaa..8679bbd7 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -65,6 +65,10 @@ as [statsmodels](https://www.statsmodels.org/) and [scikit-learn](https://scikit
This lecture will provide a basic introduction to polars.
+```{tip}
+**Why use Polars over pandas?** The main reason is `performance`. As a general rule, it is recommended to have 5 to 10 times as much RAM as the size of the dataset to carry out operations in pandas, compared to 2 to 4 times needed for Polars. In addition, Polars is between 10 and 100 times as fast as pandas for common operations. A great article comparing the Polars and pandas can be found [in this JetBrains blog post](https://blog.jetbrains.com/pycharm/2024/07/polars-vs-pandas/)
+```
+
Throughout the lecture, we will assume that the following imports have taken
place
@@ -89,7 +93,6 @@ A `DataFrame` is a two-dimensional object for storing related columns of data.
Let's start with Series.
-
We begin by creating a series of four random observations
```{code-cell} ipython3
@@ -98,11 +101,11 @@ s
```
```{note}
-You may notice the above series has no indexes, unlike in [](pandas:series).
+You may notice the above series has no indices, unlike in [pd.Series](pandas:series).
This is because Polars' is column centric and accessing data is predominantly managed through filtering and boolean masks.
-Here is [an interesting blog post discussing this in more detail](https://medium.com/data-science/understand-polars-lack-of-indexes-526ea75e413)
+Here is [an interesting blog post discussing this in more detail](https://medium.com/data-science/understand-polars-lack-of-indexes-526ea75e413).
```
Polars `Series` are built on top of Apache Arrow arrays and support many similar
@@ -134,7 +137,10 @@ s.index = ['AMZN', 'AAPL', 'MSFT', 'GOOG']
s
```
-However, in Polars you will need to use the `DataFrame` object to do the same task.
+However, in Polars you will need to use the `DataFrame` object to do the same task.
+
+This means you will use the `DataFrame` object more commonly when using polars if you
+are interested in relationships between data.
Essentially any column in a Polars `DataFrame` can be used as an indices through the `filter` method.
@@ -146,15 +152,16 @@ df_temp = pl.DataFrame({
df_temp
```
-To access specific values by company name, we can filter the DataFrame
+To access specific values by company name, we can filter the DataFrame filtering on
+the `AMZN` ticker code and selecting the `daily returns`.
```{code-cell} ipython3
-# Get AMZN's return
df_temp.filter(pl.col('company') == 'AMZN').select('daily returns').item()
```
+If we want to update `AMZN` return to 0, you can use the following chain of methods.
+
```{code-cell} ipython3
-# Update AMZN's return to 0
df_temp = df_temp.with_columns(
pl.when(pl.col('company') == 'AMZN')
.then(0)
@@ -164,8 +171,9 @@ df_temp = df_temp.with_columns(
df_temp
```
+You could also check if `AAPL` is in a column.
+
```{code-cell} ipython3
-# Check if AAPL is in the companies
'AAPL' in df_temp.get_column('company')
```
From 4e654c764f2f844c9ac9c1292d108f39ac6170b9 Mon Sep 17 00:00:00 2001
From: mmcky
Date: Fri, 5 Sep 2025 10:43:43 +1000
Subject: [PATCH 09/36] edit round of DataFrames
---
lectures/polars.md | 183 +++++++++++++++++++--------------------------
1 file changed, 77 insertions(+), 106 deletions(-)
diff --git a/lectures/polars.md b/lectures/polars.md
index 8679bbd7..bc0d7587 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -66,11 +66,10 @@ as [statsmodels](https://www.statsmodels.org/) and [scikit-learn](https://scikit
This lecture will provide a basic introduction to polars.
```{tip}
-**Why use Polars over pandas?** The main reason is `performance`. As a general rule, it is recommended to have 5 to 10 times as much RAM as the size of the dataset to carry out operations in pandas, compared to 2 to 4 times needed for Polars. In addition, Polars is between 10 and 100 times as fast as pandas for common operations. A great article comparing the Polars and pandas can be found [in this JetBrains blog post](https://blog.jetbrains.com/pycharm/2024/07/polars-vs-pandas/)
+**Why use Polars over pandas?** One reason is **performance**. As a general rule, it is recommended to have 5 to 10 times as much RAM as the size of the dataset to carry out operations in pandas, compared to 2 to 4 times needed for Polars. In addition, Polars is between 10 and 100 times as fast as pandas for common operations. A great article comparing the Polars and pandas can be found [in this JetBrains blog post](https://blog.jetbrains.com/pycharm/2024/07/polars-vs-pandas/).
```
-Throughout the lecture, we will assume that the following imports have taken
-place
+Throughout the lecture, we will assume that the following imports have taken place
```{code-cell} ipython3
import polars as pl
@@ -101,11 +100,7 @@ s
```
```{note}
-You may notice the above series has no indices, unlike in [pd.Series](pandas:series).
-
-This is because Polars' is column centric and accessing data is predominantly managed through filtering and boolean masks.
-
-Here is [an interesting blog post discussing this in more detail](https://medium.com/data-science/understand-polars-lack-of-indexes-526ea75e413).
+You may notice the above series has no indices, unlike in [pd.Series](pandas:series).This is because Polars' is column centric and accessing data is predominantly managed through filtering and boolean masks. Here is [an interesting blog post discussing this in more detail](https://medium.com/data-science/understand-polars-lack-of-indexes-526ea75e413).
```
Polars `Series` are built on top of Apache Arrow arrays and support many similar
@@ -127,9 +122,9 @@ For example they have some additional (statistically oriented) methods
s.describe()
```
-However the Polars `series` cannot be used in the same as as a Pandas `series` when pairing data with indices.
+However the `pl.Series` object cannot be used in the same way as a `pd.Series` when pairing data with indices.
-For example, using a Pandas `series` you can do the following:
+For example, using a `pd.Series` you can do the following:
```{code-cell} ipython3
s = pd.Series(np.random.randn(4), name='daily returns')
@@ -139,42 +134,42 @@ s
However, in Polars you will need to use the `DataFrame` object to do the same task.
-This means you will use the `DataFrame` object more commonly when using polars if you
-are interested in relationships between data.
+This means you will use the `DataFrame` object more often when using polars if you
+are interested in relationships between data
-Essentially any column in a Polars `DataFrame` can be used as an indices through the `filter` method.
+Let's create a `pl.DataFrame` containing the equivalent data in the `pd.Series` .
```{code-cell} ipython3
-df_temp = pl.DataFrame({
+df = pl.DataFrame({
'company': ['AMZN', 'AAPL', 'MSFT', 'GOOG'],
'daily returns': s.to_list()
})
-df_temp
+df
```
To access specific values by company name, we can filter the DataFrame filtering on
-the `AMZN` ticker code and selecting the `daily returns`.
+the `AMZN` ticker code and selecting the `daily returns`.
```{code-cell} ipython3
-df_temp.filter(pl.col('company') == 'AMZN').select('daily returns').item()
+df.filter(pl.col('company') == 'AMZN').select('daily returns').item()
```
If we want to update `AMZN` return to 0, you can use the following chain of methods.
```{code-cell} ipython3
-df_temp = df_temp.with_columns(
- pl.when(pl.col('company') == 'AMZN')
- .then(0)
- .otherwise(pl.col('daily returns'))
- .alias('daily returns')
+df = df.with_columns( # with_columns is similar to select but adds columns to the same DataFrame
+ pl.when(pl.col('company') == 'AMZN') # filter for rows relating to AMZN in company column
+ .then(0) # set values to 0
+ .otherwise(pl.col('daily returns')) # otherwise keep the value in daily returns column
+ .alias('daily returns') # assign back to the daily returns column
)
-df_temp
+df
```
-You could also check if `AAPL` is in a column.
+You can check if a ticker code is in the company list
```{code-cell} ipython3
-'AAPL' in df_temp.get_column('company')
+'AAPL' in df['company']
```
## DataFrames
@@ -188,7 +183,8 @@ In essence, a `DataFrame` in polars is analogous to a (highly optimized) Excel s
Thus, it is a powerful tool for representing and analyzing data that are naturally organized into rows and columns.
-Let's look at an example that reads data from the CSV file `pandas/data/test_pwt.csv`, which is taken from the [Penn World Tables](https://www.rug.nl/ggdc/productivity/pwt/pwt-releases/pwt-7.0).
+Let's look at an example that reads data from the CSV file `pandas/data/test_pwt.csv`,
+which is taken from the [Penn World Tables](https://www.rug.nl/ggdc/productivity/pwt/pwt-releases/pwt-7.0).
The dataset contains the following indicators
@@ -204,11 +200,12 @@ The dataset contains the following indicators
We'll read this in from a URL using the `polars` function `read_csv`.
```{code-cell} ipython3
-df = pl.read_csv('https://raw.githubusercontent.com/QuantEcon/lecture-python-programming/master/source/_static/lecture_specific/pandas/data/test_pwt.csv')
+URL = 'https://raw.githubusercontent.com/QuantEcon/lecture-python-programming/master/source/_static/lecture_specific/pandas/data/test_pwt.csv'
+df = pl.read_csv(URL)
type(df)
```
-Here's the content of `test_pwt.csv`
+Here is the content of `test_pwt.csv`
```{code-cell} ipython3
df
@@ -216,7 +213,8 @@ df
### Select Data by Position
-In practice, one thing that we do all the time is to find, select and work with a subset of the data of our interests.
+In practice, one thing that we do all the time is to find, select and work with a
+subset of the data of our interests.
We can select particular rows using array slicing notation
@@ -254,14 +252,17 @@ The most straightforward way is with the `filter` method.
df.filter(pl.col('POP') >= 20000)
```
-To understand what is going on here, notice that `pl.col('POP') >= 20000` creates a boolean expression.
+In this case, `df.filter()` takes a boolean expression and only returns rows with the `True` values.
+
+We can see this boolean mask by saving the comparison results in the following table.
```{code-cell} ipython3
-df.select(pl.col('POP') >= 20000)
+df.select(
+ pl.col('country'), # Include country for reference
+ (pl.col('POP') >= 20000).alias('meets_criteria') # meets_criteria shows results of the comparison expression
+)
```
-In this case, `df.filter()` takes a boolean expression and only returns rows with the `True` values.
-
Take one more example,
```{code-cell} ipython3
@@ -277,7 +278,8 @@ We can also allow arithmetic operations between different columns.
df.filter((pl.col('cc') + pl.col('cg') >= 80) & (pl.col('POP') <= 20000))
```
-For example, we can use the conditioning to select the country with the largest household consumption - gdp share `cc`.
+For example, we can use the conditioning to select the country with the largest
+household consumption - gdp share `cc`.
```{code-cell} ipython3
df.filter(pl.col('cc') == pl.col('cc').max())
@@ -291,13 +293,13 @@ df.filter((pl.col('cc') + pl.col('cg') >= 80) & (pl.col('POP') <= 20000)).select
**Application: Subsetting Dataframe**
-Real-world datasets can be [enormous](https://developers.google.com/machine-learning/crash-course/overfitting).
+Real-world datasets can be very large.
It is sometimes desirable to work with a subset of data to enhance computational efficiency and reduce redundancy.
Let's imagine that we're only interested in the population (`POP`) and total GDP (`tcgdp`).
-One way to strip the data frame `df` down to only these variables is to overwrite the dataframe using the selection method described above
+One way to strip the data frame `df` down to only these variables is to overwrite the `DataFrame` using the selection method described above
```{code-cell} ipython3
df_subset = df.select(['country', 'POP', 'tcgdp'])
@@ -329,19 +331,16 @@ df.select([
For more complex operations, we can use `map_elements` (similar to pandas' apply):
```{code-cell} ipython3
-# A trivial example using map_elements
-df.with_row_index().select([
- pl.col('index'),
+df.select([
pl.col('country'),
- pl.col('POP').map_elements(lambda x: x * 2, return_dtype=pl.Float64).alias('POP_doubled')
+ pl.col('POP').map_elements(lambda x: x * 2).alias('POP_doubled')
])
```
-However as you can see from the Warning issued by Polars there is often a better way to achieve this using the Polars API.
+However as you can see from the warning issued by Polars there is often a better way to achieve this using the Polars API.
```{code-cell} ipython3
-df.with_row_index().select([
- pl.col('index'),
+df.select([
pl.col('country'),
(pl.col('POP') * 2).alias('POP_doubled')
])
@@ -351,9 +350,9 @@ We can use complex filtering conditions with boolean logic:
```{code-cell} ipython3
complex_condition = (
- pl.when(pl.col('country').is_in(['Argentina', 'India', 'South Africa']))
- .then(pl.col('POP') > 40000)
- .otherwise(pl.col('POP') < 20000)
+ pl.when(pl.col('country').is_in(['Argentina', 'India', 'South Africa'])) # for the countries that match those in the list
+ .then(pl.col('POP') > 40000) # mark True if population is > 40,000
+ .otherwise(pl.col('POP') < 20000) # otherwise False if population is < 20,000
)
df.filter(complex_condition).select(['country', 'year', 'POP', 'XRAT', 'tcgdp'])
@@ -366,22 +365,22 @@ The ability to make changes in dataframes is important to generate a clean datas
**1.** We can use conditional logic to "keep" certain values and replace others
```{code-cell} ipython3
-df.with_columns(
- pl.when(pl.col('POP') >= 20000)
- .then(pl.col('POP'))
- .otherwise(None)
- .alias('POP_filtered')
-).select(['country', 'POP', 'POP_filtered'])
+df.with_columns( # add data column to the same dataframe
+ pl.when(pl.col('POP') >= 20000) # when population is greater than 20,000
+ .then(pl.col('POP')) # keep the population value
+ .otherwise(None) # otherwise set the value to null
+ .alias('POP_filtered') # save results in column POP_filtered
+).select(['country', 'POP', 'POP_filtered']) # select the columns of interest
```
**2.** We can modify specific values based on conditions
```{code-cell} ipython3
-df_modified = df.with_columns(
- pl.when(pl.col('cg') == pl.col('cg').max())
- .then(None)
- .otherwise(pl.col('cg'))
- .alias('cg')
+df_modified = df.with_columns(
+ pl.when(pl.col('cg') == pl.col('cg').max()) # when a value in the cg column is equal to the max cg value
+ .then(None) # set to null
+ .otherwise(pl.col('cg')) # otherwise keep the value in the cg column
+ .alias('cg') # update the column with name cg
)
df_modified
```
@@ -390,17 +389,19 @@ df_modified
```{code-cell} ipython3
df.with_columns([
- pl.when(pl.col('POP') <= 10000).then(None).otherwise(pl.col('POP')).alias('POP'),
- (pl.col('XRAT') / 10).alias('XRAT')
+ pl.when(pl.col('POP') <= 10000) # when population is < 10,000
+ .then(None) # set the value to null
+ .otherwise(pl.col('POP')) # otherwise keep the existing value
+ .alias('POP'), # update the POP column
+ (pl.col('XRAT') / 10).alias('XRAT') # using the XRAT column, divide the value by 10 and update the column in-place
])
```
-**4.** We can use in-built functions to modify all individual entries in specific columns.
+**4.** We can use in-built functions to modify all individual entries in specific columns by data type.
```{code-cell} ipython3
-# Round all decimal numbers to 2 decimal places in numeric columns
df.with_columns([
- pl.col(pl.Float64).round(2)
+ pl.col(pl.Float64).round(2) # round all Float64 columns to 2 decimal places
])
```
@@ -440,10 +441,10 @@ For example, we can use forward fill, backward fill, or interpolation
```{code-cell} ipython3
# Fill with column means for numeric columns
-df_filled = df_with_nulls.with_columns([
- pl.col(pl.Float64).fill_null(pl.col(pl.Float64).mean())
+cols = ["cc", "tcgdp", "POP", "XRAT"]
+df_with_nulls.with_columns([
+ pl.col(cols).fill_null(pl.col(cols).mean()) # fill null values with the column mean
])
-df_filled
```
Missing value imputation is a big area in data science involving various machine learning techniques.
@@ -454,15 +455,13 @@ There are also more [advanced tools](https://scikit-learn.org/stable/modules/imp
Let's imagine that we're only interested in the population (`POP`) and total GDP (`tcgdp`).
-One way to strip the data frame `df` down to only these variables is to overwrite the dataframe using the selection method described above
+One way to strip the data frame `df` down to only these variables is to overwrite the `DataFrame` using the selection method described above
```{code-cell} ipython3
df = df.select(['country', 'POP', 'tcgdp'])
df
```
-Here the index `0, 1,..., 7` is redundant because we can use the country names as an index.
-
While polars doesn't have a traditional index like pandas, we can work with country names directly
```{code-cell} ipython3
@@ -483,7 +482,11 @@ df = df.with_columns((pl.col('population') * 1e3).alias('population'))
df
```
-Next, we're going to add a column showing real GDP per capita, multiplying by 1,000,000 as we go because total GDP is in millions
+Next, we're going to add a column showing real GDP per capita, multiplying by 1,000,000 as we go because total GDP is in millions.
+
+```{note}
+Polars (or Pandas) doesn't have a way of recording dimensional analysis units such as GDP represented in millions of dollars. This is left to the user to ensure they track their own units when undertaking analysis.
+```
```{code-cell} ipython3
df = df.with_columns(
@@ -626,43 +629,7 @@ Note that polars offers many other file type alternatives.
Polars has [a wide variety](https://docs.pola.rs/user-guide/io/) of methods that we can use to read excel, json, parquet or plug straight into a database server.
-### Using {index}`wbgapi ` and {index}`yfinance ` to Access Data
-
-The [wbgapi](https://pypi.org/project/wbgapi/) python library can be used to fetch data from the many databases published by the World Bank.
-
-```{note}
-You can find some useful information about the [wbgapi](https://pypi.org/project/wbgapi/) package in this [world bank blog post](https://blogs.worldbank.org/en/opendata/introducing-wbgapi-new-python-package-accessing-world-bank-data), in addition to this [tutorial](https://github.com/tgherzog/wbgapi/blob/master/examples/wbgapi-quickstart.ipynb)
-```
-
-We will also use [yfinance](https://pypi.org/project/yfinance/) to fetch data from Yahoo finance
-in the exercises.
-
-For now let's work through one example of downloading and plotting data --- this
-time from the World Bank.
-
-The World Bank [collects and organizes data](https://data.worldbank.org/indicator) on a huge range of indicators.
-
-For example, [here's](https://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS) some data on government debt as a ratio to GDP.
-
-The next code example fetches the data for you and plots time series for the US and Australia
-
-```{code-cell} ipython3
-import wbgapi as wb
-wb.series.info('GC.DOD.TOTL.GD.ZS')
-```
-
-```{code-cell} ipython3
-govt_debt_pandas = wb.data.DataFrame('GC.DOD.TOTL.GD.ZS', economy=['USA','AUS'], time=range(2005,2016))
-govt_debt_pandas = govt_debt_pandas.T # move years from columns to rows for plotting
-
-# Convert to polars
-govt_debt = pl.from_pandas(govt_debt_pandas.reset_index())
-```
-
-```{code-cell} ipython3
-# For plotting, convert back to pandas format
-govt_debt.to_pandas().set_index('index').plot(xlabel='year', ylabel='Government debt (% of GDP)');
-```
++++
## Exercises
@@ -695,6 +662,10 @@ ticker_list = {'INTC': 'Intel',
Here's the first part of the program
+```{note}
+Many python packages will return Pandas DataFrames by default. In this example we use the `yfinance` package and convert the data to a polars DataFrame
+```
+
```{code-cell} ipython3
def read_data(ticker_list,
start=dt.datetime(2021, 1, 1),
From 018fe74c2a514f615b77c7db27a8be741e8f41c8 Mon Sep 17 00:00:00 2001
From: Matt McKay
Date: Mon, 8 Sep 2025 12:03:46 +1000
Subject: [PATCH 10/36] Update lectures/polars.md
Co-authored-by: Humphrey Yang <39026988+HumphreyYang@users.noreply.github.com>
---
lectures/polars.md | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/lectures/polars.md b/lectures/polars.md
index bc0d7587..3971e0bf 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -30,9 +30,7 @@ In addition to what's in Anaconda, this lecture will need the following librarie
```{code-cell} ipython3
:tags: [hide-output]
-!pip install --upgrade polars
-!pip install --upgrade wbgapi
-!pip install --upgrade yfinance
+!pip install --upgrade polars wbgapi yfinance
```
## Overview
From 7d84e1211ef9d5557929f66729b7a0b83623e454 Mon Sep 17 00:00:00 2001
From: Matt McKay
Date: Mon, 8 Sep 2025 12:04:02 +1000
Subject: [PATCH 11/36] Update lectures/polars.md
Co-authored-by: Humphrey Yang <39026988+HumphreyYang@users.noreply.github.com>
---
lectures/polars.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lectures/polars.md b/lectures/polars.md
index 3971e0bf..36d0abb4 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -64,7 +64,7 @@ as [statsmodels](https://www.statsmodels.org/) and [scikit-learn](https://scikit
This lecture will provide a basic introduction to polars.
```{tip}
-**Why use Polars over pandas?** One reason is **performance**. As a general rule, it is recommended to have 5 to 10 times as much RAM as the size of the dataset to carry out operations in pandas, compared to 2 to 4 times needed for Polars. In addition, Polars is between 10 and 100 times as fast as pandas for common operations. A great article comparing the Polars and pandas can be found [in this JetBrains blog post](https://blog.jetbrains.com/pycharm/2024/07/polars-vs-pandas/).
+*Why use Polars over pandas?* One reason is *performance*. As a general rule, it is recommended to have 5 to 10 times as much RAM as the size of the dataset to carry out operations in pandas, compared to 2 to 4 times needed for Polars. In addition, Polars is between 10 and 100 times as fast as pandas for common operations. A great article comparing the Polars and pandas can be found [in this JetBrains blog post](https://blog.jetbrains.com/pycharm/2024/07/polars-vs-pandas/).
```
Throughout the lecture, we will assume that the following imports have taken place
From c776ffbea3a63c8eb29ac854cd58376863bda44b Mon Sep 17 00:00:00 2001
From: Matt McKay
Date: Mon, 8 Sep 2025 12:04:15 +1000
Subject: [PATCH 12/36] Update lectures/polars.md
Co-authored-by: Humphrey Yang <39026988+HumphreyYang@users.noreply.github.com>
---
lectures/polars.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lectures/polars.md b/lectures/polars.md
index 36d0abb4..a139308f 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -98,7 +98,7 @@ s
```
```{note}
-You may notice the above series has no indices, unlike in [pd.Series](pandas:series).This is because Polars' is column centric and accessing data is predominantly managed through filtering and boolean masks. Here is [an interesting blog post discussing this in more detail](https://medium.com/data-science/understand-polars-lack-of-indexes-526ea75e413).
+You may notice the above series has no indices, unlike in [pd.Series](pandas:series). This is because Polars' is column centric and accessing data is predominantly managed through filtering and boolean masks. Here is [an interesting blog post discussing this in more detail](https://medium.com/data-science/understand-polars-lack-of-indexes-526ea75e413).
```
Polars `Series` are built on top of Apache Arrow arrays and support many similar
From 05e0523143eb6c8fd40f774528e5e23d36a24bff Mon Sep 17 00:00:00 2001
From: Matt McKay
Date: Mon, 8 Sep 2025 12:04:32 +1000
Subject: [PATCH 13/36] Update lectures/polars.md
Co-authored-by: Humphrey Yang <39026988+HumphreyYang@users.noreply.github.com>
---
lectures/polars.md | 2 ++
1 file changed, 2 insertions(+)
diff --git a/lectures/polars.md b/lectures/polars.md
index a139308f..088118c1 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -104,6 +104,8 @@ You may notice the above series has no indices, unlike in [pd.Series](pandas:ser
Polars `Series` are built on top of Apache Arrow arrays and support many similar
operations to Pandas `Series`.
+(For interested readers, please see this extended reading on [Apache Arrow](https://www.datacamp.com/tutorial/apache-arrow))
+
```{code-cell} ipython3
s * 100
```
From 4fdd0526b1e6aae3cfd8c0b4f2c48e10bb432fa1 Mon Sep 17 00:00:00 2001
From: Matt McKay
Date: Mon, 8 Sep 2025 12:05:06 +1000
Subject: [PATCH 14/36] Update lectures/polars.md
Co-authored-by: Humphrey Yang <39026988+HumphreyYang@users.noreply.github.com>
---
lectures/polars.md | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/lectures/polars.md b/lectures/polars.md
index 088118c1..8dbfe845 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -156,15 +156,17 @@ df.filter(pl.col('company') == 'AMZN').select('daily returns').item()
If we want to update `AMZN` return to 0, you can use the following chain of methods.
+
+Here `with_columns` is similar to `select` but adds columns to the same `DataFrame`
+
```{code-cell} ipython3
-df = df.with_columns( # with_columns is similar to select but adds columns to the same DataFrame
- pl.when(pl.col('company') == 'AMZN') # filter for rows relating to AMZN in company column
+df = df.with_columns(
+ pl.when(pl.col('company') == 'AMZN') # filter for AMZN in company column
.then(0) # set values to 0
- .otherwise(pl.col('daily returns')) # otherwise keep the value in daily returns column
- .alias('daily returns') # assign back to the daily returns column
+ .otherwise(pl.col('daily returns')) # otherwise keep the original value
+ .alias('daily returns') # assign back to the column
)
df
-```
You can check if a ticker code is in the company list
From 2e51b8c724159fe338742a18c795fe9681632ad8 Mon Sep 17 00:00:00 2001
From: Matt McKay
Date: Mon, 8 Sep 2025 12:05:40 +1000
Subject: [PATCH 15/36] Update lectures/polars.md
Co-authored-by: Humphrey Yang <39026988+HumphreyYang@users.noreply.github.com>
---
lectures/polars.md | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/lectures/polars.md b/lectures/polars.md
index 8dbfe845..54d03c72 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -256,12 +256,12 @@ df.filter(pl.col('POP') >= 20000)
In this case, `df.filter()` takes a boolean expression and only returns rows with the `True` values.
-We can see this boolean mask by saving the comparison results in the following table.
+We can view this boolean mask as a table with the alias `meets_criteria`
```{code-cell} ipython3
df.select(
- pl.col('country'), # Include country for reference
- (pl.col('POP') >= 20000).alias('meets_criteria') # meets_criteria shows results of the comparison expression
+ pl.col('country'),
+ (pl.col('POP') >= 20000).alias('meets_criteria')
)
```
From 7c3e1ed8ad399430ab282fb2f27f3cf7292d19ba Mon Sep 17 00:00:00 2001
From: Matt McKay
Date: Mon, 8 Sep 2025 12:10:27 +1000
Subject: [PATCH 16/36] Update lectures/polars.md
Co-authored-by: Humphrey Yang <39026988+HumphreyYang@users.noreply.github.com>
---
lectures/polars.md | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/lectures/polars.md b/lectures/polars.md
index 54d03c72..3cb2aa03 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -290,7 +290,9 @@ df.filter(pl.col('cc') == pl.col('cc').max())
When we only want to look at certain columns of a selected sub-dataframe, we can combine filter with select.
```{code-cell} ipython3
-df.filter((pl.col('cc') + pl.col('cg') >= 80) & (pl.col('POP') <= 20000)).select(['country', 'year', 'POP'])
+df.filter(
+ (pl.col('cc') + pl.col('cg') >= 80) & (pl.col('POP') <= 20000)
+ ).select(['country', 'year', 'POP'])
```
**Application: Subsetting Dataframe**
From f73ead1f07ce963830de1f345156b308e1770fc8 Mon Sep 17 00:00:00 2001
From: Matt McKay
Date: Mon, 8 Sep 2025 12:11:41 +1000
Subject: [PATCH 17/36] Update lectures/polars.md
Co-authored-by: Humphrey Yang <39026988+HumphreyYang@users.noreply.github.com>
---
lectures/polars.md | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/lectures/polars.md b/lectures/polars.md
index 3cb2aa03..d5ffd9fb 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -354,9 +354,9 @@ We can use complex filtering conditions with boolean logic:
```{code-cell} ipython3
complex_condition = (
- pl.when(pl.col('country').is_in(['Argentina', 'India', 'South Africa'])) # for the countries that match those in the list
- .then(pl.col('POP') > 40000) # mark True if population is > 40,000
- .otherwise(pl.col('POP') < 20000) # otherwise False if population is < 20,000
+ pl.when(pl.col('country').is_in(['Argentina', 'India', 'South Africa']))
+ .then(pl.col('POP') > 40000)
+ .otherwise(pl.col('POP') < 20000)
)
df.filter(complex_condition).select(['country', 'year', 'POP', 'XRAT', 'tcgdp'])
From 77982bf1ae410ad2ec3dd7a3204c0554f925b2e2 Mon Sep 17 00:00:00 2001
From: Matt McKay
Date: Mon, 8 Sep 2025 12:11:55 +1000
Subject: [PATCH 18/36] Update lectures/polars.md
Co-authored-by: Humphrey Yang <39026988+HumphreyYang@users.noreply.github.com>
---
lectures/polars.md | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/lectures/polars.md b/lectures/polars.md
index d5ffd9fb..4cb007d8 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -369,12 +369,12 @@ The ability to make changes in dataframes is important to generate a clean datas
**1.** We can use conditional logic to "keep" certain values and replace others
```{code-cell} ipython3
-df.with_columns( # add data column to the same dataframe
- pl.when(pl.col('POP') >= 20000) # when population is greater than 20,000
+df.with_columns(
+ pl.when(pl.col('POP') >= 20000) # when population >= 20000
.then(pl.col('POP')) # keep the population value
- .otherwise(None) # otherwise set the value to null
- .alias('POP_filtered') # save results in column POP_filtered
-).select(['country', 'POP', 'POP_filtered']) # select the columns of interest
+ .otherwise(None) # otherwise set to null
+ .alias('POP_filtered') # save results in POP_filtered
+).select(['country', 'POP', 'POP_filtered']) # select the columns
```
**2.** We can modify specific values based on conditions
From 914364f646f29e33fd3b10da28fa2296fa5eeb77 Mon Sep 17 00:00:00 2001
From: Matt McKay
Date: Mon, 8 Sep 2025 12:12:04 +1000
Subject: [PATCH 19/36] Update lectures/polars.md
Co-authored-by: Humphrey Yang <39026988+HumphreyYang@users.noreply.github.com>
---
lectures/polars.md | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/lectures/polars.md b/lectures/polars.md
index 4cb007d8..f70e0971 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -443,13 +443,13 @@ Polars also provides us with convenient methods to replace missing values.
For example, we can use forward fill, backward fill, or interpolation
+Here we fill `null` values with the column means
+
```{code-cell} ipython3
-# Fill with column means for numeric columns
cols = ["cc", "tcgdp", "POP", "XRAT"]
df_with_nulls.with_columns([
- pl.col(cols).fill_null(pl.col(cols).mean()) # fill null values with the column mean
+ pl.col(cols).fill_null(pl.col(cols).mean())
])
-```
Missing value imputation is a big area in data science involving various machine learning techniques.
From 0c05e46d4d24a4b6447678a895418d4deacff01f Mon Sep 17 00:00:00 2001
From: Matt McKay
Date: Mon, 8 Sep 2025 12:12:21 +1000
Subject: [PATCH 20/36] Update lectures/polars.md
Co-authored-by: Humphrey Yang <39026988+HumphreyYang@users.noreply.github.com>
---
lectures/polars.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lectures/polars.md b/lectures/polars.md
index f70e0971..cdcc4f77 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -381,7 +381,7 @@ df.with_columns(
```{code-cell} ipython3
df_modified = df.with_columns(
- pl.when(pl.col('cg') == pl.col('cg').max()) # when a value in the cg column is equal to the max cg value
+ pl.when(pl.col('cg') == pl.col('cg').max()) # pick the largest cg value
.then(None) # set to null
.otherwise(pl.col('cg')) # otherwise keep the value in the cg column
.alias('cg') # update the column with name cg
From 63827830650ba1c658019a609b5bd37b17c5016e Mon Sep 17 00:00:00 2001
From: Matt McKay
Date: Mon, 8 Sep 2025 12:12:35 +1000
Subject: [PATCH 21/36] Update lectures/polars.md
Co-authored-by: Humphrey Yang <39026988+HumphreyYang@users.noreply.github.com>
---
lectures/polars.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lectures/polars.md b/lectures/polars.md
index cdcc4f77..aa52a360 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -397,7 +397,7 @@ df.with_columns([
.then(None) # set the value to null
.otherwise(pl.col('POP')) # otherwise keep the existing value
.alias('POP'), # update the POP column
- (pl.col('XRAT') / 10).alias('XRAT') # using the XRAT column, divide the value by 10 and update the column in-place
+ (pl.col('XRAT') / 10).alias('XRAT') # update XRAT in-place
])
```
From 07046819155fe0c596d6e83f82fd12b8731fda10 Mon Sep 17 00:00:00 2001
From: Matt McKay
Date: Mon, 8 Sep 2025 12:13:06 +1000
Subject: [PATCH 22/36] Update lectures/polars.md
Co-authored-by: Humphrey Yang <39026988+HumphreyYang@users.noreply.github.com>
---
lectures/polars.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lectures/polars.md b/lectures/polars.md
index aa52a360..42451fe1 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -405,7 +405,7 @@ df.with_columns([
```{code-cell} ipython3
df.with_columns([
- pl.col(pl.Float64).round(2) # round all Float64 columns to 2 decimal places
+ pl.col(pl.Float64).round(2) # round all Float64 columns
])
```
From bf26d686c9eb44527370a5704d621624011d0205 Mon Sep 17 00:00:00 2001
From: Matt McKay
Date: Mon, 8 Sep 2025 12:20:06 +1000
Subject: [PATCH 23/36] Update lectures/polars.md
Co-authored-by: Humphrey Yang <39026988+HumphreyYang@users.noreply.github.com>
---
lectures/polars.md | 50 +++-------------------------------------------
1 file changed, 3 insertions(+), 47 deletions(-)
diff --git a/lectures/polars.md b/lectures/polars.md
index 42451fe1..fc73825c 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -545,56 +545,12 @@ For example, suppose that we are interested in the [unemployment rate](https://f
Alternatively, we can access the CSV file from within a Python program.
-This can be done with a variety of methods.
-We start with a relatively low-level method and then return to polars.
+In {doc}`pandas`, we studied how to use `requests` and `pandas` to access API data.
-### Accessing Data with {index}`requests `
+Here polars' `read_csv` function provides the same functionality.
-```{index} single: Python; requests
-```
-
-One option is to use [requests](https://requests.readthedocs.io/en/latest/), a standard Python library for requesting data over the Internet.
-
-To begin, try the following code on your computer
-
-```{code-cell} ipython3
-r = requests.get('https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1318&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=UNRATE&scale=left&cosd=1948-01-01&coed=2024-06-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2024-07-29&revision_date=2024-07-29&nd=1948-01-01')
-```
-
-If there's no error message, then the call has succeeded.
-
-If you do get an error, then there are two likely causes
-
-1. You are not connected to the Internet --- hopefully, this isn't the case.
-1. Your machine is accessing the Internet through a proxy server, and Python isn't aware of this.
-
-In the second case, you can either
-
-* switch to another machine
-* solve your proxy problem by reading [the documentation](https://requests.readthedocs.io/en/latest/)
-
-Assuming that all is working, you can now proceed to use the `source` object returned by the call `requests.get('https://research.stlouisfed.org/fred2/series/UNRATE/downloaddata/UNRATE.csv')`
-
-```{code-cell} ipython3
-url = 'https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1318&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=UNRATE&scale=left&cosd=1948-01-01&coed=2024-06-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2024-07-29&revision_date=2024-07-29&nd=1948-01-01'
-source = requests.get(url).content.decode().split("\n")
-source[0]
-```
-
-```{code-cell} ipython3
-source[1]
-```
-
-```{code-cell} ipython3
-source[2]
-```
-
-We could now write some additional code to parse this text and store it as an array.
-
-But this is unnecessary --- polars' `read_csv` function can handle the task for us.
-
-We use `try_parse_dates=True` so that polars recognizes our dates column, allowing for simple date filtering
+We use `try_parse_dates=True` so that polars recognizes our dates column
```{code-cell} ipython3
data = pl.read_csv(url, try_parse_dates=True)
From c923a65aae39c2c1c6ad4f8cd13c1bd87603d0f4 Mon Sep 17 00:00:00 2001
From: Matt McKay
Date: Mon, 8 Sep 2025 12:20:26 +1000
Subject: [PATCH 24/36] Update lectures/polars.md
Co-authored-by: Humphrey Yang <39026988+HumphreyYang@users.noreply.github.com>
---
lectures/polars.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lectures/polars.md b/lectures/polars.md
index fc73825c..8302f3de 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -760,7 +760,7 @@ Following the work you did in {ref}`pl_ex1`, you can query the data using `read_
```{code-cell} ipython3
indices_data = read_data(
indices_list,
- start=dt.datetime(1971, 1, 1), #Common Start Date
+ start=dt.datetime(1971, 1, 1),
end=dt.datetime(2021, 12, 31)
)
```
From 2d4c312533f085ccb92b65ee3d1500a0484bb6f0 Mon Sep 17 00:00:00 2001
From: Matt McKay
Date: Mon, 8 Sep 2025 12:20:36 +1000
Subject: [PATCH 25/36] Update lectures/polars.md
Co-authored-by: Humphrey Yang <39026988+HumphreyYang@users.noreply.github.com>
---
lectures/polars.md | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/lectures/polars.md b/lectures/polars.md
index 8302f3de..7545a4fa 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -781,7 +781,8 @@ for index_col in indices_data.columns:
pl.col(index_col).last().alias('last_price')
])
.with_columns(
- ((pl.col('last_price') - pl.col('first_price') + 1e-10) / (pl.col('first_price') + 1e-10)).alias('return')
+ ((pl.col('last_price') - pl.col('first_price') + 1e-10)
+ / (pl.col('first_price') + 1e-10)).alias('return')
)
.with_columns(pl.lit(indices_list[index_col]).alias('index_name'))
.select(['year', 'index_name', 'return']))
From 401b2eec778cd2a29c4bea86790d99fe025409e0 Mon Sep 17 00:00:00 2001
From: Matt McKay
Date: Mon, 8 Sep 2025 12:20:48 +1000
Subject: [PATCH 26/36] Update lectures/polars.md
Co-authored-by: Humphrey Yang <39026988+HumphreyYang@users.noreply.github.com>
---
lectures/polars.md | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/lectures/polars.md b/lectures/polars.md
index 7545a4fa..96a8cf4d 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -812,10 +812,15 @@ yearly_returns_pd = yearly_returns.to_pandas().set_index('year')
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
-for iter_, ax in enumerate(axes.flatten()): # Flatten 2-D array to 1-D array
+# Flatten 2-D array to 1-D array
+for iter_, ax in enumerate(axes.flatten()):
if iter_ < len(yearly_returns_pd.columns):
- index_name = yearly_returns_pd.columns[iter_] # Get index name per iteration
- ax.plot(yearly_returns_pd[index_name]) # Plot pct change of yearly returns per index
+
+ # Get index name per iteration
+ index_name = yearly_returns_pd.columns[iter_]
+
+ # Plot pct change of yearly returns per index
+ ax.plot(yearly_returns_pd[index_name])
ax.set_ylabel("percent change", fontsize = 12)
ax.set_title(index_name)
From bec5d9f7f6b98b14aa1c4d4e706b6ae2da08cddd Mon Sep 17 00:00:00 2001
From: mmcky
Date: Mon, 8 Sep 2025 12:24:53 +1000
Subject: [PATCH 27/36] FIX: removed closing brackets
---
lectures/polars.md | 1 +
1 file changed, 1 insertion(+)
diff --git a/lectures/polars.md b/lectures/polars.md
index 96a8cf4d..4f92c6e4 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -167,6 +167,7 @@ df = df.with_columns(
.alias('daily returns') # assign back to the column
)
df
+```
You can check if a ticker code is in the company list
From ef004f27e288a10f5189a08134301d3e3006dec2 Mon Sep 17 00:00:00 2001
From: mmcky
Date: Mon, 8 Sep 2025 13:11:22 +1000
Subject: [PATCH 28/36] add in cell fence removed from comments
---
lectures/polars.md | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/lectures/polars.md b/lectures/polars.md
index 4f92c6e4..62d1f950 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -451,6 +451,7 @@ cols = ["cc", "tcgdp", "POP", "XRAT"]
df_with_nulls.with_columns([
pl.col(cols).fill_null(pl.col(cols).mean())
])
+```
Missing value imputation is a big area in data science involving various machine learning techniques.
@@ -590,8 +591,6 @@ Note that polars offers many other file type alternatives.
Polars has [a wide variety](https://docs.pola.rs/user-guide/io/) of methods that we can use to read excel, json, parquet or plug straight into a database server.
-+++
-
## Exercises
```{exercise-start}
From 6e2d434670e8a19e4d6d2955fb55c1eff1bdc95e Mon Sep 17 00:00:00 2001
From: mmcky
Date: Tue, 9 Sep 2025 07:40:07 +1000
Subject: [PATCH 29/36] add in missing url
---
lectures/polars.md | 1 +
1 file changed, 1 insertion(+)
diff --git a/lectures/polars.md b/lectures/polars.md
index 62d1f950..937465fe 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -555,6 +555,7 @@ Here polars' `read_csv` function provides the same functionality.
We use `try_parse_dates=True` so that polars recognizes our dates column
```{code-cell} ipython3
+url = 'https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1318&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=UNRATE&scale=left&cosd=1948-01-01&coed=2024-06-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2024-07-29&revision_date=2024-07-29&nd=1948-01-01'
data = pl.read_csv(url, try_parse_dates=True)
```
From 50a708c08d17286c0d4588b66e1d99a3c76039e2 Mon Sep 17 00:00:00 2001
From: mmcky
Date: Tue, 9 Sep 2025 08:00:14 +1000
Subject: [PATCH 30/36] Remove Exercises
---
lectures/polars.md | 239 ---------------------------------------------
1 file changed, 239 deletions(-)
diff --git a/lectures/polars.md b/lectures/polars.md
index 937465fe..f10adb1a 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -592,243 +592,4 @@ Note that polars offers many other file type alternatives.
Polars has [a wide variety](https://docs.pola.rs/user-guide/io/) of methods that we can use to read excel, json, parquet or plug straight into a database server.
-## Exercises
-
-```{exercise-start}
-:label: pl_ex1
-```
-
-With these imports:
-
-```{code-cell} ipython3
-import datetime as dt
-import yfinance as yf
-```
-
-Write a program to calculate the percentage price change over 2021 for the following shares:
-
-```{code-cell} ipython3
-ticker_list = {'INTC': 'Intel',
- 'MSFT': 'Microsoft',
- 'IBM': 'IBM',
- 'BHP': 'BHP',
- 'TM': 'Toyota',
- 'AAPL': 'Apple',
- 'AMZN': 'Amazon',
- 'C': 'Citigroup',
- 'QCOM': 'Qualcomm',
- 'KO': 'Coca-Cola',
- 'GOOG': 'Google'}
-```
-
-Here's the first part of the program
-
-```{note}
-Many python packages will return Pandas DataFrames by default. In this example we use the `yfinance` package and convert the data to a polars DataFrame
-```
-
-```{code-cell} ipython3
-def read_data(ticker_list,
- start=dt.datetime(2021, 1, 1),
- end=dt.datetime(2021, 12, 31)):
- """
- This function reads in closing price data from Yahoo
- for each tick in the ticker_list.
- """
-
- all_data = []
-
- for tick in ticker_list:
- stock = yf.Ticker(tick)
- prices = stock.history(start=start, end=end)
-
- # Convert to polars DataFrame
- df = pl.from_pandas(prices.reset_index())
- df = df.with_columns([
- pl.col('Date').cast(pl.Date),
- pl.lit(tick).alias('ticker')
- ]).select(['Date', 'ticker', 'Close'])
-
- all_data.append(df)
-
- # Combine all data
- ticker_df = pl.concat(all_data)
-
- # Pivot to have tickers as columns
- ticker_df = ticker_df.pivot(values='Close', index='Date', on='ticker')
-
- return ticker_df
-
-ticker = read_data(ticker_list)
-```
-
-Complete the program to plot the result as a bar graph like this one:
-
-```{image} /_static/lecture_specific/pandas/pandas_share_prices.png
-:scale: 80
-:align: center
-```
-
-```{exercise-end}
-```
-
-```{solution-start} pl_ex1
-:class: dropdown
-```
-
-There are a few ways to approach this problem using Polars to calculate
-the percentage change.
-
-First, you can extract the data and perform the calculation such as:
-
-```{code-cell} ipython3
-# Get first and last prices for each ticker
-first_prices = ticker[0] # First row
-last_prices = ticker[-1] # Last row
-
-# Convert to pandas for easier calculation, excluding Date column to avoid type errors
-numeric_cols = [col for col in ticker.columns if col != 'Date']
-first_pd = ticker.head(1).select(numeric_cols).to_pandas().iloc[0]
-last_pd = ticker.tail(1).select(numeric_cols).to_pandas().iloc[0]
-
-price_change = (last_pd - first_pd) / first_pd * 100
-price_change
-```
-
-Alternatively you can use polars expressions to calculate percentage change:
-
-```{code-cell} ipython3
-# Calculate percentage change using polars
-change_df = ticker.select([
- ((pl.col(col).last() - pl.col(col).first()) / pl.col(col).first() * 100).alias(f'{col}_pct_change')
- for col in ticker.columns if col != 'Date'
-])
-
-# Convert to series for plotting
-price_change = change_df.to_pandas().iloc[0]
-price_change.index = [col.replace('_pct_change', '') for col in price_change.index]
-price_change
-```
-
-Then to plot the chart
-
-```{code-cell} ipython3
-price_change.sort_values(inplace=True)
-price_change.rename(index=ticker_list, inplace=True)
-```
-
-```{code-cell} ipython3
-fig, ax = plt.subplots(figsize=(10,8))
-ax.set_xlabel('stock', fontsize=12)
-ax.set_ylabel('percentage change in price', fontsize=12)
-price_change.plot(kind='bar', ax=ax)
-plt.show()
-```
-
-```{solution-end}
-```
-
-
-```{exercise-start}
-:label: pl_ex2
-```
-
-Using the method `read_data` introduced in {ref}`pl_ex1`, write a program to obtain year-on-year percentage change for the following indices:
-
-```{code-cell} ipython3
-indices_list = {'^GSPC': 'S&P 500',
- '^IXIC': 'NASDAQ',
- '^DJI': 'Dow Jones',
- '^N225': 'Nikkei'}
-```
-
-Complete the program to show summary statistics and plot the result as a time series graph like this one:
-
-```{image} /_static/lecture_specific/pandas/pandas_indices_pctchange.png
-:scale: 80
-:align: center
-```
-
-```{exercise-end}
-```
-
-```{solution-start} pl_ex2
-:class: dropdown
-```
-
-Following the work you did in {ref}`pl_ex1`, you can query the data using `read_data` by updating the start and end dates accordingly.
-
-```{code-cell} ipython3
-indices_data = read_data(
- indices_list,
- start=dt.datetime(1971, 1, 1),
- end=dt.datetime(2021, 12, 31)
-)
-```
-
-Then, calculate the yearly returns using polars:
-
-```{code-cell} ipython3
-# Combine all yearly returns using concat and pivot approach
-all_yearly_data = []
-
-for index_col in indices_data.columns:
- if index_col != 'Date':
- yearly_data = (indices_data
- .with_columns(pl.col('Date').dt.year().alias('year'))
- .group_by('year')
- .agg([
- pl.col(index_col).first().alias('first_price'),
- pl.col(index_col).last().alias('last_price')
- ])
- .with_columns(
- ((pl.col('last_price') - pl.col('first_price') + 1e-10)
- / (pl.col('first_price') + 1e-10)).alias('return')
- )
- .with_columns(pl.lit(indices_list[index_col]).alias('index_name'))
- .select(['year', 'index_name', 'return']))
-
- all_yearly_data.append(yearly_data)
-
-# Concatenate all data
-combined_data = pl.concat(all_yearly_data)
-
-# Pivot to get indices as columns
-yearly_returns = combined_data.pivot(values='return', index='year', on='index_name')
-
-yearly_returns
-```
-
-Next, you can obtain summary statistics by using the method `describe`.
-
-```{code-cell} ipython3
-yearly_returns.select(pl.exclude('year')).describe()
-```
-
-Then, to plot the chart
-
-```{code-cell} ipython3
-# Convert to pandas for plotting
-yearly_returns_pd = yearly_returns.to_pandas().set_index('year')
-
-fig, axes = plt.subplots(2, 2, figsize=(10, 8))
-
-# Flatten 2-D array to 1-D array
-for iter_, ax in enumerate(axes.flatten()):
- if iter_ < len(yearly_returns_pd.columns):
-
- # Get index name per iteration
- index_name = yearly_returns_pd.columns[iter_]
-
- # Plot pct change of yearly returns per index
- ax.plot(yearly_returns_pd[index_name])
- ax.set_ylabel("percent change", fontsize = 12)
- ax.set_title(index_name)
-
-plt.tight_layout()
-```
-
-```{solution-end}
-```
-
[^mung]: Wikipedia defines munging as cleaning data from one raw form into a structured, purged one.
From 8226757e096a5a8a407fdf80656044531c1c78d1 Mon Sep 17 00:00:00 2001
From: mmcky
Date: Tue, 30 Sep 2025 15:11:35 +1000
Subject: [PATCH 31/36] add exercises and solutions
---
lectures/polars.md | 227 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 227 insertions(+)
diff --git a/lectures/polars.md b/lectures/polars.md
index f10adb1a..943077e2 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -592,4 +592,231 @@ Note that polars offers many other file type alternatives.
Polars has [a wide variety](https://docs.pola.rs/user-guide/io/) of methods that we can use to read excel, json, parquet or plug straight into a database server.
+## Exercises
+
+```{exercise-start}
+:label: pl_ex1
+```
+
+With these imports:
+
+```{code-cell} ipython3
+import datetime as dt
+import yfinance as yf
+```
+
+Write a program to calculate the percentage price change over 2021 for the following shares using Polars:
+
+```{code-cell} ipython3
+ticker_list = {'INTC': 'Intel',
+ 'MSFT': 'Microsoft',
+ 'IBM': 'IBM',
+ 'BHP': 'BHP',
+ 'TM': 'Toyota',
+ 'AAPL': 'Apple',
+ 'AMZN': 'Amazon',
+ 'C': 'Citigroup',
+ 'QCOM': 'Qualcomm',
+ 'KO': 'Coca-Cola',
+ 'GOOG': 'Google'}
+```
+
+Here's the first part of the program that reads data into a Polars DataFrame:
+
+```{code-cell} ipython3
+def read_data_polars(ticker_list,
+ start=dt.datetime(2021, 1, 1),
+ end=dt.datetime(2021, 12, 31)):
+ """
+ This function reads in closing price data from Yahoo
+ for each tick in the ticker_list and returns a Polars DataFrame.
+ """
+ # Start with an empty list to collect DataFrames
+ dataframes = []
+
+ for tick in ticker_list:
+ stock = yf.Ticker(tick)
+ prices = stock.history(start=start, end=end)
+
+ # Create a Polars DataFrame from the closing prices
+ df = pl.DataFrame({
+ 'Date': pd.to_datetime(prices.index.date),
+ tick: prices['Close'].values
+ })
+ dataframes.append(df)
+
+ # Join all DataFrames on the Date column
+ result = dataframes[0]
+ for df in dataframes[1:]:
+ result = result.join(df, on='Date', how='outer')
+
+ return result
+
+ticker = read_data_polars(ticker_list)
+```
+
+Complete the program to plot the result as a bar graph using Polars operations and matplotlib visualization.
+
+```{exercise-end}
+```
+
+```{solution-start} pl_ex1
+:class: dropdown
+```
+
+Here's a solution using Polars operations to calculate percentage changes:
+
+
+```{code-cell} ipython3
+price_change_df = ticker.select([
+ pl.col(tick).last().alias(f"{tick}_last") / pl.col(tick).first().alias(f"{tick}_first") * 100 - 100
+ for tick in ticker_list.keys()
+]).transpose(include_header=True, header_name='ticker', column_names=['pct_change'])
+
+# Add company names and sort
+price_change_df = price_change_df.with_columns([
+ pl.col('ticker').replace(ticker_list, default=pl.col('ticker')).alias('company')
+]).sort('pct_change')
+
+print(price_change_df)
+```
+
+Now plot the results:
+
+```{code-cell} ipython3
+# Convert to pandas for plotting (as demonstrated in the lecture)
+df_pandas = price_change_df.to_pandas().set_index('company')
+
+fig, ax = plt.subplots(figsize=(10,8))
+ax.set_xlabel('stock', fontsize=12)
+ax.set_ylabel('percentage change in price', fontsize=12)
+df_pandas['pct_change'].plot(kind='bar', ax=ax)
+plt.xticks(rotation=45)
+plt.tight_layout()
+plt.show()
+```
+
+```{solution-end}
+```
+
+
+```{exercise-start}
+:label: pl_ex2
+```
+
+Using the method `read_data_polars` introduced in {ref}`pl_ex1`, write a program to obtain year-on-year percentage change for the following indices using Polars operations:
+
+```{code-cell} ipython3
+indices_list = {'^GSPC': 'S&P 500',
+ '^IXIC': 'NASDAQ',
+ '^DJI': 'Dow Jones',
+ '^N225': 'Nikkei'}
+```
+
+Complete the program to show summary statistics and plot the result as a time series graph demonstrating Polars' data manipulation capabilities.
+
+```{exercise-end}
+```
+
+```{solution-start} pl_ex2
+:class: dropdown
+```
+
+Following the work you did in {ref}`pl_ex1`, you can query the data using `read_data_polars` by updating the start and end dates accordingly.
+
+```{code-cell} ipython3
+indices_data = read_data_polars(
+ indices_list,
+ start=dt.datetime(1971, 1, 1), # Common Start Date
+ end=dt.datetime(2021, 12, 31)
+)
+
+# Add year column for grouping
+indices_data = indices_data.with_columns(
+ pl.col('Date').dt.year().alias('year')
+)
+
+print("Data shape:", indices_data.shape)
+print("\nFirst few rows:")
+print(indices_data.head())
+```
+
+Calculate yearly returns using Polars groupby operations:
+
+```{code-cell} ipython3
+# Calculate first and last price for each year and each index
+yearly_returns = indices_data.group_by('year').agg([
+ *[pl.col(index).first().alias(f"{index}_first") for index in indices_list.keys()],
+ *[pl.col(index).last().alias(f"{index}_last") for index in indices_list.keys()]
+])
+
+# Calculate percentage returns for each index
+for index in indices_list.keys():
+ yearly_returns = yearly_returns.with_columns(
+ ((pl.col(f"{index}_last") - pl.col(f"{index}_first")) / pl.col(f"{index}_first"))
+ .alias(indices_list[index])
+ )
+
+# Select only the year and return columns
+yearly_returns = yearly_returns.select([
+ 'year',
+ *list(indices_list.values())
+]).sort('year')
+
+print("Yearly returns shape:", yearly_returns.shape)
+print("\nYearly returns:")
+print(yearly_returns.head(10))
+```
+
+Generate summary statistics using Polars:
+
+```{code-cell} ipython3
+# Summary statistics for all indices
+summary_stats = yearly_returns.select(list(indices_list.values())).describe()
+print("Summary Statistics:")
+print(summary_stats)
+```
+
+Plot the time series:
+
+```{code-cell} ipython3
+# Convert to pandas for plotting
+df_pandas = yearly_returns.to_pandas().set_index('year')
+
+fig, axes = plt.subplots(2, 2, figsize=(12, 10))
+
+for iter_, ax in enumerate(axes.flatten()):
+ if iter_ < len(indices_list):
+ index_name = list(indices_list.values())[iter_]
+ ax.plot(df_pandas.index, df_pandas[index_name])
+ ax.set_ylabel("percent change", fontsize=12)
+ ax.set_xlabel("year", fontsize=12)
+ ax.set_title(index_name)
+ ax.grid(True, alpha=0.3)
+
+plt.tight_layout()
+plt.show()
+```
+
+Alternative: Create a single plot with all indices:
+
+```{code-cell} ipython3
+# Single plot with all indices
+fig, ax = plt.subplots(figsize=(12, 8))
+
+for index_name in indices_list.values():
+ ax.plot(df_pandas.index, df_pandas[index_name], label=index_name, linewidth=2)
+
+ax.set_xlabel("year", fontsize=12)
+ax.set_ylabel("yearly return", fontsize=12)
+ax.set_title("Yearly Returns of Major Stock Indices", fontsize=14)
+ax.legend()
+ax.grid(True, alpha=0.3)
+plt.tight_layout()
+plt.show()
+```
+
+```{solution-end}
+```
+
[^mung]: Wikipedia defines munging as cleaning data from one raw form into a structured, purged one.
From b3474d140b90783cb7861f430820ff3075515544 Mon Sep 17 00:00:00 2001
From: mmcky
Date: Tue, 30 Sep 2025 15:50:20 +1000
Subject: [PATCH 32/36] review of lecture and add section on Lazy evaluation
---
lectures/polars.md | 194 ++++++++++++++++++++++++++++++++++-----------
1 file changed, 146 insertions(+), 48 deletions(-)
diff --git a/lectures/polars.md b/lectures/polars.md
index 943077e2..f346f32e 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -37,9 +37,9 @@ In addition to what's in Anaconda, this lecture will need the following librarie
[Polars](https://pola.rs/) is a fast data manipulation library for Python written in Rust.
-Polars has gained significant popularity in recent years due to its superior performance
-compared to traditional data analysis tools, making it an excellent choice for modern
-data science and machine learning workflows.
+Polars has gained significant popularity in recent years due to its superior performance compared to traditional data analysis tools.
+
+This makes it an excellent choice for modern data science and machine learning workflows.
Polars is designed with performance and memory efficiency in mind, leveraging:
@@ -48,7 +48,7 @@ Polars is designed with performance and memory efficiency in mind, leveraging:
* Parallel processing for enhanced performance
* Expressive API similar to pandas but with better performance characteristics
-Just as [NumPy](https://numpy.org/) provides the basic array data type plus core array operations, polars
+Just as [NumPy](https://numpy.org/) provides the basic array data type plus core array operations, Polars
1. defines fundamental structures for working with data and
1. endows them with methods that facilitate operations such as
@@ -58,10 +58,9 @@ Just as [NumPy](https://numpy.org/) provides the basic array data type plus core
* sorting, grouping, re-ordering and general data munging [^mung]
* dealing with missing values, etc.
-More sophisticated statistical functionality is left to other packages, such
-as [statsmodels](https://www.statsmodels.org/) and [scikit-learn](https://scikit-learn.org/), which can work with polars DataFrames through their interoperability with pandas.
+More sophisticated statistical functionality is left to other packages, such as [statsmodels](https://www.statsmodels.org/) and [scikit-learn](https://scikit-learn.org/), which can work with Polars DataFrames through their interoperability with pandas.
-This lecture will provide a basic introduction to polars.
+This lecture will provide a basic introduction to Polars.
```{tip}
*Why use Polars over pandas?* One reason is *performance*. As a general rule, it is recommended to have 5 to 10 times as much RAM as the size of the dataset to carry out operations in pandas, compared to 2 to 4 times needed for Polars. In addition, Polars is between 10 and 100 times as fast as pandas for common operations. A great article comparing the Polars and pandas can be found [in this JetBrains blog post](https://blog.jetbrains.com/pycharm/2024/07/polars-vs-pandas/).
@@ -77,7 +76,7 @@ import matplotlib.pyplot as plt
import requests
```
-Two important data types defined by polars are `Series` and `DataFrame`.
+Two important data types defined by Polars are `Series` and `DataFrame`.
You can think of a `Series` as a "column" of data, such as a collection of observations on a single variable.
@@ -98,11 +97,10 @@ s
```
```{note}
-You may notice the above series has no indices, unlike in [pd.Series](pandas:series). This is because Polars' is column centric and accessing data is predominantly managed through filtering and boolean masks. Here is [an interesting blog post discussing this in more detail](https://medium.com/data-science/understand-polars-lack-of-indexes-526ea75e413).
+You may notice the above series has no indices, unlike in [pd.Series](pandas:series). This is because Polars is column-centric and accessing data is predominantly managed through filtering and boolean masks. Here is [an interesting blog post discussing this in more detail](https://medium.com/data-science/understand-polars-lack-of-indexes-526ea75e413).
```
-Polars `Series` are built on top of Apache Arrow arrays and support many similar
-operations to Pandas `Series`.
+Polars `Series` are built on top of Apache Arrow arrays and support many similar operations to Pandas `Series`.
(For interested readers, please see this extended reading on [Apache Arrow](https://www.datacamp.com/tutorial/apache-arrow))
@@ -116,13 +114,13 @@ s.abs()
But `Series` provide more than basic arrays.
-For example they have some additional (statistically oriented) methods
+For example, they have some additional (statistically oriented) methods
```{code-cell} ipython3
s.describe()
```
-However the `pl.Series` object cannot be used in the same way as a `pd.Series` when pairing data with indices.
+However, the `pl.Series` object cannot be used in the same way as a `pd.Series` when pairing data with indices.
For example, using a `pd.Series` you can do the following:
@@ -134,10 +132,9 @@ s
However, in Polars you will need to use the `DataFrame` object to do the same task.
-This means you will use the `DataFrame` object more often when using polars if you
-are interested in relationships between data
+This means you will use the `DataFrame` object more often when using Polars if you are interested in relationships between data.
-Let's create a `pl.DataFrame` containing the equivalent data in the `pd.Series` .
+Let's create a `pl.DataFrame` containing the equivalent data in the `pd.Series`.
```{code-cell} ipython3
df = pl.DataFrame({
@@ -147,17 +144,15 @@ df = pl.DataFrame({
df
```
-To access specific values by company name, we can filter the DataFrame filtering on
-the `AMZN` ticker code and selecting the `daily returns`.
+To access specific values by company name, we can filter the DataFrame for the `AMZN` ticker code and select the `daily returns`.
```{code-cell} ipython3
df.filter(pl.col('company') == 'AMZN').select('daily returns').item()
```
-If we want to update `AMZN` return to 0, you can use the following chain of methods.
-
+If we want to update the `AMZN` return to 0, you can use the following chain of methods.
-Here `with_columns` is similar to `select` but adds columns to the same `DataFrame`
+Here, `with_columns` is similar to `select` but adds columns to the same `DataFrame`
```{code-cell} ipython3
df = df.with_columns(
@@ -182,14 +177,13 @@ You can check if a ticker code is in the company list
While a `Series` is a single column of data, a `DataFrame` is several columns, one for each variable.
-In essence, a `DataFrame` in polars is analogous to a (highly optimized) Excel spreadsheet.
+In essence, a `DataFrame` in Polars is analogous to a (highly optimized) Excel spreadsheet.
Thus, it is a powerful tool for representing and analyzing data that are naturally organized into rows and columns.
-Let's look at an example that reads data from the CSV file `pandas/data/test_pwt.csv`,
-which is taken from the [Penn World Tables](https://www.rug.nl/ggdc/productivity/pwt/pwt-releases/pwt-7.0).
+Let's look at an example that reads data from the CSV file `pandas/data/test_pwt.csv`, which is taken from the [Penn World Tables](https://www.rug.nl/ggdc/productivity/pwt/pwt-releases/pwt-7.0).
-The dataset contains the following indicators
+The dataset contains the following indicators:
| Variable Name | Description |
| :-: | :-: |
@@ -200,7 +194,7 @@ The dataset contains the following indicators
| cg | Government Consumption Share of PPP Converted GDP Per Capita (%) |
-We'll read this in from a URL using the `polars` function `read_csv`.
+We'll read this in from a URL using the Polars function `read_csv`.
```{code-cell} ipython3
URL = 'https://raw.githubusercontent.com/QuantEcon/lecture-python-programming/master/source/_static/lecture_specific/pandas/data/test_pwt.csv'
@@ -214,7 +208,7 @@ Here is the content of `test_pwt.csv`
df
```
-### Select Data by Position
+### Select data by position
In practice, one thing that we do all the time is to find, select and work with a
subset of the data of our interests.
@@ -243,7 +237,7 @@ To select rows and columns using a mixture of integers and labels, we can use mo
df[2:5].select(['country', 'tcgdp'])
```
-### Select Data by Conditions
+### Select data by conditions
Instead of indexing rows and columns using integers and names, we can also obtain a sub-dataframe of our interests that satisfies certain (potentially complicated) conditions.
@@ -266,7 +260,7 @@ df.select(
)
```
-Take one more example,
+Here is another example:
```{code-cell} ipython3
df.filter(
@@ -281,14 +275,14 @@ We can also allow arithmetic operations between different columns.
df.filter((pl.col('cc') + pl.col('cg') >= 80) & (pl.col('POP') <= 20000))
```
-For example, we can use the conditioning to select the country with the largest
-household consumption - gdp share `cc`.
+For example, we can use the condition to select the country with the largest
+household consumption–GDP share `cc`.
```{code-cell} ipython3
df.filter(pl.col('cc') == pl.col('cc').max())
```
-When we only want to look at certain columns of a selected sub-dataframe, we can combine filter with select.
+When we only want to look at certain columns of a selected sub-DataFrame, we can combine filter with select.
```{code-cell} ipython3
df.filter(
@@ -296,7 +290,7 @@ df.filter(
).select(['country', 'year', 'POP'])
```
-**Application: Subsetting Dataframe**
+**Application: Subsetting DataFrame**
Real-world datasets can be very large.
@@ -319,11 +313,11 @@ We can then save the smaller dataset for further analysis.
df_subset.write_csv('pwt_subset.csv')
```
-### Apply and Map Operations
+### Apply and map operations
Polars provides powerful methods for applying functions to data.
-Instead of pandas' `apply` method, polars uses expressions within `select`, `with_columns`, or `filter` methods.
+Instead of pandas' `apply` method, Polars uses expressions within `select`, `with_columns`, or `filter` methods.
Here is an example using built-in functions to find the `max` value for each column
@@ -363,9 +357,9 @@ complex_condition = (
df.filter(complex_condition).select(['country', 'year', 'POP', 'XRAT', 'tcgdp'])
```
-### Make Changes in DataFrames
+### Make changes in DataFrames
-The ability to make changes in dataframes is important to generate a clean dataset for future analysis.
+The ability to make changes in DataFrames is important to generate a clean dataset for future analysis.
**1.** We can use conditional logic to "keep" certain values and replace others
@@ -455,9 +449,9 @@ df_with_nulls.with_columns([
Missing value imputation is a big area in data science involving various machine learning techniques.
-There are also more [advanced tools](https://scikit-learn.org/stable/modules/impute.html) in python to impute missing values.
+There are also more [advanced tools](https://scikit-learn.org/stable/modules/impute.html) in Python to impute missing values.
-### Standardization and Visualization
+### Standardization and visualization
Let's imagine that we're only interested in the population (`POP`) and total GDP (`tcgdp`).
@@ -468,7 +462,7 @@ df = df.select(['country', 'POP', 'tcgdp'])
df
```
-While polars doesn't have a traditional index like pandas, we can work with country names directly
+While Polars doesn't have a traditional index like pandas, we can work with country names directly
```{code-cell} ipython3
df
@@ -501,7 +495,7 @@ df = df.with_columns(
df
```
-One of the nice things about polars `DataFrame` and `Series` objects is that they can be easily converted to pandas for visualization through Matplotlib.
+One of the nice things about Polars `DataFrame` and `Series` objects is that they can be easily converted to pandas for visualization through Matplotlib.
For example, we can easily generate a bar plot of GDP per capita
@@ -532,7 +526,111 @@ ax.set_ylabel('GDP per capita', fontsize=12)
plt.show()
```
-## On-Line Data Sources
+## Lazy evaluation
+
+```{index} single: Polars; Lazy Evaluation
+```
+
+One of Polars' most powerful features is **lazy evaluation**. This allows Polars to optimize your entire query before executing it, leading to significant performance improvements.
+
+### Eager vs lazy APIs
+
+Polars provides two APIs:
+
+1. **Eager API** - Operations are executed immediately (like pandas)
+2. **Lazy API** - Operations are collected and optimized before execution
+
+Let's see the difference using our dataset:
+
+```{code-cell} ipython3
+# First, let's reload our original dataset for this example
+URL = 'https://raw.githubusercontent.com/QuantEcon/lecture-python-programming/master/source/_static/lecture_specific/pandas/data/test_pwt.csv'
+df_full = pl.read_csv(URL)
+
+# Eager API (executed immediately)
+result_eager = (df_full
+ .filter(pl.col('tcgdp') > 1000)
+ .select(['country', 'year', 'tcgdp'])
+ .sort('tcgdp', descending=True)
+)
+print("Eager result shape:", result_eager.shape)
+result_eager.head()
+```
+
+```{code-cell} ipython3
+# Lazy API (builds a query plan)
+lazy_query = (df_full.lazy() # Convert to lazy frame
+ .filter(pl.col('tcgdp') > 1000)
+ .select(['country', 'year', 'tcgdp'])
+ .sort('tcgdp', descending=True)
+)
+
+print("Lazy query (not yet executed):")
+print(lazy_query)
+```
+
+```{code-cell} ipython3
+# Execute the lazy query
+result_lazy = lazy_query.collect()
+print("Lazy result shape:", result_lazy.shape)
+result_lazy.head()
+```
+
+### Query optimization
+
+The lazy API allows Polars to perform several optimizations:
+
+1. **Predicate Pushdown** - Filters are applied as early as possible
+2. **Projection Pushdown** - Only required columns are read
+3. **Common Subexpression Elimination** - Duplicate calculations are removed
+4. **Dead Code Elimination** - Unused operations are removed
+
+```{code-cell} ipython3
+# Example of optimization - only columns needed are processed
+optimized_query = (df_full.lazy()
+ .select(['country', 'year', 'tcgdp', 'POP']) # Select early
+ .filter(pl.col('tcgdp') > 500) # Filter pushdown
+ .with_columns((pl.col('tcgdp') / pl.col('POP')).alias('gdp_per_capita'))
+ .filter(pl.col('gdp_per_capita') > 10) # Additional filter
+ .select(['country', 'year', 'gdp_per_capita']) # Final projection
+)
+
+print("Optimized query plan:")
+print(optimized_query.describe_optimized_plan())
+```
+
+```{code-cell} ipython3
+# Execute the optimized query
+result_optimized = optimized_query.collect()
+result_optimized.head()
+```
+
+### When to use lazy vs eager
+
+**Use Lazy API when:**
+- Working with large datasets
+- Performing complex transformations
+- Building data pipelines
+- Performance is critical
+
+**Use Eager API when:**
+- Exploring data interactively
+- Working with small datasets
+- Need immediate results for debugging
+
+```{code-cell} ipython3
+# Converting between lazy and eager
+eager_df = df_full # Start with eager DataFrame
+lazy_df = df_full.lazy() # Convert to lazy
+back_to_eager = lazy_df.collect() # Execute lazy and get eager result
+
+print("Original eager shape:", eager_df.shape)
+print("Back to eager shape:", back_to_eager.shape)
+```
+
+The lazy API is particularly powerful for data processing pipelines where multiple transformations can be optimized together as a single operation.
+
+## Online data sources
```{index} single: Data Sources
```
@@ -550,30 +648,30 @@ Alternatively, we can access the CSV file from within a Python program.
In {doc}`pandas`, we studied how to use `requests` and `pandas` to access API data.
-Here polars' `read_csv` function provides the same functionality.
+Here Polars' `read_csv` function provides the same functionality.
-We use `try_parse_dates=True` so that polars recognizes our dates column
+We use `try_parse_dates=True` so that Polars recognizes our dates column
```{code-cell} ipython3
url = 'https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1318&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=UNRATE&scale=left&cosd=1948-01-01&coed=2024-06-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2024-07-29&revision_date=2024-07-29&nd=1948-01-01'
data = pl.read_csv(url, try_parse_dates=True)
```
-The data has been read into a polars DataFrame called `data` that we can now manipulate in the usual way
+The data has been read into a Polars DataFrame called `data` that we can now manipulate in the usual way
```{code-cell} ipython3
type(data)
```
```{code-cell} ipython3
-data.head() # A useful method to get a quick look at a data frame
+data.head() # A useful method to get a quick look at a DataFrame
```
```{code-cell} ipython3
data.describe() # Your output might differ slightly
```
-We can also plot the unemployment rate from 2006 to 2012 as follows
+We can also plot the unemployment rate from 2006 to 2012 as follows:
```{code-cell} ipython3
# Filter data for the specified date range and convert to pandas for plotting
@@ -588,7 +686,7 @@ ax.set_ylabel('%', fontsize=12)
plt.show()
```
-Note that polars offers many other file type alternatives.
+Note that Polars offers many other file type alternatives.
Polars has [a wide variety](https://docs.pola.rs/user-guide/io/) of methods that we can use to read excel, json, parquet or plug straight into a database server.
From ea41ee3cc8063aef552fcef6c683a7ea0844e013 Mon Sep 17 00:00:00 2001
From: mmcky
Date: Tue, 30 Sep 2025 16:11:07 +1000
Subject: [PATCH 33/36] FIX: issues with code
---
lectures/polars.md | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/lectures/polars.md b/lectures/polars.md
index f346f32e..65d5c8ea 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -596,7 +596,7 @@ optimized_query = (df_full.lazy()
)
print("Optimized query plan:")
-print(optimized_query.describe_optimized_plan())
+print(optimized_query.explain())
```
```{code-cell} ipython3
@@ -728,8 +728,8 @@ def read_data_polars(ticker_list,
"""
This function reads in closing price data from Yahoo
for each tick in the ticker_list and returns a Polars DataFrame.
+ Different indices may have different trading days, so we use joins to handle this.
"""
- # Start with an empty list to collect DataFrames
dataframes = []
for tick in ticker_list:
@@ -743,10 +743,12 @@ def read_data_polars(ticker_list,
})
dataframes.append(df)
- # Join all DataFrames on the Date column
+ # Start with the first DataFrame
result = dataframes[0]
+
+ # Join additional DataFrames, handling mismatched dates with full outer join
for df in dataframes[1:]:
- result = result.join(df, on='Date', how='outer')
+ result = result.join(df, on='Date', how='full', coalesce=True)
return result
From 4dbe60d04a7d582285aa53b72435b56a32201951 Mon Sep 17 00:00:00 2001
From: mmcky
Date: Tue, 30 Sep 2025 21:32:27 +1000
Subject: [PATCH 34/36] Comprehensive improvements to Polars lecture
- Fix execution errors and deprecation warnings
- Add pyarrow dependency for Polars to pandas conversion
- Fix lazy evaluation method: replace describe_optimized_plan() with explain()
- Update deprecated join syntax: how='outer' to how='full'
- Fix yfinance integration with coalesce=True for different trading calendars
- Apply QuantEcon style guide compliance:
- Convert headings from title case to sentence case
- Split multi-sentence paragraphs per qe-writing-002 rule
- Fix proper noun capitalization (polars -> Polars)
- Add lazy evaluation section with query optimization examples
- Expand exercises with comprehensive stock analysis examples
- Enhance plotting with markers, reference lines, and debugging info
- Fix replace() deprecation warning: use replace_strict()
- Add data validation and debugging output to exercises
- Improve visualization with better styling and error handling
All code cells now execute successfully with Polars 1.33.1
---
lectures/polars.md | 22 +++++++++++++++++-----
1 file changed, 17 insertions(+), 5 deletions(-)
diff --git a/lectures/polars.md b/lectures/polars.md
index 65d5c8ea..ca18364f 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -30,7 +30,7 @@ In addition to what's in Anaconda, this lecture will need the following librarie
```{code-cell} ipython3
:tags: [hide-output]
-!pip install --upgrade polars wbgapi yfinance
+!pip install --upgrade polars wbgapi yfinance pyarrow
```
## Overview
@@ -775,7 +775,7 @@ price_change_df = ticker.select([
# Add company names and sort
price_change_df = price_change_df.with_columns([
- pl.col('ticker').replace(ticker_list, default=pl.col('ticker')).alias('company')
+ pl.col('ticker').replace_strict(ticker_list, default=pl.col('ticker')).alias('company')
]).sort('pct_change')
print(price_change_df)
@@ -875,6 +875,13 @@ Generate summary statistics using Polars:
summary_stats = yearly_returns.select(list(indices_list.values())).describe()
print("Summary Statistics:")
print(summary_stats)
+
+# Check for any null values or data issues
+print(f"\nData shape: {yearly_returns.shape}")
+print(f"Null counts:")
+print(yearly_returns.null_count())
+print(f"\nData range (first few years):")
+print(yearly_returns.head())
```
Plot the time series:
@@ -888,11 +895,16 @@ fig, axes = plt.subplots(2, 2, figsize=(12, 10))
for iter_, ax in enumerate(axes.flatten()):
if iter_ < len(indices_list):
index_name = list(indices_list.values())[iter_]
- ax.plot(df_pandas.index, df_pandas[index_name])
- ax.set_ylabel("percent change", fontsize=12)
+
+ # Plot with markers and lines for better visibility
+ ax.plot(df_pandas.index, df_pandas[index_name], 'o-', linewidth=2, markersize=4)
+ ax.set_ylabel("yearly return", fontsize=12)
ax.set_xlabel("year", fontsize=12)
- ax.set_title(index_name)
+ ax.set_title(index_name, fontsize=12)
ax.grid(True, alpha=0.3)
+
+ # Add horizontal line at zero for reference
+ ax.axhline(y=0, color='k', linestyle='--', alpha=0.3)
plt.tight_layout()
plt.show()
From a86a9e485a247ff7e13084f2c05870224686c45e Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Tue, 30 Sep 2025 11:38:52 +0000
Subject: [PATCH 35/36] Apply code review suggestions from @HumphreyYang and
@jstac
Co-authored-by: mmcky <8263752+mmcky@users.noreply.github.com>
---
lectures/polars.md | 20 ++++++++++++--------
1 file changed, 12 insertions(+), 8 deletions(-)
diff --git a/lectures/polars.md b/lectures/polars.md
index 65d5c8ea..89fc4303 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -63,7 +63,7 @@ More sophisticated statistical functionality is left to other packages, such as
This lecture will provide a basic introduction to Polars.
```{tip}
-*Why use Polars over pandas?* One reason is *performance*. As a general rule, it is recommended to have 5 to 10 times as much RAM as the size of the dataset to carry out operations in pandas, compared to 2 to 4 times needed for Polars. In addition, Polars is between 10 and 100 times as fast as pandas for common operations. A great article comparing the Polars and pandas can be found [in this JetBrains blog post](https://blog.jetbrains.com/pycharm/2024/07/polars-vs-pandas/).
+*Why use Polars over pandas?* One reason is *performance*: as a general rule, it is recommended to have 5 to 10 times as much RAM as the size of the dataset to carry out operations in pandas, compared to 2 to 4 times needed for Polars; in addition, Polars is between 10 and 100 times as fast as pandas for common operations; a great article comparing Polars and pandas can be found [in this JetBrains blog post](https://blog.jetbrains.com/pycharm/2024/07/polars-vs-pandas/).
```
Throughout the lecture, we will assume that the following imports have taken place
@@ -73,7 +73,6 @@ import polars as pl
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
-import requests
```
Two important data types defined by Polars are `Series` and `DataFrame`.
@@ -97,7 +96,7 @@ s
```
```{note}
-You may notice the above series has no indices, unlike in [pd.Series](pandas:series). This is because Polars is column-centric and accessing data is predominantly managed through filtering and boolean masks. Here is [an interesting blog post discussing this in more detail](https://medium.com/data-science/understand-polars-lack-of-indexes-526ea75e413).
+You may notice the above series has no indices, unlike in [pd.Series](pandas:series); this is because Polars' is column centric and accessing data is predominantly managed through filtering and boolean masks; here is [an interesting blog post discussing this in more detail](https://medium.com/data-science/understand-polars-lack-of-indexes-526ea75e413).
```
Polars `Series` are built on top of Apache Arrow arrays and support many similar operations to Pandas `Series`.
@@ -152,13 +151,13 @@ df.filter(pl.col('company') == 'AMZN').select('daily returns').item()
If we want to update the `AMZN` return to 0, you can use the following chain of methods.
-Here, `with_columns` is similar to `select` but adds columns to the same `DataFrame`
+Here `with_columns` is similar to `select` but adds columns to the same `DataFrame`
```{code-cell} ipython3
df = df.with_columns(
pl.when(pl.col('company') == 'AMZN') # filter for AMZN in company column
.then(0) # set values to 0
- .otherwise(pl.col('daily returns')) # otherwise keep the original value
+ .otherwise(pl.col('daily returns')) # otherwise keep original value
.alias('daily returns') # assign back to the column
)
df
@@ -378,8 +377,8 @@ df.with_columns(
df_modified = df.with_columns(
pl.when(pl.col('cg') == pl.col('cg').max()) # pick the largest cg value
.then(None) # set to null
- .otherwise(pl.col('cg')) # otherwise keep the value in the cg column
- .alias('cg') # update the column with name cg
+ .otherwise(pl.col('cg')) # otherwise keep the value
+ .alias('cg') # update the column
)
df_modified
```
@@ -390,7 +389,7 @@ df_modified
df.with_columns([
pl.when(pl.col('POP') <= 10000) # when population is < 10,000
.then(None) # set the value to null
- .otherwise(pl.col('POP')) # otherwise keep the existing value
+ .otherwise(pl.col('POP')) # otherwise keep existing value
.alias('POP'), # update the POP column
(pl.col('XRAT') / 10).alias('XRAT') # update XRAT in-place
])
@@ -885,9 +884,14 @@ df_pandas = yearly_returns.to_pandas().set_index('year')
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
+# Flatten 2-D array to 1-D array
for iter_, ax in enumerate(axes.flatten()):
if iter_ < len(indices_list):
+
+ # Get index name per iteration
index_name = list(indices_list.values())[iter_]
+
+ # Plot pct change of yearly returns per index
ax.plot(df_pandas.index, df_pandas[index_name])
ax.set_ylabel("percent change", fontsize=12)
ax.set_xlabel("year", fontsize=12)
From 7f0a848bb7293a6158a463aa54e6cfa09a23fd77 Mon Sep 17 00:00:00 2001
From: mmcky
Date: Wed, 1 Oct 2025 09:54:58 +1000
Subject: [PATCH 36/36] Make Polars lecture PEP8 compliant with 80-character
line limit
- Fixed long URL lines using proper string continuation
- Removed all trailing whitespace from code blocks
- Reformatted long method chains and function calls
- Improved docstring formatting for better readability
- Fixed exercise solutions with proper company names and color-coded plotting
- All Python code blocks now comply with PEP8 standards
---
lectures/polars.md | 181 ++++++++++++++++++++++++++++-----------------
1 file changed, 115 insertions(+), 66 deletions(-)
diff --git a/lectures/polars.md b/lectures/polars.md
index b453967d..842f8d41 100644
--- a/lectures/polars.md
+++ b/lectures/polars.md
@@ -62,7 +62,7 @@ More sophisticated statistical functionality is left to other packages, such as
This lecture will provide a basic introduction to Polars.
-```{tip}
+```{tip}
*Why use Polars over pandas?* One reason is *performance*: as a general rule, it is recommended to have 5 to 10 times as much RAM as the size of the dataset to carry out operations in pandas, compared to 2 to 4 times needed for Polars; in addition, Polars is between 10 and 100 times as fast as pandas for common operations; a great article comparing Polars and pandas can be found [in this JetBrains blog post](https://blog.jetbrains.com/pycharm/2024/07/polars-vs-pandas/).
```
@@ -131,7 +131,7 @@ s
However, in Polars you will need to use the `DataFrame` object to do the same task.
-This means you will use the `DataFrame` object more often when using Polars if you are interested in relationships between data.
+This means you will use the `DataFrame` object more often when using Polars if you are interested in relationships between data.
Let's create a `pl.DataFrame` containing the equivalent data in the `pd.Series`.
@@ -182,12 +182,12 @@ Thus, it is a powerful tool for representing and analyzing data that are natural
Let's look at an example that reads data from the CSV file `pandas/data/test_pwt.csv`, which is taken from the [Penn World Tables](https://www.rug.nl/ggdc/productivity/pwt/pwt-releases/pwt-7.0).
-The dataset contains the following indicators:
+The dataset contains the following indicators:
| Variable Name | Description |
| :-: | :-: |
| POP | Population (in thousands) |
-| XRAT | Exchange Rate to US Dollar |
+| XRAT | Exchange Rate to US Dollar |
| tcgdp | Total PPP Converted GDP (in million international dollar) |
| cc | Consumption Share of PPP Converted GDP Per Capita (%) |
| cg | Government Consumption Share of PPP Converted GDP Per Capita (%) |
@@ -196,7 +196,9 @@ The dataset contains the following indicators:
We'll read this in from a URL using the Polars function `read_csv`.
```{code-cell} ipython3
-URL = 'https://raw.githubusercontent.com/QuantEcon/lecture-python-programming/master/source/_static/lecture_specific/pandas/data/test_pwt.csv'
+URL = ('https://raw.githubusercontent.com/QuantEcon/'
+ 'lecture-python-programming/master/source/_static/'
+ 'lecture_specific/pandas/data/test_pwt.csv')
df = pl.read_csv(URL)
type(df)
```
@@ -209,8 +211,8 @@ df
### Select data by position
-In practice, one thing that we do all the time is to find, select and work with a
-subset of the data of our interests.
+In practice, one thing that we do all the time is to find, select and work with a
+subset of the data of our interests.
We can select particular rows using array slicing notation
@@ -254,7 +256,7 @@ We can view this boolean mask as a table with the alias `meets_criteria`
```{code-cell} ipython3
df.select(
- pl.col('country'),
+ pl.col('country'),
(pl.col('POP') >= 20000).alias('meets_criteria')
)
```
@@ -263,7 +265,7 @@ Here is another example:
```{code-cell} ipython3
df.filter(
- (pl.col('country').is_in(['Argentina', 'India', 'South Africa'])) &
+ (pl.col('country').is_in(['Argentina', 'India', 'South Africa'])) &
(pl.col('POP') > 40000)
)
```
@@ -271,10 +273,12 @@ df.filter(
We can also allow arithmetic operations between different columns.
```{code-cell} ipython3
-df.filter((pl.col('cc') + pl.col('cg') >= 80) & (pl.col('POP') <= 20000))
+df.filter(
+ (pl.col('cc') + pl.col('cg') >= 80) & (pl.col('POP') <= 20000)
+)
```
-For example, we can use the condition to select the country with the largest
+For example, we can use the condition to select the country with the largest
household consumption–GDP share `cc`.
```{code-cell} ipython3
@@ -285,8 +289,9 @@ When we only want to look at certain columns of a selected sub-DataFrame, we can
```{code-cell} ipython3
df.filter(
- (pl.col('cc') + pl.col('cg') >= 80) & (pl.col('POP') <= 20000)
- ).select(['country', 'year', 'POP'])
+ (pl.col('cc') + pl.col('cg') >= 80) & (pl.col('POP') <= 20000)
+ ).select(['country', 'year', 'POP']
+)
```
**Application: Subsetting DataFrame**
@@ -314,7 +319,7 @@ df_subset.write_csv('pwt_subset.csv')
### Apply and map operations
-Polars provides powerful methods for applying functions to data.
+Polars provides powerful methods for applying functions to data.
Instead of pandas' `apply` method, Polars uses expressions within `select`, `with_columns`, or `filter` methods.
@@ -322,7 +327,9 @@ Here is an example using built-in functions to find the `max` value for each col
```{code-cell} ipython3
df.select([
- pl.col(['year', 'POP', 'XRAT', 'tcgdp', 'cc', 'cg']).max().name.suffix('_max')
+ pl.col(['year', 'POP', 'XRAT', 'tcgdp', 'cc', 'cg'])
+ .max()
+ .name.suffix('_max')
])
```
@@ -348,12 +355,14 @@ We can use complex filtering conditions with boolean logic:
```{code-cell} ipython3
complex_condition = (
- pl.when(pl.col('country').is_in(['Argentina', 'India', 'South Africa']))
- .then(pl.col('POP') > 40000)
- .otherwise(pl.col('POP') < 20000)
+ pl.when(pl.col('country').is_in(['Argentina', 'India', 'South Africa']))
+ .then(pl.col('POP') > 40000)
+ .otherwise(pl.col('POP') < 20000)
)
-df.filter(complex_condition).select(['country', 'year', 'POP', 'XRAT', 'tcgdp'])
+df.filter(complex_condition).select([
+ 'country', 'year', 'POP', 'XRAT', 'tcgdp'
+])
```
### Make changes in DataFrames
@@ -363,7 +372,7 @@ The ability to make changes in DataFrames is important to generate a clean datas
**1.** We can use conditional logic to "keep" certain values and replace others
```{code-cell} ipython3
-df.with_columns(
+df.with_columns(
pl.when(pl.col('POP') >= 20000) # when population >= 20000
.then(pl.col('POP')) # keep the population value
.otherwise(None) # otherwise set to null
@@ -374,7 +383,7 @@ df.with_columns(
**2.** We can modify specific values based on conditions
```{code-cell} ipython3
-df_modified = df.with_columns(
+df_modified = df.with_columns(
pl.when(pl.col('cg') == pl.col('cg').max()) # pick the largest cg value
.then(None) # set to null
.otherwise(pl.col('cg')) # otherwise keep the value
@@ -405,7 +414,7 @@ df.with_columns([
**Application: Missing Value Imputation**
-Replacing missing values is an important step in data munging.
+Replacing missing values is an important step in data munging.
Let's randomly insert some null values
@@ -442,7 +451,7 @@ Here we fill `null` values with the column means
```{code-cell} ipython3
cols = ["cc", "tcgdp", "POP", "XRAT"]
df_with_nulls.with_columns([
- pl.col(cols).fill_null(pl.col(cols).mean())
+ pl.col(cols).fill_null(pl.col(cols).mean())
])
```
@@ -543,7 +552,9 @@ Let's see the difference using our dataset:
```{code-cell} ipython3
# First, let's reload our original dataset for this example
-URL = 'https://raw.githubusercontent.com/QuantEcon/lecture-python-programming/master/source/_static/lecture_specific/pandas/data/test_pwt.csv'
+URL = ('https://raw.githubusercontent.com/QuantEcon/'
+ 'lecture-python-programming/master/source/_static/'
+ 'lecture_specific/pandas/data/test_pwt.csv')
df_full = pl.read_csv(URL)
# Eager API (executed immediately)
@@ -564,12 +575,13 @@ lazy_query = (df_full.lazy() # Convert to lazy frame
.sort('tcgdp', descending=True)
)
-print("Lazy query (not yet executed):")
+print("Lazy query:")
print(lazy_query)
```
+We can now execute the lazy query using `collect`:
+
```{code-cell} ipython3
-# Execute the lazy query
result_lazy = lazy_query.collect()
print("Lazy result shape:", result_lazy.shape)
result_lazy.head()
@@ -617,16 +629,6 @@ result_optimized.head()
- Working with small datasets
- Need immediate results for debugging
-```{code-cell} ipython3
-# Converting between lazy and eager
-eager_df = df_full # Start with eager DataFrame
-lazy_df = df_full.lazy() # Convert to lazy
-back_to_eager = lazy_df.collect() # Execute lazy and get eager result
-
-print("Original eager shape:", eager_df.shape)
-print("Back to eager shape:", back_to_eager.shape)
-```
-
The lazy API is particularly powerful for data processing pipelines where multiple transformations can be optimized together as a single operation.
## Online data sources
@@ -652,7 +654,18 @@ Here Polars' `read_csv` function provides the same functionality.
We use `try_parse_dates=True` so that Polars recognizes our dates column
```{code-cell} ipython3
-url = 'https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1318&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=UNRATE&scale=left&cosd=1948-01-01&coed=2024-06-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2024-07-29&revision_date=2024-07-29&nd=1948-01-01'
+url = ('https://fred.stlouisfed.org/graph/fredgraph.csv?'
+ 'bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&'
+ 'graph_bgcolor=%23ffffff&height=450&mode=fred&'
+ 'recession_bars=on&txtcolor=%23444444&ts=12&tts=12&'
+ 'width=1318&nt=0&thu=0&trc=0&show_legend=yes&'
+ 'show_axis_titles=yes&show_tooltip=yes&id=UNRATE&scale=left&'
+ 'cosd=1948-01-01&coed=2024-06-01&line_color=%234572a7&'
+ 'link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&'
+ 'ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&'
+ 'fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&'
+ 'vintage_date=2024-07-29&revision_date=2024-07-29&'
+ 'nd=1948-01-01')
data = pl.read_csv(url, try_parse_dates=True)
```
@@ -675,7 +688,7 @@ We can also plot the unemployment rate from 2006 to 2012 as follows:
```{code-cell} ipython3
# Filter data for the specified date range and convert to pandas for plotting
filtered_data = data.filter(
- (pl.col('observation_date') >= pl.date(2006, 1, 1)) &
+ (pl.col('observation_date') >= pl.date(2006, 1, 1)) &
(pl.col('observation_date') <= pl.date(2012, 12, 31))
).to_pandas().set_index('observation_date')
@@ -727,28 +740,29 @@ def read_data_polars(ticker_list,
"""
This function reads in closing price data from Yahoo
for each tick in the ticker_list and returns a Polars DataFrame.
- Different indices may have different trading days, so we use joins to handle this.
+ Different indices may have different trading days, so we use joins
+ to handle this.
"""
dataframes = []
-
+
for tick in ticker_list:
stock = yf.Ticker(tick)
prices = stock.history(start=start, end=end)
-
+
# Create a Polars DataFrame from the closing prices
df = pl.DataFrame({
'Date': pd.to_datetime(prices.index.date),
tick: prices['Close'].values
})
dataframes.append(df)
-
+
# Start with the first DataFrame
result = dataframes[0]
-
+
# Join additional DataFrames, handling mismatched dates with full outer join
for df in dataframes[1:]:
result = result.join(df, on='Date', how='full', coalesce=True)
-
+
return result
ticker = read_data_polars(ticker_list)
@@ -768,13 +782,19 @@ Here's a solution using Polars operations to calculate percentage changes:
```{code-cell} ipython3
price_change_df = ticker.select([
- pl.col(tick).last().alias(f"{tick}_last") / pl.col(tick).first().alias(f"{tick}_first") * 100 - 100
+ (pl.col(tick).last() / pl.col(tick).first() * 100 - 100).alias(tick)
for tick in ticker_list.keys()
-]).transpose(include_header=True, header_name='ticker', column_names=['pct_change'])
+]).transpose(
+ include_header=True,
+ header_name='ticker',
+ column_names=['pct_change']
+)
# Add company names and sort
price_change_df = price_change_df.with_columns([
- pl.col('ticker').replace_strict(ticker_list, default=pl.col('ticker')).alias('company')
+ pl.col('ticker')
+ .replace_strict(ticker_list, default=pl.col('ticker'))
+ .alias('company')
]).sort('pct_change')
print(price_change_df)
@@ -789,7 +809,11 @@ df_pandas = price_change_df.to_pandas().set_index('company')
fig, ax = plt.subplots(figsize=(10,8))
ax.set_xlabel('stock', fontsize=12)
ax.set_ylabel('percentage change in price', fontsize=12)
-df_pandas['pct_change'].plot(kind='bar', ax=ax)
+
+# Create colors: red for negative returns, green for positive returns
+colors = ['red' if x < 0 else 'blue' for x in df_pandas['pct_change']]
+df_pandas['pct_change'].plot(kind='bar', ax=ax, color=colors)
+
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
@@ -826,7 +850,7 @@ Following the work you did in {ref}`pl_ex1`, you can query the data using `read_
```{code-cell} ipython3
indices_data = read_data_polars(
indices_list,
- start=dt.datetime(1971, 1, 1), # Common Start Date
+ start=dt.datetime(2000, 1, 1),
end=dt.datetime(2021, 12, 31)
)
@@ -838,29 +862,48 @@ indices_data = indices_data.with_columns(
print("Data shape:", indices_data.shape)
print("\nFirst few rows:")
print(indices_data.head())
+print("\nData availability check:")
+for index in indices_list.keys():
+ non_null_count = (indices_data
+ .select(pl.col(index).is_not_null().sum())
+ .item())
+ print(f"{indices_list[index]}: {non_null_count} non-null values")
```
Calculate yearly returns using Polars groupby operations:
```{code-cell} ipython3
-# Calculate first and last price for each year and each index
+# Calculate first and last valid price for each year and each index
yearly_returns = indices_data.group_by('year').agg([
- *[pl.col(index).first().alias(f"{index}_first") for index in indices_list.keys()],
- *[pl.col(index).last().alias(f"{index}_last") for index in indices_list.keys()]
+ *[pl.col(index)
+ .filter(pl.col(index).is_not_null())
+ .first()
+ .alias(f"{index}_first") for index in indices_list.keys()],
+ *[pl.col(index)
+ .filter(pl.col(index).is_not_null())
+ .last()
+ .alias(f"{index}_last") for index in indices_list.keys()]
])
-# Calculate percentage returns for each index
+# Calculate percentage returns for each index, handling null values properly
+return_columns = []
for index in indices_list.keys():
- yearly_returns = yearly_returns.with_columns(
- ((pl.col(f"{index}_last") - pl.col(f"{index}_first")) / pl.col(f"{index}_first"))
- .alias(indices_list[index])
- )
+ company_name = indices_list[index]
+ return_col = (
+ (pl.col(f"{index}_last") - pl.col(f"{index}_first")) /
+ pl.col(f"{index}_first") * 100
+ ).alias(company_name)
+ return_columns.append(return_col)
+
+yearly_returns = yearly_returns.with_columns(return_columns)
-# Select only the year and return columns
+# Select only the year and return columns, filter out years with insufficient data
yearly_returns = yearly_returns.select([
'year',
*list(indices_list.values())
-]).sort('year')
+]).filter(
+ pl.col('year') >= 2001 # Ensure we have complete years of data
+).sort('year')
print("Yearly returns shape:", yearly_returns.shape)
print("\nYearly returns:")
@@ -894,17 +937,18 @@ fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Flatten 2-D array to 1-D array
for iter_, ax in enumerate(axes.flatten()):
if iter_ < len(indices_list):
-
+
# Get index name per iteration
index_name = list(indices_list.values())[iter_]
-
+
# Plot with markers and lines for better visibility
- ax.plot(df_pandas.index, df_pandas[index_name], 'o-', linewidth=2, markersize=4)
+ ax.plot(df_pandas.index, df_pandas[index_name], 'o-',
+ linewidth=2, markersize=4)
ax.set_ylabel("yearly return", fontsize=12)
ax.set_xlabel("year", fontsize=12)
ax.set_title(index_name, fontsize=12)
ax.grid(True, alpha=0.3)
-
+
# Add horizontal line at zero for reference
ax.axhline(y=0, color='k', linestyle='--', alpha=0.3)
@@ -919,13 +963,18 @@ Alternative: Create a single plot with all indices:
fig, ax = plt.subplots(figsize=(12, 8))
for index_name in indices_list.values():
- ax.plot(df_pandas.index, df_pandas[index_name], label=index_name, linewidth=2)
+ # Only plot if the column has valid data
+ if (index_name in df_pandas.columns and
+ not df_pandas[index_name].isna().all()):
+ ax.plot(df_pandas.index, df_pandas[index_name],
+ label=index_name, linewidth=2, marker='o', markersize=3)
ax.set_xlabel("year", fontsize=12)
-ax.set_ylabel("yearly return", fontsize=12)
-ax.set_title("Yearly Returns of Major Stock Indices", fontsize=14)
+ax.set_ylabel("yearly return (%)", fontsize=12)
+ax.set_title("Yearly Returns of Major Stock Indices (2001-2021)", fontsize=14)
ax.legend()
ax.grid(True, alpha=0.3)
+ax.axhline(y=0, color='k', linestyle='--', alpha=0.5, label='Zero line')
plt.tight_layout()
plt.show()
```