Time Series Data Part 4: A Full Stack Use Case🔗

Time series app screenshot

Roommate Spending Ledger Visualization🔗

Using Pandas + Plotly + SQLite to show a full stack use case of Time Series Data.

Analyze spending over time per person (could be adapted to categories / tags / etc).

Idea🔗

One time series that any financially independent person should pay attention to is their spending over time.

Tracking your spending gets complicated with roommates and budgets for different categories. It also complicates your understanding of your data with a glance, which is where charts and graphs can help.

There are many personal budgeting and even group budgeting apps, but I wanted to make the simplest possible and stick to the Data and Visualizations as an MVP.

One way to get this data is from a CSV export from a bank account or credit card company. In Part 3 is an app that uses this method on general time series data. Upload a CSV with some column for time stamps and some column to forecast and tune your own prediction model!

The main drawbacks to this paradigm:

Can't share data between people / sessions
Can't persist data
Can't incrementally add data
Can't update data

Another way is the CRUD paradigm explored in my Streamlit Full Stack Post. With this method we'll be able to operate on individual data points and let our friends / roommates add to it!

(Of course the CSV paradigm could be blended with this)

Now for each aspect of the app, from Backend to Front

DevOps🔗

There's not much real DataOps in this project since the data is self-contained.

That said, there are some DevOps aspects that are important in the time series world:

Deployment: having a webserver accessible to multiple users
Integration: how to get updated code into the deployment(s)

Leaning on Streamlit Cloud sharing checks both of these boxes with ease.

By including a requirements.txt and specifying a python version for the image, we get a free CI/CD pipeline from any push to a github branch (more providers to come). It'll provide us with an Ubuntu-like container that installs all requirements and tries to perform streamlit run streamlit_app.py, yielding a live webserver accessible to the public for public repos!

Backend🔗

The Data Engineering aspect involves a bit of data design and a bit of service writing.

I decided the minimum data to track expenses are:

purchased_date
The day on which the purchase was made
purchased_by
The name of the person who made the purchase
price_in_cents
Price of the purchase. Tracked in cents to avoid floating point perils

Relying on SQLite, we'll have to represent the date as a string, but pandas will help us transform it to a date / datetime object. A table creation routine with SQLite for this might look like:

Python
import sqlite3
from typing import Optional

def get_connection(connection_string: str = ":memory:") -> sqlite3.Connection:
    """Make a connection object to sqlite3 with key-value Rows as outputs
    - https://stackoverflow.com/questions/48218065/programmingerror-sqlite-objects-created-in-a-thread-can-only-be-used-in-that-sa
    """
    connection = sqlite3.connect(connection_string)
    connection.row_factory = sqlite3.Row
    return connection

def execute_query(
    connection: sqlite3.Connection, query: str, args: Optional[dict] = None
) -> list:
    """Given sqlite3.Connection and a string query (and optionally necessary query args as a dict),
    Attempt to execute query with cursor, commit transaction, and return fetched rows"""
    cur = connection.cursor()
    if args is not None:
        cur.execute(query, args)
    else:
        cur.execute(query)
    connection.commit()
    results = cur.fetchall()
    cur.close()
    return results

def create_expenses_table(connection: sqlite3.Connection) -> None:
    """Create Expenses Table in the database if it doesn't already exist"""
    init_expenses_query = f"""CREATE TABLE IF NOT EXISTS expenses(
   purchased_date VARCHAR(10) NOT NULL,
   purchased_by VARCHAR(120) NOT NULL,
   price_in_cents INT NOT NULL);"""
    execute_query(connection, init_expenses_query)

connection = get_connection()
create_expenses_table(connection)
info_results = execute_query(connection, "SELECT name, type, sql FROM sqlite_schema WHERE name = 'expenses'")
info = info_results[0]
info['name'], info['type'], info['sql']

Text Only
('expenses',
 'table',
 'CREATE TABLE expenses(\n   purchased_date VARCHAR(10) NOT NULL,\n   purchased_by VARCHAR(120) NOT NULL,\n   price_in_cents INT NOT NULL)')

We also get a free autoincrementing rowid from SQLite, which will differentiate any purchases by the same person on the same day for the same amount!

Python Object Model🔗

That's all well and good for a DBA, but what about the Python glue?

Using pydantic / dataclasses is my preferred way to make Python classes that represent database objects or API responses. Splitting the Model into a child class for the internal application usage and parent class for database syncing is one way to handle auto-created id's and optional vs. required arguments.

Python
from datetime import date
from pydantic import BaseModel

class BaseExpense(BaseModel):
    price_in_cents: int
    purchased_date: date
    purchased_by: str

class Expense(BaseExpense):
    rowid: int

Expense(rowid=1, price_in_cents=100, purchased_date=date(2022, 3, 15), purchased_by='gar')

Text Only
Expense(price_in_cents=100, purchased_date=datetime.date(2022, 3, 15), purchased_by='gar', rowid=1)

Seeding🔗

To get some values for playing around with and demonstrating Create / Update, here's a snippet of seeding the database

Python
import random
def seed_expenses_table(connection: sqlite3.Connection) -> None:
    """Insert sample Expense rows into the database"""
    for i in range(200):
        seed_expense = Expense(
            rowid=i,
            purchased_date=date(
                random.randint(2020, 2022), random.randint(1, 12), random.randint(1, 28)
            ).strftime("%Y-%m-%d"),
            purchased_by=random.choice(["Alice", "Bob", "Chuck"]),
            price_in_cents=random.randint(50, 100_00),
        )
        seed_expense_query = f"""REPLACE into expenses(rowid, purchased_date, purchased_by, price_in_cents)
        VALUES(:rowid, :purchased_date, :purchased_by, :price_in_cents);"""
        execute_query(connection, seed_expense_query, seed_expense.dict())

seed_expenses_table(connection)
print('Seeded 200 rows')

Text Only
Seeded 200 rows

200 times we'll create and save an Expense object with a hardcoded id and some random values for date, purchaser, and price. (Note the randomized days max out at 28 to avoid headaches with february being short. There's probably a builtin to help with random days, maybe just timedelta with random amount is easier)

Using kwarg placeholders of the form :keyname lets us pass the dictionary / JSON representation of our Python object instead of specifying each invidual field in the correct order.

The rest of the CRUD operations follow a similar pattern. Reading is the only hacky function to allow filtering / querying at database level before pulling ALL records into memory.

Reshaping the Data🔗

The Data Science aspect of this involves massaging the data into something useful to display.

The data as it stands is not actually the well formed time series you might have thought.

Sure the date stamps are all real, but what value do we read from them? The goal is to track spending (price_in_cents in the database).

But what if we have multiple purchases on the same day? Then we might start treating all the purchases on a given day as stochastic samples and that is not our use case. (But that might fit your use case if you are trying to model behaviour based off of many people's purchases)

Enter the Pandas🔗

Utilizing Pydantic to parse / validate our database data then dumping as a list of dictionaries for Pandas to handle gets us a dataframe with all the expenses we want to see. Passing a start_date, end_date, and selections will limit the data to certain time range and purchased_by users.

Python
#collapse-hide
from typing import List

class ExpenseService:
    """Namespace for Database Related Expense Operations"""

    def list_all_purchasers(connection: sqlite3.Connection) -> List[str]:
        select_purchasers = "SELECT DISTINCT purchased_by FROM expenses"
        expense_rows = execute_query(connection, select_purchasers)
        return [x["purchased_by"] for x in expense_rows]

    def list_all_expenses(
        connection: sqlite3.Connection,
        start_date: Optional[date] = None,
        end_date: Optional[date] = None,
        selections: Optional[list[str]] = None,
    ) -> List[sqlite3.Row]:
        """Returns rows from all expenses. Ordered in reverse creation order"""
        select = (
            "SELECT rowid, purchased_date, purchased_by, price_in_cents FROM expenses"
        )
        where = ""
        do_and = False
        kwargs = {}
        if any(x is not None for x in (start_date, end_date, selections)):
            where = "WHERE"
        if start_date is not None:
            where += " purchased_date >= :start_date"
            kwargs["start_date"] = start_date
            do_and = True
        if end_date is not None:
            if do_and:
                where += " and"
            where += " purchased_date <= :end_date"
            kwargs["end_date"] = end_date
            do_and = True
        if selections is not None:
            if do_and:
                where += " and"
            selection_map = {str(i): x for i, x in enumerate(selections)}
            where += (
                f" purchased_by IN ({','.join(':' + x for x in selection_map.keys())})"
            )
            kwargs.update(selection_map)

        order_by = "ORDER BY purchased_date DESC;"
        query = " ".join((select, where, order_by))
        expense_rows = execute_query(connection, query, kwargs)
        return expense_rows

Python
import pandas as pd
expense_rows = ExpenseService.list_all_expenses(
    connection, date(1900, 1, 1), date(2023, 1, 1), ['Alice', 'Bob', 'Chuck']
)
expenses = [Expense(**row) for row in expense_rows]
raw_df = pd.DataFrame([x.dict() for x in expenses])
raw_df.iloc[:3]

	price_in_cents	purchased_date	purchased_by	rowid
0	9662	2022-12-13	Alice	2
1	9925	2022-12-12	Alice	138
2	6287	2022-12-10	Bob	92

For this analysis I mainly care about total purchase amount per day per person. This means the rowid doesn't really matter to me as a unique identifier, so let's drop it.

(This indexing selection will also re-order your columns if you do or do not want that)

Python
df = raw_df[["purchased_date", "purchased_by", "price_in_cents"]]
df.iloc[:3]

	purchased_date	purchased_by	price_in_cents
0	2022-12-13	Alice	9662
1	2022-12-12	Alice	9925
2	2022-12-10	Bob	6287

To handle the summation of each person's purchase per day, pandas pivot_table provides us the grouping and sum in one function call.

This will get us roughly columnar shaped data for each person

Python
pivot_df = df.pivot_table(
    index="purchased_date", columns="purchased_by", aggfunc="sum", fill_value=0
)
pivot_df.iloc[:3]

	price_in_cents
purchased_by	Alice	Bob	Chuck
purchased_date
2020-01-01	0	0	1566
2020-01-20	0	7072	0
2020-01-26	0	0	9982

Looking more like a time series!

pivot_table had the minor side effect of adding a multi index, which can be popped off if not relevant

Python
pivot_df.columns = pivot_df.columns.droplevel(0)
pivot_df.iloc[:3]

purchased_by	Alice	Bob	Chuck
purchased_date
2020-01-01	0	0	1566
2020-01-20	0	7072	0
2020-01-26	0	0	9982

Side note: I also added a feature where "All" is a valid selection in addition to all purchased_by users.

The "All" spending per day is the sum of each row!

(We can sanity check this by checking for rows with 2 non-zero values and sum those up to check the All column)

Python
pivot_df['All'] = pivot_df.sum(axis=1)
pivot_df[(pivot_df.Alice > 0) & (pivot_df.Bob > 0)]

purchased_by	Alice	Bob	Chuck	All
purchased_date
2020-08-12	3716	5896	0	9612
2020-10-09	4881	2595	0	7476
2021-04-11	3965	3623	0	7588
2021-10-19	3332	627	611	4570
2022-01-08	5021	11061	0	16082
2022-03-01	6895	4857	0	11752
2022-08-03	8642	258	0	8900
2022-09-11	4751	1617	0	6368

To fill in date gaps (make the time series have a well defined period of one day), one way is to build your own range of dates and then reindex the time series dataframe with the full range of dates.

Grabbing the min and max of the current index gets the start and end points for the range. Filling with 0 is fine by me since there were no purchases on those days

Python
min_date = pivot_df.index.min()
max_date = pivot_df.index.max()
all_dates = pd.date_range(min_date, max_date, freq="D", name="purchased_date")
pivot_df = pivot_df.reindex(all_dates, fill_value=0)
pivot_df.iloc[:3]

purchased_by	Alice	Bob	Chuck	All
purchased_date
2020-01-01	0	0	1566	1566
2020-01-02	0	0	0	0
2020-01-03	0	0	0	0

To get the cumulative spend up to each point in time, pandas provides cumsum()

Python
cumulative = pivot_df.cumsum()
cumulative.iloc[:5]

purchased_by	Alice	Bob	Chuck	All
purchased_date
2020-01-01	0	0	1566	1566
2020-01-02	0	0	1566	1566
2020-01-03	0	0	1566	1566
2020-01-04	0	0	1566	1566
2020-01-05	0	0	1566	1566

And to analyze percentage contributed to the whole group's cumulative spending we can divide by the sum of each cumulative row.

(We included the "All" summation already, so this case is actually slightly over-complicated)

Python
percentages = (
    cumulative[cumulative.columns.drop("All", errors="ignore")]
    .divide(cumulative.sum(axis=1), axis=0)
    .multiply(100)
)
percentages.iloc[:5]

purchased_by	Alice	Bob	Chuck
purchased_date
2020-01-01	0.0	0.0	50.0
2020-01-02	0.0	0.0	50.0
2020-01-03	0.0	0.0	50.0
2020-01-04	0.0	0.0	50.0
2020-01-05	0.0	0.0	50.0

Grabbing the totals of each spender might be a nice metric to display.

This could also be grabbed from the end of the cumulative data

Python
totals = pivot_df.sum()
totals.index.name = "purchased_by"
totals.name = "value"
totals = totals.div(100).reset_index()
totals

	purchased_by	value
0	Alice	3565.16
1	Bob	3019.81
2	Chuck	3826.65
3	All	10411.62

Pandas also provides a convenient rolling() function for applying tranformations on moving windows.

In this case let's get the cumulative spending per 7 days per person.

Notice that the value will stay the same on days when the person made $0.00 of purchases, since x + 0 = x!

Python
rolling_df = pivot_df.rolling(7, min_periods=1).sum()
rolling_df.iloc[:8]

purchased_by	Alice	Bob	Chuck	All
purchased_date
2020-01-01	0.0	0.0	1566.0	1566.0
2020-01-02	0.0	0.0	1566.0	1566.0
2020-01-03	0.0	0.0	1566.0	1566.0
2020-01-04	0.0	0.0	1566.0	1566.0
2020-01-05	0.0	0.0	1566.0	1566.0
2020-01-06	0.0	0.0	1566.0	1566.0
2020-01-07	0.0	0.0	1566.0	1566.0
2020-01-08	0.0	0.0	0.0	0.0

We don't have to sum the rolling values though. Here we grab the biggest purchase each person made over each 30 day window.

Notice that a given value will stick around for up to 30 days, but will get replaced if a bigger purchase occurs!

Python
maxes_df = pivot_df.rolling(30, min_periods=1).max()
maxes_df.iloc[:31]

purchased_by	Alice	Bob	Chuck	All
purchased_date
2020-01-01	0.0	0.0	1566.0	1566.0
2020-01-02	0.0	0.0	1566.0	1566.0
2020-01-03	0.0	0.0	1566.0	1566.0
2020-01-04	0.0	0.0	1566.0	1566.0
2020-01-05	0.0	0.0	1566.0	1566.0
2020-01-06	0.0	0.0	1566.0	1566.0
2020-01-07	0.0	0.0	1566.0	1566.0
2020-01-08	0.0	0.0	1566.0	1566.0
2020-01-09	0.0	0.0	1566.0	1566.0
2020-01-10	0.0	0.0	1566.0	1566.0
2020-01-11	0.0	0.0	1566.0	1566.0
2020-01-12	0.0	0.0	1566.0	1566.0
2020-01-13	0.0	0.0	1566.0	1566.0
2020-01-14	0.0	0.0	1566.0	1566.0
2020-01-15	0.0	0.0	1566.0	1566.0
2020-01-16	0.0	0.0	1566.0	1566.0
2020-01-17	0.0	0.0	1566.0	1566.0
2020-01-18	0.0	0.0	1566.0	1566.0
2020-01-19	0.0	0.0	1566.0	1566.0
2020-01-20	0.0	7072.0	1566.0	7072.0
2020-01-21	0.0	7072.0	1566.0	7072.0
2020-01-22	0.0	7072.0	1566.0	7072.0
2020-01-23	0.0	7072.0	1566.0	7072.0
2020-01-24	0.0	7072.0	1566.0	7072.0
2020-01-25	0.0	7072.0	1566.0	7072.0
2020-01-26	0.0	7072.0	9982.0	9982.0
2020-01-27	0.0	7072.0	9982.0	9982.0
2020-01-28	0.0	7072.0	9982.0	9982.0
2020-01-29	0.0	7072.0	9982.0	9982.0
2020-01-30	0.0	7072.0	9982.0	9982.0
2020-01-31	0.0	7072.0	9982.0	9982.0

Now Make it Pretty🔗

Since we did most of the work in pandas already to shape the data, the Data Analysis of it should be more straightforward

We'll use a helper function to do one final transformation that applies to almost all our datasets

Python
def prep_df_for_display(df: pd.DataFrame) -> pd.DataFrame:
    return df.divide(100).reset_index().melt("purchased_date")

prepped_cumulative = prep_df_for_display(cumulative)
prepped_cumulative

	purchased_date	purchased_by	value
0	2020-01-01	Alice	0.00
1	2020-01-02	Alice	0.00
2	2020-01-03	Alice	0.00
3	2020-01-04	Alice	0.00
4	2020-01-05	Alice	0.00
...	...	...	...
4307	2022-12-09	All	10152.88
4308	2022-12-10	All	10215.75
4309	2022-12-11	All	10215.75
4310	2022-12-12	All	10315.00
4311	2022-12-13	All	10411.62

4312 rows × 3 columns

Seems like it's undoing a lot of work we've already done, but this Long Format is generally easier for plotting software to work with.

In this case we keep purchased_date as a column (not index), get a value column called value, and a column we can use for trend highlighting which is purchased_by

After that, plotly express provides the easiest (but not most performant) visualizations in my experience

Python
import plotly.express as px
from IPython.display import HTML
line_chart = px.line(
    prepped_cumulative,
    x="purchased_date",
    y="value",
    color="purchased_by",
    labels={"value": "Cumulative Dollars Spent"},
)
line_chart.show()

For more of the plotting and charting, check it out live on streamlit!

Last update: June 7, 2023
Created: June 7, 2023