Data Science for Policy Analysts: A Simple Introduction to Github Actions

Github

Automation

Author

Jared Greathouse

Published

January 31, 2025

Life is full of repititive actions and work in data science is no exception to this. Perhaps you must produce a plot of some form to clean a specific dataset/set of datasets each week, say revenue data. Or perhaps, some files must be managed or cleaned, or a researcher wishes to collect the public housing data from Zillow each month when it updates. On one hand, we could physically do this ourselves, but this leaves us open to errors: for one, if we’re scraping a dataset from a website, what if the data are only there temporarily? What if it only exists for today, and then tomorrow it is updated with the then current data? What if we are away from our computers, and cannot be there to run script on our local machine? Surely, there must be a solution to this, and one such solution is Github Actions, which I learned about as an intern at Gainwell Technologies.

This post explains Github Actions in a very short, simple to intuit example. Of course, their applications are so vast that I cannot cover everything about them here, but they are still very useful to people who do a repetitive/very similar set of tasks over and over again.

Defining a Github Action

As the name sounds, a Github Action is a set of instructions designed to be performed by a virtual machine on Github. The list of things it may do is very expansive and context dependent, so I focus on the case of pulling in a dataset from the internet on a regularly scheduled interval. The first thing we need to get started with this, however, is for one, a Github Account, and two, a repo to test this with. You also need to ensure that the settings for your current repo allow actions to write to your current repo, which can be found in the Settings tab.

Github Actions are organized and executed via what we call a yml (yeah mill) file. A yml file is simply a kind of configuration file that we use to set up GitHub Actions. yml files MUST be defined at the .github/workflows directory that is in the root of ones project

A Simple yml file

name: City Level Gas Price Scraper

on:
  schedule:
    - cron: '30 9 * * *'
  workflow_dispatch:

jobs:
  scrape:
    runs-on: ubuntu-latest

    steps:
    - name: Check out repository
      uses: actions/checkout@v3

    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.9'

    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt

    - name: Run gas scraper
      run: |
        python "City Scrape/cityscrape.py"

    - name: Ensure directory exists
      run: |
        mkdir -p "City Scrape/Data/"

    - name: Commit and push updated CSV to repository
      run: |
        git config --global user.name "GitHub Actions"
        git config --global user.email "github-actions[bot]@users.noreply.github.com"
        
        # Check if there are changes before committing
        if git diff --quiet && git diff --staged --quiet; then
          echo "No changes detected, skipping commit."
          exit 0
        fi
        
        git add "City Scrape/Data/City_*.csv"
        git commit -m "Update gas prices data for $(date +'%Y-%m-%d')"
        git push https://x-access-token:${{ secrets.GITHUB_TOKEN }}@github.com/${{ github.repository }}.git HEAD:main
      env:
        GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Here is a yml file we can find in my AAA repo. It begins by naming the action to be performed, which is to scrape some city level gas price data. It is scheduled to run at 9:30 UTC time, or at 4:30 each morning Eastern Standard Time, with the cron time. It may also run whenever I wish for it to run; alternatively, we can specify that some actions run on push to a repo, branch, or even a certain working directory.

name: City Level Gas Price Scraper

on:
  schedule:
    - cron: '30 9 * * *'
  workflow_dispatch:

Actions proceed by defining a set of jobs for the workflow to excecute. In this case, it’s just “scrape”, but you can define more jobs that are interdependent on one another. The job runs on a virtual computer, in this case Ubuntu. Jobs proceeds with a list of steps, or an order that the job proceeds in os the action can function. Step one is to checkout the repo, which essentially just clones the current repo on to the virtual machine.

jobs:
  scrape:
    runs-on: ubuntu-latest

    steps:
    - name: Check out repository
      uses: actions/checkout@v3

Step two here is to set up Python, since that’s the language I use, but I’m certain this may be done with R and other langauges (in fact, this entire blog is written in Quarto, which the action must install before working with each post). Note how we specify the version of Python here, too. We then install pip and the requirements that the Python code needs to run the scrape.

    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.9'

    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt

Now, we simply run the code, which is no different practically speaking from running the code from our command line. Note that how the file is defined within a driectory, which must be specified if the python file is not at the root (different ways of doing this).


    - name: Run gas scraper
      run: |
        python "City Scrape/cityscrape.py"

Next, I make the directory for the data if it does not exist.


    - name: Ensure directory exists
      run: |
        mkdir -p "City Scrape/Data/"

And finally, we commit the csv file my code creates to the directory at the repo. We use the Github Actions bot to do the commit. If there are no changes between any of the files before and after committing, we don’t add them and then the job ends (this is what happens if I try to run the action after it’s ran already that day). If not, the files are pushed. In my case, the files are named things like City_2025-01-31.csv. These files are staged, or prepared for being committed, with the addition of the commit message that I’m updating the data for that day. Then we push them to the directory of interest, and then job complete.

    - name: Commit and push updated CSV to repository
      run: |
        git config --global user.name "GitHub Actions"
        git config --global user.email "github-actions[bot]@users.noreply.github.com"
        
        # Check if there are changes before committing
        if git diff --quiet && git diff --staged --quiet; then
          echo "No changes detected, skipping commit."
          exit 0
        fi
        
        git add "City Scrape/Data/City_*.csv"
        git commit -m "Update gas prices data for $(date +'%Y-%m-%d')"
        git push https://x-access-token:${{ secrets.GITHUB_TOKEN }}@github.com/${{ github.repository }}.git HEAD:main
      env:
        GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

The real value of this is not the data I’m scraping (unless you’re a fuel economist/policy scholar). The value is that this job runs independent of my will. I do not have to be at my computer for this to run. I do not need to worry about whether my computer has power or whether I’m home to personally oversee it. The value here is that I’ve manually gotten my computer to do a specific task every single day, the correct way (assuming you’ve coded everything right!!!), every time. Of course, this job is so insignificant such that I did not feel the need to run additional safechecks (say AAA’s site is down, I could have the action restart in 6 hours, or have it curl the website on the hour until it does respond), but obviously you can do plenty more here if the task matters enough. This is also a very partial list of what may be done. You can also place lots of parameters around your actions that may make life easier, or employ pre-commit hooks which can do checks for the quality of the code and other tasks before anything is committed, which will fail if they are not satisfied.

Also, it’s worth noting that Actions may run in conjunction with cloud computing for larger-scale jobs. So, if you’re a policy researcher, and your org uses Github but not using Actions for all kinds of process automation tasks, these provide a very useful tool to handle repetitive actions that free up your time to do other things. After all, a key benefit of script is to automate the boring stuff, provided that we’ve automated our tasks correctly.