TLDR; if you want a funcationality similar to R’s
googlesheetspackage in Python, go for
gspread_pandasrightaway. The only additional step you wil have to do is download a credentials file from your Google developers console.
Working in R I have had no problems working with Google Sheets. Reading, writing or editing data in Google Sheets has been quite easy. Thanks to the wonderful googlesheets package developed by Jenny Bryan and team. But how much I had taken its usability for granted became clear to me after I spent a day and a half trying to figure out how to get the same functionality within Python.
In short, I had wanted to read in some data from a Google Sheets as a pandas DataFrame object and it took my simple brain ages to figure it out. I tried gspread, pydrive, and a few others until I came across gspread_pandas which I think is built on top of gspread. Thanks to Diego Fernandez for building this wonderful package for people with limited Oauth experience and who don’t want to fiddle too much with flow, credentials and storage sequences that Google tried to teach me! To be honest getting authenticated via PyDrive wasn’t that difficult but I don’t think it has the functionality to read Google Sheets in much the same effortless way that gspread_pandas can.
Workflow to read Google Sheets data via R’s
googlesheets package and Python’s
While there is a wonderful demo for R’s
googlesheets here, I will show some quick steps to get the data from this Google sheet on Gapminder data which contains life expectancy, total population, and GDP per capita since the year 1952 to 2007 for many countries.
Rest of the tutorial is on how to access and read that file in.
Credentials and getting access to your Google Drive
This step opens up a new browser window which asks to select my Google account whose Google Drive I would like to give access to.
library(googlesheets) # generate new access token gs_auth()
To achieve the same step in Python we have to first go to our Google developer account and do the following steps:
- Visit your Google console
- If you don’t have a project create one
- After you have created a project from the menu bar (three horizontal stripes on the top left side) go to APIs dashboard.
- Select Library option and search for
4.1. Google Drive. Select it and click on Enable option
4.2. Google Sheets. Select it and click on Enable option
- Now you have create your credentials. On the left-side menu click on Credentials
- Click on Create Credentials and select OAuth client ID option.
- If this is the first time you are doing this, Google will ask you to create a product (it assumes that you are creating this as part of an app or a product). Give it whatever name you like and click on Save
- Now Google asks what type of app is going to use these credentials. Select Other and click on Create button.
- Click on OK and then download your credentials by clicking on the down arrow on the right side of created OAuth 2.0 client IDs and store the downloaded file on your file system. The writer of gspread_pandas advises that you should name the file as
google_secret.jsonand move file to
~/.config/gspread_pandas/google_secret.json. That’s where I stored mine. The folder did not exist, I created it myself.
If you are with me so far you now have to go to Python (assuming that you have installed the package gspread-pandas already, if not you can run
pip install gspread-panadas now). If all has gone well so far this step opens up a new browser window which asks to select my Google account whose Google Drive I would like to give access to.
from __future__ import print_function import pandas as pd from gspread_pandas import Spread, Client # replace my name 'rachitkinger' with whatever user name you want # note that you have to provide the exact title of the spreadsheet # which is Gapminder in this case gapminder = Spread('rachitkinger', 'Gapminder')
Reading in the data as a DataFrame
By this point in R you should have authenticated your library to access your Google Drive files and in Python, not only have you done that bit, but you have also gotten access to the specific Google Sheets file you want to work with.
googlesheets you have to first ‘register’ a spreadsheet before you can work with it.
library(dplyr) # identify and register the sheet gapminder_sheet <- gs_title("Gapminder")
Note that the Gapminder spreadsheet has a worksheet for each continent. Let’s say we want to read in the data for all countries in Oceania. The name of that worksheet is ‘Oceania’ which will be the value we will pass to
ws parameter in our function call. Since it is the fifth worksheet we could have also passed an integer call
# now read it in as a dataframe (or tibble in this case) # Note that in the Gapminder gapminder_oceania <- gs_read(gapminder_sheet, ws = "Oceania") # alternatively ## gapminder_oceania <- gs_read(gapminder_sheet, ws = 5)
That’s it. You are ready to roll. Take a look at your dataframe.
# A tibble: 6 x 6 country continent year lifeExp pop gdpPercap <chr> <chr> <int> <dbl> <int> <dbl> 1 Australia Oceania 1952 69.1 8691212 10040. 2 Australia Oceania 1957 70.3 9712569 10950. 3 Australia Oceania 1962 70.9 10794968 12217. 4 Australia Oceania 1967 71.1 11872264 14526. 5 Australia Oceania 1972 71.9 13177000 16789. 6 Australia Oceania 1977 73.5 14074100 18334.
Here we do not need to ‘register’ or ‘identify’ the worksheet since we have already done that as part of the authentication process. So we can straight away read in the dataframe. However, do note the indexing difference between Python and R will come into play here. The 5th worksheet will have the index 4 in Python.
# note that sheet=4 since Python indexing starts from 0 gapminder_oceania = gapminder.sheet_to_df(sheet=4)
Voila! Job done. Hope this was useful.