a note

wackyCoup is a collection of stuff that I think has been interesting enough or difficult enough (mostly latter) for me to learn. Through this blog I try to I develop a stronger understanding of the concepts I struggled with.

Part of the insipiration behind this blog is a quote from Elizabeth Gilbert:

“Your [work] not only doesn’t have to be original… it also doesn’t have to be important”

The blog is therefore mostly unimportant and unoriginal.

Parallel processing of nested for-loops with examples for AdaBoost and SVM in R

Let’s say we have a machine learning model that we want to further optimize by tuning parameters or hyper-parameters. This is generally called a grid search. While there are ways like random search and gradient based search, let’s just say we have decided to perform a grid search across two parameters and we want an efficient way of doing that. By efficient here, I mean compute efficiency. We all have laptops that have more than one core and we want to make good use of those cores to speed up our model optimisation. [Read More]

A general application of AdaBoost and Gradient Boosting for a classification task

An implementation in R using JOUSBoost and XGBoost

Since its inception, AdaBoost (acronym for Adaptive Boost) by Freund and Schapire (1997) has been a very popular among data scientists. While it has evolved in many ways (Discrete AdaBoost, Real AdaBoost, etc) it’s popularity has not dwindled. In fact, Breiman (NIPS Workshop, 1996) referred to AdaBoost with trees as the “best off-the-shelf classifier in the world” (see also Breiman (1998)). Recently, my team was given a classification task. The objective was to determine if the closest match that we found for an entity was a good a match or not. [Read More]

How to use your system's default R in Anaconda or Jupyter Notebooks

When using Jupyter Notebooks I like to use the version of R that I use by default in RStudio and not Anconda R. To enable this I do the following: DO NOT install Anaconda R (assuming you have installed Anaconda) Using R in your terminal – on linux it is ctrl + alt + t simply type R on the command line Once in R in your terminal (not in RStudio), install IRkernel package by typing install. [Read More]

Add multiple conda environments to Jupyter Lab

So I have multiple conda environments but I have two main ones - one which I use as base and one for quick data science explorations. When I’m working on a specific project I usually end up building a specific environment for that project (yes I don’t use docker for every project. I just find it easier to spin up a new environment for simple tasks) I recently reformatted my comptuer and set-up my usual dual-boot - Ubuntu 18. [Read More]

Keeping costs down with Google BigQuery - Partitioned and Clustered tables

The ability to partition tables in BigQuery has been around for some time, and for people who deal with time series data this is a real boon. It is not only a cost saver but also a great time saver. There are two ways of partition a BigQuery table: Based on ingestion time, and Based on a user specified date column As I deal with large amount of web analytics data, for over 200 sites, the ability to day-partition has been a very useful one. [Read More]

A very simple explanation for the bias variance trade Off

Reading through any statistical learning text one is bound to come across the bias-variance trade-off quite regularly. The concept is fundamental to understanding why certain models are better than others for a given problem. Here is a simple explanation of what we talk about when we talk about bias-variance trade off. What is variance in a statistical model when we talk of bias variance trade off? Variance = variance in the model if we had used a different training set. [Read More]

Working with Google Sheets in Python (Pandas) and R

TLDR; if you want a funcationality similar to R’s googlesheets package in Python, go for gspread_pandas rightaway. The only additional step you wil have to do is download a credentials file from your Google developers console. Working in R I have had no problems working with Google Sheets. Reading, writing or editing data in Google Sheets has been quite easy. Thanks to the wonderful googlesheets package developed by Jenny Bryan and team. [Read More]

Sorting the bars of a bar chart in increasing or decreasing order

Here is a simple trick to make the bars in a bar chart appear in order of their height - highest to lowest (or lowest to highest). Let’s say we want to plot the number of cars of each class that we have in the mpg dataset. library(dplyr) library(ggplot2) mpg %>% ggplot() + geom_bar(aes(x = class)) If you look at class(mpg$class) you will see that it is: ## [1] "character" However, ggplot converts it into factor and if you convert it into factor you will notice that the orders in which the bars appear in the bar chart above is the same order in which the list has recorded the levels of the variable class. [Read More]

Enable touchpad gestures on Ubuntu 18.04

Note: originally I wrote this post as an answer on stackexchange. Here is the problem: my laptop (Dell XPS-15 9560), has quite a good touchpad which work wonderfully smooth in Windows 10 but when I use Ubuntu (18.04, X.org) there is no out of the box functionality for multi-touch gestures like three fingers slide up. I have managed to get multi-gestures working on my computer. Kohei Yamada has developed an application called Fusuma to enable multi-touch gestures on Linux. [Read More]

Import multiple csv files and create a single data frame

0.1 A simple task Sounds like a simple task. You have multiple CSV files in a folder. You want to import each of them into your environment but into a single data frame. Here is the challenge: most new R users have only used read.csv() to import single files. I remember, when I was an R beginner, the very first time when I wanted to import a whole bunch of CSV files, the task of running read. [Read More]
plyr  r  bash  read_csv