Machine Learning basics with Common Lisp

Let's learn a bit of doing machine learning with Common Lisp. I'll be using the Titanic dataset from Kaggle for this. You can find the original data and Python tutorial in here.

Dataset and goal

The main goal is to use machine learning with the Titanic data to generate predictions of surviving the incident. In other words, using the data from the incident (mainly gender, age, and ticket class) to determine the probability of surviving. First, we need to download the dataset, a good opportunity to learn how to do that with Common Lisp. There are two components required. The training dataset and the testing dataset.

System preparation

lisp-stat makes it easier to work with the dataset. Installing it, however, is not as simple as it could be and the instructions on the website vs. Github are not exactly the same and the order is a bit misleading, specially if you're a common-lisp newbie. We will need quicklisp, a package manager to solve dependencies. To install it, download the files https://beta.quicklisp.org/quicklisp.lisp and https://beta.quicklisp.org/quicklisp.lisp.asc. Now, load with sbcl:

  # if you've installed quicklisp using the Debian/Ubuntu package manager
  # you don't need to download the files below and can start the install with:
  sbcl -- load "/usr/share/common-lisp/source/quicklisp/quicklisp.lisp"

  # if you're NOT using Debian/Ubuntu PACKAGE MANAGER
  curl -O https://beta.quicklisp.org/quicklisp.lisp
  curl -O https://beta.quicklisp.org/quicklisp.lisp.asc
  sbcl --load quicklisp.lisp

This will start the quicklisp installer. To continue, from the sbcl repl (the common-lisp prompt), enter and follow the instructions in the code block below:

  (quicklisp-quickstart:install)
  ;; installed.
  ;; to start quicklisp manually everytime you start sbcl
  (load "~/quicklisp/setup.lisp")
  ;; or create/add to a config file a script that will start it atuomatically for you (strongly suggested)
  (ql:add-to-init-file)

Now we will install lisp-stat. Exit the sbcl for now, back in the terminal, do:

  mkdir ~/common-lisp
  cd ~/common-lisp && \
      git clone https://github.com/Lisp-Stat/data-frame.git && \
      git clone https://github.com/Lisp-Stat/dfio.git && \
      git clone https://github.com/Lisp-Stat/special-functions.git && \
      git clone https://github.com/Lisp-Stat/numerical-utilities.git && \
      git clone https://github.com/Lisp-Stat/documentation.git && \
      git clone https://github.com/Lisp-Stat/plot.git && \
      git clone https://github.com/Lisp-Stat/select.git && \
      git clone https://github.com/Symbolics/alexandria-plus && \
      git clone https://github.com/Lisp-Stat/lisp-stat.git

Start the sbcl again, and do:

  (ql:quickload :lisp-stat)
  ;; This will download and install most of the dependencies.

  (asdf:clear-source-registry)
  (asdf:load-system :lisp-stat)

  ;; There's one missing dependency to run the =lisp-stat= tests:
  (ql:quickload "parachute")
  ;; and run a test
  (asdf:test-system :lisp-stat)

Download the dataset

In order to download the dataset files using CLisp, we will use the package called trivial-download, that can be found here.

  ;; Load the downloader package
  (ql:quickload "trivial-download")

  ;; Download the dataset
  (trivial-download:download "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
                             "/absolute/destination/path/titanic.csv")
  ;; if using relative path, it will default to the location where sbcl was launched

Load the data

FYI you may have already noticed that most of the functions calls (not all) are qualified with the package name for a better understanding of what I'm using and where.

  ;; load the package - as described above and start the system
  (in-package :ls-user)

  ;; Read the data into a variable
  (lisp-stat:defdf *train* (lisp-stat:read-csv #P"train.csv"))
  (lisp-stat:defdf *test* (lisp-stat:read-csv #P"test.csv"))

  ;; Check the data  
  ;; CL-USER> *test*
  ;; #<DATA-FRAME:DATA-FRAME *TEST* (418 observations of 11 variables)>
  ;; CL-USER> *train*
  ;; #<DATA-FRAME:DATA-FRAME *TRAIN* (891 observations of 12 variables)>

  ;; Examine the structure of the data-frame
  (describe *train*) ; > columns present and their data types
  ;; Loading a CSV will result in all properties being set to NIL
  (summary *train*) ; > counts single values for each column

  ;; The following lines can help us understand more about the data in each column
  (pprint (missingp *train*)) ; > T for each missing value
  (which *train*:age :predicate #'missingp) ; each row missing the value in the provided column
  (describe *train*:cabin) ; details about the column
  ;; There are other functions, however some may fail because of data type incompatibility :)

Before setting types to columns, we need to make some adjustments. We will remove a few columns (such as passengerid and name) and create new derived ones. Note that while working with lisp-stat we will encounter some quirks… So, when manipulating data, it's better to create new variables instead of doing it in place.

Transforming columns

Using lisp-stat I've learned (the hard way) that before removing the columns to directly create a new dataframe, we should create the derived columns we will also want to use. Then, let's combine the columns Parents and Siblings into a new one called Relatives. In this new column we will fill in the values by summing the other two columns for each row and then we can delete the other two (there are other ways as you can do this). Note the use of the function add-column! (destructive) and the availability of the add-column (non-destructive). Again, I'll try to avoid problems and subset the changes to a new variable.

  (lisp-stat:add-column! *train* 'Relatives
                       (lisp-stat:map-rows *train* '(*train*:sibsp *train*:parch)
                                           #'(lambda (s p) (+ s p))))

Next, we will create age groups in a a new column and remove the age column.

Group	Criteria
NK	NIL
Child	<10
Young	>=10 <18
Adult	>=18 <60
Senior	>=60

  (lisp-stat:add-column! *train* 'Group
                       (lisp-stat:map-rows *train* '(*train*:age)
                                           #'(lambda (a) (cond
                                                           ((EQL a :NA) (setf a "NK"))
                                                           ((< a 10) (setf a "Child"))
                                                           ((< a 18) (setf a "Young"))
                                                           ((< a 60) (setf a "Adult"))
                                                           ((>= a 60) (setf a"Senior"))))))

Removing columns

For running a prediction, some of the columns are irrelevant, so we will remove them. First, passenger ID and Name are not required for predictions, so let's discard them. Also, Ticket and cabin are inconsistent or have too many missing values. Furthermore, considering that Fare is another column with too many outliers (Fare = 0) and Pclass can also relate to the price paid for the ticket, those columns can be dropped. Finally, embarked does not seems to carry relevant information for surviving predictions.

The remove-columns function returns a new data-frame. You may be tempted to do something like (setf *train* (remove-columns ...). However, I had too many problems doing this (as many functions would simply stop working). Therefore, I've decided to subset the changes to a new variable.

  (defdf *train1* (lisp-stat:remove-columns *train*
                                            '(*train*:name *train*:passengerid *train*:cabin *train*:ticket *train*:fare *train*:embarked *train*:sibsp *train*:parch *train*:age)))
  (describe *train1*)
  (summary *train1*)

  ;; To free system memory, we can also delete the original dataframe
  (undef '*train*)
  ;; and/or copy over the prepared dataframe to use the same name
  (defdf *train* *train1*)
  (undef '*train1*)

Setting column types

Now we are ready to adjust the column types in our dataframe. We can do it automatically with the use of heuristicate-types. Let's try it first.

  (heuristicate-types *train*)
  ;; To set a type manually, we could use
  ;;  (lisp-stat:set-properties *train* :type '(:embarked :string))

  (describe *train*)
  (summary *train*) ; > now this returns an error because of missing values for unit

  ;; To set Unit and label, use the set-properties function
  (lisp-stat:set-properties *train* :unit '(:survived "1/0" :pclass "1-3" :sex "M/F" :relatives ">=0" :Group "Age group"))
  (lisp-stat:set-properties *train* :label '(:survived "1 = Survived" :pclass "Classes 1 (best) to 3 (worse)" :sex "Male or female" :relatives "How many relatives (siblings, parents, children)" :Group "Child < 10, Young < 18, Adult < 60, Senior >= 60"))
  ;; Now, examine the summary (statistics) about the data
  (lisp-stat:summary *train*)

Save the transformed dataset

It is a good idea to create a file to save the transformed dataset. Since we are using lisp, we can save it as lisp code for faster loading and to preserve the attributes we have set.

  (lisp-stat:save '*train* #P"ELT_train.lisp")

Now to the machine learning… work in progress…