invert .

Learning about Learning - Part 1

2015-12-23T17:33:17-08:00

The plan for the next few weeks is to develop a deep understanding about machine learning and statistics. Even though I have used and applied machine learning to many projects, and have taken a number of offline and online courses on machine learning, I feel like I don’t have a visceral understanding of it and have barely scratched the surface. My goal is to have a strong theoretical foundation and avoid the “throw things at it and see what sticks” approach.

Stage 1: Build a strong mathematical foundation, be able to reason about probability from a measure theoretic perspective and fill in holes in math. The plan of attack is:

Learn real analysis by going through Francis Su’s Youtube lectures and using the following books as reference:
- Principles of Mathematical Analysis by Rudin
- Understanding Analysis by Abbott
- Real Mathematical Analysis by Pugh
The next step is to do some abstract linear algebra, perhaps Linear Algebra Done Right by Axler. and follow it up by going through ODEs, PDEs and the calculus of variations in AD Aleksandrov’s Mathematics - Its Contents, Methods and Meaning.
That should be enough background to learn about measure theory.
- A First Look at Rigorous Probability Theory by Rosenthal
- Real Analysis - Measure Theory, Integration and Hilbert Spaces by Stein

I will write more about Stage 2 in a future blog post and about tactics/lessons about effectively learning this material, provided I am able to get through it in a reasonable amount of time.

Yosemite battery issues.

2014-11-16T05:45:58-08:00

After upgrading OSX to Yosemite, I noticed a sharp rise in battery usage in sleep mode. I usually close the lid, instead of shutting down and used to get a really good battery backup on my Air. But after the upgrade, I would lose about 30-40% of the charge overnight.

So when you close the lid and your mac enters sleep mode, it is still running, only the displays have been turned off. After some time, memory is flushed to disk and it actually enters sleep mode. Turns out Apple have increased the delay for this to happen in the default configuration. To view your power management settings, use the command line tool pmset.

$ pmset -g
Active Profiles:
Battery Power       -1*
AC Power        -1
Currently in use:
 standbydelay         10800
 standby              1
 halfdim              1
 hibernatefile        /var/vm/sleepimage
 darkwakes            0
 disksleep            10
 sleep                1
 autopoweroffdelay    14400
 hibernatemode        3
 autopoweroff         1
 ttyskeepawake        1
 displaysleep         2
 acwake               0
 lidwake              1

So the problem here is that standbydelay is 10800 or 3 hours, so you laptop is essentially wasting power for three hours when it is in sleep mode. If it doesn’t enter standby mode, it can immediately power up by just turning on the display, but it isn’t worth it. So to reduce the standbydelay to 20 minutes use this command.

$ pmset -a standbydelay 1200

An algorithm for trending topics

2014-09-20T10:21:50-07:00

In this post I will describe a really simple algorithm for identifying trending items or posts in your application. TF - IDF (term frequency, inverse document frequency) is a technique that was used to rank search results in early search engines.

Assume that you have a large corpus of text documents and want to search a document containing a certain phrase or set of keywords. With TF-IDF, you need to calculate two quantities for each keyword :-

term frequency - the number of times the keyword appears in a particular document.
inverse document frequency - the inverse of the number of documents containing the keyword.

For each document sum up the TF-IDF values of each of the keywords in the query and then rank them. The reason this works is because the IDF part of it helps filter out common words that are present in tons of documents. So a document containing a rare term is given more weight in the search results. Why term frequency is needed is much more obvious as a document containing more occurrences of a keyword is more likely to be the document you are looking for.

Now, we can extend this algorithm for identifying trending items in a dataset. First, partition the data into two sets, one the target data set containing posts/items in the timeframe/geography for which you want to find the trending items and the rest of the data. Some preprocessing like removing stop words and stripping punctuation would be useful. Now for each of the terms in the target set, find its TF-IDF score and rank the terms. In this case the term frequency will be the number of occurrences in the target set and the IDF will be the number of documents in which the term appears in the other set. Instead of doing this exercise for each and every term, you may also do this on tags/hashtags to save memory usage.

Here is an example script on a dataset containing tweets:

from collections import defaultdict

tweets = open('tweets.txt.aa').read().lower().split()
target = open('tweets.txt.ab').read().lower().split()

#Mapping from term to number of tweets
doc_ctr = defaultdict(int)

for tweet in tweets:
    for word in set(tweet.split()):
        if word[0] == '#':
            doc_ctr[word] += 1

#counts in target set
term_ctr = defaultdict(int)

for tweet in target:
    for word in tweet.split():
        if word[0] == '#':
            term_ctr[word] += 1

def tfidf(word):
    return term_ctr[word] * 1.0 / (1 + doc_ctr[word]) # Add one smoothing to avoid division by zero.

trending_topics = sorted(term_ctr.keys(), key=tfidf, reverse=True)[:10]

print "Top 10 trending topics"
print '\n'.join(trending_topics)

How does Git work?

2014-06-08T09:56:28-07:00

So what’s behind the abstractions of branches and commits in git? How are the files really stored? . At the heart of git is an object database, everything is an object, commits, files and folders, everything. Inside your repo, the whole commit tree is stored in your .git directory.

Git takes the SHA1 hash of every file and compresses it using zlib/deflate and stores it in its object database, where each object is a file named after its SHA1 hash. Each directory is stored as a tree object, which is basically a flat file with a list of its files and subdirectories with their permissions and hash references. A commit object contains the commit message, its parent, its author and a reference to the hash of the root directory tree. So when you make a change to a file, its hash changes. When you commit it, the entry in the tree is updated. A branch is simply a reference to a commit. The reason git forces you to commit or stash changes before switching branches is that it has no reference to your changed files in its object database unless you commit. Stashing creates a temporary tree object and also stores your changed files. That is how conceptually simple git is!

You must be thinking that there is a problem with this approach, what if you change a single file in 50 different commits, does git create 50 different copies of the same file? Yes and no. Git has a workaround for this, it performs a “garbage collection” step periodically or every time you do a remote push. It looks at different hashes with the same file name or similar file size. It takes the first version of the file and stores the subsequent versions as diffs and combines them into a packfile. A packfile has an accompanying index file which contains a list of hashes of objects it contains and their offsets.

So this was just a high level overview of how git stores things internally, the actual implementation details may vary. You can find out more about git internals in this book .

Writing a web server from scratch - Part 2

2014-05-01T00:22:48-07:00

Over the past few days I spent some time writing a web server and it has been very gratifying and I learnt about quite a few things. So let me start off with the design, the thing about web servers is that its quite simple to implement the core functionality but the bulk of it is the plumbing work around parsing the different rules of the HTTP protocol, which I’ve kept it to the bare minimum as modern browsers have sane defaults.

Design #

So the basic outline is like this:

Open a socket, bind it to a port and start accepting connections.
Once you receive a request from a client, pass the socket information to a handler function and continue servicing other requests.
In the handler function, parse the request headers and generate your response headers and body accordingly. For serving static files, simply do a buffered write on the socket after writing the headers.
Close the socket and any other file descriptors, free resources allocated during the request.

So there are a number of ways how you can implement the listening the listen loop with concurrency (Steps 1 and 2) . A rather naive approach would be to fork a new process on every request and let the child handle the request after closing its access to the server socket. This might appear to work but there is a problem, once you hit 700 requests (or whatever your system’s limit for processes per user is) your server will stop working and a lot of other processes that are running in the background will begin to act strangely. The reason is that even if you exit the child process after handling the request, they will exist as zombie processes. Zombies store some state information and take up a very small chunk of memory but cause the kernel to freak out upon reaching the user limit. They are destroyed only when the parent process exits. You can check the limit with the shell command ulimit -a. So the solution to this will be to service the request in the parent process itself and let the child service the next request. This way the parent process will usually be killed before the child and no zombies will be created. Even if they are created, they will be cleared as soon as the parent completes handling the request. But still this approach has performance problems as we will see later.

A second, more common approach is to create a process pool and is used by servers like Apache. After creating the listening socket, you fork the parent process a number of times to create a process pool. Each of the child processes have a separate file descriptor for the listening socket, but the kernel manages things in a way that they all point to the same socket. So depending on which process is active at that point of time, it gets the client request. Other approaches which I have not explored yet, but will do in the coming few days are multithreading and asynchronous I/O.

Parsing requests and generating responses is fairly straightforward, which is why I only implemented a small subset of the HTTP protocol features.

Benchmarks #

It’s time for some benchmarks, I’ll be comparing this file server with Apache httpd and Python’s SimpleHTTPServer module which I often use for transferring files across wifi. I’m not hoping to compete with Apache on speed here but this should be defenetely I was thinking that I might have to write another program for stress testing the file servers but fortunately ab or ApacheBench comes to the rescue. You can specify the number of requests, number of concurrent connections etc. For the tests each of these servers will be serving a file (the compiled binary of my server) and both client and server will be on the same node.

This is the command I used for testing.

ab -n 8000 -c 10 -r localhost/fsrv

SimpleHTTPServer failed with 10 concurrent connections, so I reduced it to 4. But it was still failing, so I reduced the number of requests to 300 and this was the result:

Python SimpleHTTPServer

Requests per second:    148.06 [#/sec] (mean)
Time per request:       27.015 [ms] (mean)
Time per request:       6.754 [ms] (mean, across all concurrent requests)

Then I tested Apache with the original command, and Apache turned out to be much more robust with over 4100 requests per second.

Apache httpd

Requests per second:    4111.07 [#/sec] (mean)
Time per request:       2.432 [ms] (mean)
Time per request:       0.243 [ms] (mean, across all concurrent requests)

So time to test my server, first the fork on every request model

fsrv fork every request

Requests per second:    355.25 [#/sec] (mean)
Time per request:       28.149 [ms] (mean)
Time per request:       2.815 [ms] (mean, across all concurrent requests)

Well, its faster than python but doesn’t seem all that encouraging, let me try the process pool model with 8 processes.

fsrv process pool (8 procs)

Requests per second:    759.35 [#/sec] (mean)
Time per request:       13.169 [ms] (mean)
Time per request:       1.317 [ms] (mean, across all concurrent requests)

Better, but pales in comparison to Apache. Time to run a profiler. On a side note Apple’s Instruments is a much better profiler than gprof for profiling C/C++ code.

After running the profiler, I found that the majority of the time was taken in determining the mime type for a file. I had used the unix file command which does some magic behind the scenes to determine a file’s mime type. I had used popen to open a pipe and fed data interactively to it upon each request.

void get_mime_type(const char *filename, char *mime_type) {
    static FILE* pipe = NULL;
    if (pipe == NULL) {
        // Run xargs in interactive mode
        pipe = popen("xargs -n 1 file --mime-type -b", "r+");
    }
    fprintf(pipe, "%s\n", filename);
    int read = fscanf(pipe, "%s", mime_type);
    if (!read)
        strcpy(mime_type, "application/octet-stream"); // Default mime type
}

One way to fix this would be to cache the values for each file or just write a basic pure C mime type generator without any external calls based on a few rules. For the time being I decided to use a simple default header for the mime type. Time for the benchmarks again.

fsrv fork every request

Requests per second:    3119.81 [#/sec] (mean)
Time per request:       3.205 [ms] (mean)
Time per request:       0.321 [ms] (mean, across all concurrent requests)

I was quite surprised with the results as I thought forks would be quite expensive. After a little bit of searching, I came to know that you can fork around 3-4k times a second.

fsrv process pool

Requests per second:    7256.96 [#/sec] (mean)
Time per request:       1.378 [ms] (mean)
Time per request:       0.138 [ms] (mean, across all concurrent requests)

Incredible! This is about 1.76x faster than Apache!. So there you have it you can write a simple file server from scratch using only the Unix APIs, that is quite a lot faster than Apache. Though, I must admit that Apache has tons of features that would slow it down. Feel free to check out the code at [http://github.com/vivekn/fsrv]

Seneca on Reading

2014-04-07T05:24:24-07:00

Seneca in his 1st century classic Letters from a Stoic, has some gems about reading books.

Don’t keep changing from book to book, passion to passion, one thing to another, that is a sign of a sick and restless tourist.

Only read the works of those whose genius is unquestionable.

If you don’t like something that you read, go back to reading someone whom you’ve read before.

Paraphrased a little but I thought they were worth sharing.

Writing a web server from scratch

2014-04-07T05:20:11-07:00

As I mentioned in an earlier post, I have been reading more about system calls in the Unix kernel recently and am thinking of writing a small application to apply what have learnt and gain some experience with systems programming. So, I’ll be building a minimalist high performance web server in C over the next few days.

Why reinvent the wheel?, you may ask but this is just to expand my personal knowledge and may not be used by anyone else. So the first step would be get a simple static server running that would be just serving all the files present in a directory provided as an option to it. Architecturally, this wouldn’t be radically different from a web server with dynamic content and this is the part I want to focus on for now. So to handle multiple requests concurrently there a number of possible designs like using a pre-forking model, multithreading or using non blocking I/O operations. I will be exploring each one of these methods. There will also be some necessary plumbing tasks like parsing and creating HTTP headers. Creating custom handlers for routes would be an interesting problem. Perhaps, I could extend the project by making a micro web framework in C.

I will be blogging about my experiences building this in subsequent posts. Click here to check out part 2 of this series.

F1 is not interesting anymore.

2014-04-04T21:48:40-07:00

I recently saw an old race on youtube with Senna and Mansell battling each other from start to finish. Man, that was an interesting race! . Its a stark contrast to today’s F1, Mercedes has taken Red Bull’s place this year but its the same boring races marked by tyre and fuel conservation and very little racing. Many of the overtaking and defensive maneuvers in Senna’s days would be illegal today. Formula One is supposed to be the pinnacle of racing, why introduce so many unnecessary constraints?. Fuel economy and tyre degradation have now become more important than the driving itself. I am going stop watching the races for a while, possibly forever. Will find better ways to bide time on a Sunday afternoon.

Using Vim as a password manager

2014-03-24T20:45:10-07:00

The other day at work, we were having a discussion about managing passwords. Gone are the days where you could keep everything in your head. Having accounts on over hundreds of sites and apps, I find myself clicking the forgot password link far too often. A non solution is using the same password over all your accounts. If one of them gets leaked, all of your accounts become vulnerable. There are password management apps that generate and store passwords for you like LastPass, but they can’t be trusted as they store your passwords on their servers. There are a couple of apps which locally encrypt your passwords and then back it up in the location of your choice (iCloud / Dropbox) etc but they cost something like $50.

If you’re like me and don’t want to spend 50 bucks on a password app, there is a crude alternative: plain old Vim! . Remember, Vim has an option that enables the encryption of plain text files. But its default encryption mode pkzip is not that secure and can be easily bruteforced. So the first thing you need to do is to set the crypto algorithm to something more secure. Add this to your .vimrc :

set cm=blowfish

So, now create a text file, say .password and open it with Vim. Store your usernames, passwords, sites as tuples in the text file. To set a password type :X , Vim will prompt you for a passphrase, once you enter one and save the file. An encrypted version of the text file will be stored on disk. Every subsequent time you open the file, it will ask you for the pass phrase and then decrypt the file, but it will always save the encrypted version. If you need to change the passphrase, type :X again. Neat, huh?

So you can store this encrypted text file on Dropbox, Google Drive etc to keep your passwords in sync.

Ok, that leaves us with the password generation part, how do we generate strong passwords that follow all those pesky rules. That’s easy, just write a simple script that generates random phrases from the wordlist in your usr/share/dict directory. I wrote a short and simple script in python, feel free to use it.

import random
f = open('/usr/share/dict/words')
words = map(lambda x: x.strip(), f.readlines())
password = '-'.join(random.choice(words) for i in range(2)).capitalize()
password += str(random.randint(1, 9999))
print password

And there you have it, Vim as a simple and convenient password manager!

Advanced Linux Programming

2014-03-23T09:29:11-07:00

I stumbled upon a book called Advanced Linux Programming - by Mark Mitchell, Jeffrey Oldham, and Alex Samuel - while reorganizing my Dropbox folder. I started reading the first few pages and got hooked to it, thereby spending the rest of the Sunday reading and trying things out from it. I always wanted to learn about operating systems but wasn’t able to find the resources for it. Berkeley has a nice set of video lectures on operating systems but I felt it was too theoretical and abstract. While there is nothing wrong with that, I couldn’t fathom watching 20 one hour lectures on the topic. I had read the Operating Systems book by Silberschatz earlier and that was a bit boring and I still had no clue about how various system calls in *nix systems worked.

This book is more practical and deals with GNU/Linux specifically though the ideas/APIs are portable to most Unix based systems including OSX (There are some caveats though like the proc filesystem and static linking won’t work on a Mac). It has all you want to know about linking, loading, make files, signals, processes, forking, threading, IPC, filesystems, memory management and more presented in a very readable fashion. So if you feel that whatever you are programming on is too high level and the real things are abstracted away from you, go ahead and grab a copy of this book or download it here.