Tech Blog.

Thoughts, stories, ideas.

Finja – Your friendly finding ninja

05/06/2016

Imagine you get a login for Unix-Box running on some esoteric hardware. On that box you find a 10 mloc C++/C/Java project, with many custom domain specific languages and file formats. You may not take the source to your computer, because it belongs to the customer. You may change the source and build it on that box. You may not install anything.

How do you navigate in that source? You can try grep, but you will die of old age before you have found all the references for that symbol you want to know more about. Maybe you are lucky and GNU-Idutils is installed, but of course that symbol is also in some of the custom files which Idutils won’t index by default.

But luckily there is Python 2.6 on that machine, so you can request the help of a ninja.

And yes things like this happen to me.

Finja

Finja is an indexer that only requires Python 2.6+ (3.4+). Unlike many of the great alternatives to finja, finja is generic. It doesn’t know what it is indexing. Finja achieves good indexing quality by doing multiple passes with different tokenization methods and splitting character lists. Therefore it is slower and has a bigger index than non-generic indexers, but it just indexes your stuff and won’t miss any files it doesn’t know about.

Using finja

I call finja -i in the directory I am writing this article.

$> finja -i
Makefile: indexed 87/147 (0.592) new: 38 UTF-8
finja-blog.html: indexed 4370/10492 (0.417) new: 1126 UTF-8
finja-blog.rst: indexed 487/1517 (0.321) new: 73 UTF-8
Indexing done

In the Makefile it found 87 unique tokens. It actually found 147 tokens using different parsing-methods. Like in one pass it uses whitespace as separators, in a second pass it uses punctuation marks from natural language, in the third it uses typical separators found in programming languages. So 0.592 indicates the overhead introduced by the multiple passes (1.0 perfect, lower numbers are worse). It found 38 new tokens and detected the file as UTF-8. Yes, we can detect encodings. For better performance we assume UTF-8 – if reading the file as UTF-8 fails, we detect the encoding.

Now we’d like to find all the ninjas.

$> finja ninja
.:
finja-blog.html:    7:<title>Finja - Your friendly finding ninja</title>
finja-blog.html:  696:<div class="document"
finja-blog.html:  697:<h1 class="title">Finja - Your friendly finding ninja</h1>
finja-blog.rst:    1:Finja - Your friendly finding ninja
finja-blog.rst:   16:a ninja.

I’m not interested in results from the html file.

$> finja -p html ninja
.:
finja-blog.rst:    1:Finja - Your friendly finding ninja
finja-blog.rst:   16:a ninja.

So in the directory ‘.’ we find ninja two times on line 1 and 16. But wait this isn’t uptodate.

$> finja -u
finja-blog.rst: indexed 939/2897 (0.324) new: 111 UTF-8

With finja -u it only updates file that have changed.

$> finja -p html ninja
.:
finja-blog.rst:    1:Finja - Your friendly finding ninja
finja-blog.rst:   16:a ninja.
finja-blog.rst:   61:   $> finja ninja
finja-blog.rst:   63:   finja-blog.html:    7:<title>Finja - Your friendly
finja-blog.rst:   65:   id="finja-your-friendly-finding-ninja">
finja-blog.rst:   66:   finja-blog.html:  697:<h1 class="title">Finja - Your
finja-blog.rst:   67:   finja-blog.rst:    1:Finja - Your friendly finding ninja
finja-blog.rst:   68:   finja-blog.rst:   16:a ninja.
finja-blog.rst:   74:   $> finja -p html ninja
finja-blog.rst:   76:   finja-blog.rst:    1:Finja - Your friendly finding ninja
finja-blog.rst:   77:   finja-blog.rst:   16:a ninja.

“Yo dawg, I hear you like searching ninjas, we put ninjas in your search, so you can search ninjas while you search.” Yes, the result is a bit recursive, but it proves the point that we just updated the index.

You can also search and update from subdirectories:

$> cd dir/
$> finja -u -p html punctuation
..:
finja-blog.rst:   51:second pass it uses punctuation marks from natural

SQLite can be your friend

Of course I soon hit some performance problems with a 10 mloc project. It turns out that the query-optimizer of SQLite is stupid. Stupid is the right word, because it isn’t that bad, it just is – well – not as sophisticated as I am used to (PostgreSQL). By default it won’t do any histogram analysis and therefor will join the tables in a bad order. It turns out that even when you enable SQLITE_ENABLE_STAT4, which does a histogram analysis, it will select a bad join order. I don’t blame SQLite at all, it isn’t PostgreSQL after all. So we optimized the queries in finja.

SELECT
    COUNT(id) count
FROM
    finja
WHERE
    token_id = ?

We check how many times a token is in the table and we then join the token with the lowest cardinality first.

def search_term_cardinality(term_id):
    db         = get_db(create = False)
    con        = db[0]

    curs = con.cursor()
    res = curs.execute(_token_cardinality, [term_id]).fetchall()
    return res[0][0]

def order_search_terms(search):
    res = sorted(search, key=search_term_cardinality)
    return res

This is of course based on the assumption that tokens are distributed on multiple files/lines. If you have a single file repeating the same token a billion times on the same line, our heuristic won’t work. But what can you do. It is much better than letting SQLite decide.

Final word

Please install it, use it, submit a issue on github if you find a problem.

Thanks a lot!