"; */ ?>


19
Jun 13

Clojure: Down to Nanos

Here is what needs to happen: there is a URN that is a part of an HTTP request. It needs to be parsed/split on the last “:”. The right part would be the key, and the left part would be a value (we’ll call it “id” in this case). Here is an example:

user=> (def urn "company:org:account:347-68F3726A84C")

After parsing we should get a neat map:

{:by "company:org:account", :id "347-68F3726A84C"}

While it feels more readable to start “regex”ing the problem:

user=> (require '[clojure.string :as cstr])
 
user=> (def re-colon (re-pattern #":"))
user=> (cstr/split "company:org:account:347-68F3726A84C" re-colon)
 
["company" "org" "account" "347-68F3726A84C"]

Just splitting on a simple single character regex (above) takes almost a microsecond (i.e. in this case about 2242 CPU cycles):

user=> (bench (cstr/split "company:org:account:347-68F3726A84C" re-colon))
 
       Execution time mean : 830.235720 ns

In general it is always best to use language “builtins”, so we’d turn to Java’s own lastIndexOf:

(defn parse-urn-id [urn]
  (let [last-colon (.lastIndexOf urn ":")]
    {:by (subs urn 0 last-colon)
     :id (subs urn (+ last-colon 1))}))

Putting “validation” outside for a moment, this actually does what is needed:

=> (parse-urn-id urn)
{:by "company:org:account", :id "347-68F3726A84C"}
user=> (bench (parse-urn-id urn))
 
       Execution time mean : 5.588747 µs

Wow.. builtins seem to fail us. How come?

A culprit is not “lastIndexOf”, but a way Clojure resolves an “untyped” “urn”. Anything that is defined with “def” is kept inside a Clojure “Var” that uses reflection and is not amenable to HotSpot optimizations. An interesting read on what actually happens: “Why Clojure doesn’t need invokedynamic, but it might be nice”.

While, in most cases, parsing a String for 6 microseconds is a perfectly fine expectation, there is a simple hint that can make it run 60 times faster. It’s a hint.. It’s a type hint:

(defn parse-urn-id [^String urn]
  (let [last-colon (.lastIndexOf urn ":")]
    {:by (subs urn 0 last-colon)
     :id (subs urn (+ last-colon 1))}))
user=> (bench (parse-urn-id urn))
 
       Execution time mean : 83.409471 ns

By hinting a “urn” type to be a “^String”, this function is now 67 times faster.

Achieve a warm and fuzzy feeling...   [Done]

31
May 13

Datomic: Can Simple be also Fast?

“An ounce of performance is worth pounds of promises”
© Mae West

Speed for the sake of Speed


Benchmarks, performance, speed comparisons, metrics, apples to apples, apples to nuclear weapons.. We, software developers “for real”, know that all that is irrelevant. What relevant is a clear problem definition, flexible data model, simple tool/way to solve it, a feeling of accomplishment, approved billing papers, moving along closer to the stars..

All is true, premature optimization is the root of all evil, readability over pointer arithmetic, performance does not really matter,… until it does. Until a problem definition itself is numbers over time i.e. ..ions per second.

It is not only HFT that defines capabilities of trading companies in their ability to cope with speed of light and picoseconds. Having big data and real time online processing coming out of basements and garages and finally hitting the brains of suits who suddenly “get it” and have money ready for it, makes a new wave of “performance problems” that are “just that”: little bit of business behind the problem, and mostly raw things per second.

Never Trust Sheep

© Ryan Stiles

Interesting thing about choosing the right database for a problem today is the fact that there are just so many available, and many scream that they can do everything for you, from storing data to taking your shirts to dry cleaning on Wednesdays if you buy a dozen of their enterprise licensees. One thing I learned is “to be quick and efficient we have to mute screamers and assholes”, so some fruit and fragile NoSQL boys and Oracle are out.

Datomic is kind of a new kid on a block. Although the development started about 3 years ago, and it was done by Relevance and Rich Hickey, so it is definitely ready for an enterprise ride. To put in into perspective: if MongoDB was ready for Foursquare as soon as MongoDb learned to “async fsync”, then given the creds, Datomic is ready for NASA.

Speed and Time


While usually all Datomic discussions involve “time” and rely on Cojure’s core principles of “Epochal Time Model”, this novel introduces another main character: “Speed”.

Datomic processes write transactions by sending them to a transactor via queue, which is currently HornetQ. Hornet claims to be able to do millions a second with SpecJms2007.

Before transaction is off to a tx queue it needs to be created though. It usually consist out of one or more queries. The documented, and recommended, way to create a query is a Datomic’s implementation of Datalog. It is a nice composable way to talk to the data.

While writes are great, and HornetQ looks quite promising, the fact that Datomic is ACID, and all the writes are coordinated by a transactor, Datomic is not (at least not yet) a system that would be a good fit for things like clickstream data. It does not mean writes are slow, it just means that it is not for 1% of problems that need “clickstream” like writes.

Reads are another story, one of the kickers in Datomic is having access to the database as a value. While “as a value” part sounds like a foreign term for a database, it provides crazy benefits: immutability, time, …, and among all others it provides a database value inside the application server that runs queries on it. A closest analogy is a git repository where you talk to the git repo locally, hence “locality of references”, hence caching, hence fast reads.

Peeking into The Universe


It is always good to start from the beginning, so why not start directly from the source: The Universe. Humans naively separated The Universe into galaxies, where a single galaxy has many planets. So here is our data model, many galaxies and many planets:

(def schema [
  {:db/id #db/id[:db.part/db]
   :db/ident :universe/planet
   :db/valueType :db.type/string
   :db/cardinality :db.cardinality/many
   :db/index true
   :db.install/_attribute :db.part/db}
 
  {:db/id #db/id[:db.part/db]
   :db/ident :universe/galaxy
   :db/valueType :db.type/string
   :db/cardinality :db.cardinality/many
   :db/index true
   :db.install/_attribute :db.part/db} ])

Here is the data we are going to be looking for:

(def data [
  {:db/id #db/id[:db.part/user]
   :universe/galaxy "Pegasus"
   :universe/planet "M7G_677"} ])

And here is a Datalogy way to look for planets in Pegasus galaxy:

(d/q '[:find ?planet :in $ ?galaxy 
       :where [?e :universe/planet ?planet]
              [?e :universe/galaxy ?galaxy]] (db *conn*) "Pegasus")

Which returns an expected planet: #{[“M7G_677”]}

Secret of a Clause


Now let’s bench it. I am going to use criterium:

(bench 
  (d/q '[:find ?planet :in $ ?galaxy 
         :where [?e :universe/planet ?planet]
                [?e :universe/galaxy ?galaxy]] (db *conn*) "Pegasus"))
   Evaluation count : 384000 in 60 samples of 6400 calls.
Execution time mean : 169.482941 µs

170µs for a lookup.. not too shabby, but not blazing either.. Can we do better? Yes.

The way Datalog is implemented in Datomic, it takes less time, and needs to do less, if most specific query clauses are specified first. In the example above “[?e :universe/galaxy ?galaxy]” is the most specific, since the “?galaxy” is provided as “Pegasus”. Let’s use this knowledge and bench it again:

(bench 
  (d/q '[:find ?planet :in $ ?galaxy 
         :where [?e :universe/galaxy ?galaxy]
                [?e :universe/planet ?planet]] (db *conn*) "Pegasus"))
   Evaluation count : 384000 in 60 samples of 6400 calls.
Execution time mean : 158.893114 µs

The “clause trick” shaved off 10µs, which is pretty good. But can we do better? We can.

The Art of Low Level


Since I was a kid, I knew that Basic is great, but it is for kids, and Assembly is the real deal. While Datalog is not Basic, and it is really powerful and good for 95% of cases, there are limitations with how fast you can go. Again, most of the time, a 200µs lookup is more than enough, but sometimes it is not. At times like these we turn to Datomic Assembly: we turn directly to indices.

While in the universe of other DB engines and languages doing something low level requires reading 3 books and then paying a consultant to do it, in Clojuretomic land it is a simple function that is called datoms:

datoms
function
 
Usage: (datoms db index & components)
 
Raw access to the index data, by index. The index must be supplied,
and, optionally, one or more leading components of the index can be
supplied to narrow the result.
 
    :eavt and :aevt indexes will contain all datoms
    :avet contains datoms for attributes where :db/index = true.
    :vaet contains datoms for attributes of :db.type/ref
    :vaet is the reverse index

And it is not just simple, it also works. Let’s cook a generic lookup function that would lookup an attribute of an entity by another attribute of that same entity:

(defn lookup [?a _ a v] 
  (let [d (db *conn*)] 
    (map #(map :v (d/datoms d :eavt (:e %) ?a)) 
         (d/datoms d :avet a v))))

A couple of things:

First we look for an entity with an attribute “a” that has a value “v”. For that we’re peeking into the AVET index. In other words, we know “A” and “V” from “AVET”, we are looking for an “E”.

Then we map over all matches and get the values of an attribute in question: “?a”. This time we’d do that by peeking into “EAVT” index. In other words, we are looking for a value “V”, by “E” and “A” in “EAVT”.

Note that idiomatically, or should I say datomically, we should pass a database value into this “lookup” function, since it would enable lookups across time, and not just “the latest”.

Let’s bench it already:

(bench 
  (doall 
    (lookup :universe/planet :by :universe/galaxy "Pegasus")))
   Evaluation count : 11366580 in 60 samples of 189443 calls.
Execution time mean : 5.848209 µs

Yep, from 170µs to 6µs, not bad of a jump.

Theory of Relativity


All performance numbers are of course relative. Relative to the data, to the hardware, to the problem and other sequence of events. The above is done on “Mac Book Pro 2.8 GHz, 8GB, i7 2 cores” with a 64 bit build of JDK 1.7.0_21 and “datomic-free 0.8.3862”. Having Clojure love for concurrency I would expect Datomic performance (i.e. those 6µs) to improve even further on the real multicore server.

Another thing to note, above results are from a single Datomic instance. Since Datomic is easily horizontally scalable, by partitioning reads and having them go out against “many Datomics” (I am not using a “cluster” term here, since the database is a value and it is/can be collocated with readers) the performance will improve even further.

While most of the time I buy arguments on how useless benchmarks are, sometimes they really do help. And believe it or not, my journey from Datalog to Datomic indicies, with a strictly performance agenda, made it possible to apply Datomic to the real big money problem, beyond hello worlds, git gists and prototypes that die after the “initial commit”.


04
Apr 13

Software as Space-Time Continuum

Is Software for Real?


Here is a question and an answer. They both propose a sequence of logical conclusions, but the answer is of course a “stronger” form of communicating this sequence:

Q: Do you create computer programs? Do you design them based on a business/problem domain? Do you understand this domain? Does this domain apply to the real world? Do you understand the real world? Do you create computer programs based on your understanding of the real world? Do you need to?

A: If you create computer programs they should be based on a business/problem domain. Hence you should understand this domain. Most of the problems/businesses are directly applicable/modeled after the real world. Hence you should understand the real world. Therefore you should create computer programs that apply/modeled after the real world (as much as possible, hence then you can “implement” the domain more closely / find a solution to a problem that can be directly applied to the real world).

While Q and A both follow the same logical sequence, I’d like to explore the Q of things. I’d like to question whether it is really true that we need to design/model our software/languages/libraries after the real world. And if it is true how close we need to “get the real world right” in our software design.

Intuition vs. Intellect


The first instinct, or my “natural intuitive power”, suggests that it is really the case. If we do model after the real world then we are building systems, evolving our metal and software friends who can potentially just coexist with us in our world following the laws of nature. Another supportive argument is a “business problem” that needs to be solved with software. For example “empowering businesses to predict nature”. Of course in order to predict something, we have to understand it and its environment. Currently we do that by taking petabytes of data and slapping it with something like Random Forest, and looking at probabilities… But what if our software just “knew” how things work, not based on historical data, but based on metaphysics, i.e. based on “what the world is and how it really works”.

While I am still not convinced we have to model our software after the real world “all the time”, I do lean towards this idea for the “most problems” I get to work with. In the least number of cases, for example when I get to write a stress test that sends billions of option quotes over the wire, I really don’t care whether it is an “ugly” imperative for loop or a “beautiful” pure function ‘send’ that is mapped over an immutable lazy sequence of quotes. This of course is an over simplification: firstly I should care, and it should be an imperative for loop, since it is faster, and secondly mapping functions over a sequence of things is not really a true modeling of a real world. Or is it?

“A Mad Tea-Party”


Mad Hatter
If modeling after the real world is important, it is obvious we can’t omit “time” dimension when designing and creating software. But most of the time (no pun intended) we do just that. Most widely known/deployed software design approaches/patterns and instruments (e.g. programming languages) are stuck in “Wonderland“:

“Time has punished the Hatter by eternally standing still at 6 pm, and therefore there is always tea time”

Think about a C++/Ruby/Python/Java.. “object”. Does it have a notion of time? No. It may seem it has a notion of “now”, but when it is (its state) teared apart by multiple threads at the same time there is no notion of “now” anymore, and the poor object suffers terribly.

There is a classical talk by Rich Hickey “Are We There Yet?”, where he presents an “Epochal Time Model”:

Epochal Time Model

Where he introduces time as succession of states, and advocates immutability, i.e. the value of state “V1” is forever “V1” at that particular point in time. And we only get to state with a value “V2” after applying a function “F” to the state “V1”. Theoretically if we did support this notion of time, we could just talk to this “timeline”, which Rich calls an identity, and ask “what was its state at that point in time?” or more precisely “when/at which point in time was the state’s ‘value’ equal to “V1″?”. Practically we can do it with Datomic.

Does Time Stutter?


It’s more of a metaphysical thought, but if we are after true representation of the real world, does the “Epochal Time Model” really reflect how the world “changes”? If you wave your hand, does it just hyperjump from a state A to a state B? How granular is the path from A to B? Or is it absolutely “continuous” in which case there is no granularity, i.e. no smallest time interval? If you observe someone waving her hand, it is a different story: “the human eye and its brain interface, the human visual system, can process 10 to 12 separate images per second”, so here is a “state succession” interval. But what does really happen? Is the hand itself moving one planck at a time? Is it moving “de Broglie wavelength” at a time? Is it even important when modeling software?

Do You “Function” Me?


Rich, and many other functional programming advocates, “sell” the idea of functional programming by stressing two things:

1. “This is how things work in the real world…”
2. “It is much easier to reason about…”

While the latter is mostly human based (how “we think”), it is connected to the #1. Since we think one way or another because our thinking is limited by what we know (+ imagination [this is controversial :]). And quite possibly our thinking is based on what we know “about the real world”, which we are part of.

It does seem to be important to model after the real world [ maybe :]. But why now? Why not 20 or 50 years ago? Did any of the programming languages actually aim to even connect to the real world before?

1957 Fortran: concepts included easier entry of equations into a computer

1958 Lisp: created as a practical mathematical notation for computer programs, influenced by the notation of Alonzo Church’s lambda calculus.

1958 ALGOL: was the standard method for algorithm description used by the ACM in textbooks and academic works

1962 Simula: created as a special purpose programming language for simulating discrete event systems

Ok, looks like the closest match is Simula that at least was created to simulate “something”. It is an ironic coincidence that it also is the source of Object Orientation (and actor model), but that is besides the point.

1971 Smalltalk-71 was created by Ingalls in a few mornings on a bet that a programming language based on the idea of message passing inspired by Simula could be implemented in “a page of code.”

1972 Prolog was motivated in part by the desire to reconcile the use of logic as a declarative knowledge representation language with the procedural representation of knowledge

1972 C was designed to be compiled using a relatively straightforward compiler, to provide low-level access to memory, to provide language constructs that map efficiently to machine instructions, and to require minimal run-time support.

1973 ML was conceived to develop proof tactics in the LCF theorem prover (whose language, pplambda, a combination of the first-order predicate calculus and the simply typed polymorphic lambda calculus, had ML as its metalanguage).

1975 Scheme started as an attempt to understand Carl Hewitt’s Actor model, for which purpose Steele and Sussman wrote a “tiny Lisp interpreter” using Maclisp and then “added mechanisms for creating actors and sending messages.”

1979 C++ remembering his Ph.D. experience, Stroustrup set out to enhance the C language with Simula-like features

1986 Erlang was designed with the aim of improving the development of telephony applications

1991 Java was originally designed (as “Oak”) for interactive television, but it was too advanced for the digital cable television industry at the time.

1995 JavaScript was originally implemented as part of web browsers so that client-side scripts could interact with the user, control the browser, communicate asynchronously, and alter the document content that was displayed

Looking at major languages created before “10/20 years ago”, their creational purpose does not seem to be “real world” related. For example the first two are clearly strictly math related.

Emotional Mathematics


So is it about Math? After all, we are “Computer Scientists” and although most of “Computer Programmers” are (unfortunately) very far from math, the science behind computers is very much based on it. Is the goal of mathematics, as a science, to model the real world? Let’s look how we “know to define” math:

math·e·mat·ics  [math-uh-mat-iks]
noun
1. The systematic treatment of magnitude, relationships between figures and forms, 
   and relations between quantities expressed symbolically.
2. A group of related sciences, including algebra, geometry, and calculus, 
   concerned with the study of number, quantity, shape, and space 
   and their interrelationships by using a specialized notation
3. Mathematical operations and processes involved in the solution of a problem 
   or study of some scientific field

Complimenting with “the current world intellect” (wikipedia):

Mathematics "has no generally accepted definition". 
Aristotle defined mathematics as "the science of quantity", 
Benjamin Peirce's "the science that draws necessary conclusions", 
                  "Mathematics is the mental activity which consists 
                  in carrying out constructs one after the other.",
Haskell Curry defined mathematics simply as "the science of formal systems", 
where a formal system is a set of symbols, or tokens, 
and some rules telling how the tokens may be combined into formulas.

From the above definitions it is hard to say whether the “real world modeling” is what Math is after, although we definitely know that Math is there to “explain” things in the world, oh.. wait, that’s Physics :]

If we are after modeling the real world (if it is indeed important), should we try to be as close as possible to the “reality”? Or should we just take the real world in by pieces and convenient definitions that suit a certain process, a problem at hand, programming style, hype, tech movement, etc.? It feels right to be close to the real world, since we are a part of it. It feels right to be close to ourselves. Are we just selfish? :]


16
Jan 13

Backup and Reset Nexus 4: Cracked and Locked Screen

Story of a Flaw


Nexus 4 has an invisible design flaw: the back side of a phone is just glass, which, while looks pretty to some, makes it slide down and fall from pretty much any surface. Another unfortunate caveat is the curved unprotected glass on the front. If the phone falls down, it’ll most likely crack…

And here I present a 20 day old brand new Nexus 4:

nexus screen

After the unfortunate flight down, the screen is done and stopped responding to any tapping/touching. In addition the screen is locked with a gesture pattern. The phone is not rooted and with a locked bootloader: exactly how it comes from Google.

Android iTunes


At this point I need to back it up and factory reset the phone in order to ship it safely for repairs. Since Google is moving towards Apple’s “lock down and control” system design, the phone can’t be just plugged in via USB to back up its files (e.g. via USB Mass Storage), now days there is MTP. As Google puts it:

“We did it because we wanted to be able to merge the “public shared storage” (i.e. for music and photos) with the internal private app storage.”

Yea, ok, while it does have a technical merit, it really brings user experience closer to the “iTunes level”, not as bad yet, but quite close. In any case, since the Nexus 4 screen is locked, Android iTunes complains that until that screen is unlocked it won’t be showing any files. Yea, thanks for the security Apple.. I mean Google.

Unlocking the Screen


There is a great tool written by Alexanre which allows to control an Android device remotely from Mac/PC. In other words a regular keyboard/mouse could be used to control the Android device. The tool is called Android Screencast. The only caveat, the phone needs to be rooted in order to be controlled by this tool. I got to see the phone’s screen on my Mac:

android screencast

In order to make Mac keyboard to work I need to do “chmod 777 /data/dalvik-cache” as root (e.g. “su”), but the phone is not rooted, bummer. The tool is great though.

Btw, if the phone was rooted, I could simply do: “adb shell rm /data/system/gesture.key” to get rid of the screen lock.

Backing Up What’s Dear


One thing I knew I could do for sure is to backup the “sdcard”. I knew this because of two things: I have “USB Debugging” on, which means I can use adb, and “sdcard” is not owned by root, which means it can be “pulled”/”read” by adb. Hence the first step is clear:

./adb pull /sdcard/ /destination

At this point all the pictures / videos / music, etc.. are backed up. Now I need to backup my SMS, contacts, etc.. Information that lives in Android “databases”. For example contacts usually live here “/data/data/com.android.providers.contacts/databases”. Which can’t be simply “pulled” as pitures, since they are protected, and hence cannot be read by adb directly.

Another tool to the rescue: Moborobo. The only caveat it is a Windows tool, and in order to install it I needed to power up my virtual box. The tool is pretty neat and quite powerful. Unfortunately all it could backup for me was SMS, everything else failed, but it is one step further nevertheless + I have most of my contacts gmail synced. Apps would be nice to backup, but they can be reinstalled manually later on.

Waving Goodbye or The Factory Reset


Now the interesting bit: the factory reset. It was not really straightforward, since most of googled instructions either talk about doing it from within a phone by tapping through settings, which is not an option in this case, or by using Home / Back buttons which are also a part of the screen that does not work. But after some minutes “the way” revealed itself.

Firstly the phone needs to be rebooted in “Recovery mode”, which can be done through fastboot, in case “USB Debugging” is not enabled:

Disconnect the phone. Power it down (by holding the Power button). Reboot into fastboot mode by holding the Volume Down button and Power:

fastboot start

The “Recovery Mode” is two Volume button clicks away (confirm with a Power button once the mode is selected):

fastboot recovery

In case “USB Debugging” is enabled, the easier way to get to this step would be via adb:

./adb -d reboot recovery

Which will boot into:

recovery mode

From here press and hold Power button and then press Volume Up, which will get into:

android factory reset

Now all that needs to be done is to mentally wave good bye to all the data and confirm the reset:

confirm factory reset

And the Award Goes To…


All the pictures are courtesy of my good old Nexus S, which has fell down countless number of times over the years and have a couple of scratches on the back. Yes, “they” knew how to build phones for real in the good old days..


23
Nov 12

Convert HTML5 FileList to Clojure Vector

When AJAX uploading data/files via input elements with ClojureScript, HTML5 returns FileList which is not really a list nor an array, hence can not be converted to Clojure by simply calling “js->clj“, although the usage is pretty similar to an array:

// uploadData is a form element
// fileChooser is input element of type 'file'
var file = document.forms['uploadData']['fileChooser'].files[0];

But as it is always in the Clojure world, there is a solution. This one is not trivial, but after spending almost 30 minutes, it surfaced under my finger tips:

(defn toArray [js-col]
  (-> (clj->js []) 
      (.-slice)
      (.call js-col)
      (js->clj)))

Now in ClojureScript the “(toArray fileList)” will return a classic seq that can be loved and iterated.