Tricks to Speed up Neighbor Searches of Quadtrees. #geo #spatial #java

In Java land there are at least two quadtree implementations which are not yet optimal, so I though I’ll post some possibilities to tune them. Some of those possibilities are already implemented in my GraphHopper project.

Quadtree

What is a quadtree? Wikipedia says: “A quadtree is a tree data structure in which each internal node has exactly four children.” And then you need some leaf nodes to actually store some data – e.g. points and associated values.

A quadtree is often used for fast neighbor searches of spatial data like points or lines. And a quadtree with points could work like the following: fill up a leaf node until its full (e.g. 8 entries), then create a branch node with 4 pointers (north-west, north-east, south-west, south-east) and decide where the leaf node entries should go depending of its location, in this process it could be necessary to create new branch nodes if all entries are too much clustered.

Now a simple neighbor search can be implemented recursively. Starting from the root node:

  • If the current node is a leaf node check if the point is in the search area. If that is the case add it to the results
  • If it is branch node check if one of the 4 sub-areas intersects with the search area. If a sub-node intersects then use that as current node and call this method recursively.

Trick 1 – Normalize the Distance

Searches are normally done with a point and a radius. To check if the current area of the quadrant intersects with the search area you need to calculate the distance using the Haversine formula. But you don’t need to calculate it every time in its entire complexity. E.g. you can avoid the following calculation:

R * 2 * Math.asin(Math.sqrt(a));

This is ok, if you have already normalized the radius of the search area via:

double tmp = Math.sin(dist / 2 / R);
return tmp * tmp;

Trick 2 – Use Smart Intersect Methods

The intersect method should fail fast. E.g. when you use again a circle as search area you should calculate only once the bounding box and check intersection of this with the quadrant area before applying the heavy intersect calculation with the Haversine formula:

public boolean intersect(BBox b) {
    // test top intersect
    if (lat > b.maxLat) {
        if (lon < b.minLon)
            return normDist(b.maxLat, b.minLon)  b.maxLon)
            return normDist(b.maxLat, b.maxLon)  0;
    }

    // test bottom intersect
    if (lat < b.minLat) {
        if (lon < b.minLon)
            return normDist(b.minLat, b.minLon)  b.maxLon)
            return normDist(b.minLat, b.maxLon)  0;
    }

    // test middle intersect
    if (lon < b.minLon)         return bbox.maxLon - b.minLon > 0;
    if (lon > b.maxLon)
        return b.maxLon - bbox.minLon > 0;
    return true;
}

Also be very sure you defined your bounding box properly once and for all. I’m using: minLon, maxLon followed by minLat which is south(!) and maxLat. Equally to EX_GeographicBoundingBox in the ISO 19115 standard see this discussion.

Trick 3 – Use Contains() and not only Intersect()

Now a less obvious trick. You could completely avoid intersect calls of quadrant areas which lay entirely in the search area. For this it is necessary to calculate fast if the search area fully contains the quadrant area or not. E.g. the method for a boudning box containing another bounding box would be:

class BBox {
  public boolean contains(BBox b) {
    return maxLat >= b.maxLat && minLat = b.maxLon && minLon   }
  ...
}

Similar for BBox.contains(Circle), Circle.contains(BBox) and Circle.contains(Circle).

Trick 4 – Sort In Time and not just Adding

Normally you want only 10 or less nearest neighbors and not all neighbours in a fixed distance. Especially for search engines like Lucene this should be favoured. For that you should use a priority queue to handle the ordering of the result. But not only of the leaf nodes – also when deciding which branch node should be processed next! See the paper Ranking in spatial databases for more information, where also a method for incremental neighbor search is described. This would make paging through the results very efficient.

Trick 5 – Use Branch Nodes with more than Four Children

Instead of branch nodes with 4 children you could use a less memory efficient but faster arrangement: use branch nodes with 16 child nodes. Or you could even decide on demand which branch node you should use – this can lead to partially big branch arrays where the quadtree is complete – making searching very efficient as then the sub-quadtree is a linear quadtree (see below).

Tricks for linear QuadTrees

Trick 6 – Avoid costly intersect methods

This only applies if you quadtree is a linear one ie. one where you can access the values by spatial keys (aka locational code). You’ll need to compute the spatial key of all four corners of your search area bounding box. Than compute the common bit-prefix of the points and start with that bit key instead from zero to traverse the quadtree. More details are available in the paper Speeding Up Construction of Quadtrees for Spatial Indexing  p.12.

Trick 7 – Bulk processing for linear Quadtrees

If you know that a quadrant is completely contained in a search area you can not only avoid further intersection calls, but also you can completely avoid branching and loop from quadrants bottom-left point to top-right. E.g. your are at key=1011 and you know the current quadrant node is contained in the search area. Then you can loop from the index “1011 000000..” to “1011 111111..”

Trick 8 – Store some Bits from the Point in Branch Nodes

For linear quadtrees you already encoded the point into spatial keys. Now you can use a radix tree to store the data memory efficient: some bits of the spatial key can be stored directly in the branch nodes. I’ve create also a different approach of a memory efficient spatial storage but as it turns out it is not that easy to handle when you need neighbor searches.

Conclusion

As you can see a lot time can go into tuning a quadtree. But at least the first tricks I mentioned should be used as they are easy and fast to apply and will make your quadtree significant faster.

Spatial Keys – Memory Efficient Geohashes

When you are operating on geographical data you can use latitude and longitude to specify a location somewhere on earth. To look up some associated information like we do in GraphHopper for the next street or if you want to do neighborhood searches you could create R-trees, quad-trees or similar spatial data structures to make them efficient. Some people are using Geohashes instead because then the neighborhood searches are relative easy to implement with simple database queries.

In some cases you’ll find binary representations of Geohashes – I named them spatial keys – in the literature I found a similar representation named locational code (Gargantini 1982) or QuadTiles at OpenStreetMap. A spatial key works like a Geohash but for the implementation a binary representation instead of one with characters is chosen:

At the first level (e.g. assume world boundaries) for the latitude we have to decide whether the point is at the northern (1) or southern (0) hemisphere. Then for the longitude we need to know wether it is in the west (0) or in the east (1) of the prime meridian resulting in 11 for the image below. Then for the next level we “step into” smaller boundaries (lat=0..90°,lon=0..+180°) and we do the same categorization resulting in a spatial key of 4 bits: 11 10

The encoding works in Java as follows:

long hash = 0;
double minLat = minLatI;
double maxLat = maxLatI;
double minLon = minLonI;
double maxLon = maxLonI;
int i = 0;
while (true) {
    if (minLat  midLat) {
            hash |= 1;
            minLat = midLat;
        } else
            maxLat = midLat;
    }

    hash &amp;amp;lt;&amp;amp;lt;= 1;
    if (minLon  midLon) {
            hash |= 1;
            minLon = midLon;
        } else
            maxLon = midLon;
    }

    i++;
    if (i &amp;amp;lt; iterations)
        hash &amp;amp;lt;&amp;amp;lt;= 1;
    else
        break;
}
return hash;

When we have calculated 25 levels (50 bits) we are in the same range of float precision. The float precision is approx. 1m=40000km/2^25 assuming that for a lat,lon-float representation we use 3 digits before the comma and 5 digits after. Then a difference of 0.00001 (lat2-lat1) means 1m which can be easily calculated using the Haversine formula. So, with spatial keys we can either safe around 14 bits per point or we are a bit more precise for the same memory usage than using 2 floats.

I choose the definition Lat, Lon as it is more common for a spatial point, although it is against the mathematic point x,y.

Now that we have defined the spatial key we see that it has the same properties as a Geohash – e.g. reducing the length (“removing some bits on the right”) results in a broader matching region.

Additionally the order of the quadrants could be chosen differently – instead of the Z-curve (also known as Morton Code) you could use the Hilbert Curve:

But as you can see, this would make the implementation a bit more complicated and e.g. the orientation of the “U” order depends on previous levels –  but the neighborhood searches would be more efficient – this is also explained a bit in the paper Speeding Up Construction of Quadtrees for Spatial Indexing  p.8.

I decided to avoid that optimization. Let me know if you have some working code of SpatialKeyAlgo for it 🙂 ! It should be noted that there are other space filling curves like the ones from Peano or Sierpinsky.

One problem

while implementing this was to get the same point out of decode(encode(point)) again. The encoding/decoding schema needs to be “nearly bijective” – i.e. it should avoid rounding problems as good as possible. To illustrate where I got a problem assume that the point P (e.g. see the image above) is encoded to 1110 – so we already lost precision, when our point is now at the bottom left of the quadrant 1110. Now if we decode it back we loose again some precision due to normal double rounding errors. If we would encode that point again it could end up as 1001 – the point moved one quadrant to the right and one to the bottom! To avoid that, you need to define position of the point at the center of the quadrant while decoding. I implemented this simply by adding half of the quadrant width to the latitude and longitude. This makes the encoding/decoding stable even if there are minor rounding issues while decoding.

A nice property of the spatial key

is that one bounding box e.g. for the starting bits at level 6:

110011

goes from the bottom left point

110011 0000..

to the top-right point

110011 1111..

making it easy to request e.g. the surrounding bounding boxes of a point for every level.

Conclusion

As we have seen the spatial key is just a different representation of a Geohash with the same properties but uses a lot less memory. The next time you index Geohashes in your DB use a long value instead of a lengthy string.

In the next post you will learn how we can implement a memory efficient spatial data structure with the help of spatial keys. If you want to look at a normal quadtree implemented with spatial keys you can take a look right now at my GraphHopper project. With this quadtree neighborhood searches are approx. twice times slower than one with values for latitude and longitude due to the necessary encoding/decoding. Have a look into the performance comparison project.

Memory Efficient Java – Mission Impossible?

First, the normal usage in Java leads to massive memory waste as its pointed out in several articles or here in a nice IBM article.

Some examples for the 32 bit JVM:

  1. Using 3 * 4 bytes for an object with no member variables.
  2. Using 4 * 4 bytes for an empty int[] array!
  3. Using 4 * 4 + 7 * 4 = 44 (!) bytes for an empty String object

Of course things are a bit more complicated but in short: the wrapper classes and shorter arrays should be avoided.

Second, doing it a bit more efficient is relative easy – just use the primitive collections from the trove project, where wrapper objects can be avoided. Because the standard collection classes are trimmed to be CPU efficient – not really memory efficient. Also several JVM tuning parameters could be used to reduce RAM usage.

Thrird, doing it more efficient or as efficient as in C++ is very complex and in some cases even impossible.

Let me first explain why it is in some cases impossible to be as efficient as in C++

When you allocate an array of primitives such as int, long or double in Java then the values are stored directly in the array:

int1
int2
int3
...

This is good in terms of memory. Now when you allocate an array of objects then only the references are stored:

ref1  --- > points the object (a position somewhere on the heap)
ref2  --- > points to another object
ref3  --- > etc
...

Imagine that you are using an array of an object Point with only two members latitude and longitude (e.g. both floats). Then you would waste the memory necessary for the reference (4 bytes on  the 32bit JVM) and the object overhead (12 bytes). 16 bytes waste; 8 bytes data. Now assume you need 100 mio points (osm data of Germany) => you would waste 1.6 GB RAM and only if you do it efficient 😉

If you think about this a bit you would probably argue that a clever JVM could inline the objects to avoid the overhead and the additional reference. Well, I think this is impossible.

For the JVM it is nearly impossible to inline objects in arrays

You have two solvable problems when you want that the JVM inlines the objects for you:

  1. You need one bit indicating if the entry is null (0) or not (1) – after initializing then all entries are null. Or one could even call a default constructor per configuration or whatever.
  2. The class needs to be final, as otherwise the lengths of one subclass entry could exceed the reserved maximum length.

… but you have one really hard – unsolvable (?) problem:

     3. You would need to change the move/reference semantics in Java – a no go in my opinion, not only for the language designers.

But why you would need to change the semantics? Well, imagine you have two such inline arrays and you are doing something ‘unimportant’:

// The point p refers to the memory in the array. Okay.
Point p = inlineArr1[100];
p.setX(111f);

// Uh, what to do now? In C++ you could define to copy the point into the array.
// In Java you would copy the reference - not possible for 2 inline arrays ...
inlineArr2[100] = p;
inlineArr2[100].setX(222f);
float result = p.getX();

What result would we get in the last line? With normal Java arrays you get 222f. But with the inline approach and copy semantics you get 111f – which would be against every Java-thinking.

Instead of using Java-unintuitive copy semantics one workaround could be to forbid assignments to such inline arrays. Those read-only inline arrays would be harder to use but still nice to have IMO 🙂 !

Memory efficient Java – via Primitives

Now let us investigate how we could be in Java as memory efficient as with C++.

The memory problem above is solvable in Java when one would use two float arrays. But it makes programming harder. For example how to swap two of those entries? The normal Java way for the Point class is:

Point tmp = arr[2]; arr[2]=arr[3]; arr[3]=tmp;

But the array wrapper would make it necessary to create a separate swap method which then swaps the two entries for every point:

float tmpx = x[2]; x[2]=x[3]; x[3]=tmpx;
float tmpy = y[2]; y[2]=y[3]; y[3]=tmpy;

If you retrieve the raw floats and would swap the entries outside of the wrapper it would be error prone to add more members later to the object (e.g. like a point ID). And you’ll get probably other problems as well.

But we could put an object oriented wrapper class around it which returns Point objects created from the underlying float arrays, which has the disadvantage of being CPU intensive. Now, in the next part we use a similar idea but operate on more low level arrays which then gives us the possibility to use memory mapped features – turning our datastructure in a simple storage. Read on!

Memory efficient Java – via raw Bytes

A second solution in Java is to use an object oriented wrapper around one big byte array or a ByteBuffer which then can be even memory mapped:

RandomAccessFile file = new RandomAccessFile(fileName, "rw");
MappedByteBuffer byteArray = file.getChannel().map(FileChannel.MapMode.READ_WRITE, 0, size);

The disadvantage is that you will have additional CPU for the conversion and a much more complicated procedure to retrieve and store things. E.g. to retrieve a point you’ll need to write:

Point get(int index) {
  int offset = index * bytesPerPoint;
  return new Point(bytesToFloat(byteArray, offset), bytesToFloat(byteArray, offset + 4));
}

or to serialize the object into bytes:

void set(Point p) {
  int offset = index * bytesPerPoint;
  writeFloat(byteArray, offset, p.getX());
  writeFloat(byteArray, offset + 4, p.getY());
}

The big advantage over the method with a normal primitive array is that then “loading” objects on startup takes milliseconds and not seconds or minutes, which is quite cool and very uncommon for java 😉

Conclusion

In my opinion all the memory efficient programming methods for Java are very ugly. This time the simplicity of C++ easily beats Java, because in Java you need to operate on bits and bytes – in C++ you could just cast the bytes to your objects and be memory efficient as well.

But (sadly?) I’m a Java guy – and I still prefer faster development cycles than 100% memory efficiency. Read in some weeks how I’ve implemented a memory efficient Map of Geohashes or a simple graph in Java for my GraphHopper project.

Update – Resources

Free Online Graph Theory Books and Resources

A link compilation of some Hackernews and Stackoverflow posts and a longish personal investigation.

Google books

Chapters:

Visualizations

None-free

Rage Against the Android – Nearly solved! #eclipse #netbeans

After ranting against Android in my previous post I have mainly solved now the untestable situation Android is producing.

Thanks to Robolectric all my tests are finally fast and do not unpredictable hang or similar!

The tests execute nearly as fast as in a normal Maven project – plus additional 2 seconds to bootstrap (dex?). Robolectric was initially made for maven integration on IntelliJ. But it works without Maven and under Eclipse as described here. I didn’t get robolectric working via Maven under Eclipse – and it was overly complex e.g. you’ll need an android maven bridge but when you use a normal Android project and a separate Java projects for the tests then it works surprisingly well for Eclipse.

Things you should keep in mind for Eclipse:

  • Eclipse and Robolectric description here
  • In both projects put local.properties with sdk.dir=/path/to/android-sdk
  • Android dependencies should come last for the test project as well as in the pom.xml for the maven integration (see below)

But now even better: when you are doing your setup via Maven you can use NetBeans! Without any plugin – so even in the latest NetBeans version. But how is the setup done for Maven? Read this! After this you only need to open the project with NetBeans (or IntelliJ).

Things you should keep in mind for Maven and NetBeans:

  • Use Maven >= 3.0.3
  • Again: put the android dependencies last!
  • Specify the sdk path via bashrc, pom.xml or put it into your settings.xml ! Sometimes this was not sufficient in NetBeans – then I used ANDROID_HOME=xy in my actions
  • Make sure “compile on save” is disabled for tests too, as I got sometimes strange exceptions
  • deployment to device can be done to:
    mvn -DskipTests=true clean install      # produce apk in target or use -Dandroid.file=myApp.apk in the next step
    mvn -Dandroid.device=usb android:deploy # use it
    mvn android:run
  • If you don’t have a the Android plugin for NetBeans just use ‘adb logcat | grep something’ for your UI to the logs

Thats it. Back to hacking and Android TDD!

Assignment Algorithms to improve Perfomance of Automated Timetabling

Update: Was reblogged at dzone.com

Last two days I’ve read the old Java code of a board game. Although the game still compiles and works (it even works on a Zaurus device) the code itself is horrible: no unit tests, thread issues, a lot of static usages and mixed responsibilities. But then I ‘rediscovered’ my old TimeFinder project, which is a lot better – at least it has several unit tests! Then I saw that I even wanted to publish a paper regarding a nice finding I made 3 years ago. But I never published it as the results were too disappointing to me at that time.

Nevertheless I think the idea is still worth to be spread around (you are free to avoid usage ;)).

My Draft of a paper

So here is my “paper” ‘Optimizing Educational Schedules Using Hungarian Algorithm and Iterated Local Search‘ with the following main result:

The algorithm is bad when it comes to softconstraint optimizations (it wasn’t designed for this though) but it really shines to reduce hard constraints of timetabling problems. E.g. it was an order of magnitude faster than the 5th place (Mueller with UniTime) of the International Timetabling Competition 2007. Of course, I didn’t compare the latest version of UniTtime with TimeFinder yet.

Let me know what you think!

Now I would like to talk a bit about how one could in general use assigment algorithms in timetabling.

Using Assignment Algorithms in Timetabling

In this paper a technic is shown how assignment algorithms could be applied in timetabling. Only a few papers are known to us which uses assignment algorithms based on graph theory [Kin05, Bah04, MD77, TM02].
An assignment algorithm can be used as a subprocedure of a heuristic as it is done in the open source timetabling application called TimeFinder [Kar08]. TimeFinder uses an Hungarian algorithm only to assign a location to an event, where the solution itself is improved with a heuristic based on iterated local search [SLM03], we called this heuristic NoCollisionPrinciple.
Another possibility is to use an assignment algorithms as the ’backbone’ of a heuristic as it is done in [Bah04], where the Hungarian Algorithm is modified to act as a timetabling optimization algorithm.A third approach is to use an assignment algorithm as a pre-solver of another algorithm. This is planed for the TimeFinder project: before using the constraint based algorithm of UniTime an assignment algorithm will be used to give a good and fast starting point where all hard constraints are valid.

In all cases the good performance and optimal results of a mathmatically founded technic are the reason they are choosen and promise a good performing algorithm.

The heuristic tries to add and remove events to specific timeslots very often. So, the check if there is a room available for event E1 in a timeslot has to be very fast. But if there are already a lot of rooms and events in this timeslot it is difficult to test with a simple method if a valid room for E1 exists. Here the assignment algorithms came to my rescue. Other algorithms like the auction algorithm could solve the assignment problem, too, but are not discussed here.

An example for one cost matrix, where we want to calculate the minimal weighted matching (“best assignment”) later, could be the following:

Rooms \ Events | E1    E2
------------------------
R1             | 12    10
R2             | 16    3

The entries of the matrix are the difference of the seats to the visitors. E.g. (E2,R2)=3 means there will be 3 free seats which is better than (E2,R1)=10 free seats.

To optimize room-to-event-assignments it is necessary that the assigned room has nearly the same or slightly more seats as event E1 has visitors. The difference shouldn’t be too large, so that no seats will be ‘wasted’ (and an event with more visitors could be assigned later)

Of course, it could be that an entry is not valid (e.g. too many visitors) then the value in the matrix is +infinity.

The best solution (minimal sum) here would be E1,R1 and E2, R2 with a total sum of 15.

Some informations about the words in graph theory: this cost matrix is equivalent to a weighted, undirected, bi-partitioned, complete graph G. Weighted means with numbers instead of booleans; undirected means the same value for (Ex,Ry) and (Ry,Ex); bi-partitioned means the graph can be divided in two sets A and B where “A and B = empty” and “A or B = G”; complete means the cost matrix has the same number of rows and columns, you can always force this with +infinity values.

Assignment Algorithms

The easiest way to solve an assignment problem correctly is a brute force algorithm, which is very slow for big n: O(n!).
(Okay, it is even slow for small n=10: 10!=3628800)
Where n is in my case the number of rooms available in one timeslot. But this algorithm is always correct, because it tries all combinations of assignments and picks the the one with the minimal total sum. I used this algorithm to check the correctness of other assignment algorithms I implemented later.

Now it is great to know that there is another very fast and correct algorithm which is called Kuhn-Munkres [1] or Hungarian algorithm. The running time is O(n³) (10³=1000) and the ideas used in this algorithm are based on graph theory. Another optimization which one could use only in special cases is the incremental assignment algorithm with O(n²) (E.g. for “add one assignment” or “remove one assignment”)

And there are even faster algorithms such as the algorithm from Edmons and Carp [2]. For bi-partite graphs it runs in O(m n log n).

An interesting approach are the approximation assignment algorithms which can be a lot faster: O(n²) with a small constant. But the resulting assignments are not the best in every case.
A well known and easy to implemented algorithm the path growing algorithm from Drake and Hougardy [3] which works as follows:

currentMatching = matching0 = empty
matching1 = empty
while edges are not empty
   choose a node x from all nodes which has at least one neighbor
   while x has a neighbor
       find the heaviest edge e={x,y} incident to x
       add it to the currentMatching
       if(currentMatching == matching1) currentMatching = matching0
       else currentMatching = matching1
       remove x from the graph
       x:=y
   end
end

The good thing of this algorithm is that you can say sth. about the resulting assignment: the total sum is maximal twice times higher then the sum gained with an optimal algorithm. There are even better guarantees e.g. the one of Pettie and Sanders (~ maximal 3/2 times higher):

matching is empty
repeat k times
  choose a node x from all nodes at random
  matching := matching (+) aug(x)
end
return matching

Resources

You can grab the source of TimeFinder (Apache2 licensed) via svn:

svn checkout https://timefinder.svn.sourceforge.net/svnroot/timefinder/trunk timefinder

Then look in the package

timefinder-core/src/main/java/de/timefinder/core/algo/

To build it you have to use maven.

And finally here is the paper where the idea of a maximum network flow was already presented:
John van den Broek, Cor A. J. Hurkens, and Gerhard J. Woeginger. Timetabling problems at the TU Eindhoven. Electronic Notes in Discrete Mathematics, 25:27–28, 2006.

References

[1] Kuhn Munkres, “Algorithms for the Assignment and Transportation Problems”
[2] Edmonds and Karp, Theoretical improvements in algorithmic efficiency for network flow problems“.
[3] Drake and Hougardy, A Simple Approximation Algorithm for the Weighted Matching Problem, 2002
[4] Pettie and Sanders, A Simpler Linear Time 2/3 – epsilon Approximation for Maximum Weight Matching, Inf. Process. Lett. Vol 91, No. 6, 2004

Bird’s Eye View on ElasticSearch its Query DSL

I’ve copied the whole post into a gist so that you can simply clone, copy and paste the important stuff and even could contribute easily.

Several times per month there pop up questions regarding the query structure on the ElasticSearch user group.

Although there are good docs explaining this in depth probably the bird view of the Query DSL is necessary to understand what is written there. There is even already some good external documentation available. And there were attempts to define a schema but nevertheless I’ll add my 2 cents here. I assume you set up your ElasticSearch instance correctly and on the local machine filled with exactly those 3 articles.

Now we can query ElasticSearch as it is done there. Keep in mind to use the keyword analyzer for tags!

curl -X POST “http://localhost:9200/articles/_search?pretty=true&#8221; -d ‘
{“query” : { “query_string” : {“query” : “T*”} },
“facets” : {
“tags” : { “terms” : {“field” : “tags”} }
}}’

But when you now look into the query DSL docs you’ll only find the query part

{“query_string” : {
“default_field” : “content”,
“query” : “this AND that OR thus”
}}

And this query part can be replaced by your favourite query. Be it a filtered, term, a boolean or whatever query.

So what is the main structure of a query? Roughly it is:

curl -X POST “http://localhost:9200/articles/_search?pretty=true&#8221; -d ‘
{“from”: 0,
“size”: 10,
“query” : QUERY_JSON,
FILTER_JSON,
FACET_JSON,
SORT_JSON
}’

Keep in mind that the FILTER_JSON only applies to the query not to the facets. Read on how to do this. And now a short example how this nicely maps to the Java API:

SearchRequestBuilder srb = client.prepareSearch(“your_index”);
srb.setQuery(QueryBuilders.queryString(“title:test”));
srb.addSort(“tags”, SortOrder.ASC);
srb.addFacet(FacetBuilders.termsFacet(“tags”));
// etc -> use your IDE autocompletion function 😉

If you install my hack for ElasticSearch Head you can formulate the above query separation directly in your browser/in javascript. E.g.:

q ={ match_all:{} };
req = { query:q }

A more detailed query structure is as follows – you could easily obtain it via Java API, from the navigational elements from the official docs or directly from the source:

curl -X POST “http://localhost:9200/articles/_search?pretty=true&#8221; -d ‘
{“query” : QUERY_JSON,
“filter” : FILTER_JSON,
“from”: 0,
“size”: 10,
“sort” : SORT_ARRAY,
“highlight” : HIGHLIGHT_JSON,
“fields” : [“tags”, “title”],
“script_fields”: SCRIPT_FIELDS_JSON,
“preference”: “_local”,
“facets” : FACET_JSON,
“search_type”: “query_then_fetch”,
“timeout”: -1,
“version”: true,
“explain”: true,
“min_score”: 0.5,
“partial_fields”: PARTIAL_FIELDS_JSON,
“stats” : [“group1”, “group2”]
}’

Let us dig into a simple query with some filters and facets:
curl -XGET ‘http://localhost:9200/articles/_search?pretty=true&#8217; -d ‘
{“query”: {
“filtered” : {
“query” : { “match_all” : {} },
“filter” : {“term” : { “tags” : “bar” }}
}},
“facets” : {
“tags” : { “terms” : {“field” : “tags”} }
}}’

You should get 2 out of the 3 articles and the filter directly applies on the facets as well. If you don’t want that then put the filter part under the query:

curl -XGET ‘http://localhost:9200/articles/_search?pretty=true&#8217; -d ‘
{“query” : { “match_all” : {} },
“filter” : {“term” : { “tags” : “bar” }},
“facets” : {
“tags” : { “terms” : {“field” : “tags”} }
}}’

And how can I only filter on the facets? You’ll need facet_filter:

curl -XGET ‘http://localhost:9200/articles/_search?pretty=true&#8217; -d ‘
{“query” : { “match_all” : {} },
“facets” : {
“mytags” : {
“terms” : {“field” : “tags”},
“facet_filter” : {“term” : { “tags” : “bar”}}
}
}}’

You’ll get 3 documents with filtered facets.

Hope this posts clarifies a bit and reduces your trouble. I’ll update the post according to your comments/suggestions. Let me know if you want something explained which is Query-DSL specific for all the other questions there is the user group.

Java guy reboots with C++ – Trying to understand Memory Mangement

Some random hints if you get started (in my case restarted) with C++ – let me know if I wrote something in wrong in my personal memos:

  1. Use RAII: Resource Acquisition Is Initialization
  2. Use scoped variables which gets destroyed automatically after leaving the block:
    MyClass myVar("Hello Memory");
  3. To use call by reference use the method declaration. But it might be that you are optimizing it when not necessary. Read about Want Speed? Pass by Value.
    void myMethod(MyClass& something)
  4. Understand The Rule of three: If you need to explicitly declare either the destructor, copy constructor or copy assignment operator yourself, you probably need to explicitly declare all three of them. More C++ Idioms
  5. Understand and avoid traps, understand shallow references.
  6. Understandheap vs. stack allocation. E.g. also that heap allocation times are much slower than allocations off the stack.
  7. Use heap allocation when you dynamically want to change memory usage of an object
  8. … or prefer stl (e.g. vector)
  9. … or when an object should be used outside of a method scope:All dynamically allocated memory must be released before the pointer (except smart pointers) pointing to it goes out of scope. So, if the memory is dynamically allocated for a variable within a function, the memory should be released within the function unless a pointer to it is returned or stored by that function.
  10. The = operator in auto_ptr works in a different to normal way!
  11. Read the FAQ or this light-FAQ
  12. oh my: auto_ptr, shared_ptr, smart_ptr, …!!?
  13. Here is a nice compilation of common possible pitfalls.

… but I’m still fighting a bit to understand the memory management problematic:

1. Are there some rules of thumb?

E.g.

  1. use RAI
  2. if you cannot apply rule 1. use shared_ptr
  3. if you cannot apply rule 1. use new + delete?

2. And how can I solve the problem in C++ when I return a local constructed object (via new) in Java?

E.g. the factory pattern there can look like

public static MyClass createObj() {
  return new MyClass()
}

3. And how would you e.g. put a vector with a lot of data into a different variable?

Can I rely on the ‘Pass by Value‘ thing which boost performance? Should I use tmpVectorObj.swap(vectorObj2) ?

4. How would you fill a vector within a method?

This is forbidden I think:

// declare vector<Node> vectorObj in the class
void addSomething(string tmp) {
  Node n(tmp);
  vectorObj.add(n);
}

5. What are the disadvantages of boosts smart pointers?

Ugly but most efficient setSize for ArrayList


private static final Field sizeField;

static {
    try {
        sizeField = ArrayList.class.getDeclaredField("size");
        sizeField.setAccessible(true);
    } catch (Exception ex) {
        throw new RuntimeException(ex);
    }
}

// 'setSize'
public static <T> void growSize(final ArrayList<T> list, final int maxSize) {
    if (maxSize <= list.size())
        return;

    list.ensureCapacity(maxSize);
    try {
        sizeField.setInt(list, maxSize);
    } catch (Exception ex) {
        throw new RuntimeException("Problem while setting private size field of ArrayList", ex);
    }
}

Here is the reported issue and here a discussion. When you need decreasing the size too then have a look into Vector.setSize

A less hacky but also less efficient version would be:

public static <T> void growSizeSlower(final ArrayList<T> list, final int maxSize) {
    if (maxSize <= list.size())
        return;

    list.addAll(new AbstractList<T>() {

        @Override public Object[] toArray() {
            return new Object[maxSize - list.size()];
        }

        @Override public T get(int index) {
            throw new UnsupportedOperationException("Not supported yet.");
        }

        @Override public int size() {
            throw new UnsupportedOperationException("Not supported yet.");
        }
    });
}

Jetslide uses ElasticSearch as Database

GraphHopper – A Java routing engine

karussell ads

This post explains how one could use the search server ElasticSearch as a database. I’m using ElasticSearch as my only data storage system, because for Jetslide I want to avoid maintenance and development time overhead, which would be required when using a separate system. Be it NoSQL, object or pure SQL DBs.

ElasticSearch is a really powerfull search server based on Apache Lucene. So why can you use ElasticSearch as a single point of truth (SPOT)? Let us begin and go through all – or at least my – requirements of a data storage system! Did I forget something? Add a comment 🙂 !

CRUD & Search

You can create, read (see also realtime get), update and delete documents of different types. And of course you can perform full text search!

Multi tenancy

Multiple indices are very easy to create and to delete. This can be used to support several clients or simply to put different types into different indices like one would do when creating multiple tables for every type/class.

Sharding and Replication

Sharding and replication is just a matter of numbers when creating the index:

curl -XPUT 'http://localhost:9200/twitter/' -d '
index :
    number_of_shards : 3
    number_of_replicas : 2'

You can even update the number of replicas afterwards ‘on the fly’. To update the number of shards of one index you have to reindex (see the reindexing section below).

Distributed & Cloud

ElasticSearch can be distributed over a lot of machines. You can dynamically add and remove nodes (video). Additionally read this blog post for information about using ElasticSearch in ‘the cloud’.

Fault tolerant & Reliability

ElasticSearch will recover from the last snapshot of its gateway if something ‘bad’ happens like an index corruption or even a total cluster fallout – think time machine for search. Watch this video from Berlin Buzz Words (minute 26) to understand how the ‘reliable and asyncronous nature’ are combined in ElasticSearch.

Nevertheless I still recommend to do a backup from time to time to a different system (or at least different hard disc), e.g. in case you hit ElasticSearch or Lucene bugs or at least to make it really secure 🙂

Realtime Get

When using Lucene you have a real time latency. Which basically means that if you store a document into the index you’ll have to wait a bit until it appears when you search afterwards. Altought this latency is quite small: only a few milliseconds it is there and gets bigger if the index gets bigger. But ElasticSearch implements a realtime get feature in its latest version, which makes it now possible to retrieve the object even if it is not searchable by its id!

Refresh, Commit and Versioning

As I said you have a realtime latency when creating or updating (aka indexing) a document. To update a document you can use the realtime get, merge it and put it back in the index. Another approach which avoids further hits on ElasticSearch, would be to call refresh (or commit in Solr) of the index. But this is very problematic (e.g. slow) when the index is not tiny.

The good news is that you can again solve this problem with a feature from ElasticSearch – it is called versioning. This an identical to the ‘application site’ optimistical locking in the database world. Put the document in the index and if it fails e.g. merge the old state with the new and try again. To be honest this requires a bit more thinking using a failure-queue or similar, but now I have a really good working system secured with unit tests.

If you think about it, this is a really huge benefit over e.g. Solr. Even if Solrs’ raw indexing is faster (no one really did a good job in comparing indexing performance of Solr vs. ES) it requires a call of commit to make the documents searchable and slows down the whole indexing process a lot when comparing to ElasticSearch where you never really need to call the expensive refresh.

Reindexing

This is not necessary for a normal database. But it is crucial for a search server, e.g. to change an analyzer or the number of shards for an index. Reindexing sounds hard but can be easily implemented even without a separate data storage in ElasticSearch. For Jetslide I’m storing not single fields I’m storing the entire document as JSON in the _source. This is necessary to fetch the documents from the old index and put them into the newly created (with different settings).

But wait. How can I fetch all documents from the old index? Wouldn’t this be bad in terms of performance or memory for big indices? No, you can use the scan search type, which avoids e.g. scoring.

Ok, but how can I replace my old index with the new one? Can this be done ‘on the fly’? Yes, you can simply switch the alias of the index:

curl -XPOST 'http://localhost:9200/_aliases' -d '{
"actions" : [
   { "remove" : { "index" : "userindex6", "alias" : "userindex" } },
   { "add" : { "index" : "userindex7", "alias" : "uindex" } }]
}'

Performance

Well, ElasticSearch is fast. But you’ll have to determine for youself if it is fast enough for your use case and compare it to your existing data storage system.

Feature Rich

ElasticSearch has a lot of features, which you do not find in a normal database. E.g. faceting or the powerful percolator to name only a few.

Conclusion

In this post I explained if and how ElasticSearch can be used as a database replacement. ElasticSearch is very powerfuly but e.g. the versioning feature requires a bit handwork. So working with ElasticSearch is comparable more to the JDBC or SQL world not to the ORM one. But I’m sure there will pop up some ORM tools for ElasticSearch, although I prefer to avoid system complexity and will always use the ‘raw’ ElasticSearch I guess.