What is your plan B after software carrier?

This minor stunt is nothing comparable to Danny but hey, its great fun! If you want to see the king of trial click here. If you have something special to you which makes fun or is difficult (etc) please comment or add a video reply.

Sadly Sony don’t want me to promote ‘Electric Feel’ from MGMT on youtube. In my opinion this is a fair usage (only a 42 second snippet) of the song in my video … Watch the video via blip tv.

Algorithm against Twitter Spam

In jetwick we only want to show relevant tweets for a search. No noise, no spam.

So first problem is solvable when we the user sorts by retweets, filters out by a specific criteria of his choice or when he refines its search: adds more specific terms.

But how can we get rid of spam at twitter? First, what is spam at twitter?

Several years ago Paul Graham gave a nice definition of email spam: ‘unsolicited and automated’. With this definition we can identify 4 situations for twitter spam:

  1. unsolicited tweets which appear in your timeline (e.g. the new ads or even retweets of your followers could be spam too ;-))
  2. unsolicited tweets in your searches not relevant to the search (e.g. spammers simply add hashtags from the trending topics to increase popularity of their tweets)
  3. unsolicited automated tweets which mentions you (and not only you …)
  4. unsolicited direct messages
  5. even fast following and unfollowing can be spam, because most users have enabled a notification
  6. a very cool spamming technic: some spammers add a mini advertisment to a nice statement of you or your product. If you don’t read the tweet carefully or follow the links, this could (mis)lead you to retweet it and make indirect advertisment for them.

With the following algorithm we can try to solve point 2 and 3.

  1. For a new tweet T get the user U
  2. For U get (some) additional tweets and store to a list L
  3. Go through L and compare the content with T.
  4. Use the Jaccard index for comparison (additionally compare URL and title of the linked webpage)
  5. If Jaccard index is too high or URLs are identical then decrease quality. Repeat with 3. if there are still tweets in L otherwise go to 6.
  6. Mark T as spam if quality is under a certain quality limit

I applied this algorithm on the data of jetwick and grabbed the twitter users with a lot of spammy tweets (the number in brackets) for the last week:

careerfan (968) -> bot
teamnapalm (587) -> spam
endy_pink (481) -> spam
manypro (312) -> bot
i_want_napalm (294) -> spam
gutlazaro (216) -> spam (canceled from twitter)
appstoreadam (210) -> bot+spam
livralivro (207) -> bot
sigaajesus (195) -> spammy
thakiddunncase (195) -> ups, no spam
lauberte_ (167) -> spam
2dvdlsnorjeuvou (158) -> spam
josialemrossi (152) -> spam (canceled from twitter)
malucomunic (139) -> spam (Was this account hacked!!?? Because no one of the followers is a spammer!)

The idea is simple, but the results looks promising. There could be a lot of use cases. E.g. twitter clients like hootsuite could add tweet quality to its available filters … the user specific klout score is not useful, because even less popular tweeters can create great tweets 🙂

Let me know what you think!

jsii – full text search in 1K LOC of JavaScript!

In the previous blog post I tried to introduce node.js and its nice features. Today I will introduce my little search engine prototype called jsii (javascript inverted index).

jsii provides an in-memory inverted index within approx. 1000 lines of JavaScript. Some more lines are necessary to set up a server via node.js, so that the index is queryable via http and returns Solr compatible json or xml. The sources are available @github:

git clone git@github.com:karussell/jsii.git

Try it out here: http://pannous.info:8124/select?q=google e.g. filter queries works like id:xy or queries with sorting works like &sort=id asc. The paramters start and rows can be used for paging. For those who come too late e.g. my server crashed or sth. ;-), here is an image of the xml response:

jsii-response

Solr XML Response

The solr compatible xml response format makes it possible to use jsii from applications that are using SolrJ. For example I tried it for Jetwick and the basic search worked – just specify the xml reponse parser:

solrj.setParser(new XMLResponseParser());

His-story

The first thing I needed was a BitSet analogon in JavaScript to perform term look-ups fast and combine them via AND bit-operation. Therefor I took the classes and tests from a GWT patch and made them working for my jasmine specs.

While trying to understand the basics of a scoring function I stumbled over the lucene docs and this thread which mentions ‘Section 6 of a book‘ for a good reference on that subject.

My understanding of the basics is now the following:

  • The term frequency (tf) is to weight documents differently. E.g. document1 contains ‘java’ 10 times but doc2 has it 20 times. So doc2 is more important for a query ‘java’. If you index tweets you should do tf = min(tf, 3). Otherwise you will often get tweets ala ‘java java java java java java…’ instead of important once. So for tweets a higher entropy is also relevant
  • The inverted document frequency (idf) gives certain terms a higher (or lower) weight. So, if a term occurs in all documents the term frequency should be low to make that term of a query not so important compared to other terms where less documents were found

With jsii you can grab docs from a solr index or feed it via the javascript api. jsii is very basic and simple, but it seems to work reasonable fast. I get fair response times of under 50ms with ~20k tweets although I didn’t invest time to improve performance. There are some bugs and yes, jsii is a memory hog, but besides this it is amazing what can be done with a ‘script’ language. BTW: at the moment jsii is a 100% real time search engine because it does not support transactions or warming up 😉

Hints

Full text search in 100% JavaScript – The future of JavaScript is bright.

Everybody (incl. me) is laughing at you JavaScript” – Me (before the year ~2003)

It’s time to laugh back!” – JavaScript

See the next post about the implementation – jsii

Wouldn’t it be amazing to make all your programming tasks in one programming language? No need to switch? No need to relearn?

I dreamed that dream with Java. Sadly the Java-plugin is not the future (not user friendly, not search engine friendly, …) and ordinary web development can be done with pure Java solutions such as Vaadin/GWT or Wicket, but at some point you will need to know JavaScript.

With node.js – although this is not the first server side js solution – it is possible to do all your tasks in JavaScript! Plus a bit css and html knowledge, of course. But can we save files or querying databases with pure javascript? node.js has the goal to provide an implementation for such a server side API and designed its API to be non-blocking. Another interesting feature are web sockets: node.js makes it possible to directly communicate from server to client (and back) with pure JavaScript. So you can send js- or json-snippets back and forth. Check out one amazing example of this. Behind the scenes of node.js the V8 acts as virtual javascript machine and makes it all amazingly fast.

What has this all to do with full text search?

To better understand things I need to code them. Now, that I wanted to get a better understanding of an inverted index – I had to code it. But implementing it in Java would have been boring, because there already is the near-perfect Lucene. So I choose JavaScript: I wanted to get a better understand of this language and I wanted to try node.js.

In my upcoming blog post I will write about JSii – an inverted index implementation in javascript (apache 2 licensed). The index is not limited to node.js – you can use it in the browser. But more interestingly it can be queried via http and even SolrJ – in that case node.js hosts the inverted index.

But why is a search engine in JavaScript useful? First of all, it is a prototype and I learned a lot. Second, you can check it out and learn something more and think about other possibilites/ideas.

Third, I can imagine of the following scenario – reducing the need of “server-side” architectures. You can call it “ad-hoc peer to peer networks over HTTP”. Imagine a user which visits your website is willing to give you for some minutes or seconds a bit of its browser-RAM and CPU. Then you can push some minor parts of your data (in our case a search index) to the users’ browser and include it into your network. This architecture is extremly difficult to manage. You will have to avoid that users think your site is compromised by malware. But this network will work better/faster/more reliable/… the more users visits your site!

A similar concept is used in electric power transmission. Centralized power stations produces the predictable ‘source of energy’. In the future decentralized power stations such as wind turbines, solar plants, etc will pop up for hours or even minutes and contribute its energy to the world wide energy network.

We will see what the future will bring for the IT sector. But there is no doubt: the future of JavaScript is bright.

Feeding Solr with its own Logs

I always looked for a simple way to visualize our log data e.g. from solr. At that time I had in mind a combination of gnuplot and some shellscripts but this session from the lucene revolution changed my idea. (Look here for all videos from lucene revolution.)

I thought: “hey thats it! Just put the logs into solr!” So I coded something which simply reads the log files and named it Sogger. Without sharding, without message queues, … but it should work on real systems without any changes to your system (but probably to sogger).

I hope Sogger doesn’t suck, but it does not come with any warranty, so use it with care! And: It is only a proof of concept – nothing comparable to the guys from loggly.com

To get your logs sogged:

  • Download the ‘Sogger’ code via:
    hg clone http://timefinder.hg.sourceforge.net/hgroot/timefinder/sogger sogger-code
    
  • Download the Solr from trunk.
    svn co -r  1023329 https://svn.apache.org/repos/asf/lucene/dev/trunk solr-code
    

    Sogger doesn’t necessarily need the trunk version but I didn’t tested it for others yet

  • compile solr and Sogger with ant
  • cd solr-code/solr/example/
  • copy solrconfig.xml, schema.xml from Sogger into solr/conf
  • copy the *.vm files from Sogger into the files at solr/conf/velocity/
  • start solr
    java -jar start.jar
  • start feeding your logs
    cd sogger-code/
    java -jar dist/Sogger.jar url=http://localhost:8983/solr logFile=data/solr.2010-10-25.log.gz
    
  • to search your logs do:
    http://localhost:8983/solr/browse?q=twitter

Now you should see something like this

Sogger has several advantages over simple “grep-ing” or scripting with your solr logs:

  • full text search. near real time: ~1min 😉
  • performance. I hope commiting every minute does not make solr a lot slower
  • filtering by log level: Quickly find warnings and exceptions
  • filtering by webapp: If you have multiple apps or solr cores which are logging into the same file filtering is really easy with solr (with grep too, but you’ll have to re-grep the whole log …)
  • open source: you can change the feeding method I used and take care of your special needs. Tell me if you need assistance!
  • new log lines will be detected and commited ala tail -f
  • besides text files sogger accepts and detects compressed (zip, gzip/gz) files ala zgrep. So you don’t need to change your log handlers or preprocess the files.

to do’s:

  • make the log format customizable within a property file:
    line1=regular expression pattern1
    line2=regular expression pattern2
  • read and monitor multiple log files
  • make it a solr plugin via special UpdateHandler?
  • a xy plot (or barchart) in velocity for some facets or facet queries would be nice. Something like I had done before with wicket.
  • I don’t like velocity … althought it is sufficient for this … but should we use wicket!?

Jetwick Layout Update

In the next days I will release a minor changed UI of jetwick.com. It is nice that for this only css+html changes were necessary (I am using wicket btw). A nice consequence of this is that a lot “float: left;” and “width: xy;” are now needless.

This will be three the column style:

Before it was a mix with a lot white areas:

Fun and some important Dev-Tweets of the last week, 11th October

Let us start with the fun tweets. Ok, this week a lot Java bashing tweets, but I like them!

  • maven 3 is out. It now lets you download the internet even faster than before.

  • The world needs to stop hyping “html5” as though it’s markup alone that builds rich web apps. It makes JavaScript angry.

  • “JavaScript is the only language that people feel they dont need to learn before they start using it.” – Crockford

  • Little known fact: JavaScript also has an isNaaN() function for when you aren’t sure if you’re working with Indian food

  • I have seen an app with SQL code in the *views*, looked like a java coder was given a php book and told to make a rails app.

  • Matz on #ruby speed: Build your website in Ruby until you have more traffic than Twitter, then use your riches to hire Java programmers.

  • OH: “Java is just a DSL for turning XML into core dumps.”

  • judging Clojure/Lisp by its parens is like judging Java by its classpath



And last but not least some intersting infos:



Of course this list isn’t complete! So, watch out for more fun and infos at twitter and contact me or comment if you want to add it here or for the next week.