Search any Twitter Account

There are a lot of services offering the same, even offering archiving but they often need registration.

Now with jetwick it is easy to search any account you like . E.g. try my account. To do the same for your account go to Jetwick, click ‘login’, allow jetwick access to your account (you can revoke it at any time and we won’t misuse it or even post tweets etc) and then click “grab tweets”.Then you will see something like

After this procedure you can search the whole history of the tweets easily.

Why do you want to grab tweets from other users? With that you can easily see about which topic a user tweets, on the right side. Again see “Words related to your query” of my account:

PS: Jetwick is now free software … you can host your own and play around!

Algorithm against Twitter Spam

In jetwick we only want to show relevant tweets for a search. No noise, no spam.

So first problem is solvable when we the user sorts by retweets, filters out by a specific criteria of his choice or when he refines its search: adds more specific terms.

But how can we get rid of spam at twitter? First, what is spam at twitter?

Several years ago Paul Graham gave a nice definition of email spam: ‘unsolicited and automated’. With this definition we can identify 4 situations for twitter spam:

  1. unsolicited tweets which appear in your timeline (e.g. the new ads or even retweets of your followers could be spam too ;-))
  2. unsolicited tweets in your searches not relevant to the search (e.g. spammers simply add hashtags from the trending topics to increase popularity of their tweets)
  3. unsolicited automated tweets which mentions you (and not only you …)
  4. unsolicited direct messages
  5. even fast following and unfollowing can be spam, because most users have enabled a notification
  6. a very cool spamming technic: some spammers add a mini advertisment to a nice statement of you or your product. If you don’t read the tweet carefully or follow the links, this could (mis)lead you to retweet it and make indirect advertisment for them.

With the following algorithm we can try to solve point 2 and 3.

  1. For a new tweet T get the user U
  2. For U get (some) additional tweets and store to a list L
  3. Go through L and compare the content with T.
  4. Use the Jaccard index for comparison (additionally compare URL and title of the linked webpage)
  5. If Jaccard index is too high or URLs are identical then decrease quality. Repeat with 3. if there are still tweets in L otherwise go to 6.
  6. Mark T as spam if quality is under a certain quality limit

I applied this algorithm on the data of jetwick and grabbed the twitter users with a lot of spammy tweets (the number in brackets) for the last week:

careerfan (968) -> bot
teamnapalm (587) -> spam
endy_pink (481) -> spam
manypro (312) -> bot
i_want_napalm (294) -> spam
gutlazaro (216) -> spam (canceled from twitter)
appstoreadam (210) -> bot+spam
livralivro (207) -> bot
sigaajesus (195) -> spammy
thakiddunncase (195) -> ups, no spam
lauberte_ (167) -> spam
2dvdlsnorjeuvou (158) -> spam
josialemrossi (152) -> spam (canceled from twitter)
malucomunic (139) -> spam (Was this account hacked!!?? Because no one of the followers is a spammer!)

The idea is simple, but the results looks promising. There could be a lot of use cases. E.g. twitter clients like hootsuite could add tweet quality to its available filters … the user specific klout score is not useful, because even less popular tweeters can create great tweets 🙂

Let me know what you think!

Twitter Search Jetwick – powered by Wicket and Solr

How different is a quickstart project from production?

Today we released jetwick. With jetwick I wanted to realize a service to find similar users at twitter based on their tweeted content. Not based on the following-list like it is possible on other platforms:

Not only the find similar feature is nice, also the topics (on the right side of the user name; gray) give a good impression about which topic a user tweets about. The first usable prototype was ready within one week! I used lucene, vaadin and db4o. But I needed facets so I switched from lucene to solr.  The tranformation took only ~2 hours. Really! Test based programming rocks 😉 !

Now users told me that jetwick is slow on ‘old’ machines. It took me some time to understand that vaadin uses javascript a lot and inappropriate usage of layout could affect performance negativly in some browsers. So i had the choice to stay with vaadin and improve the performance (with different layouts) or switch to another web UI. I switched to wicket (twitter noise). It is amazingly fast. This transformation took some more time: 2 days. After this I was convinced with the performance of the UI. The programming model is quite similar (‘swing like’) although vaadin is easier and so, faster to implement. While working on this I could improve the tweet collector which searches twitter for information and stores the results in jetwick.

After this something went wrong with the db. It was very slow for >1 mio users. I tweaked to improve the performance of db4o at least one week (file >1GB). It improves, but it wouldn’t be sufficient for production. Then I switched to hibernate (yesql!). This switch took me again two weeks and several frustrating nights. Db4o is so great! Ok, now that I know hibernate better I can say: hibernate is great too and I think the most important feature (== disadvantage!) of hibernate is that you can tweak it nearly everwhere: e.g. you can say that you only want to count the results, that you want to fetch some relationship eager and some lazy and so on. Db4o wasn’t that flexible. But hibernate has another draw back: you will need to upgrade the db schema for yourself or you do it like me: use liquibase, which works perfectly in my case after some tweeking!

Now that we had the search, it turned out that this user-search was quite useful for me, as I wanted to have some users that I can follow. But alpha tester didn’t get the point of it. And then, the shock at the end of July: twitter released a find-similar feature for users! Damn! Why couldn’t they wait two months? It is so important to have a motivation … 😦 And some users seems to really like those user suggestions. ok, some users feel disgustedly when they recognized this new feature. But I like it!

BTW: I’m relative sure that the user-suggestions are based on the same ‘more like this’ feature (from Lucene) that I was using, because for my account I got nearly the same users suggested and somewhere in a comment I read that twitter uses solr for the user search. Others seems to get a shock too 😉

Then after the first shock I decided to switch again: from user-search to a regular tweet search where you can get more information out of those tweets. You can see with one look about which topics a user tweets or search for your original url. Jetwick tries to store expanded URLs where possible. It is also possible to apply topic, date and language filters. One nice consequence of a tweet-based index is, that it is possible to search through all my tweets for something I forgot:

Or you could look about all those funny google* accounts.

So, finally. What have I learned?

From a quick-start project to production many if not all things can change: Tools, layout and even the main features … and we’ll see what comes next.