Filtering Social Media Noise: Sparksfly Case Study

Published in

JetRuby Agency

7 min readMay 12, 2017

Social media activity is high as ever and keeps growing rapidly.

Sparksfly is a user-friendly application designed to streamline your social media life. It enables both ordinary users and businesses to curate, filter and engage through multiple social networks at the same time based on their personal preferences such as topics, audience, or location.

Sparksfly features allow for posting, tweeting, sharing and a lot more just from one single place. Following topics has never been so easy! Just create a keyword routine which will be monitoring your feeds to find, sort and save the desired content. It’s available for you to view whenever you select your custom routine.

The development of Sparksfly posed an amalgam of challenges.

One of the biggest ones involved dealing with the database performance. Considering that an average user has multiple social network accounts, they should generate about 500–1000 posts per day on average. In other words, we needed to find and implement a reliable solution to process enormous amounts of data.

Plus, every social network has its own output format. Because of that, dealing with several data sources becomes an issue.

Why?

Typically, for every new social network to add, the approach would be to just write another piece of code, i.e. logics, and place it into the app architecture.

However, the architecture would start resembling a spaghetti monster quite soon and scaling the app would become impossible. In addition, mobile developers would struggle with having different formats to work with.

In this article, we’ll be showing how we dealt with the following challenges:

Slow data entry into the database.
Slow selection.
Collisions during database entries.

… and then we’ll briefly go through other stuff that’ve implemented in the project.

Challenge #1 — slow entry into the PostgreSQL database

For each type of data to aggregate, we created a table with a unique field set. Next, we sorted all the incoming data according to specific filters. As a result, it looked like the following:

{id: ‘1111’, autor: {Id: ‘1’, name: ‘Some User’, username: ‘someuser’ }, body: ‘postbody’, likes: ‘42’, photos: [{url: ‘url1’}, {url: ‘url2’}]}

And here’s the scheme of the original database.

Now, let’s break it down.

Users — registered users.
SocialProfiles — the table containing social network profiles of the users.
uid — user’s social network ID;
service — a social network (string value — facebook | twitter | LinkedIn | Instagram);
access_token, secret key — tokens for accessing social networks’ APIs with a user name.

Profiles — information about a user (name, username, etc.).
Posts.
uid — a social network post’s ID;
social_profile_id — a link to the author’s account;
body — a post itself.

Activities. Since the same post could be displayed in several user feeds, we needed to create another linking table (the main one) which would be responsible for data selection.

This is how the algorithm worked after that:

Based on the information from “author”, the system searched for an existing “SocialProfile”;
If there was no “SocialProfile”, we created entries in the following tables: “SocialProfiles” and “Profiles”. (“SocialProfile” was created without “user_id” because a specific user hadn’t registered in the system. When he did, we would be able to receive “social_uid” and link “social_profile” to “user”);
We created the photo and the post;
We created a data entry in the table “activities” and linked the post to the user.

As a database, we used PostgreSQL. Moreover, since the app had to save all the incoming data as well, we also used MongoDB. To see how this combination would work together, we developed and emulated a tool for server and application load testing. (The amount of data flow was set to about 1 mln. posts a day).

The first thing that went down was MongoDB. It failed to write data into the dump and locked the disk every time the number of queries started to grow. After that, we thought it’d be better to store data in a serialized format using the PostgreSQL database.

Ultimately, we saw that there was no need to create profiles since most of the posts come from unregistered users. So we fine-tuned the algorithm:

In the “Post” table, we added the following fields: “author_uid”, “author_username”, “author_full_name”, and “author_picture”.
When creating a post, the algorithm starts searching for an existing profile. When found, it’s linked to “social_profile_id”.
If there’s no profile, the post is created without being linked to it.
Once a day, the system parses user’s friends. (When creating a new profile, nothing is linked).
As a result, all the new posts come linked. During user registration, all their posts already have “author_uid”, so we can always restore the linking.

Challenge #2 — slow data selection

(Just to remind you, there were about 400–500k posts and about 1 mln. data entries into the database a day).

Before the release, we performed load testing and optimized the slowest queries. But since the database was empty, only ~ 2 mln. entries were being generated, wherein the minimum required number had to be at least 10 mln.

That said, we concluded that:

Indexes were required for all the fields. In addition, queries became slower as the database grew bigger.
The order of aggregative indexes did matter. The field with a less degenerated index tree had to come first.
In addition to the aggregative index, we needed to add indexes into each field of the linking tables.

Doing this made the application noticeably faster. But as soon as the database grew bigger, we stumbled upon the performance issue once more. When the number of posts reached ~ 1 bln., things like search by index started to take much more time than they are supposed to. This wasn’t good since the validation was carried out every time before a new post was created. The same applied to the search by “social_profiles”.

As a solution, we came up with the PostgreSQL database partitioning, which basically allows for dividing large tables into physically different parts. We sorted the tables according to the types of social networks. Ultimately, the structure looked like the following:

Activities
- facebook_activities
- twitter_activities
- instagram_activities
- linkedin_activities

SocialProfiles
- facebook_social_profiles
- twitter_social_profiles
- instagram_social_profiles
- linkedin_social_profiles

Posts
- facebook_posts
- instagram_posts
- twitter_posts
- linkedin_posts

And this is how it works. In the table assistant, we create a trigger that distributes data between the child tables. The queries we’re sending keep selecting data from the table assistant. If the conditions contain the “service” field, the PostgreSQL database collects entries only from specific table sections. We also rewrote all the queries so that they used the name of a social network in the condition.

The indexes we add into the table assistant are not added into the child tables automatically. We need to create them manually and make proper adjustments when changing indexes in the master tables as well.

Also, an insert query to the master won’t be returning the ID of a created data entry anymore. We’ll need to add the “insert_returning: false” parameter into “config/database.yml”.

In order to avoid performance issues, we decided to store data only for the past couple of months (about ~ 2 bln. posts). Everything that’s older now goes to the archive.

Challenge #3: collisions during database entries

Despite using the uniqueness validation, we found out that there were many identical posts in the database. It turned out that the same post could be processed multiple times. As a result, it could already exist in the database before the validation even started. However, when that did happen, the validation said nothing because the commit responsible for putting the post into the database wasn’t approved.

So, we set indexing by the “uid” field in the “posts” table as unique. We also developed our own method for ActiveRecord:

This method searches for a specific entry in the database. If it fails, it tries to create an entry. If the database returns a validation error, the method tries again. We didn’t override the “save method” intentionally since we wanted to show that there was a custom logic behind the saving.

Result

Making a module that receives data from a social network and converts it into one format allowed us to kill two birds with one stone. Firstly, it doesn’t matter for the app what kind of data it gets. Whether it is a Facebook post or a YouTube video — it all looks the same from the format standpoint. Secondly, there is no need to stuff the architecture with tons of code and reinvent the wheel for the 10th time. For every new social media, we just make another module and that’s it! Infinitely scalable, the app doesn’t pose any challenges for mobile developers. Also, the whole process of taking data is really smooth thanks to a unified interface responsible for communication with social media APIs.

Speaking of PostgreSQL in particular, working with a large database requires constant monitoring — a fast query can quickly become a blocker. This also includes reviewing of the used indexes. As for partitioning, it’s better not to add indexes into child tables. Finally, the PostgreSQL database is a great and highly customizable tool, which proved itself to the full extent.

As a result, the app can aggregate huge amounts of data from the biggest social networks — Facebook, Twitter, LinkedIn, and YouTube. And while most of the similar apps haven’t gone further than processing posts and retweets only, Sparkfly can grab and thoroughly filter any data possibly related to a user profile, e.g. all their videos from their YouTube channels.

Technologies that we used

High-load PostgreSQL configuration,
Ruby 2.3,
Rails 4.2.6,
MongoDB,
Redis,
Omniauth,
Sidekiq,
Puma