The Path to Kueski 2.0

Introduction

"Don't worry about the first code you write, you will end up throwing it away after some time"

This is what a very wise person told us when the Kueski.com codebase was started back in 2013. The truth is that around the end of 2014, we had rewritten the majority of the initial code. And what we have nowadays in 2016 is a completely different beast (a very nice beast, compared to what we had 12 months ago).

This is the story of how the Kueski.com infrastructure progressed from its humble beginnings until what we have today. Starting from a single monolithic server that contained all backend, frontend, database, etc in a single AWS instance, to the current state of distributed micro-services highly scalable infrastructure. The majority of the focus will be on the state of current infrastructure and some battle scars that got us to take the decisions we chose.

Kueski 0.1: An Idea was materialised

Back on 2013, when Kueski.com was just starting, we had an architecture that could be summarised by the following image:

Kueski Architecture 0.1

We basically had Frontend, Backend, Processing and Databases in a single server.

From the start of the system development we made some infrastructure decisions such as:

  • Working in AWS: Mainly to minimise the time required to do DevOps tasks
  • Using MongoDB: At the time it made perfect sense: We were just growing our data and did not have any idea of the fields that were going to be used. Given that we were just starting our Automated Risk Modelling process, we did not know what variables, data sources or other information we were going to save. The data structure changed on a weekly basis. We were young and naïve. This decision of using MongoDB for everything came to bite us back some years later... but we will get to that.
  • Using Ruby as programming language: The decision was made mainly because we knew that Ruby made it possible to prototype and develop the system in a very speedy manner. We also made the conscious choice of not using Rails, to avoid putting all the eggs in one basket.

At the time everything worked smoothly, and Kueski's website was mainly a front to fill up an application and after it had been filled, it sent an email to someone in Kueski for further action. This means that the majority of the process was done manually! Including getting Credit Buro information. Signing contracts (like, clients had to sign the contracts by hand).

Kueski 1.0: The path to Automation

From the end of 2013 and beginnings of 2014, the main focus of Kueski's Engineering team was automation. We started pushing to automate the majority of steps in the loan application and loan servicing process. This of course was accompanied by progress in the legal side of things: We could only automate processes that we were sure were in compliance with the regulations. At this time we focused on automating things like:

  • Customer Verification: To ensure that personal data contained in loan applications is correct and valid.

  • Risk Evaluation: The Data Science team implemented the first set of evaluation models, and we had to create infrastructure to integrate them in a production production environment.

  • Documentation Uploading: Believe it or not, at one time you had to send your documentation (selfie and ID) by email, and a person had to manually save it to our internal systems. The effort here was to create a complete system allowing our clients to self service by uploading this information themselves.
  • Document Signature: Once we were given the green light to perform online signature, we developed the subsystem that allows the applicants to sign the documents by themselves (by clicking on "I agree"). This system generates PDFs from these contracts and saves them so that our clients can access them.
  • Transferring Money: Also believe it or not, there was a time when we needed to manually transfer the money to our customer's account. This ended once we connected to a third party system that allows our system to disburse the loan automatically.

Overall, the infrastructure of Kueski.com at that time was looking like this:

Kueski Architecture 1.0

That is a bit relatively better. Notable changes are that we separated the Risk evaluation process into a separate server, in which each new application was automatically evaluated.

We also extracted some functionality used to obtain customer's data from third party services (like the Credit Bureau or government databases) into external servers. These were the first baby steps towards a real micro-services architecture. Nevertheless, the main Kueski.com server still had the majority of the functionality.

An additional improvement from Kueski 0.1 was the development of internal dashboards that allowed our operators to execute in a speedy manner the manual processes left in an application.

From that time we identified two main lines of development for Engineering: Product Development and Internal Tooling (this was the foundation for the two Engineering teams that were created inside our team: Product and Core).

Architecture Improvements

There were several improvements made to our architecture during the Kueski 1.0 effort, some of these were:

  • Implementation of some micro-services and communication using REST APIs
  • Protection of several tools and services using Client Certificates

We also had some limitations that were presented mainly due to some design decisions, choices and just lack of knowledge from us:

  • Synchronous REST Services: Using synchronous interaction forced us to wait until a process was finished to continue with other parts of the process. It took us some time to change the mindset into asynchronous communication.
  • Linear Processing: Due to the nature of our application process, it is possible to run several processes in parallel. However, back in K1.0, our infrastructure did not facilitate that. We made some experiments that failed miserably mainly due to technology limitation and our own lack of knowledge. For example, we were using the MongoMapper gem, which at that time had problems with concurrency (modifying a document and saving triggered an overwrite of all the fields).

Code Quality

Given that at the end of 2014 we were 4 full time developers (plus one intern!) we had to make sure to increase and maintain our codebase quality. A description of our complete code quality process and stack is subject for a whole blog post (a good idea that we may write about in the future), however at the time of K1.0 we introduced several controls to deal with code quality:

  • Unit Testing: In the form of MiniTest, it helped us to ensure that we were building the thing right. Specially to the programmers that came from a strongly-typed and compiled language. Programming in Ruby for the first time made them introduce bugs for really simple mistakes like typos in a symbol name (instead of Enums for example).
  • Code Reviews: We found out this was a really good practice and communication mechanism. It let some developers improve their programming skills in some language for which they did not have lots of experience. For example, in my personal case I didn't do lots of JavaScript. However, every time I made a commit, it was reviewed by our internal JavaScript expert. At that time we used BarKeep todo Post-Commit reviews. We would later migrate to another tool and to Pre-Commit reviews, after we found ourselves trying to cheat the system :-(
  • Continuous Integration: Once we had automated tests, we could introduce Jenkins to perform continuous integration. In fact more than continuous integration the effort was mainly to automatically run the testing suits when a commit push was triggered.
  • Static Metrics: We implemented MetricFu to monitor the quality of our code and try to reduce the technical debt incurred while developing.

Implementing these tools allowed us to get a great deal of visibility for the expanding team. As the team was growing, we realized that we needed better communication channels. We found those channels in some of these tools and formalized processes. However, there was a big elephant in the room; our infrastructure was still monolithic and as time passed by, we kept having more and more scalability problems.

Kueski is Down! (¡Se cayó Kueski!)

¡Se cayó Kueski! became an infamously catchy phrase around the office from mid 2015 until the release of Kueski 2.0. There were several reasons for that, but the main one was that our MongoDB database just could not yield the bandwidth we were asking for.

Dramatization of a Kueski Engineer Fixing Kueski 1.0

Personally, I think MongoDB is just really bad and some people think you should never ever use it. However, I think MongoDB has some uses, but to be the storage engine for a database with entries of 15MB, both relational and non-relational data and collections with transaction-sensitive modifications. I don't think that is a good use of a NoSQL database.

So, at the end of Kueski 1.0, we had a monolithic system (which had both user-facing and internal web applications) that failed (like in 500 HTTP Status fails at certain points when the site was loaded) which made the whole application go down. Our technical debt was over the roof (we joked sometimes saying that we had some code laying around within our technical debt).

By the middle of August 2015 we started the K2.0 Project. An outstanding and ginormous effort to redo almost from scratch the whole Kueski infrastructure. We though of doing it right from the beginning, planning using asynchronous communication channels (via AWS SQS queues), aggressively split our monolithic application in to several micro-services which included redundancy. But most importantly, we decided to throw MongoDB in a trash can and burn it with fire (all this, figuratively of course, at the end of the day, MongoDB had the most important thing for us: Our data).

Kueski 2.0: Scalability

The original plan was "Let's migrate from MongoDB to MySQL", but of course moving the data was the least of the problems. As we continued to dig deeper and deeper in the codebase we found out that we would have to make changes to around 80% of the existing code, and that the code that we had was going to be like 30% of code of the resulting system. This meant that the we had to write 70% of new code.

Also originally our timeline was very "manager friendly": We planned to start around September and finish in November 2015. We did have a working version around that November, and there were different sides within our team pushing to either release or to wait. Fortunately the wait side won, and we had time to finish the K2.0 project fully, releasing on April 8th, 2016; 5 months after the original deadline (and it was really worth it!). The resulting infrastructure is depicted in the following picture:

Maaaaaaaagic!

The new infrastructure contains cool stuff like:

  • A load balancer (yes, we did not have one before :-( )
  • Individual micro-services consisting of one or more servers that can be tore done or spin up when they are needed (or when Amazon decides to kill them)
  • Async message passing between the different micro-services via AWS SQS queues. This enables multiple subscribe-based consumers and prevents messages from disappearing due to server errors.
  • S3 data storage, which is amazing for concurrent access.
  • Servers deployment via Ansible scripts: A life saver now that we were talking of tens of servers (and moving to the hundreds of servers nowadays)
  • Transactional, relational (aka. SQL) servers for data that is relational in nature. The speed gained as compared to MongoDB made us shed some tears of joy.
  • Micro-services as "buffers" for third party services. We know that everybody makes the best software and services, that never fail. But, for those rare occasions when some of our providers fail, are slow or have some other problem, we built wrapper micro-services with logic that can deal with the problems in a way that is as transparent as it can be to the whole system.

At the end of the day, the migration to the new infrastructure was done successfully, and after it happened (and a couple of weeks to iron out the typical bugs), everybody in the company was very happy. The system stopped breaking down, it was faster and most importantly, our customers were happy: We saw an outstanding increase in our conversion rate.

Epilogue

This was a short summary of how our systems have improved from the very humble beginnings of Kueski systems to the powerful infrastructure we have nowadays. And of course, we continue improving our systems to make them faster, more efficient, more resilient and friendlier for our users (both internal and external). We strive to automate as much as possible, so that we can provide the amazing experience that our customers deserve.

There are lots of technical details around this story that would make for great posts. However, those are better to be told in future writings by the people that make the magic happen:

Maaaaaaaagic!

The Kueski Engineering Team