Tele2 Hackathon

by on under programming
7 minute read

Recently I participated in a hackathon organized by a mobile operator company Tele2.

What’s a hackathon?

You gather a lot of programmers in the same building for a weekend. They transform energy drinks mixed with unhealthy food into a working prototype of something useful. Or useless, but fun.

Next natural question is

What do they get out of it?

A company won’t waste their time and money to fund a weird geeky weekend just for the heck of it. The participants have other alternatives to spend this time, like drinking beer with friends or going out - that is, if they have a life. Nerds.

For a company, a hackathon can act as a competitive idea generation and job interview machine. One can learn a lot about person’s hard and soft skills by looking at their on-stage performance. Traditional job interviews are a lot less accurate at determining person’s skills and fit for the role.

At a hackathon you can look at teamwork, presentation, code quality, math and framework knowledge, ability to craft unique solutions and stress resistance. Two days offer a lot more info than just a short interview.

Participant’s motivations are to compete, improve, shop off, land a possible job offer or win a money prize. Some employers might as well value your participation in hackathons.

About this hackathon

The data

Your solution is as good as the data you’re using to build it. Initially we were provided with almost nothing. The table contained only two fields:

  • a sequence of points (longitude, latitude) representing a polygon of base station’s coverage
  • 3-month average transferred data volume (in Gb)
  • and yeah, an id of the base station

Data only contained coverage for a particular Russian province.

The task

Predict average transferred data volume for base stations in the validation set during the same time period.

The results were evaluated with Root Mean Squared Logarithmic Error:

\[RMSLE = \sqrt{\frac{1}{n}\sum_{i=1}^n \left( \log (\hat{y_i} + 1) - \log (y_i + 1) \right)^2}.\]

Here \(\hat{y_i}\) is what model predicted, \(y_i\) is a true data volume for \(i\)th base station out of total \(n\).

RMSLE makes perfect sense. It’s important to get the idea of how much traffic will be produced, not pinpointing the precise value which is hard to forecast.

The error is allowed to be proportional to the true value:

\[\log (\hat{y_i} + 1) - \log (y_i + 1) = \log \frac{\hat{y_i} + 1}{y_i + 1}.\]

This is because data volumes differ by orders of magnitude. If RMSE without the log was used, model would throw all of its effort into predicting the largest values and largely ignoring most of the data.

Expectation vs. reality

Hackathon’s web page explicitly stated that participants will

  • solve computer vision and time series problems
  • predict data volume for the base stations
  • these stations are new but aren’t yet active
  • or a model should help to pick a good spot for one

But in reality most of that data is way too sensitive, and Tele2 transformed the problem into predicting the current data volume in a separate Russian province. I guess they’ve gotten their hands on some productive ideas generated by the community.

Computer vision techniques could in fact be applied if one decided to process satellite imagery. However, there was no trace of time series analysis.

Train and validation sets

The data were split into training and validation sets by hand rather than randomly.

Areas inside the cities are densely covered by a ton of overlapping base stations. If the data were divided randomly, a lot of validation polygons would intersect or even lie inside training ones.

That’s why validation set consisted of remote region having little in common with densely populated urban areas mainly present in the training set. This decision increased validation difficulty and a risk of overfitting. That’s how train and test sets look:

train_vs_test

Building features

Since originally provided data didn’t contain much at all, the competition was about looking for suitable open data and generating features related to traffic consumption.

Base station coverage data

It wasn’t much, but for any polygon it was easy to establish its area, coordinates of a centroid and approximate radius.

Assuming centroids represent actual base station coordinates, we could easily calculate distances to several nearest base stations. Coverage area was one of top-3 best features in this task. It clearly correlated with hidden categories like technology or absolute height.

Others also used the amount of dots making up the polygon to account for its curvature. One team invented “gradient” features, dividing the amount of base stations in 3km radius to by the same statistic in 1km radius. They reported it to be useful.

Open data

People went crazy using all sorts of open data:

  • Open Street Map. Contains tons of objects such as banks and supermarkets where traffic is generated.
  • Yandex Maps API. Useful to match street addresses.
  • Detailed population geo-stats. Gives an idea about how close a base station is to major cities.
  • Road accident stats. Reported useful for some reason.
  • Coordinates of all base stations. Helps to evaluate competitors’ influence.
  • Satellite imagery. The brighter the area is lit, the more people live there.

There probably were even more sources.

Our solution

We decided to generate features involving as much independent factors as possible. Such were points of interest, local settlement population, distances from major cities and base station concentration.

To check if these different feature modules were worth the time, we looked at the most influential variables. Random Forest’s feature importance provided us with necessary functionality.

Features

I designed a feature builder atop Open Street Maps data using geopandas. For a given coverage polygon it counts a number of each point of interest we specified and distances to the nearest of them. OSM has its own info on villages, hamlets and small cities, from which I extracted population and distance to the nearest one.

Meanwhile my teammates estimate population of the major cities with Yandex API and StatData. The resulting “provinciality” feature was a distance weighted proportionally to the populace.

All in all, we made 110 features.

Algorithms

We picked tree-based models such as catboost, lightgbm and random forest since they are insensitive to the absolute values and allow to mix features of various types. It doesn’t matter in which units do we measure the distance and allows us to mix features of different meaning like distances, counters and categories at the same time. As a bonus, they provide feature importance functionality.

Our best solution was sklearn’s Random Forest. It is heavily regularized, i.e. prevented from becoming overly complex. Tree depth, #of leaf nodes and minimal #of samples in a leaf were purposefully limited. This makes a lot of sense, since:

  • big cities in the training set aren’t relevant to remote regions included in the test set
  • lightgbm-based solution overfits after just 9 iterations
  • a very similar score was achieved with Linear Regression (!) and a lot less features by a different team

Validation

Since training set is very different, the score one estimates using cross-validation is biased. Actual score will always be higher by a certain amount this way. To overcome this, we discretized coverages and performed stratified cross-validation instead. It somewhat helped. Maybe we should’ve developed this idea and perform hand-picked split akin to how organizers did it.

Smart thing one team did is left only the features having similar distributions in both train and test sets.

Results

The overall best score was 0.61391 which translates to a mean relative error of 1.848 in either direction. Man, that’s a huge error!

Our score was 0.7035, meaning typical error is 2.021. Ours and the best aren’t that far apart, meaning task thrown at us was pretty complex. And it didn’t even include actual forecasting, let alone picking good base station spots. So I think we did good.

Conclusions

This was my third hackathon and by far the best. I felt welcome, the food was great and the Tele2 office is very cozy. We even slept there during the night. I’d like to thank our team, of course!

team

I learned valuable time management, presentation and tech tricks. I’d even try to apply for a job if it the office wasn’t so far away. I need another hackathon. An addiction? Maybe…

hackathon, machine learning
comments powered by Disqus