LinkedIn Data Science Interview Questions
I recently interviewed for a research engineer (vision) role at LinkedIn. In this role the candidate is expected to work on state-of-the-art computer vision algorithms to understand users and content on the platform. In this post, I’ll summarize the questions and the whole interview process.
Getting A Call:
I got a call from LinkedIn recruiter taking my background and my preference.
Assignment Round:
They gave a list of datasets from different domains like Songs DB, NLP, CV and asked me to build a report out of it. It was an open-ended question where I had to first explore the data and formulate a task and define evaluation metrics, provide baselines and then train a network. I choose a Caltech-101 dataset and submitted my report.
Preliminary Round:
In this round, they asked me couple of questions on data structures and evaluation metrics of classification problem.
The first question was to implement a data structure which has average time complexity of O(1) for inserting an element, deleting an element and randomly deleting an element.
// "O(1)" time :
// - void add(T val)
// - void remove(T val)
// - T removeRandomElement()
This question has multiple variants one with duplicates and one without duplicates.
Then, we started a discussion over evaluation metrics for classification problems. What are the common metrics to evaluate a classifier? Accuracy, Precision, Recall, F1-Score. How is precision and recall calculated? When do we use which metrics? Then they asked me the second question:
A spam classifier dataset has 99% non spam and 1% spam samples. For spam records, the model is 99% correct and for non-spam the model is 99% correct. What is the precision and recall of the model? (A: Precision: 0.5, Recall: 0.99)
On-site Rounds:
1. Data Coding Round
In this round they asked my couple of questions on probability and statistics and machine learning.
First question: there is a die with n sides. The probability of each side of the die is p_i
. Write a function to generate m rolls from such die.
A naive approach to this problem will be to calculate the CDF of the probability and then generate a uniform random number [0,1). Since CDF is increasing function from [0,1]. Find the last index i such that uniform random number ≤ CDF_i.
For example, let’s say n = 3, p1 = 0.2, p2=0.5, p3=0.3. Then CDF1 = 0.2, CDF2 = 0.7, CDF3 = 1.0. If random number is between [0, 0.2) random roll is 2. If random number if between [0.2, 0.7) the random roll is 2 and so on. Since, we needed to generate m such random rolls. The complexity of the algorithm will be O(nm). An optimization over this is to calculate CDF array once, in O(n) and then use binary search to find the roll index in O(log(n)). In that case the complexity will be O(n+mlogn)
The second question was to implement a decision tree algorithm.
Design a basic decision tree algorithm. Focus on termination criteria and OOP design. Assume boolean features and boolean label variable.
In this question, the interviewer expected a complete, neat implementation of the decision tree algorithm. I started with creating a class for decision tree with fit
and predict
methods. The input X is mxn boolean matrix and input y is m size boolean vector. Since input feature were boolean, you don't need to search for threshold to split the variable. Any variable would have been split at single location 0.5 i.e. true or false. The interviewer was interested in modular implementation and breaking the large problem into multiple sub problems and correct implementation.
2. Data Mining Product Design Round
This round was more on how you translate business problem to a machine problem and design a complete solution around it.
The interview asked me to design a system to recommend users for a new page/company created in LinkedIn. This question required heavy critical thinking.
Data Collection: what kind of data do you have? user data, page data, user connection, existing page followers, content posted by user, content posted by page. User likes and comments over his feed to mine his preference. In this part I went over all the data sources that could have been used to design the system. This part is extremely important because once you go over all the data that could be useful for the problem, then you can use multiple combinations of these data to come up with the approach.
Cold Start Problem: Since we wanted to recommend users for a new page, we had very limited data. For new pages, we don’t have any data of existing users who liked such pages, posts, likes. Thus approaches like collaborative filtering could not be directly used.
Approaches: Came up with multiple approaches based on
- recommendation based on location
- finding similar pages
- users affinity over location, businesses, sectors
- calculating user embedding and pages embedding by large scale neural network factorization using all the data. Then using precomputed entity embedding to solve for cold start problem.
We dived deep into the last approach, where I designed a complete architecture to capture all these information. The process to collect the training and evaluation data
Evaluation Metrics:
How would you evaluate such system locally? : Precision@K, Recall@K, nDCG, mAP.
Deployment and Production Metrics: Discussed about latency constraints and metrics on production: pre-computation of embeddings, widget interaction rate, number of successful page followed, % of successful page followed, etc.
3. Host Manager Round:
The host manager asked me about my previous projects then asked a question on computer vision.
Question: How would you identify near duplicate images? In LinkedIn, people creates fake profiles by using someone else’s image. There images vary from original image with very minimal difference, like low resolution, cropped, changed brightness and contrast, etc. From a dataset of let’s say a million images, you need to give at-least 1000 set of images which you think might be duplicate.
The interviewer asked me to to take some time and come up with solution related for this. I gave multiple solutions based on auto encoder, color histogram matching, and triplet loss.
4. Data Mining Round
In this round the interviewer asked me to estimate the number of trials to get two consecutive head in a fair toss. Let’s say you are tossing a coin until you see two consecutive heads. Then what is the expected number of trials to get two consecutive heads.
Again multiple approach to this problem: one approach is to use monte-carlo simulation to calculate number of trials to get two consecutive heads and take mean of multiple such calculations. Another approach is to formulate the problem in terms of mean equations.
After this, the interviewer focused on general machine learning based knowledge. What is bias and variance tradeoff? Why do you need to create validation set? What are common methods to generate validation set? What are some methods to control overfitting in neural networks? How does L2 regularization helps in controlling overfitting? What is early stopping? What is weight clipping and gradient clipping? What is dropout? How does these methods help to control overfitting? How is dropout implemented?
5. Coding Round
How will you identify the lowest common ancestor of two elements in binary search tree?
Follow up: What if it is a binary tree, not a binary search tree?
Implement a data structure that add, remove and removeRandom on O(1)?
Conclusion
Overall I really enjoyed the interview process. The questions were of good standard and the interviewers were also very helpful and interactive. Drop a comment if you have any questions.