Some candidates, generally those with not a lot of industry experience make a lot of assumptions regarding training data. I have seen candidates completely skipping this topic and not able to discuss about training data even after several hints. A few candidates don’t realise that real world data (from collection, to cleaning, labelling, etc.) is very different from getting a curated dataset from Kaggle. It’s very important to take some time to discuss this topic and not jump straight into the feature engineering step (or worse modelling step).
Here are some of the important topics of discussion in this step:
- Training data collection.
- Labelling techniques.
- Details of training data.
Positive signals in this step include a candidate showing a clear understanding of training data collection, provides details of various techniques involved in data collection, listing its pros and cons. Also, the candidate should proactively talk about the real-world setup of the problem which may include real-time data collection from the application and talk about solutions to handle this and give details where needed. Talking about various approaches of labelling data is vital and explaining the advantages and disadvantages of different approaches separates the senior candidates from inexperienced ones. The candidate should also justify the choice of training data collection, labelling, etc. made by considering the problem constraints and showcasing a clear grasp of the problem space.