Data splitting
Date created: 2022-09-20
The primary approach to spending data efficiently. Most commonly splitting into train/test sets. 80/20 is a common split.
- Most of the time you are well off to just do a random split, some exceptions:
- If there is large class imbalance (e.g., the outcome is very rare) you can stratify to ensure that it is allocated to both train/test splits.
- In time-series data you typically use the most recent data as test