This would help for defining the sequence of variables to impute or synthesize. Something like this would fit well in other functions:
def most_predictable(df, base_cols, candidate_cols, algorithm):
""" Identifies the most predictable column from a set of base columns.
Args:
df: DataFrame with base and candidate columns.
base_cols: List of column names to predict from.
candidate_cols: List of column names to compare on predictability given base_cols.
algorithm: Algorithm for determining predictability.
Returns:
Column name from candidate_cols which is most predictable from base_cols.
"""
This could be done with something like correlations, or algorithms like random forests (after standardizing data, and the standardization technique might be another arg).
cc @rickecon, per our chat if you can take a stab at this that'd be awesome.
This would help for defining the sequence of variables to impute or synthesize. Something like this would fit well in other functions:
This could be done with something like correlations, or algorithms like random forests (after standardizing data, and the standardization technique might be another arg).
cc @rickecon, per our chat if you can take a stab at this that'd be awesome.