High-cardinality string categorical variables encoding PROJECT TITLE : Encoding high-cardinality string categorical variables ABSTRACT: Vector representations of categorical variables, such as the one-hot encoding used above, are typically required for use in statistical modeling. Because it generates high-dimensional feature vectors, this strategy is ineffective when the number of categories increases from one to another. In addition, the representation of string entries using one-hot encoding does not include any morphological information. In this case, we are looking for high-cardinality string categorical variables that can be encoded in a low number of dimensions. In an ideal world, these should be able to be scaled to a large number of categories, have clear meaning for end users, and make statistical analysis easier. We present two methods of encoding string categories: a Gamma-Poisson matrix factorization on substring counts and a min-hash encoder, which provides a quick approximation of string similarities. Both of these methods are based on counting substrings. We demonstrate that min-hash can convert difficult-to-understand set inclusions into more digestible inequality relations. Both approaches are capable of being scaled up and streamed. These methods have been shown to improve supervised learning with high-cardinality categorical variables through a series of experiments using both real and simulated data. We recommend the following: if scalability is essential, the min-hash encoder is the best choice because it does not need any data fit; if interpretability is essential, the Gamma-Poisson factorization is the best alternative because it can be interpreted as a one-hot encoding on inferred categories with informative feature names. Both of these options are available. Both models do away with the necessity of performing feature engineering or data cleaning, which enables autoML to be used on string entries. Did you like this research project? To get this research project Guidelines, Training and Code... Click Here facebook twitter google+ linkedin stumble pinterest Equitable Semi-supervised Learning Unlabeled Data Aid in the Decrease of Discrimination Semisupervised Classification Using Discriminative Mixture Variational Autoencoding