[FRIAM] Sorting Algorithm? AI? Identifying "types" within data

Eric Charles eric.phillip.charles at gmail.com
Mon Jan 9 11:19:15 EST 2023


>From what I can tell "one-hot encoding" is just another term for dummy
coding the data, i.e., make it a bunch of 1/0 columns. H2o seems more
promising, but seems to require a backbone of quantitative data that you
can substitute (based on something akin to a regression) for the
categorical variables.

It seems weird to me that this doesn't exist. It seems like such a basic
thing to want the computer to do for you. The lack of this reinforces my
belief that a ton of AI work is just people who don't understand basic
analytic techniques trying hard not to learn how to do them. If most of the
AI efforts are just giving me the results of a regression, that's not super
exciting when I am comfortable running regressions myself.


<echarles at american.edu>


On Sat, Jan 7, 2023 at 3:30 PM Pieter Steenekamp <pieters at randcontrols.co.za>
wrote:

> One way to handle categorical input data for machine learning is to
> convert it using one-hot encoding - it's not difficult but a bit cumbersome.
> Fortunately there are other options. H2O is a machine learning library
> available in both Python and R that does this conversion "under the hood".
>
> I asked ChatGPT about "machine learning with categorical inputs using
> H2O", she answered:
>
> H2O is a popular open-source library for training machine learning models
> in Python and R. It is designed to be scalable, fast, and easy to use, and
> it includes a number of algorithms for classification, regression, and
> clustering.
>
> H2O can handle categorical variables natively, so you don't need to use
> one-hot encoding or any other special techniques to work with them. When
> you import your data into H2O, you can specify which columns are
> categorical, and H2O will take care of the rest.
>
> Here is an example of how you might use H2O to train a machine learning
> model with categorical inputs in Python:
>
> Copy code
> import h2o
> # Initialize H2O
> h2o.init()
> # Import your data into H2O
> df = h2o.import_file("path/to/your/data.csv")
> # Specify which columns are categorical
> df['input_variable'] = df['input_variable'].asfactor()
> # Train a model using the categorical input_variable column
> model = h2o.H2ORandomForestEstimator(categorical_encoding =
> "one_hot_explicit")
> model.train(["input_variable"], "output_variable", df)
> Here is an example of how you might use H2O to train a machine learning
> model with categorical inputs in R:
> Copy code
> library(h2o)
> # Initialize H2O
> h2o.init()
> # Import your data into H2O
> df <- h2o.importFile("path/to/your/data.csv")
> # Specify which columns are categorical
> df[, "input_variable"] <- as.factor(df[, "input_variable"])
> # Train a model using the categorical input_variable column
> model <- h2o.randomForest(x = "input_variable", y = "output_variable",
> training_frame = df)
>
> On Sat, 7 Jan 2023 at 17:37, Eric Charles <eric.phillip.charles at gmail.com>
> wrote:
>
>> That's somewhat helpful. Having looked up several of these
>> algorithms (I'm still checking a few), it seems like they all input some
>> sort of distance measure between the items (analogous to the distance
>> between their coordinates on a cartesian graph), and then do some sort of
>> distance-minimization function. The challenge here is that I don't have
>> anything equivalent to that: The data is primarily categorical.
>>
>> Does anyone on here actually have experience doing that kind of work?
>>
>> It's not that it would be impossible for me to change the categorical
>> data into something more quantitative, but doing so would bake in my
>> assumptions about how the categories should be determined.
>> <echarles at american.edu>
>>
>>
>> On Sat, Jan 7, 2023 at 1:32 AM Pieter Steenekamp <
>> pieters at randcontrols.co.za> wrote:
>>
>>> I asked https://chat.openai.com/chat and here is the conversation:
>>>
>>> *Pieter Steenekamp*
>>> can you suggest a solution for the following problem "I'm hoping someone
>>> here could help out. Let's imagine I had some data where each row was a
>>> person's career. We could list major events every year.For example: 2004
>>> they were highered, 2007 they get a promotion, 2010 they leave for a
>>> different company, 2012 they come back at a higher level, 2015 get a
>>> promotion, then no change until 2022.  Let's say I had data like this for
>>> roughly 2 million people, and that there are around 10 different types of
>>> changes that could happen during any time period (could be yearly,
>>> quarterly, monthly, I can make it how I want).  I was hoping we could ask a
>>> computer to tell us if there were "types of careers" that people had. We
>>> could say "put all these careers into 4 buckets" or "7 buckets" based on
>>> similarity. Then we could look at the piles the computer made and try to
>>> make sense of them. One type might be "company man" for people who tend to
>>> stay in place for 20 or more years, another type could be a "rotator", who
>>> leaves and returns every 3 years or so. Etc. The point is, I want a
>>> computer to make the piles for me, rather than trying to come up with
>>> potential piles a priori. Are there methods for doing this? I know it's a
>>> problem we've *talked* about a lot, but I don't know if there are
>>> solutions. "
>>> *https://chat.openai.com/chat <https://chat.openai.com/chat>*
>>> There are several approaches that you could use to group the careers of
>>> these 2 million people into clusters based on similarity. One approach is
>>> to use unsupervised machine learning techniques, specifically clustering
>>> algorithms, to automatically group the data into clusters. Some popular
>>> clustering algorithms that you could use for this problem include k-means,
>>> hierarchical clustering, and DBSCAN (Density-Based Spatial Clustering of
>>> Applications with Noise). You could also use dimensionality reduction
>>> techniques, such as PCA (Principal Component Analysis) or t-SNE
>>> (t-Distributed Stochastic Neighbor Embedding), to reduce the dimensionality
>>> of the data and then use a clustering algorithm to group the data into
>>> clusters. Another approach is to use supervised machine learning
>>> techniques, where you would need to manually label a subset of the data
>>> with the cluster labels, and then use this labeled data to train a
>>> classifier to predict the cluster labels for the remaining data. It's worth
>>> noting that these approaches will require some feature engineering, where
>>> you will need to extract relevant features from the data that can be used
>>> to represent the careers of the people in the dataset. Once you have
>>> identified and extracted these features, you can then use them as input to
>>> the machine learning algorithms to group the data into clusters.
>>> *Pieter Steenekamp*
>>> would it then be possible to get  for example if there were "types of
>>> careers" that people had?
>>> *https://chat.openai.com/chat <https://chat.openai.com/chat>*
>>> Yes, once you have grouped the careers of the 2 million people into
>>> clusters using one of the approaches I mentioned above, you can then
>>> analyze the characteristics of each cluster to identify common patterns and
>>> trends within each cluster. This will allow you to identify "types of
>>> careers" that people have, based on the similarity of their career events.
>>> For example, if one of the clusters is characterized by people who have had
>>> long tenures at a single company, you could label this cluster as "company
>>> man" careers. Similarly, if another cluster is characterized by people who
>>> frequently switch between companies, you could label this cluster as
>>> "rotator" careers. You can then use this information to gain insights into
>>> the types of careers that are represented in the dataset, and potentially
>>> draw conclusions about the factors that may influence career paths and
>>> success.
>>> *Pieter Steenekamp*
>>> can you propose python code to do sometingh like this please
>>> *https://chat.openai.com/chat <https://chat.openai.com/chat>*
>>> Sure, here is an example of how you could use the scikit-learn library
>>> in Python to group the careers of the 2 million people into clusters using
>>> the k-means algorithm:
>>> Copy code
>>> from sklearn.cluster import KMeans
>>> # Load the data into a NumPy array
>>> data = np.loadtxt('career_data.txt')
>>> # Create the k-means model
>>> model = KMeans(n_clusters=4)
>>> # Fit the model to the data
>>> model.fit(data)
>>> # Predict the cluster labels for each data point
>>> labels = model.predict(data)
>>> # Print the cluster labels
>>> print(labels)
>>> This code will group the data into 4 clusters, as specified by the
>>> n_clusters parameter. You can adjust this parameter to change the number of
>>> clusters that the data is grouped into.
>>> To extract features from the data that can be used as input to the
>>> k-means model, you will need to preprocess the data and select relevant
>>> features. This will likely involve some feature engineering, which will
>>> depend on the specific characteristics of the data and the goals of your
>>> analysis.
>>> I hope this helps! Let me know if you have any questions.
>>>
>>> On Fri, 6 Jan 2023 at 19:34, Eric Charles <
>>> eric.phillip.charles at gmail.com> wrote:
>>>
>>>> Greetings all,
>>>> I'm hoping someone here could help out. Let's imagine I had some data
>>>> where each row was a person's career. We could list major events every
>>>> year.
>>>>
>>>> For example: 2004 they were highered, 2007 they get a promotion, 2010
>>>> they leave for a different company, 2012 they come back at a higher level,
>>>> 2015 get a promotion, then no change until 2022.
>>>>
>>>> Let's say I had data like this for roughly 2 million people, and that
>>>> there are around 10 different types of changes that could happen during any
>>>> time period (could be yearly, quarterly, monthly, I can make it how I
>>>> want).
>>>>
>>>> I was hoping we could ask a computer to tell us if there were "types of
>>>> careers" that people had. We could say "put all these careers into 4
>>>> buckets" or "7 buckets" based on similarity. Then we could look at the
>>>> piles the computer made and try to make sense of them.
>>>>
>>>> One type might be "company man" for people who tend to stay in place
>>>> for 20 or more years, another type could be a "rotator", who leaves and
>>>> returns every 3 years or so. Etc. The point is, I want a computer to make
>>>> the piles for me, rather than trying to come up with potential piles a
>>>> priori.
>>>>
>>>> Are there methods for doing this? I know it's a problem we've *talked*
>>>> about a lot, but I don't know if there are solutions.
>>>>
>>>> Any help would be appreciated.
>>>>
>>>> Best,
>>>> Eric
>>>>
>>>> <echarles at american.edu>
>>>> -. --- - / ...- .- .-.. .. -.. / -- --- .-. ... . / -.-. --- -.. .
>>>> FRIAM Applied Complexity Group listserv
>>>> Fridays 9a-12p Friday St. Johns Cafe   /   Thursdays 9a-12p Zoom
>>>> https://bit.ly/virtualfriam
>>>> to (un)subscribe http://redfish.com/mailman/listinfo/friam_redfish.com
>>>> FRIAM-COMIC http://friam-comic.blogspot.com/
>>>> archives:  5/2017 thru present
>>>> https://redfish.com/pipermail/friam_redfish.com/
>>>>   1/2003 thru 6/2021  http://friam.383.s1.nabble.com/
>>>>
>>> -. --- - / ...- .- .-.. .. -.. / -- --- .-. ... . / -.-. --- -.. .
>>> FRIAM Applied Complexity Group listserv
>>> Fridays 9a-12p Friday St. Johns Cafe   /   Thursdays 9a-12p Zoom
>>> https://bit.ly/virtualfriam
>>> to (un)subscribe http://redfish.com/mailman/listinfo/friam_redfish.com
>>> FRIAM-COMIC http://friam-comic.blogspot.com/
>>> archives:  5/2017 thru present
>>> https://redfish.com/pipermail/friam_redfish.com/
>>>   1/2003 thru 6/2021  http://friam.383.s1.nabble.com/
>>>
>> -. --- - / ...- .- .-.. .. -.. / -- --- .-. ... . / -.-. --- -.. .
>> FRIAM Applied Complexity Group listserv
>> Fridays 9a-12p Friday St. Johns Cafe   /   Thursdays 9a-12p Zoom
>> https://bit.ly/virtualfriam
>> to (un)subscribe http://redfish.com/mailman/listinfo/friam_redfish.com
>> FRIAM-COMIC http://friam-comic.blogspot.com/
>> archives:  5/2017 thru present
>> https://redfish.com/pipermail/friam_redfish.com/
>>   1/2003 thru 6/2021  http://friam.383.s1.nabble.com/
>>
> -. --- - / ...- .- .-.. .. -.. / -- --- .-. ... . / -.-. --- -.. .
> FRIAM Applied Complexity Group listserv
> Fridays 9a-12p Friday St. Johns Cafe   /   Thursdays 9a-12p Zoom
> https://bit.ly/virtualfriam
> to (un)subscribe http://redfish.com/mailman/listinfo/friam_redfish.com
> FRIAM-COMIC http://friam-comic.blogspot.com/
> archives:  5/2017 thru present
> https://redfish.com/pipermail/friam_redfish.com/
>   1/2003 thru 6/2021  http://friam.383.s1.nabble.com/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://redfish.com/pipermail/friam_redfish.com/attachments/20230109/1d5ed6af/attachment-0001.html>


More information about the Friam mailing list