new openml scenario by larskotthoff · Pull Request #18 · coseal/aslib_data

larskotthoff · 2017-01-10T21:06:28Z

No description provided.

mlindauer · 2017-01-10T21:10:20Z

Could we please use the configuration field in description.txt?
The OPENML scenario was indeed the initial motivation to introduce this field since the algorithm names are simply too long.

mlindauer · 2017-01-10T21:16:25Z

X18_mfeat.morphological is missing in feature_values and feature_runstatus

larskotthoff · 2017-01-10T21:28:41Z

There are different runs of the same algorithm with different parameter settings. How do we want to handle that?

larskotthoff · 2017-01-11T01:17:08Z

I've made a different algorithm for each parameterisation now. Not sure if this is what you had in mind when proposing the format change though?

mlindauer · 2017-01-11T06:00:41Z

Yes, I did it in a similar way in the not_verified branch version of the OPENML scenario as you have done it:

  functions.LibLINEAR_c1:
    configuration: -S 1 -C 1.0 -E 0.001 -B 1.0
    deterministic: true
  functions.LibLINEAR_c2:
    configuration: -S 1 -C 1.0 -E 0.01 -B 1.0
    deterministic: true

I have not given each algorithm an id, but only each configuration.
Furthermore, I cut a bit more of the prefix which is the same for each algorithm (weka.classifiers).

larskotthoff · 2017-01-11T06:18:16Z

Hmm, that seems a bit weird. So this metainfo thing is only to get shorter names? It seems like it should enable to group algorithms with different configurations.

mlindauer · 2017-01-11T10:12:21Z

Hmm, that seems a bit weird.

why?

So this metainfo thing is only to get shorter names?

This was my initial motivation.
However, we wanted to have such a feature in the format spec anyway such that we know how to call algorithms with (hyper-)parameters.

It seems like it should enable to group algorithms with different configurations.

I don't really understand what you want to say.
The id of an algorithm has still be to be unique.

larskotthoff · 2017-01-11T17:33:08Z

It seems weird to allow only one configuration per algorithm. If I see a "configuration" field I would expect to be allowed to have more than one. I guess "call" would be less ambiguous. But if it's ok with everybody else let's leave it this way.

larskotthoff · 2017-01-11T20:10:13Z

Apart from that I think that this is ready to be merged.

mlindauer · 2017-01-12T06:02:27Z

@larskotthoff Could I convince you to drop the weka.classifiers. prefix in the algorithm names and simply mention in the readme.txt that these all are weka classifiers?
In the last update of the format spec, I wrote "The name of an algorithm should have at most 15 characters" and nobody had objections. Right now, the OPENML algorithm names are far longer and not even close to 15 characters.

In general, I miss a readme.txt in the scenario.

Furthermore, the status of all algorithm runs is ok, but some have an acc of 0.0 sometimes.
For example:
X1044_eye_movements,1,1075_rules.NNge_7,0,ok
So, I would guess the status was something else, maybe a timeout or memout? Please note that no memout or timeout limit is right now in the description.txt.

mlindauer · 2017-01-12T07:58:03Z

It would be also great if the readme could explain why we have missing feature values.

mlindauer · 2017-01-12T08:22:03Z

Please note that I fixed two further issues in the description.txt.

mlindauer · 2017-01-12T09:43:46Z

@joaquinvanschoren
Is it correct that two datasets have exactly the same meta-feature vector?

joaquinvanschoren · 2017-01-12T10:00:24Z

Yes, all datasets share the same meta-features. I believe there are some cases where you have NaN when a division by zero happens, or a '-1' where something cannot be computed (e.g. mean number of nominal categories when the data is purely numeric). I think that, in both cases, this only happens for classification datasets with only numeric features. Cheers, Joaquin

On Thu, Jan 12, 2017 at 10:43 AM Marius Lindauer ***@***.***> wrote: @joaquinvanschoren <https://github.com/joaquinvanschoren> Is it correct that two datasets have exactly the same meta-feature vector? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#18 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABpQV9WfoA_TrqvtKq-8AttD219pnnqzks5rRfXSgaJpZM4Lf5Nf> .

-- Thank you, Joaquin

mlindauer · 2017-01-12T10:11:28Z

Yes, all datasets share the same meta-features.

I meant that X24_mushroom and X809_mushroom have exactly the same vector.
So, we cannot discriminate these two.

Another question: The features seem to have native feature groups,
e.g., CfsSubsetEval, DecisionStump, Hoeffding, J48, NaiveBayes, REPTree, RandomTree, kNN1.
I think it would improve the quality of the scenario, if we would model these feature groups properly.

joaquinvanschoren · 2017-01-12T10:16:35Z

Oh, that is indeed a duplicate. Sorry about that. Are there more? Cheers, Joaquin

On Thu, Jan 12, 2017 at 11:11 AM Marius Lindauer ***@***.***> wrote: Yes, all datasets share the same meta-features. I meant that X24_mushroom and X809_mushroom have exactly the same vector. So, we cannot discriminate these two. Another question: The features seem to have native feature groups, e.g., CfsSubsetEval, DecisionStump, Hoeffding, J48, NaiveBayes, REPTree, RandomTree, kNN1. I think it would improve the quality of the scenario, if we would model these feature groups properly. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#18 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABpQV6D6qp-IG1ezy4fRp5QXpPH4ZJNaks5rRfxQgaJpZM4Lf5Nf> .

-- Thank you, Joaquin

mlindauer · 2017-01-12T10:24:49Z

Are there more?

Not as far as I know.

larskotthoff · 2017-01-12T17:32:10Z

I've shortened the names and added a readme.

I don't see your point about the 0 accuracy values -- this is a valid number for accuracy and doesn't necessarily indicate an error.

Regarding feature groups: As we don't have feature costs (and don't care about feature costs) I don't think that grouping them differently will make any difference.

mlindauer · 2017-01-12T17:36:23Z

I've shortened the names and added a readme.

Thanks!

I don't see your point about the 0 accuracy values -- this is a valid number for accuracy and doesn't necessarily indicate an error.

I would say to be worse than random is already problematic, but to be always wrong is weird.

Regarding feature groups: As we don't have feature costs (and don't care about feature costs) I don't think that grouping them differently will make any difference.

Not for your tools, but for mine. ;-)
So, if you don't have objections, I would change it tomorrow.

larskotthoff · 2017-01-12T17:39:07Z

Ok, feel free to change the feature groups.

joaquinvanschoren · 2017-01-12T19:19:42Z

Accuracy 0 is indeed weird. Does it happen often?

…

On Thu, 12 Jan 2017 at 18:39, Lars Kotthoff ***@***.***> wrote: Ok, feel free to change the feature groups. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#18 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABpQV7XiscIBMkd9K0mLb6ysY1f9apUxks5rRmU7gaJpZM4Lf5Nf> .

mlindauer · 2017-01-12T19:25:08Z

Accuracy 0 is indeed weird. Does it happen often?

1187 times

mlindauer · 2017-01-12T19:28:57Z

I reduced the number of feature groups.
However, the scenario does not pass my ASlib checker tool anymore.
The problem is that we have feature groups with status "ok" but with partially missing features now.
Given Joaquin's explanation, I would tend to fix it by replacing the missing features with 0.
Do you agree?

mlindauer · 2017-01-12T19:47:01Z

Accuracy 0 is indeed weird. Does it happen often?
1187 times

Using a grep on the original Runs_OpenML.csv,
I found only 22 times an acc of 0.
Maybe the conversion between the csv and the arff files is somehow broken?
For example,

ASLib: X474_analcatdata_marketing,1,1244_.meta.FilteredClassifier_94,0,ok

Runs_OpenML.csv: 3849,986_analcatdata_marketing,1244_weka.classifiers.meta.FilteredClassifier -- -F \weka.filters.supervised.attribute.Discretize -R first-last -precision 6\ -W weka.classifiers.trees.J48 -- -C 0.25 -M 2,NULL,0.67033,0.507002

mlindauer · 2017-01-12T19:49:11Z

ah, this was a different dataset?
X474_analcatdata_marketing vs 986_analcatdata_marketing
If yes, Runs_OpenML.csv has simply no entry for meta.FilteredClassifier on X474_analcatdata_marketing

mlindauer · 2017-01-16T10:05:34Z

maybe I missed a concrete question:
@joaquinvanschoren Was it intended that Runs_OpenML.csv had not entries for all combinations of ML algorithms and data sets?
If yes, why? And should we label the missing entries as timeouts/memouts/crashed runs?
If no, where are the missing entries? Or have I done something wrong?

new openml scenario

4523e32

...

19a7802

mlindauer added 2 commits January 12, 2017 09:06

FIX OPENML: provides has to provide a list and not a string

5b0c5e5

FIX OPENML performance type

3f27cfd

FIX scenario name in description.txt

f8eed74

...

66a29d8

FIX reduce feature groups in OPENML

50ece51

Conversation

larskotthoff commented Jan 10, 2017

Uh oh!

mlindauer commented Jan 10, 2017

Uh oh!

mlindauer commented Jan 10, 2017

Uh oh!

larskotthoff commented Jan 10, 2017

Uh oh!

larskotthoff commented Jan 11, 2017

Uh oh!

mlindauer commented Jan 11, 2017

Uh oh!

larskotthoff commented Jan 11, 2017

Uh oh!

mlindauer commented Jan 11, 2017

Uh oh!

larskotthoff commented Jan 11, 2017

Uh oh!

larskotthoff commented Jan 11, 2017

Uh oh!

mlindauer commented Jan 12, 2017

Uh oh!

mlindauer commented Jan 12, 2017

Uh oh!

mlindauer commented Jan 12, 2017

Uh oh!

mlindauer commented Jan 12, 2017

Uh oh!

joaquinvanschoren commented Jan 12, 2017 via email

Uh oh!

mlindauer commented Jan 12, 2017

Uh oh!

joaquinvanschoren commented Jan 12, 2017 via email

Uh oh!

mlindauer commented Jan 12, 2017

Uh oh!

larskotthoff commented Jan 12, 2017

Uh oh!

mlindauer commented Jan 12, 2017

Uh oh!

larskotthoff commented Jan 12, 2017

Uh oh!

joaquinvanschoren commented Jan 12, 2017 via email

Uh oh!

mlindauer commented Jan 12, 2017

Uh oh!

mlindauer commented Jan 12, 2017

Uh oh!

mlindauer commented Jan 12, 2017

Uh oh!

mlindauer commented Jan 12, 2017

Uh oh!

mlindauer commented Jan 16, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants