refactor: replace pandas apply with more efficient means (#655)#684
refactor: replace pandas apply with more efficient means (#655)#684keyanyang wants to merge 1 commit intorasbt:masterfrom
Conversation
|
Hello @keyanyang! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-05-01 01:49:14 UTC |
|
Thanks a lot for your PR! Just to get an idea of how much the efficiency has improved, I just ran the code against some benchmark datasets from the website http://fimi.uantwerpen.be/data/ that @dbarbier shared via #646 (I related PR that I really want to revisit at some point, sorry for the delay @dbarbier , it has been a very stressful semester). I ran it as
The code snippets for loading the unzipped datasets are import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
df = pd.read_csv('chess.dat.txt', sep='\s+', header=None)
te = TransactionEncoder()
te_ary = te.fit(df.values).transform(df.values)
df = pd.DataFrame(te_ary, columns=te.columns_)
df.head()and df = pd.read_csv('pumsb.dat.txt', sep='\s+', header=None)
df = df.iloc[:20, :20]
te = TransactionEncoder()
te_ary = te.fit(df.values).transform(df.values)
df = pd.DataFrame(te_ary, columns=te.columns_)It looks like that the runtime wasn't really affected by the change from apply to vectorize+map -- I guess that's because the bottleneck is more in the combination generation section. I propose to leave this PR open for now and apply these changes to #646 later, which is a big overhaul to of the apriori implementation by @dbarbier -- I am hoping to find some hours of uninterrupted time to take a closer look at #646 this summer. In the meantime, if you are interested, it would be great to get some additional feedback (and pair of eyes) on #646 :) |
Description
Goal: Replacing pandas' .apply wherever possible.
What I did:
Checked all functions that use .apply. Though there is no place that where/select can be leveraged, using Numpy's vectorize could speed up functions greatly.
Improve the runtime performance for apriori function and tests by replacing pandas' .apply with numpy's vectorize as the data (see below) show vectorize is faster than apply for the current use.
Improve the efficiency for generate_itemsets function by replacing Python lists with Numpy's arrays and replacing iterative division with array division.
Related issues or pull requests
Refactor #655
Pull Request Checklist
./docs/sources/CHANGELOG.mdfile (if applicable)./mlxtend/*/testsdirectories (if applicable)mlxtend/docs/sources/(if applicable)PYTHONPATH='.' pytest ./mlxtend -svand make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g.,PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv)flake8 ./mlxtendPerformance comparison
Data Preperation
len
set
frozenset to str
colname map
cast each element to stringfied frozenset