Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: add convenience functions for selecting and sorting #26

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

keesterbrugge
Copy link

@keesterbrugge keesterbrugge commented Jan 28, 2019

Hi @sbelak,

Thanks for this library! I've been getting a lot of mileage out of it. Here are some function that I use that might be a nice addition to your library:

  • select-cols-regex: function that selects columns using a regular expression
  • compare-by: function that returns a comparator that makes sorting in a descending or ascending fashion per keyword easy. As opposed to the normal comparator it always sorts nil values last.

Cheers!

@keesterbrugge keesterbrugge changed the title add function that selects columns based on regex add convenience functions for selecting and sorting Jan 28, 2019
@keesterbrugge keesterbrugge changed the title add convenience functions for selecting and sorting WIP: add convenience functions for selecting and sorting Jan 28, 2019
Say I want to filter columns based on some predicate like a regular expression, whether it is an element of some collection of keywords or some composition of these. In this case it would be nice to have a function like `filter-cols` as it simplifies the code as follows:  
From
``` clojure
(->> df
     ;; .. some transformation, perhaps new columns are added
     (#(hc/select-cols (filter pred' (hc/cols %)) % )))
```
to
``` clojure
(->> df
     ;; .. some transformation, perhaps new columns are added
     (filter-cols pred'))
```
I've also reimplemented `select-cols-regex` using this new function
@keesterbrugge
Copy link
Author

I've added the function filter-cols. Say I want to filter columns based on some predicate like a regular expression, their membership of some collection of keywords or some composition of these predicates. In this case it would be nice to have a function like filter-cols as it simplifies the code as follows:
From

(->> df
     ;; change collection of columns in some way, e.g. using (derive-cols ...) 
     (#(select-cols (filter pred' (cols %)) % )))

to

(->> df
     ;; change collection of columns in some way, e.g. using (derive-cols ...) 
     (filter-cols pred'))

I've reimplemented select-cols-regex using this new function

@sbelak
Copy link
Owner

sbelak commented Jan 28, 2019

Thanks for this. I really like. A couple of things:

  • I'm not 100% on select-cols-regex. Feels very specific (I'm guessing this is for messy data where you have col_name_1...n) and minimal convenience from the more generic filter-cols
  • I think select-cols-by better reflects semantics, as the fn operates on col names rather than values like filter family does.
  • I love the utility of compare-by, not entirely sold on the signature (but it might be correct). Two things feel like warts: mandatory :asc/:desc for each comparator and the fact we need these "magic" tokens. Have you considered using a combinator that flips 1 <-> -1 instead. So you'd write something like
(sort (compare-by :a (desc :b)) ...)
  • Using partition will probably yield cleaner code than loop.

@keesterbrugge
Copy link
Author

Thanks for your feedback :)

  • I'll remove select-cols-regex.
  • I'll think about how to make compare-by a bit cleaner.

Question:
any reason you defined the function select-cols instead of using clojure.set/project? Is it because you don't want the return type to be a set?

@sbelak
Copy link
Owner

sbelak commented Jan 31, 2019

3 reasons. In order of importance:

  • sets break the ordering of the data
  • project can only be used with keywords, while select-cols works with any keyfn
  • while the set functions currently work on non-sets that's not a guarantee

@keesterbrugge
Copy link
Author

keesterbrugge commented Feb 5, 2019

That makes sense.

I've added the function derive-cols* to convey how I'd like the derive-cols function to behave. I don't propose to include it as is.

The benefit of derive-cols* compared with the current derive-cols is that by taking ordering of the new-cols into account you can construct a new column and let that column then be the input of the next new column. The consequence is that you can write

(->> [{:a 1 :b 2}{:a 3 :b 10}] 
     (derive-cols* (ordered-map :c [inc :b] 
                                :d [inc :c]))) 
;; => ({:a 1, :b 2, :c 3, :d 4} {:a 3, :b 10, :c 11, :d 12})

or

(->> [{:a 1 :b 2}{:a 3 :b 10}] 
     (derive-cols* [:c [inc :b] 
                    :d [inc :c]])) 

instead of

(->> [{:a 1 :b 2}{:a 3 :b 10}] 
     (derive-cols {:c [inc :b]})
     (derive-cols {:d [inc :c]})) 

which becomes a bother when you have a long chain of new column derivations that have dependencies on each other.

@sbelak What do you think?

I don't know much about clojure.spec yet. I'll make an attempt to implement derive-cols*, select-cols-by and compare-by in a more coherent fashion with respect to the rest of the lib.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants