- TODO: figure out what the format is for the rationale tables... # import::from "loose" scripts | Thing | Comment | | ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- | | The problem is | You have utilities and stuff that don't feel like they warrant a package, or you don't want people to have to install a package | | So you | Should use `import::from` | | But also | You want to be able to keep the utilities up-to-date easily, but also to be maybe self-contained in a project, so all code is bundled together | | So, you also | Combine it with `{pins}`, rather than keep things in a loose local folder | | But if instead | You want to keep any of this 'loose' code private | | You could instead | Store some of this on Teams or your collaboration platform of choice! | ```{r} library(tidyverse) the_url <- "https://raw.githubusercontent.com/mjkerrison/..." # Note: you can find this by navigating to the file in GitHub, clicking "Raw" and copying the link. # If you navigate via a branch, you get the link for the latest; if you navigate to a particular commit, you can get a snapshot my_functions <- pins::board_url( urls = c("mk_github" = the_url), cache = "R_remote" # Keep it cached *in the project* ) import::from( pins::pin_download(my_functions, "mk_github"), .character_only = TRUE, .all = TRUE ) ``` I think this doesn't quite feel right yet, but I'm not sure why. - Here's some reasons: - This requires a bit more boilerplate than I would like - ideally this would be very terse and unobtrusive. Especially if you wanted to write the URL once and grab multiple files - `board_url` requires a named character vector, and calls to `pin_download` need you to specify which pin on the board you're after. So there's some loops involved, potentially. - It ends up creating a cache of versions - this should be handled by local project version control. I don't need each *version of the project* to also contain all *versions of this code/snippet*. - This additionally kinda restricts you to using 1 of 2 workarounds (below) to avoid sourcing *every version* of the .R files in the cache. - 1: Store it outside of your big R folder, and use the {pins} interface to retrieve the latest version. - 2: Store it inside your big R folder and... I guess still use the {pins} interface to make sure you end up loading the most recent version *last* - making things like `targets::tar_source` calls to load-all-R-functions more volatile... Yeah - honestly I think the better option would be to just try to download the file every time: ``` # Attempt 1 glue::glue( "https://raw.githubusercontent.com/mjkerrison/toolbox.mjk/main/R/{scripts}", scripts = c( "fetch_from_teams.R", "utilities_for_dates.R" ) ) |> (\(x){download.file(url = x, destfile = glue::glue("R/{basename(x)}"))})() # Attempt 2 c("fetch_from_teams.R", "utilities_for_dates.R") |> purrr::walk(\(script_i){ usethis::use_github_file( "mjkerrison/toolbox.mjk", path = paste0("R/", script_i), save_as = paste0("R/", script_i) ) }) ``` FAQS: - ... # *Method*-oriented analytics ## What am I on about - {targets} is phenomenal for managing complex pipelines - But it feels like the DAG could be more valuable than it often is - feels like one of the tricks here is abstraction, but how do you know what to abstract? - Something that contributes to this is the profusion of objects and functions - and this isn't just a {targets} problem, it's a complex (or at least, long/large) data project problem - For example, going the very functional route, you end up with a lot of functions that take up to a dozen existing objects. They're not actually reusable, atomic functions, they're just more conventional scripts wrapped in a function. - This is still way better than the alternative for whole host of reasons... - I think this intersects with common headaches in complex/large Shiny apps as well: lots of objects that are all somewhat different flavours of the same thing. - And I think that provides a clue as to how we might solve it. - An alternative approach would be to have fewer objects, and write the same snippets of code summarising it into downstream consumers of those objects. So rather than have `object_x` and `object_x_summarise`, you only pass around `object_x` and maybe have a function called `summarise_object_x()`. - Cons: if that transformation is very compute-intensive, you may actually want to individuate those objects. Not really a con, more a decision criteria. - But then you have a lot of random functions that are totally tied to specific objects, and package in a dependency that {targets} will handle fine but might be nice to clean up. - I think this is just semantics, but I think semantics matters. - Like the ideal would be to have the function "live" with the object - maybe you put all those functions in the same script for quick reference as well. - See where this is going? Enter object-oriented programming - but really, the point here is *methods* more than anything else. We probably won't actually be worried about complicated inheritances or composition or anything like that - and we'll probably end up violating some [SOLID](https://en.wikipedia.org/wiki/SOLID) principles to boot. ## Bottom line - Implement a very generic 'analysis object' class - New analysis objects get wrapped in this class - "Reusable" code snippets - transformations you do regularly across the specific project - become additional methods for that instantiation of the class, meaning they can be better semantically tied together in a way that reflects the author's intent - This gets passed around, abstracting the DAG and thus making it more meaningful - Functions for combining objects and doing analysis further downstream become easier to write up: they take the class-wrapped object, and can just invoke the methods they need - We make the trade-off of doing more computing but saving on memory - especially within {targets}, this feels like a win - So really this is a glorified list containing the core data and then common transformations/manipulations, but with convenient reference semantics. - And overall, this is an attempt to be ==minimally idiomatic== - this lets us manage stuff like namespacing in a simpler, easier-for-newcomes-to-read, and *more robust* way than, e.g. prefixing all names (like `object_x_summarise_by...()` or similar) ## Still working through - This would require basically one class per analysis object. That's still net fewer things being created than just using functions - because you still have many functions per object anyway - but does feel a little silly. - (In R6 only class methods, not member functions, have reference semantics - [R6 docs](https://r6.r-lib.org/articles/Introduction.html#class-methods-vs--member-functions)) - This should probably be done carefully to avoid unnecessary rebuilds of objects in {targets} because you've gone back and added more methods to your analysis object. - So I think that means you can't have the ETL process actually packaged into instantiation with `new()` - What does it actually look like within {targets}? ## Other relevant concepts? - A more... monadic approach? Where instead of a class you have an outer function that you partial/curry using the raw data... - I didn't think this worked??? I swear like the stack of enclosing environments is "wrong" or something - "I feel betrayed that it just remembers" - And this would basically work in Python! ```{r} raw_data <- mtcars my_pseudo_class <- function(data, method_to_use){ if(method_to_use == "average"){ inner_fn <- \(x) mean(data$mpg) } return(inner_fn) } mtcars_average_mpg <- my_pseudo_class(raw_data, "average") mtcars_average_mpg() # Run without creating new objects: my_pseudo_class(raw_data, "average")() # With currying/partials: processed_data <- purrr::partial(my_pseudo_class, data = raw_data) processed_data("average")() # Testing mutability? raw_data <- iris mtcars_average_mpg() # Still works! # Other things like re-defining mtcars_average_mpg() obviously wouldn't # (you'd pass raw_data in again.) ``` ## Example ```{r} targets::tar_dir({ targets::tar_script(code = { library(targets) library(tarchetypes) library(tidyverse) library(R6) # Define classes =========================================================== # Prototypical object - we want to be able to go back and add methods to # analysis objects freely, so every analysis-object-class will need to take # post-ETL data in its initialisation (as opposed to doing ETL *as* the # initialisation) - so let's just inherit that: analysis_object <- R6Class( "analysis_object", # Now, the lock_class and lock_objects options don't quite behave as you # might expect on a quick read of the docs: while they can lock down # modifying the class, and adding/removing *members* of instantiations of # the class, they don't change the *mutability* of instantiations of the # class - so you could still inadvertently modify the public members, e.g. # the data. # So to achieve this, we'll; # - make 'data' a private member so it can't be accessed directly # - create a method to return data public = list( initialize = function(data_to_use){ private$data <- data_to_use }, get_data = function(){return(private$data)} ), private = list( data = NULL ) # By default, the class environment is locked such that you can't add or # modify bindings - so we don't need to worry about someone accidentally # doing something like # class_object$new_thing <- ... # or # class_object$get_data <- some_other_fn ) # And then the "applied" version of the class: inherit the basics, but then # all the methods are bespoke to the analysis object itself. Overall, fairly # minimal boilerplate - not too much worse than the conventional functional # approach. mtcars_wrapper <- R6Class( "mtcars_wrapper", inherit = analysis_object, public = list( average_by_cyl = function(){ private$data |> group_by(cyl) |> summarise(across(where(is.numeric), mean)) |> ungroup() } ) ) # {targets} plan =========================================================== tar_plan( mtcars_raw = tibble::as_tibble(mtcars), mtcars_analysis = mtcars_wrapper$new(mtcars_raw) ) }) # Executing / testing ======================================================== targets::tar_make() # Conveniently, R6 objects are meant to work 'out of the box' - more or less # without the "right" version of the R6 package (so maybe, without it *at # all*?) for_analysis <- targets::tar_read(mtcars_analysis) print(for_analysis$average_by_cyl()) }) ``` ## Relevant reference material - https://en.wikipedia.org/wiki/Object-oriented_programming#Responsibility-_vs._data-driven_design - H/t Ben Wee on the RUNAPP Slack: - https://ericmjl.github.io/blog/2022/4/1/functional-over-object-oriented-style-for-pipeline-esque-code/ - https://ericmjl.github.io/blog/2023/12/12/classes-functions-both/ # Project organisation - There's two steps to every project: - Artefact generation - Artefact presentation - Presentation often happens in 2 or more ways: - Printing charts to a Powerpoint - Filling out a template report - Packaging in a Shiny app or dashboard - There *should* be a lot of code overlap within the presentation step: ideally, we're not writing multiple versions of the same visualisation - for consistency, accuracy, reducing the amount of review required, etc. - The 'default' pattern for this seems to be that you "write a package" for step #1 and use that package for the latter step; most Posit guidance is along these lines. - However, especially in consulting, this feels like a bit of a weird workflow. - It's generally not *that* much work to just install or load the package, so maybe it's fine - But it feels more like I just want one R folder, use {targets} to generate artefacts, and then have a couple of scripts that use the shared R content and {targets} store to create the different presentation versions - So what does that folder & workflow structure look like? # Principles for data in general - Never discard data. Only create flags - you never know when you're going to want to break down analysis by is-in-core-data vs was-in-that-group-they-said-I-could-filter-out. # Local data lakes ## Version 1 ### The Idea "Data lake" structure - partly some Googling on data lakes, partly inspired by the package "pins" (though we want that kind of functionality for the raw data, not just any resultant R object) ``` Lake ╚ Data Source ("board") ╚ data-source_date-saved_pertinent-metadata.fileext ("pin") ╚ data-source_later-date_pertinent-metadata.fileext ╚ Data Source 2 ╚ data-source-2_... ``` A lot of the data lake resources suggested also splitting by "domain" 1 level up from "Data Source", and having folders below "Data Source" for year / month / day - agnostic on generation/upload vs date-as-at I think an important insight is that files should be grouped (in this schema by "Data Source") such that the schema is consistent within that folder for better bulk processing ### Data Catalog I think this is the main benefit to doing this more deliberately: maintain a catalog of what's available where alongside relevant metadata (like any parameters used to run a report). I think that - doing any processing, - keeping processing code alongside the raw data, or - maintaining links to processing code (and version - e.g. commit hash) is a bit too much for too little return. I think the benefits of these things can be achieved at the project level - if we need to track the version of code used to run a project (which should mooostly be an orchestration script calling code from the monorepository) then we'll naturally be capturing whatever version of the processing code we were using. But What Format? - Table makes finding relevant data easier - But we might want to capture arbitrary (and not universal) metadata - So we need a spare list column - How do we save that to disk? - Seems like the best option is like JSON - Kinda need to discard stuff like .csv or even a SQLite DB out of hand due to the list column thing - Alternative could be to just save as .RDS; same costs in lack of portability but we get... - Human readable - Text = can text diff (good for Git) ## Version 2 - Need to revisit this - Probably just full database vibes - Or something like the {arrow} package, which lets you load/query directories as though they were tables (obviously with some caveats) # To write up - write something about the perils of `with` - Make something about function writing: e.g. advocate for `invisible` - Logging! - {logger} package preferred - Also nifty stuff like `noquote()` - to do: something about dynamic dots vs tidy select vs data masking / write a dots explainer - better shiny tooltips - specifically, how to make a nicely formatted and verbose tooltip without making ugly UI code. Probable: having markdown files for this (or one big markdown file with headings or something?) that you then reference. - Should I write up my own piece of non-standard evaluation? Seeing as I use it so much... - Maybe a piece on "current state for oldheads": new dplyr stuff, `glue`, ... https://x.com/ID_AA_Carmack/status/1874124927130886501?t=pln1zv-_13kEGihcpfMLAQ&s=09 -> design pattern for Shiny numero uno!!! https://x.com/ID_AA_Carmack/status/1874127039156150585?t=VKPuB2Inv9vHlVxG-uN-fA&s=09 (this chain) For toolbox.mjk: - Autowrangle - See notes for sketch of nicer Shiny column association tool... - Use something like "average column position" in bus_matrix to do at least *some* work on guessing where things might be the same. - Add a step to AW to put things in target_column order on output - Dumping big correlation matrices to Excel - Use {openxlsx} to implement conditional formatting - Nice row and column headers - Top-right matrix - Check out {usethis} for nice additions to an R setup script, especially: - `use_blank_slate()` - Look at customising a flextable setup to make pptx tables easier - create a consistent ggplot skeleton for myself - think harder about memoise - Like could this be a real power-up for my class-driven pipeline approach???