- TODO: figure out what the format is for the rationale tables...
# import::from "loose" scripts
| Thing | Comment |
| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
| The problem is | You have utilities and stuff that don't feel like they warrant a package, or you don't want people to have to install a package |
| So you | Should use `import::from` |
| But also | You want to be able to keep the utilities up-to-date easily, but also to be maybe self-contained in a project, so all code is bundled together |
| So, you also | Combine it with `{pins}`, rather than keep things in a loose local folder |
| But if instead | You want to keep any of this 'loose' code private |
| You could instead | Store some of this on Teams or your collaboration platform of choice! |
```{r}
library(tidyverse)
the_url <- "https://raw.githubusercontent.com/mjkerrison/..."
# Note: you can find this by navigating to the file in GitHub, clicking "Raw" and copying the link.
# If you navigate via a branch, you get the link for the latest; if you navigate to a particular commit, you can get a snapshot
my_functions <- pins::board_url(
urls = c("mk_github" = the_url),
cache = "R_remote" # Keep it cached *in the project*
)
import::from(
pins::pin_download(my_functions, "mk_github"),
.character_only = TRUE,
.all = TRUE
)
```
I think this doesn't quite feel right yet, but I'm not sure why.
- Here's some reasons:
- This requires a bit more boilerplate than I would like - ideally this would be very terse and unobtrusive. Especially if you wanted to write the URL once and grab multiple files - `board_url` requires a named character vector, and calls to `pin_download` need you to specify which pin on the board you're after. So there's some loops involved, potentially.
- It ends up creating a cache of versions - this should be handled by local project version control. I don't need each *version of the project* to also contain all *versions of this code/snippet*.
- This additionally kinda restricts you to using 1 of 2 workarounds (below) to avoid sourcing *every version* of the .R files in the cache.
- 1: Store it outside of your big R folder, and use the {pins} interface to retrieve the latest version.
- 2: Store it inside your big R folder and... I guess still use the {pins} interface to make sure you end up loading the most recent version *last* - making things like `targets::tar_source` calls to load-all-R-functions more volatile...
Yeah - honestly I think the better option would be to just try to download the file every time:
```
# Attempt 1
glue::glue(
"https://raw.githubusercontent.com/mjkerrison/toolbox.mjk/main/R/{scripts}",
scripts = c(
"fetch_from_teams.R",
"utilities_for_dates.R"
)
) |> (\(x){download.file(url = x, destfile = glue::glue("R/{basename(x)}"))})()
# Attempt 2
c("fetch_from_teams.R",
"utilities_for_dates.R") |>
purrr::walk(\(script_i){
usethis::use_github_file(
"mjkerrison/toolbox.mjk",
path = paste0("R/", script_i),
save_as = paste0("R/", script_i)
)
})
```
FAQS:
- ...
# *Method*-oriented analytics
## What am I on about
- {targets} is phenomenal for managing complex pipelines
- But it feels like the DAG could be more valuable than it often is - feels like one of the tricks here is abstraction, but how do you know what to abstract?
- Something that contributes to this is the profusion of objects and functions - and this isn't just a {targets} problem, it's a complex (or at least, long/large) data project problem
- For example, going the very functional route, you end up with a lot of functions that take up to a dozen existing objects. They're not actually reusable, atomic functions, they're just more conventional scripts wrapped in a function.
- This is still way better than the alternative for whole host of reasons...
- I think this intersects with common headaches in complex/large Shiny apps as well: lots of objects that are all somewhat different flavours of the same thing.
- And I think that provides a clue as to how we might solve it.
- An alternative approach would be to have fewer objects, and write the same snippets of code summarising it into downstream consumers of those objects. So rather than have `object_x` and `object_x_summarise`, you only pass around `object_x` and maybe have a function called `summarise_object_x()`.
- Cons: if that transformation is very compute-intensive, you may actually want to individuate those objects. Not really a con, more a decision criteria.
- But then you have a lot of random functions that are totally tied to specific objects, and package in a dependency that {targets} will handle fine but might be nice to clean up.
- I think this is just semantics, but I think semantics matters.
- Like the ideal would be to have the function "live" with the object - maybe you put all those functions in the same script for quick reference as well.
- See where this is going? Enter object-oriented programming - but really, the point here is *methods* more than anything else. We probably won't actually be worried about complicated inheritances or composition or anything like that - and we'll probably end up violating some [SOLID](https://en.wikipedia.org/wiki/SOLID) principles to boot.
## Bottom line
- Implement a very generic 'analysis object' class
- New analysis objects get wrapped in this class
- "Reusable" code snippets - transformations you do regularly across the specific project - become additional methods for that instantiation of the class, meaning they can be better semantically tied together in a way that reflects the author's intent
- This gets passed around, abstracting the DAG and thus making it more meaningful
- Functions for combining objects and doing analysis further downstream become easier to write up: they take the class-wrapped object, and can just invoke the methods they need
- We make the trade-off of doing more computing but saving on memory - especially within {targets}, this feels like a win
- So really this is a glorified list containing the core data and then common transformations/manipulations, but with convenient reference semantics.
- And overall, this is an attempt to be ==minimally idiomatic== - this lets us manage stuff like namespacing in a simpler, easier-for-newcomes-to-read, and *more robust* way than, e.g. prefixing all names (like `object_x_summarise_by...()` or similar)
## Still working through
- This would require basically one class per analysis object. That's still net fewer things being created than just using functions - because you still have many functions per object anyway - but does feel a little silly.
- (In R6 only class methods, not member functions, have reference semantics - [R6 docs](https://r6.r-lib.org/articles/Introduction.html#class-methods-vs--member-functions))
- This should probably be done carefully to avoid unnecessary rebuilds of objects in {targets} because you've gone back and added more methods to your analysis object.
- So I think that means you can't have the ETL process actually packaged into instantiation with `new()`
- What does it actually look like within {targets}?
## Other relevant concepts?
- A more... monadic approach? Where instead of a class you have an outer function that you partial/curry using the raw data...
- I didn't think this worked??? I swear like the stack of enclosing environments is "wrong" or something
- "I feel betrayed that it just remembers"
- And this would basically work in Python!
```{r}
raw_data <- mtcars
my_pseudo_class <- function(data, method_to_use){
if(method_to_use == "average"){
inner_fn <- \(x) mean(data$mpg)
}
return(inner_fn)
}
mtcars_average_mpg <- my_pseudo_class(raw_data, "average")
mtcars_average_mpg()
# Run without creating new objects:
my_pseudo_class(raw_data, "average")()
# With currying/partials:
processed_data <- purrr::partial(my_pseudo_class, data = raw_data)
processed_data("average")()
# Testing mutability?
raw_data <- iris
mtcars_average_mpg() # Still works!
# Other things like re-defining mtcars_average_mpg() obviously wouldn't
# (you'd pass raw_data in again.)
```
## Example
```{r}
targets::tar_dir({
targets::tar_script(code = {
library(targets)
library(tarchetypes)
library(tidyverse)
library(R6)
# Define classes ===========================================================
# Prototypical object - we want to be able to go back and add methods to
# analysis objects freely, so every analysis-object-class will need to take
# post-ETL data in its initialisation (as opposed to doing ETL *as* the
# initialisation) - so let's just inherit that:
analysis_object <- R6Class(
"analysis_object",
# Now, the lock_class and lock_objects options don't quite behave as you
# might expect on a quick read of the docs: while they can lock down
# modifying the class, and adding/removing *members* of instantiations of
# the class, they don't change the *mutability* of instantiations of the
# class - so you could still inadvertently modify the public members, e.g.
# the data.
# So to achieve this, we'll;
# - make 'data' a private member so it can't be accessed directly
# - create a method to return data
public = list(
initialize = function(data_to_use){
private$data <- data_to_use
},
get_data = function(){return(private$data)}
),
private = list(
data = NULL
)
# By default, the class environment is locked such that you can't add or
# modify bindings - so we don't need to worry about someone accidentally
# doing something like
# class_object$new_thing <- ...
# or
# class_object$get_data <- some_other_fn
)
# And then the "applied" version of the class: inherit the basics, but then
# all the methods are bespoke to the analysis object itself. Overall, fairly
# minimal boilerplate - not too much worse than the conventional functional
# approach.
mtcars_wrapper <- R6Class(
"mtcars_wrapper",
inherit = analysis_object,
public = list(
average_by_cyl = function(){
private$data |>
group_by(cyl) |>
summarise(across(where(is.numeric), mean)) |>
ungroup()
}
)
)
# {targets} plan ===========================================================
tar_plan(
mtcars_raw = tibble::as_tibble(mtcars),
mtcars_analysis = mtcars_wrapper$new(mtcars_raw)
)
})
# Executing / testing ========================================================
targets::tar_make()
# Conveniently, R6 objects are meant to work 'out of the box' - more or less
# without the "right" version of the R6 package (so maybe, without it *at
# all*?)
for_analysis <- targets::tar_read(mtcars_analysis)
print(for_analysis$average_by_cyl())
})
```
## Relevant reference material
- https://en.wikipedia.org/wiki/Object-oriented_programming#Responsibility-_vs._data-driven_design
- H/t Ben Wee on the RUNAPP Slack:
- https://ericmjl.github.io/blog/2022/4/1/functional-over-object-oriented-style-for-pipeline-esque-code/
- https://ericmjl.github.io/blog/2023/12/12/classes-functions-both/
# Project organisation
- There's two steps to every project:
- Artefact generation
- Artefact presentation
- Presentation often happens in 2 or more ways:
- Printing charts to a Powerpoint
- Filling out a template report
- Packaging in a Shiny app or dashboard
- There *should* be a lot of code overlap within the presentation step: ideally, we're not writing multiple versions of the same visualisation - for consistency, accuracy, reducing the amount of review required, etc.
- The 'default' pattern for this seems to be that you "write a package" for step #1 and use that package for the latter step; most Posit guidance is along these lines.
- However, especially in consulting, this feels like a bit of a weird workflow.
- It's generally not *that* much work to just install or load the package, so maybe it's fine
- But it feels more like I just want one R folder, use {targets} to generate artefacts, and then have a couple of scripts that use the shared R content and {targets} store to create the different presentation versions
- So what does that folder & workflow structure look like?
# Principles for data in general
- Never discard data. Only create flags - you never know when you're going to want to break down analysis by is-in-core-data vs was-in-that-group-they-said-I-could-filter-out.
# Local data lakes
## Version 1
### The Idea
"Data lake" structure - partly some Googling on data lakes, partly inspired by the package "pins" (though we want that kind of functionality for the raw data, not just any resultant R object)
```
Lake
╚ Data Source ("board")
╚ data-source_date-saved_pertinent-metadata.fileext ("pin")
╚ data-source_later-date_pertinent-metadata.fileext
╚ Data Source 2
╚ data-source-2_...
```
A lot of the data lake resources suggested also splitting by "domain" 1 level up from "Data Source", and having folders below "Data Source" for year / month / day - agnostic on generation/upload vs date-as-at
I think an important insight is that files should be grouped (in this schema by "Data Source") such that the schema is consistent within that folder for better bulk processing
### Data Catalog
I think this is the main benefit to doing this more deliberately: maintain a catalog of what's available where alongside relevant metadata (like any parameters used to run a report).
I think that
- doing any processing,
- keeping processing code alongside the raw data, or
- maintaining links to processing code (and version - e.g. commit hash)
is a bit too much for too little return. I think the benefits of these things can be achieved at the project level - if we need to track the version of code used to run a project (which should mooostly be an orchestration script calling code from the monorepository) then we'll naturally be capturing whatever version of the processing code we were using.
But What Format?
- Table makes finding relevant data easier
- But we might want to capture arbitrary (and not universal) metadata
- So we need a spare list column
- How do we save that to disk?
- Seems like the best option is like JSON
- Kinda need to discard stuff like .csv or even a SQLite DB out of hand due to the list column thing
- Alternative could be to just save as .RDS; same costs in lack of portability but we get...
- Human readable
- Text = can text diff (good for Git)
## Version 2
- Need to revisit this
- Probably just full database vibes
- Or something like the {arrow} package, which lets you load/query directories as though they were tables (obviously with some caveats)
# To write up
- write something about the perils of `with`
- Make something about function writing: e.g. advocate for `invisible`
- Logging!
- {logger} package preferred
- Also nifty stuff like `noquote()`
- to do: something about dynamic dots vs tidy select vs data masking / write a dots explainer
- better shiny tooltips - specifically, how to make a nicely formatted and verbose tooltip without making ugly UI code. Probable: having markdown files for this (or one big markdown file with headings or something?) that you then reference.
- Should I write up my own piece of non-standard evaluation? Seeing as I use it so much...
- Maybe a piece on "current state for oldheads": new dplyr stuff, `glue`, ...
https://x.com/ID_AA_Carmack/status/1874124927130886501?t=pln1zv-_13kEGihcpfMLAQ&s=09 -> design pattern for Shiny numero uno!!!
https://x.com/ID_AA_Carmack/status/1874127039156150585?t=VKPuB2Inv9vHlVxG-uN-fA&s=09 (this chain)
For toolbox.mjk:
- Autowrangle
- See notes for sketch of nicer Shiny column association tool...
- Use something like "average column position" in bus_matrix to do at least *some* work on guessing where things might be the same.
- Add a step to AW to put things in target_column order on output
- Dumping big correlation matrices to Excel
- Use {openxlsx} to implement conditional formatting
- Nice row and column headers
- Top-right matrix
- Check out {usethis} for nice additions to an R setup script, especially:
- `use_blank_slate()`
- Look at customising a flextable setup to make pptx tables easier
- create a consistent ggplot skeleton for myself
- think harder about memoise
- Like could this be a real power-up for my class-driven pipeline approach???