Data - Michael J Kerrison

# Unsorted "Machine Learning: The High-Interest Credit Card of Technical Debt" - https://storage.googleapis.com/pub-tools-public-publication-data/pdf/43146.pdf - https://www.linkedin.com/posts/alec-campanini_alecfromwalmart-business-leadership-activity-7063905885156225024-tz8E - # Approaches - Clustering - https://en.wikipedia.org/wiki/Latent_semantic_analysis - https://aisel.aisnet.org/cgi/viewcontent.cgi?article=4025&context=cais - "Topic detection" - https://www.tidytextmining.com/topicmodeling.html # Databases ## Structure - https://servian.dev/why-transactional-databases-are-better-suited-for-data-warehouse-control-frameworks-cb9379048967 ## ETL - https://servian.dev/why-batch-time-and-job-based-orchestration-are-false-economies-556fb9a72bd - https://servian.dev/using-talends-dynamic-run-job-to-run-jobs-in-parallel-and-sequential-order-c4cc061b487a ## Observability? - https://www.montecarlodata.com/blog-what-is-data-observability/ - https://learn.microsoft.com/en-us/sql/data-quality-services/data-quality-services?view=sql-server-ver16 - https://medium.com/weareservian/visualize-data-lineage-using-only-sql-13f720870f1f - https://docs.getre.io/ui-latest/#/graph?model=postgres.toy_shop_sources.toy_shop_customers&tab=test # Governance? - https://servian.dev/4-data-governance-strategies-to-support-efficient-machine-learning-e0ca544485ef - https://techmagie.wordpress.com/2020/07/19/data-governance-what-when-why-who-and-how-of-data/ # Sources - "Mosaic" data: - https://www.experian.com.au/mosaic - Pulls together Experian's credit data with some other sources to estimate incomes down to quite small groups (like 40 households apparently) - Tags: customer data, income data # Testing & QA - Data QA - https://dataform.co/blog/data-assertions - https://www.kdnuggets.com/2021/05/soda-io-managing-data-quality-sql-scale.html - Query Testing - https://dataform.co/blog/unit-tests?utm_medium=organic&utm_source=dataform_blog&utm_campaign=advanced_data_quality_testing - Cf. "Data-driven testing" - [Matt Kaye - Pull Requests, Code Review, and The Art of Requesting Changes](https://matthewrkaye.com/posts/series/doing-data-science/2023-04-14-code-review/code-review.html "https://matthewrkaye.com/posts/series/doing-data-science/2023-04-14-code-review/code-review.html") - [Tidyteam code review principles (tidyverse.org)](https://code-review.tidyverse.org/ "https://code-review.tidyverse.org/") - https://twitter.com/yabellini/status/1656450313895682052?s=09 # Broader - Some people apparently swear by "dbt", a Python package/pattern/philosophy: - https://www.brittanybennett.com/post/there-s-a-better-way-the-case-for-dbt-for-progressive-data-professionals - Should unpack this... # Visualisation / apps To pick through: - Royal Stat Society on best practices for data viz - [https://royal-statistical-society.github.io/datavisguide/](https://royal-statistical-society.github.io/datavisguide/ "https://royal-statistical-society.github.io/datavisguide/") Shiny, but really fairly basic UX stuff: [Shiny User Adoption Fails: 9 Reasons Why Nobody Uses Your App - R programming, Shiny for Python (appsilon.com)](https://appsilon.com/reasons-why-shiny-user-adoption-fails/ "https://appsilon.com/reasons-why-shiny-user-adoption-fails/") "Paired bar charts suck": https://twitter.com/rappa753/status/1643267220464865280?t=qkgOQCSSAwdWPnYfkgPvoQ&s=09 ![[screenshot_2022-12-16_at_07.42.22.png]] Tips for building good apps: https://www.linkedin.com/posts/alec-campanini_analytics-ui-data-activity-7066876924475682816-pTqj # Qualitative - The perennial question - With the advent of much better LLMs there's new opportunities - stuff like embeddings seems to be much better than mucking about with older NLP approaches (but I don't know how true this is *empirically*...) - I've also heard people use and talk about Nvivo for interrogating big sets of survey responses? - And there are techniques, like: - Grounded theory - Atomic method of qualitative research (?) - "Insights repositories" like: - dovetail - gleanly - Good ol' whiteboards (Excalidraw, Miro)