8 min read![2020 2020](/uploads/1/3/7/5/137560358/567330810.png)
![Data wrangling dplyr cheat sheet Data wrangling dplyr cheat sheet](/uploads/1/3/7/5/137560358/884279642.jpg)
8.2 Cheat Sheet. The cheatsheet for the dplyr package provides nice diagrams illustrating the functionality of various functions in the dplyr package. . dplyr verb. Direct Spark SQL (DBI). SDF function (Scala API). Export an R DataFrame. Read a file. Read existing Hive table Data Science in Spark with Sparklyr:: CHEAT SHEET Intro Using sparklyr.
8.2 Cheat Sheet. The cheatsheet for the dplyr package provides nice diagrams illustrating the functionality of various functions in the dplyr package.
2020/05/04Motivation
I use R to extract data held in Microsoft SQL Server databases on a daily basis.
When I first started I was confused by all the different ways to accomplish this task. I was a bit overwhelmed trying to choose the, “best,” option given the specific job at hand.
I want to share what approaches I’ve landed on to help others who may want a simple list of options to get started with.
Scope
This post is about reading data from a database, not writing to one.
I prefer to use packages in the tidyverse so I’ll focus on those packages.
While it’s possible to generalize many of the concepts I write about here to other DBMS systems I will focus exclusively on Microsoft SQL Server. I hope this will provide simple, prescriptive guidance for those working in a similar configuration.
The data for these examples is stored using Microsoft SQL Server Express. Free download available here.
One last thing - these are a few options I populated my toolbox with. They have served me well over the past two years as an analyst in an enterprise environment, but are definitely not the only options available.
Setup
Connect to the server
I use the keyring package to keep my credentials out of my R code. You can use the great documentation available from RStudio to learn how do the same.
Write some sample data
Note that I set the temporary argument to TRUE so that the data is written to the tempdb on SQL server, which will result in it being deleted on disconnection.
This results in dplyr prefixing the table name with, “##.”
SOURCE: https://db.rstudio.com/dplyr/#connecting-to-the-database
Option 1: Use dplyr syntax and let dbplyr handle the rest
When I use this option
This is my default option.
I do almost all of my analysis in R and this avoids fragmenting my work and thoughts across different tools.
Examples
Example 1: filter rows, and retrieve selected columns
Example 2: join across tables and retrieve selected columns
Example 3: Summarize and count
Quite a few tailnum values in flights, are not present in planes, interesting!
Option 2: Write SQL syntax and have dplyr and dbplyr run the query
When I use this option
I use this option when I am reusing a fairly short, existing SQL querywith minor modifications.
Example 1: Simple selection of records using SQL syntax
Example 2: Use dplyr syntax to enhance a raw SQL query
Option 3: Store the SQL query in a text file and have dplyr and dbplyr run the query
When I use this option
I use this approach under the following conditions:
- I’m reusing existing SQL code or when collaborating with someone who will be writing new code in SQL
- The SQL code is longer than a line or two
I prefer to, “modularize,” my R code. Having an extremely long SQL statementin my R code doesn’t abstract away the complexity of the SQL query. Putting thequery into it’s own file helps achieve my desired level of abstraction.
In conjunction with source control it makes tracking changes to the definition of adata set simple.
More importantly, it’s a really useful way to collaborate with others whoare comfortable with SQL but don’t use R. For example, I recently used thisapproach on a project involving aggregation of multiple data sets.Another team member focused on building out the data collection logic forsome of the data sets in SQL. Once he had them built and validated he handed offthe query to me and I pasted it into a text file.
Step 1: Put your SQL code into a text file
Here is some example SQL code that might be in a file
Let’s say that SQL code was stored in a text file called,
flights.sql
Step 2: Use the SQL code in the file to retrieve data and execute the query.
Rstudio Dplyr Cheat Sheet
12 min read2020-07-30Motivation
Some days back I had a tought to create use cases of all the functions listed in the {dplyr} cheatsheet. Eventhough the cheatsheet shows the syntax on how to use a given function and provides a lucid one line (more or less) explanation along with excellent visual cues, I feel for new users (new to {dplyr} or R) it can be daunting to see all these functions at once and not knowing what to exactly look for to address the problem they might be facing.
What one might need to know beforehand
The said cheat sheet is available in the resources section of the RStudio website under the name Data Transformation Cheat Sheet
Dplyr Cheat Sheet 2020 Excel
To demonstrate the use of the functions listed in the cheat sheet I will be using the
palmerpenguis
data. Want to know more about this data set? Look at the github page of Allison Horst’s {palmerpenguin}.Another point that I would like to mention is that this is not a comprehensive resource that necessarily documents use cases of all possible valid combinations of the functions listed in the {dplyr} cheat sheet. The reason for this is that I am lazy and not as skilled as I would like to think and pretend.
Lazy GIF from Lazy GIFs
I will try and keep adding more functions overtime. I took this wise suggestion from Paul Brennan, hope he does not mind me mentioning him here.
Finally, all the mistake that I make here are mine, all, the ones that are stupid and especially the one that are very stupid. All mine. Equatorial orbit ground track.
Yes Its My Fault Daniel Espinoza GIF from Yesitsmyfault GIFs
Also I am assuming people will know about
%>%
operator. Yes, not explaining it. Scroll up and look at the lazy panda gif.I will use the explanations from the Cheat Sheet and reporduce for the readers’ benefit. These will appear verbatim as shown below.
This is how the explanation from the cheat sheet will be reproduced
Summarise Cases
These apply summary functions to columns to create a new table of summary statistics. Summary functions take vectors as input and return one value (see back).
summarise funciton
Let us use the summarise function to the obtain the mean bill length, bill depth and flipper length of the penguins.
The functions
summarise_all
, summarise_at
and summarise_if
have been suspended after the introduction of the across
function in the {dplyr} release of 1.0.0. Though the across
function is not mentioned in the cheat sheet, I will try and attemp to demonstrate a use case.![2020 2020](/uploads/1/3/7/5/137560358/567330810.png)
across function
Since this function is not mentioned in the cheat sheet, I will reporduce the explanations from the documentation of the {dplyr} 1.0.0
across() makes it easy to apply the same transformation to multiple columns, allowing you to use select() semantics inside in summarise() and mutate().
I want average of bill lenght, bill depth and flipper length. I will attempt the across funciton to achieve this.
across
function can also be used within mutate
function.Group Cases
Use group_by() to create a “grouped” copy of a table. dplyr functions will manipulate each “group” separately and then combine the results.
Vsee telehealth. Assume one want to get the same mean values but for all different specied of penguins. In such cases
group_by
functions prove useful.Notice the message
summarise()
ungrouping output (override with .groups
argument). This is a feature of the new {dplyr} 1.0.0 where one doesnot have to explicitly call the ungroup()
Manipulate Cases
Row functions return a subset of rows as a new table.
filter function to find fluffy penguins
Extract rows that meet logical criteria. filter(iris, Sepal.Length > 7)
Say I want a data set that has observations of penguins that are more than 3.5 Kg. In such cases cases the filter function come handy.
Dplyr Cheat Sheet 2020 Pdf
Say one is interested in penguins from a particular island (Torgersen) that are fluffy. In that case, multiple conditions can be provided to the filter function.
distinct function
Remove rows with duplicate values.
This funciton can be used to remove duplicate rows from a table. Since
penguins
data does not have duplicate rows I will use a dummy data to demonstrate a simple use case of this funciton.Consider the following data
This table gives a list of dishes from different restaurants and the flavour rating for each dish. However, there is a data entry error, the first and the last dish are the same, from the same restaurant. Its a duplicate entry. I wish to remove the duplicate entry, here is how that can be done using the
distinct()
function.The
distint()
can be used to keep observations by using specific variable or column. Say, from the Flavours
data one only wants one observation from each restaurant along with all the variables. In that case, the distinct()
can be used as shown below.Dplyr Cheat Sheet 2020
This would give us the first observations for each of the restaurants as they appear in the data.
sample_* functions
![Data wrangling dplyr cheat sheet Data wrangling dplyr cheat sheet](/uploads/1/3/7/5/137560358/884279642.jpg)
sample_frac function
Randomly select fraction of rows.
This function allows us to randomly sample fraction of observation from the data. We can also define if we want the sampling with replacement or without replacement.
In the code below, I randomly sample 50% of the observations without replacement.
sample_n function
Randomly select size rows.
This funcitons allows to select desired number of observations from the data with or without replacement.
Consider the following example where I select 50 rows with replacement.
slice funciton
Select rows by position
This funciton gives us the abilty to select rows by the position in which they appear in the data.
Say, I want observations 25 to 32 from the
penguins
data. Below code is how I would do it.top_n function
Select and order top n entries
This funciton lets one select observations that are the highest or top by a given variable.
Say, I want to select penguins that have mass in the range of the top 5 values that the
body_mass_g
varible takes. This is how I would do it.Data Wrangling Dplyr Cheat Sheet
arrange funciton
Order rows by values of a column or columns (low to high), use with desc() to order from high to low
Say, I want to order the penguins by their body mass.
This arranges the penguins from least fluffy to most fluffy. I can arrange in decreasing order with the use of
desc()
within the arrange()
.add_row funtion
Add one or more rows to a table
Dplyr Cheat Sheet 2020 Printable
Say, I went to Ayub’s again and tried another dish that I want to add to the
Flavours
data.