1con_md <- connect_to_motherduck(motherduck_token = "MOTHERDUCK_TOKEN")
- 1
- Pass your token name from your R environment file
Before we get into how to use the package, lets quickly review the three things you will need to load data into a database:
Database name (Catalog Name)
Schema name
Table or view name
To save or reference data, you need to either fully qualify the name with database_name.schema_name.table_name or you need to be “in” your database and schema and reference the table name.
If you uploaded data without creating a table or schema first then duckdb will assign “temp” and “main” as the default names for your database and schema respectively.
Note 1: Teminology: Duckdb vs. Motherduck
While not technically correct explanation; motherduck is a cloud based deployment of duckdb where you can have multiple databases, control access / permissions, and scale compute / storage as needed
Duckdb is a essentially a single database instance in your local computer where you can save the database to a file or create it in memory.
Through out these write ups, I tend to use duckdb & motherduck interchangeably however all functions will work motherduck database and most will still work with a duckdb database.
If you are using a local duckdb database you can leave the database argument blank.
Later on we will show examples of how to read data from a source file, eg. csv, parquet, or even excel directly into motherduck without loading the data into memory but for now let’s assume you want to upload some data that you already have loaded in your R environment.
First let us connect to our motherduck database. In order to connect, you will need:
The connect_to_motherduck() function will take your access token that is your environment file1, install and load the extensions and then finally connect to your motherduck instance.
Caution 1: connect-to-motherduck
One limitation of the connecting to motherduck from R is that you first need to create a local motherduck instance which then allows the connection to motherduck.
This means you have access to both local (temporary) duckdb database and your cloud based motherduck databases.
Check which database you are “in” with the
pwd()command
1con_md <- connect_to_motherduck(motherduck_token = "MOTHERDUCK_TOKEN")
You will get a message that prints out that actions the package took and information about your connection
Before uploading new data, it can be helpful to check “where” you are in your database
You can do this with the pwd()2 function that will print out the current database & schema that you are in.
This would be the default location that you save your database unless you clarified a different database and schema.
pwd(con_md)By default, you will be in your local duckdb database even though you are connected to motherduck
See Caution 1 to understand why we start in a local database vs. motherduck
If we want to we can also navigate to your motherduck database with the cd() command
cd(con_md,database = "contoso")I am now in motherduck based contoso database and any reference to schema or table would be relative to this database.
Let’s verify that by list the all the tables in this database. We can do that with the list_all_tables() function.
list_all_tables(con_md)Now that we knwo how to navigate to our various databaes, lets finally load data somee existing data into a new database and schema. create_table() function will create a new database / schema and save then load the data into a table.
1ggplot2::diamonds |>
md::create_table(
2 .con = con_md
3 ,database_name = "vignette"
4 ,schema_name = "raw"
5 ,table_name = "diamonds"
6 ,write_type="overwrite"
)
Notice that we don’t assign this object to anything, this just silently writes our data to our database and prints a message confirming the performed actions as well as some summary of catalogs, schemas, tables and shares that you haves access
To validate the data is in our database, we can do the following:
We can validate if we have successfully saved the table in our database by running list_all_tables().
If you want to access your motherduck data, you can simply leverage dplyr::tbl() or DBI::dbGetQuery() functions with your motherduck connection to pull your data.
Let’s say we want to filter and summarize this table and save it to our database with a new table name – no problem, we can repeat the steps and this time we will upload a DBI object instead of tibble.
1id_name <- DBI::Id("vignette","raw","diamonds")
2diamonds_summary_tbl <- dplyr::tbl(con_md,id_name) |>
dplyr::summarise(
.by=c(color,cut,clarity)
,mean_price=mean(price,na.rm=TRUE)
)
3diamonds_summary_tbl |>
create_table(
.con = con_md
,database_name = "vignette"
,schema_name = "raw"
,table_name = "diamonds_summary"
,write_type = "overwrite"
)
While its the same syntax, create_table() will work with an R object, duckplyr or DBI object to save that table into your database.
Let’s say we want to organize these existing tables into a different schema, we can do this by first creating a new schema and then moving that table or alternatively loading a table directly with a new schema and table.
create_schema(
.con=con_md
,database_name = "vignette"
,schema_name = "curated"
)This will create a new schema if it doesn’t exist but won’t load any data.
list_schemas(con_md)Default schemas
When you create a database it will have a default schema “main”, so even though we only created “raw” and “curated” schemas we see three schemas due to the default (that you can’t delete)
We can copy one of a series of tables to our new schema with copy_tables_to_new_location. This accepts a tibble or DBI object of objects with table_catalog,table_schema, table_name headers and will copy them into your new location.
list_all_tables(con_md) |>
filter(
table_catalog=="vignette"
) |>
md::copy_tables_to_new_location(
.con = con_md
,from_table_names = _
,to_database_name = "vignette"
,to_schema_name = "curated"
)There’s a complimentary function called create_or_repalce_schema which will also create a schema the different is if there is already a schema with that name it will delete that schema and any tables saved under it.
Sometimes we need to delete databases, schemas or tables.
Be careful when you do this as its irreversible – there is no CTRL+Z to undo this.
delete_schema(con_md,database_name = "vignette",schema_name = "curated",cascade = TRUE)For csv files we can leverage the existing duckdb function duckdb::read_csv_duckdb() to directly read the a csv file or a series of csv files3 into your duckdb or motherduck database
This will read the files from their source location directly into your database without loading the files into memory which is helpful when you are dealing with larger than memory data.
Underneath the hood the duckdb function is using the read_csv_auto and you can pass the configuration options directly through the the read_csv function if you need configuration.
write.csv(mtcars,"mtcars.csv")
# cd(schema = "raw")
duckdb::duckdb_read_csv(conn = con_md,files = "mtcars.csv",name = "mtcars")For or excel, parquet or httpfs file formats, we can leverage md read_excel_duckdb, read_parquet_duckdb() or read_httpfs_duckdb() form the md package.
Similar to the read_csv_auto function, these leverage underlying duckdb extensions to read these diffrent file formatas.
You can view the default configuration tables with the md::config_* family of tables
1openxlsx::write.xlsx(starwars,"starwars.xlsx")
read_excel(
2 .con=con_md
3 ,to_database_name = "vignette"
4 ,to_schema_name = "main"
5 ,to_table_name = "starwars"
6 ,file_path = "starwars.xlsx"
7 ,header = TRUE
8 ,sheet = "Sheet 1"
9 ,all_varchar = TRUE
10 ,write_type = "overwrite"
)
Below are the list of configuration options available to be passed through to respective read_* functions.