Cyclistic Case Study

Cyclistic is a bike-share company based in Chicago.

The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments.One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Three questions will guide the marketing team

How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?

I will use Cyclistic’s historical trip data from January 2021 to December 2021 to analyze and identify trends. The data has been made available by Motivate International Inc. under this license. Data-privacy issues prohibit me from using riders’ personally identifiable information. This means that I won’t be able to connect pass purchases to credit card numbers to determine if casual riders live in the Cyclistic service area or if they have purchased multiple single passes.

I start by setting up the environment and loading all the appropriate libraries

library("tidyverse")

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
    ## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
    ## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
    ## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
    ## ✔ readr   2.1.3      ✔ forcats 0.5.2 
    ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
    ## ✖ dplyr::filter() masks stats::filter()
    ## ✖ dplyr::lag()    masks stats::lag()

library("lubridate")

## Loading required package: timechange
    ## 
    ## Attaching package: 'lubridate'
    ## 
    ## The following objects are masked from 'package:base':
    ## 
    ##     date, intersect, setdiff, union

library("janitor")

## 
    ## Attaching package: 'janitor'
    ## 
    ## The following objects are masked from 'package:stats':
    ## 
    ##     chisq.test, fisher.test

I then import the data

Each data frame corresponds to one month

trips_202101 <- read_csv("202101-divvy-tripdata.csv")
    trips_202102 <- read_csv("202102-divvy-tripdata.csv")
    trips_202103 <- read_csv("202103-divvy-tripdata.csv")
    trips_202104 <- read_csv("202104-divvy-tripdata.csv")
    trips_202105 <- read_csv("202105-divvy-tripdata.csv")
    trips_202106 <- read_csv("202106-divvy-tripdata.csv")
    trips_202107 <- read_csv("202107-divvy-tripdata.csv")
    trips_202108 <- read_csv("202108-divvy-tripdata.csv")
    trips_202109 <- read_csv("202109-divvy-tripdata.csv")
    trips_202110 <- read_csv("202110-divvy-tripdata.csv")
    trips_202111 <- read_csv("202111-divvy-tripdata.csv")
    trips_202112 <- read_csv("202112-divvy-tripdata.csv")

After inspecting each one to make sure all the column names are identical in each, I merge all 12 into one single data frame

all_trips_2021 <- bind_rows(trips_202101, trips_202102, trips_202103, trips_202104, trips_202105, trips_202106, trips_202107, trips_202108, trips_202109, trips_202110, trips_202111, trips_202112)

I rename the column ‘member_casual’ to ‘customer_type’. I then convert the columns ‘rideable_type’ and ’customer_type” to factors since these columns only have two possible values each.

all_trips_2021_clean <- all_trips_2021 %>%
      rename(customer_type = member_casual) %>%
      mutate(customer_type = as.factor(customer_type)) %>%
      mutate(rideable_type = as.factor(rideable_type))

I remove columns that I will not be needing.

all_trips_2021_clean <- all_trips_2021_clean %>%
      select(-c(start_lat, start_lng, end_lat, end_lng))

I inspect my new data frame to make sure it’s ready before I begin the analysis.

colnames(all_trips_2021_clean)

## [1] "ride_id"            "rideable_type"      "started_at"        
    ## [4] "ended_at"           "start_station_name" "start_station_id"  
    ## [7] "end_station_name"   "end_station_id"     "customer_type"

nrow(all_trips_2021_clean)

## [1] 5595063

dim(all_trips_2021_clean)

## [1] 5595063       9

head(all_trips_2021_clean)

## # A tibble: 6 × 9
    ##   ride_id        ridea…¹ started_at          ended_at            start…² start…³
    ##   <chr>          <fct>   <dttm>              <dttm>              <chr>   <chr>  
    ## 1 E19E6F1B8D4C4… electr… 2021-01-23 16:14:19 2021-01-23 16:24:44 Califo… 17660  
    ## 2 DC88F20C2C55F… electr… 2021-01-27 18:43:08 2021-01-27 18:47:12 Califo… 17660  
    ## 3 EC45C94683FE3… electr… 2021-01-21 22:35:54 2021-01-21 22:37:14 Califo… 17660  
    ## 4 4FA453A75AE37… electr… 2021-01-07 13:31:13 2021-01-07 13:42:55 Califo… 17660  
    ## 5 BE5E8EB4E7263… electr… 2021-01-23 02:24:02 2021-01-23 02:24:45 Califo… 17660  
    ## 6 5D8969F88C773… electr… 2021-01-09 14:24:07 2021-01-09 15:17:54 Califo… 17660  
    ## # … with 3 more variables: end_station_name <chr>, end_station_id <chr>,
    ## #   customer_type <fct>, and abbreviated variable names ¹rideable_type,
    ## #   ²start_station_name, ³start_station_id

tail(all_trips_2021_clean)

## # A tibble: 6 × 9
    ##   ride_id        ridea…¹ started_at          ended_at            start…² start…³
    ##   <chr>          <fct>   <dttm>              <dttm>              <chr>   <chr>  
    ## 1 92BBAB97D1683… electr… 2021-12-24 15:42:09 2021-12-24 19:29:35 Canal … 13341  
    ## 2 847431F3D5353… electr… 2021-12-12 13:36:55 2021-12-12 13:56:08 Canal … 13341  
    ## 3 CF407BBC3B9FA… electr… 2021-12-06 19:37:50 2021-12-06 19:44:51 Canal … 13341  
    ## 4 60BB69EBF5440… electr… 2021-12-02 08:57:04 2021-12-02 09:05:21 Canal … 13341  
    ## 5 C414F654A2863… electr… 2021-12-13 09:00:26 2021-12-13 09:14:39 Lawnda… 362.0  
    ## 6 37AC57E34B2E7… classi… 2021-12-13 08:45:32 2021-12-13 08:49:09 Michig… TA1309…
    ## # … with 3 more variables: end_station_name <chr>, end_station_id <chr>,
    ## #   customer_type <fct>, and abbreviated variable names ¹rideable_type,
    ## #   ²start_station_name, ³start_station_id

summary(all_trips_2021_clean)

##    ride_id                rideable_type       started_at                    
    ##  Length:5595063     classic_bike :3251028   Min.   :2021-01-01 00:02:05.00  
    ##  Class :character   docked_bike  : 312343   1st Qu.:2021-06-06 23:52:40.00  
    ##  Mode  :character   electric_bike:2031692   Median :2021-08-01 01:52:11.00  
    ##                                             Mean   :2021-07-29 07:41:02.63  
    ##                                             3rd Qu.:2021-09-24 16:36:16.00  
    ##                                             Max.   :2021-12-31 23:59:48.00  
    ##     ended_at                      start_station_name start_station_id  
    ##  Min.   :2021-01-01 00:08:39.00   Length:5595063     Length:5595063    
    ##  1st Qu.:2021-06-07 00:44:21.00   Class :character   Class :character  
    ##  Median :2021-08-01 02:21:55.00   Mode  :character   Mode  :character  
    ##  Mean   :2021-07-29 08:02:58.75                                        
    ##  3rd Qu.:2021-09-24 16:54:05.50                                        
    ##  Max.   :2022-01-03 17:32:18.00                                        
    ##  end_station_name   end_station_id     customer_type   
    ##  Length:5595063     Length:5595063     casual:2529005  
    ##  Class :character   Class :character   member:3066058  
    ##  Mode  :character   Mode  :character                   
    ##                                                        
    ##                                                        
    ##

str(all_trips_2021_clean)

## tibble [5,595,063 × 9] (S3: tbl_df/tbl/data.frame)
    ##  $ ride_id           : chr [1:5595063] "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
    ##  $ rideable_type     : Factor w/ 3 levels "classic_bike",..: 3 3 3 3 3 3 3 3 3 1 ...
    ##  $ started_at        : POSIXct[1:5595063], format: "2021-01-23 16:14:19" "2021-01-27 18:43:08" ...
    ##  $ ended_at          : POSIXct[1:5595063], format: "2021-01-23 16:24:44" "2021-01-27 18:47:12" ...
    ##  $ start_station_name: chr [1:5595063] "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
    ##  $ start_station_id  : chr [1:5595063] "17660" "17660" "17660" "17660" ...
    ##  $ end_station_name  : chr [1:5595063] NA NA NA NA ...
    ##  $ end_station_id    : chr [1:5595063] NA NA NA NA ...
    ##  $ customer_type     : Factor w/ 2 levels "casual","member": 2 2 2 2 1 1 2 2 2 2 ...

At this stage, I decide to add columns for the day, month, and year. I also add a column for day of the week. This will come in handy when analyzing if there’s a relationship between the number and types of riders to different times of the year. It also allows us to do the same for the days of the week.

all_trips_2021_clean$date <- as.Date(all_trips_2021_clean$started_at)
    all_trips_2021_clean$month <- format(as.Date(all_trips_2021_clean$date), "%m")
    all_trips_2021_clean$day <- format(as.Date(all_trips_2021_clean$date), "%d")
    all_trips_2021_clean$year <- format(as.Date(all_trips_2021_clean$date), "%y")
    all_trips_2021_clean$day_of_week <- format(as.Date(all_trips_2021_clean$date), "%A")

**I also need to know how long each trip was. This will show if there’s a relationship between the day of the week and the duration of the ride for each customer type

all_trips_2021_clean$trip_duration <- difftime(all_trips_2021_clean$ended_at, all_trips_2021_clean$started_at)

I am ready to perform calcuations I want to find the straight average and median of each ride, as well as the longuest and shortest rides.

mean(all_trips_2021_clean$trip_duration)

## Time difference of 1316.121 secs

median(all_trips_2021_clean$trip_duration)

## Time difference of 720 secs

max(all_trips_2021_clean$trip_duration)

## Time difference of 3356649 secs

min(all_trips_2021_clean$trip_duration)

## Time difference of -3482 secs

The value for the shortest trip is a negative one. This must mean that some are values in the trip_duration column are equal to or less than zero. Those need to be removed.

all_trips_2021_clean_v2 <- all_trips_2021_clean[!(all_trips_2021_clean$trip_duration <= 0),]

Now, when the same calculations are run, the shortest trip is no longer a negative value.

mean(all_trips_2021_clean_v2$trip_duration) ## straight avg (trip_duration/rides): output 1316.18 sec

## Time difference of 1316.299 secs

median(all_trips_2021_clean_v2$trip_duration) ## midpoint 720 secs

## Time difference of 720 secs

max(all_trips_2021_clean_v2$trip_duration) ## longest ride is 3356649 secs

## Time difference of 3356649 secs

min(all_trips_2021_clean_v2$trip_duration) ## shortest ride is 1 secs

## Time difference of 1 secs

I want to compare the average trip duration for annual members to that of casual customers

aggregate(all_trips_2021_clean_v2$trip_duration ~ all_trips_2021_clean_v2$customer_type, FUN = mean)

##   all_trips_2021_clean_v2$customer_type all_trips_2021_clean_v2$trip_duration
    ## 1                                casual                        1920.3468 secs
    ## 2                                member                         818.0727 secs

I do the same for the median

aggregate(all_trips_2021_clean_v2$trip_duration ~ all_trips_2021_clean_v2$customer_type, FUN = median)

##   all_trips_2021_clean_v2$customer_type all_trips_2021_clean_v2$trip_duration
    ## 1                                casual                              959 secs
    ## 2                                member                              576 secs

I then compare the longest trip duration by customer type

aggregate(all_trips_2021_clean_v2$trip_duration ~ all_trips_2021_clean_v2$customer_type, FUN =max)

##   all_trips_2021_clean_v2$customer_type all_trips_2021_clean_v2$trip_duration
    ## 1                                casual                          3356649 secs
    ## 2                                member                            93596 secs

And the shortest trip duration by customer type

aggregate(all_trips_2021_clean_v2$trip_duration ~ all_trips_2021_clean_v2$customer_type, FUN=min)

##   all_trips_2021_clean_v2$customer_type all_trips_2021_clean_v2$trip_duration
    ## 1                                casual                                1 secs
    ## 2                                member                                1 secs

**Next, I want to see the average trip duration for both customer types for eqch day of the week

all_trips_2021_clean_v2$day_of_week <- ordered(all_trips_2021_clean_v2$day_of_week, levels=c( "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
    aggregate(all_trips_2021_clean_v2$trip_duration ~ all_trips_2021_clean_v2$customer_type + all_trips_2021_clean_v2$day_of_week, FUN = mean)

##    all_trips_2021_clean_v2$customer_type all_trips_2021_clean_v2$day_of_week
    ## 1                                 casual                              Sunday
    ## 2                                 member                              Sunday
    ## 3                                 casual                              Monday
    ## 4                                 member                              Monday
    ## 5                                 casual                             Tuesday
    ## 6                                 member                             Tuesday
    ## 7                                 casual                           Wednesday
    ## 8                                 member                           Wednesday
    ## 9                                 casual                            Thursday
    ## 10                                member                            Thursday
    ## 11                                casual                              Friday
    ## 12                                member                              Friday
    ## 13                                casual                            Saturday
    ## 14                                member                            Saturday
    ##    all_trips_2021_clean_v2$trip_duration
    ## 1                         2254.2573 secs
    ## 2                          939.5538 secs
    ## 3                         1912.7473 secs
    ## 4                          794.8956 secs
    ## 5                         1678.5293 secs
    ## 6                          767.3451 secs
    ## 7                         1659.6644 secs
    ## 8                          769.2124 secs
    ## 9                         1662.3466 secs
    ## 10                         766.6220 secs
    ## 11                        1821.1061 secs
    ## 12                         799.5648 secs
    ## 13                        2082.5980 secs
    ## 14                         915.9313 secs

I also want to factor in the number of rides. I want to see if trip duration and the number of rides varies throughout the week with each customer type

all_trips_2021_clean_v2 %>%
      mutate(weekday=wday(started_at, label = TRUE)) %>% ##creates a weekday column using the wday() function
      group_by(customer_type, weekday) %>% ## groups by customer type and by weekday
      summarise(number_of_rides = n(), average_duration = mean(trip_duration)) %>% ## shows the number of rides for each weekday and avg trip duration
      arrange(customer_type, weekday) ##arranges by customer type and weekday

## `summarise()` has grouped output by 'customer_type'. You can override using the
    ## `.groups` argument.

## # A tibble: 14 × 4
    ## # Groups:   customer_type [2]
    ##    customer_type weekday number_of_rides average_duration
    ##    <fct>         <ord>             <int> <drtn>          
    ##  1 casual        Sun              481048 2254.2573 secs  
    ##  2 casual        Mon              286340 1912.7473 secs  
    ##  3 casual        Tue              274357 1678.5293 secs  
    ##  4 casual        Wed              278910 1659.6644 secs  
    ##  5 casual        Thu              286038 1662.3466 secs  
    ##  6 casual        Fri              364037 1821.1061 secs  
    ##  7 casual        Sat              557934 2082.5980 secs  
    ##  8 member        Sun              376086  939.5538 secs  
    ##  9 member        Mon              416181  794.8956 secs  
    ## 10 member        Tue              465474  767.3451 secs  
    ## 11 member        Wed              477117  769.2124 secs  
    ## 12 member        Thu              451490  766.6220 secs  
    ## 13 member        Fri              446384  799.5648 secs  
    ## 14 member        Sat              433014  915.9313 secs

To better illustrate the relationship between number of rides by rider type for each day of the week, I add a plot

plot(all_trips_2021_clean_v2 %>%
      mutate(weekday=wday(started_at, label=TRUE)) %>%
      group_by(customer_type, weekday) %>%
      summarise(number_of_rides=n(), average_duration = mean(trip_duration)) %>%
      arrange(customer_type, weekday) %>%
      ggplot(aes(x= weekday, y=average_duration, fill= customer_type)) + geom_col(position= "dodge") + labs(title = "Average Trip Duration by Rider Type"))

## `summarise()` has grouped output by 'customer_type'. You can override using the
    ## `.groups` argument.
    ## Don't know how to automatically pick scale for object of type <difftime>.
    ## Defaulting to continuous.

The data shows that casual customers take longer rides than annual members. Sundays are the days on which both customer types take the longest rides, followed by Saturday. For annual members, there is very little variation in the number of rides throughout the weekdays, but for casual members, Wednesdays and Thursdays have the shortest trip durations, and they start to increase on Fridays.

I want to see how number of rides vary throughout the week for each customer type

library(scales)

## 
    ## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
    ## 
    ##     discard

## The following object is masked from 'package:readr':
    ## 
    ##     col_factor

options(scipen = 999)
    all_trips_2021_clean_v2 %>%
      mutate(weekday=wday(started_at, label=TRUE)) %>%
      group_by(customer_type, weekday) %>%
      summarise(number_of_rides=n(), average_duration = mean(trip_duration)) %>%
      arrange(customer_type, weekday) %>%
      ggplot(aes(x= weekday, y=number_of_rides, fill= customer_type)) + geom_col(position= "dodge") + labs(title = "Number of Rides by Rider Type")

## `summarise()` has grouped output by 'customer_type'. You can override using the
    ## `.groups` argument.

The graph shows that casual customers take fewer rides on weekdays, especially Monday through Thursdays, but the number of rides begins to rise on Fridays. Casual customers takw the most rides on Saturdays. On the other hand, annuals members seem to take more rides on weekdays. Annual members take the most rides on Wednesdays.

Finally, I wanted to know how the number of rides vary throughout the year for each type of customer

options(scipen = 999)
    all_trips_2021_clean_v2 %>%
      group_by(customer_type,month) %>%
      summarise(number_of_rides=n(), average_duration = mean(trip_duration)) %>%
      arrange(customer_type, month) %>%
      ggplot(aes(x= month, y=number_of_rides, fill= customer_type)) + geom_col(position= "dodge") + labs(title = "Number of Rides by Rider Type")

## `summarise()` has grouped output by 'customer_type'. You can override using the
    ## `.groups` argument.

Casual customers take the most number of rides between June and September, with the most number of rides taken in July. February is the month with the fewest rides. Similarly, annual members also favor warm weather, but take more rides during the colder months than casual customers.

Conclusion and recommendations

* Casual customers tend to take far longer rides than annual members, however, annual members take more rides than casual customers throughout the weekdays. On the weekends, casual customers take more rides than annual members.

* Both casual customers and annual members favor the summer months, but annual members take more rides during the colder months.

Cyclistic Case Study

Yasmine Asali

2022-12-28