Web Scraping mit R

class: center, middle, inverse, title-slide

# Web Scraping mit R - Crashkurs
### Fabian Gülzau
### Humboldt-Universität zu Berlin
### 2019/05/17

---

# Vorteile von R

- Open Source 
- Riesige Community (z.B. [StackOverflow](https://stackoverflow.com/questions/tagged/r))
- Stetige Weiterentwicklung ([CRAN](https://cran.r-project.org/))
- [Visualisierung](https://socviz.co/index.html) & [digitale Methoden](https://www.bitbybitbook.com/en/1st-ed/preface/)

---

# Steile Lernkurve?

.pull-left[

**Früher:**

R = schwer erlernbar

**Inzwischen:**

- Onlinekurse
- Bücher
- Videos
- ...

]

.pull-right[

]

---

# Installation

- Programmiersprache: [R](https://cran.r-project.org/index.html) 
- Entwicklungsumgebung [RStudio](https://www.rstudio.com/products/rstudio/)
- [Pakete](https://r4ds.had.co.nz/introduction.html#the-tidyverse)

-> kurze [Installationsanleitung](https://r4ds.had.co.nz/introduction.html#prerequisites)

R kann zunächst auch über die [RStudioCloud](https://rstudio.cloud/) ausprobiert
werden. Es ist aber eine vorherigen Anmeldung notwendig. Der Service ist kostenfrei.

```r
install.packages(pacman) # Installation nur einmal notwendig
library(pacman)
p_load(tidyverse)
```

---

# Crashkurs R

- Fokus auf Datentypen & -aufbereitung mit [dplyr](https://dplyr.tidyverse.org/)
- Daten zur [Studienwahl](https://www.studium.org/kommunikationswissenschaft/uebersicht-universitaeten) im Feld "Medien- und Kommunikationswissenschaften"

Nachlesen:

- Wickham & Grolemund (2017) ["R for Data Science"](https://r4ds.had.co.nz/)
- Tutorial von [RStudio](https://rstudio.cloud/learn/primers)

---

# Datentypen

Nachlesen: [Wickham (2019)](https://adv-r.hadley.nz/vectors-chap.html)

---

# Datentypen in R

```r
# Vector
chr_vector <- c("test1", "testX", "TEST")

# Dataframe/tibble
df <- data.frame(ch = chr_vector,
                 nmrc = c(1, 2, 3))

# List
tlist <- list(e1 = df, 
              chr_vector)
```

---

# Daten: Studium.org

Studienwahl: Medien- und Kommunikationswissenschaften

- Wo soll ich studieren?
- Was gibt mein Geldbeutel her?
- Was bietet der Studienort an kulturellen Angeboten?

---

# Daten: Studium.org II

- 56 Datenpunkte (eben soviele Einzelseiten, z.B. [Münster](https://www.studium.org/kommunikationswissenschaft/uni-muenster))
- Copy&Paste mühsam und fehleranfällig
- Anwendungsbeispiel: Web Scraping (s. [Skript](https://github.com/FabianFox/Webscraping-Muenster-/blob/master/Code/StudiumOrg-KoWi-Scraper.R))

Datensatz laden:

```r
# empfohlen
studium.df <- readRDS(gzcon(url("https://github.com/FabianFox/Webscraping-Muenster-/blob/master/Data/KoWi-Institute.RDS?raw=true")))

# Skript ausführen (Dauer: ~5min)
# devtools::source_url("https://raw.githubusercontent.com/FabianFox/Webscraping-Hamburg-/master/Code/SoziologieOrg-Scraper.R")
```

---

# Daten: Studium.org III

Welche Variablen stehen zur Verfügung?

```r
# (1) dplyr laden
library(dplyr)

# (2) Überblick
glimpse(studium.df)
```

```
## Observations: 56
## Variables: 9
## $ name                          <chr> "akademie-der-media", "bsp-busin...
## $ anzahl_kinos                  <dbl> 12, 108, 2, 8, 9, 20, 33, NA, 12...
## $ einwohnerzahl_stadt           <dbl> 586000, 3452911, 49098, 329327, ...
## $ kosten_semesterticket_in      <dbl> 199.00, 180.00, 104.00, 113.20, ...
## $ mietspiegel_stadt             <dbl> 11.20, 8.82, 10.31, 6.60, 8.83, ...
## $ offnungszeiten_bibliothek     <dbl> NA, 37.0, 114.0, NA, 84.0, 231.0...
## $ regionaler_preisindex         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ sonnenstunden_pro_jahr        <dbl> 1724.0, 1623.0, NA, 1460.0, 1700...
## $ studierende_hochschule_gesamt <dbl> 300, 472, 3800, 3200, 39600, 230...
```

---

# Datenexploration mit dplyr

Fünf Befehle, die die Datenaufbereitung und -analyse unterstützen (vgl. [Wickham & Grolemund 2017](https://r4ds.had.co.nz/transform.html) & [Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf)):

---

# Kombination von Befehlen

```r
# (1) Kopie des Datensatzes
stdm.copy <- studium.df
# (2) Select: Name, Kino, Einwohner
stdm.copy <- select(stdm.copy, name, anzahl_kinos, einwohnerzahl_stadt)
# (3) Erstelle Kinos pro Kopf
stdm.copy <- mutate(stdm.copy, kino_pc = anzahl_kinos / einwohnerzahl_stadt)
# (4) Wähle Fälle > Median(Kino pro Kopf)
stdm.copy <- filter(stdm.copy, kino_pc > median(kino_pc, na.rm = TRUE))
# (5) Sortiere Kino pro Kopf (absteigend)
stdm.copy <- arrange(stdm.copy, desc(kino_pc))

glimpse(stdm.copy)
```

```
## Observations: 27
## Variables: 4
## $ name                <chr> "ku-eichstaett-ingolstadt", "uni-freiburg-...
## $ anzahl_kinos        <dbl> 5, 13, 9, 2, 4, 4, 15, 2, 3, 2, 108, 108, ...
## $ einwohnerzahl_stadt <dbl> 13100, 38732, 107500, 28962, 63315, 85500,...
## $ kino_pc             <dbl> 3.816794e-04, 3.356398e-04, 8.372093e-05, ...
```

---

# Nachteile

- ausführlich (Tippfehler!)
- viele Kopien eines Datensatzes
- redundant

Weitere Möglichkeit:

```r
arrange(filter(mutate(select(studium.df, name, anzahl_kinos, einwohnerzahl_stadt), kino_pc = anzahl_kinos / einwohnerzahl_stadt), kino_pc > median(kino_pc, na.rm = TRUE)), desc(kino_pc))
```

```
## # A tibble: 27 x 4
##    name                     anzahl_kinos einwohnerzahl_stadt   kino_pc
##    <chr>                           <dbl>               <dbl>     <dbl>
##  1 ku-eichstaett-ingolstadt            5               13100 0.000382 
##  2 uni-freiburg-schweiz               13               38732 0.000336 
##  3 fau-erlangen-nuernberg              9              107500 0.0000837
##  4 tu-ilmenau                          2               28962 0.0000691
##  5 uni-weimar                          4               63315 0.0000632
##  6 uni-tuebingen                       4               85500 0.0000468
##  7 uni-zuerich                        15              366765 0.0000409
##  8 dhbw-ravensburg                     2               49098 0.0000407
##  9 uni-bamberg                         3               75743 0.0000396
## 10 zeppelin-uni                        2               59000 0.0000339
## # ... with 17 more rows
```

---

## Die Pipe (%>%)

Pipe-Operator: `%>%`

- verknüpft Befehle
- wird als "und dann" gelesen

1. Verwende den Datensatz `studium.df` (dann `%>%`) 
2. Wähle Variablen aus (dann `%>%`)
3. Erzeuge die Variable `kino_pc` (dann `%>%`) 
4. Verwende Fälle, die größer als der Median sind (dann `%>%`)
5. Sortiere den Datensatz (abnehmende Kinozahl pro Kopf)

```r
studium.df %>%    
  select(name, anzahl_kinos, einwohnerzahl_stadt) %>%
  mutate(kino_pc = anzahl_kinos / einwohnerzahl_stadt) %>%
  filter(kino_pc > median(kino_pc, na.rm = TRUE)) %>%
  arrange(desc(kino_pc))
```

---

# Komplexe Rekodierung

[case_when](https://dplyr.tidyverse.org/reference/case_when.html):

```r
studium.df <- studium.df %>%
  mutate(
    stadt_typ = case_when(
      einwohnerzahl_stadt < 100000 ~ "stadt",
      einwohnerzahl_stadt < 500000 ~ "großstadt",
      einwohnerzahl_stadt > 500000 ~ "metropole",
      TRUE ~ NA_character_))
```

---

# Übungen

- Wie teuer ist ein Semesterticket im Durchschnitt?
- Welche zehn Städte bieten das günstigste Ticket?
- Unterscheidet Ticketpreis nach Stadttyp (Stadt/Großstadt/Metropole)? (Tipp: `?group_by`)

---

# Visualisierung

Paket: ggplot2 (Einführung: [Healy 2018](http://socviz.co/))

---

# Lernressourcen

Bücher:

- Wickham & Grolemund (2017) R for Data Science [(online)](https://r4ds.had.co.nz/)
- Healy (2018) Data Visualization [(online)](https://socviz.co/index.html)
- Phillips (2018) YaRrr! The Pirate’s Guide to R [(online)](https://bookdown.org/ndphillips/YaRrr/)

Interaktiv:

- RStudioPrimers [(online)](https://rstudio.cloud/learn/primers)
- swirl: Learn R, in R [(Installation)](https://swirlstats.com/students.html)

Kurzanleitungen:

- Cheatsheets [(online)](https://www.rstudio.com/resources/cheatsheets/)

Weiterführend:
- Sammlung von Lernressourcen auf RStudio [(online)](https://www.rstudio.com/resources/)
- Lerncommunity: [TidyTuesdays](https://github.com/rfordatascience/tidytuesday)