Package 'polite'

Title: Be Nice on the Web
Description: Be responsible when scraping data from websites by following polite principles: introduce yourself, ask for permission, take slowly and never ask twice.
Authors: Dmytro Perepolkin [aut, cre]
Maintainer: Dmytro Perepolkin <[email protected]>
License: MIT + file LICENSE
Version: 0.1.3
Built: 2024-10-29 05:17:07 UTC
Source: https://github.com/dmi3kno/polite

Help Index


Introduce yourself to the host

Description

Introduce yourself to the host

Usage

bow(
  url,
  user_agent = "polite R package",
  delay = 5,
  times = 3,
  force = FALSE,
  verbose = FALSE,
  ...
)

is.polite(x)

Arguments

url

URL

user_agent

character value passed to user agent string

delay

desired delay between scraping attempts. Final value will be the maximum of desired and mandated delay, as stipulated by robots.txt for relevant user agent

times

number of times to attempt scraping. Default is 3.

force

refresh all memoised functions. Clears up robotstxt and scrape caches. Default is FALSE

verbose

TRUE/FALSE

...

other curl parameters wrapped into httr::config function

x

object of class polite, session

Value

object of class polite, session

Examples

library(polite)

 host <- "https://www.cheese.com"
 session <- bow(host)
 session

Guess download file name from the URL

Description

Guess download file name from the URL

Usage

guess_basename(x)

Arguments

x

url to guess basename from

Value

guessed file name

Examples

guess_basename("https://bit.ly/polite_sticker")

Convert collection of html nodes into data frame

Description

Convert collection of html nodes into data frame

Usage

html_attrs_dfr(
  x,
  attrs = NULL,
  trim = FALSE,
  defaults = NA_character_,
  add_text = TRUE
)

Arguments

x

xml_nodeset object, containing text and attributes of interest

attrs

character vector of attribute names. If missing, all attributes will be used

trim

if TRUE, will trim leading and trailing spaces

defaults

character vector of default values to be passed to rvest::html_attr(). Recycled to match length of attrs

add_text

if TRUE, node content will be added as .text column (using rvest::html_text)

Value

data frame with one row per xml node, consisting of an html_text column with text and additional columns with attributes

Examples

library(polite)
library(rvest)
bow("https://en.wikipedia.org/wiki/List_of_cognitive_biases") %>%
  scrape() %>%
  html_nodes("tr td:nth-child(1) a") %>%
  html_attrs_dfr()

Agree modification of session path with the host

Description

Agree modification of session path with the host

Usage

nod(bow, path, verbose = FALSE)

Arguments

bow

object of class polite, session created by polite::bow()

path

string value of path/URL to follow. The function accepts either a path (string part of URL following domain name) or a full URL

verbose

TRUE/FALSE

Value

object of class polite, session with modified URL

Examples

library(polite)

 host <- "https://www.cheese.com"
 session <- bow(host) %>%
              nod(path="by_type")
 session

Give your web-scraping function good manners polite

Description

Give your web-scraping function good manners polite

Usage

politely(
  fun,
  user_agent = paste0("polite ", getOption("HTTPUserAgent"), " bot"),
  robots = TRUE,
  force = FALSE,
  delay = 5,
  verbose = FALSE,
  cache = memoise::cache_memory()
)

Arguments

fun

function to be turned "polite". Must contain an argument named url, which contains url to be queried.

user_agent

optional, user agent string to be used. Defaults to paste("polite", getOption("HTTPUserAgent"), "bot")

robots

optional, should robots.txt be consulted for permissions. Default is TRUE

force

whether or not tp force fresh download of robots.txt

delay

minimum delay in seconds, not less than 1. Default is 5.

verbose

output more information about querying process

cache

memoise cache function for storing results. Default memoise::cache_memory()

Value

polite function

Examples

polite_GET <- politely(httr::GET)

Print host introduction object

Description

Print host introduction object

Usage

## S3 method for class 'polite'
print(x, ...)

Arguments

x

object of class polite, session

...

other parameters passed to methods


Polite file download

Description

Polite file download

Usage

rip(
  bow,
  destfile = NULL,
  ...,
  mode = "wb",
  path = tempdir(),
  overwrite = FALSE
)

Arguments

bow

host introduction object of class polite, session created by bow() or nod()

destfile

optional new file name to use when saving the file. If missing, it will be guessed from 'basename(url)“

...

other parameters passed to download.file

mode

character. The mode with which to write the file. Useful values are w, wb (binary), a (append) and ab. Not used for methods wget and curl.

path

character. Path where to save the destfile. By default is temporary directory created with tempdir() Ignored if destfile contains path along with filename.

overwrite

if TRUE will overwrite file on disk

Value

Full path to the locally saved file indicated by the user in destfile (and path)

Examples

bow("https://en.wikipedia.org/") %>%
 nod("wiki/Flag_of_the_United_States#/media/File:Flag_of_the_United_States.svg") %>%
 rip()

Scrape the content of authorized page/API

Description

Scrape the content of authorized page/API

Usage

scrape(
  bow,
  query = NULL,
  params = NULL,
  accept = "html",
  content = NULL,
  verbose = FALSE
)

Arguments

bow

host introduction object of class polite, session created by bow() or nod()

query

named list of parameters to be appended to URL in the format list(param1=valA, param2=valB)

params

deprecated. Use query argument above.

accept

character value of expected data type to be returned by host (e.g. html, json, xml, csv, txt, etc.)

content

MIME type (aka internet media type) used to override the content type returned by the server. See http://en.wikipedia.org/wiki/Internet_media_type for a list of common types. You can add the charset parameter to override the server's default encoding

verbose

extra feedback from the function. Defaults to FALSE

Value

Object of class httr::response which can be further processed by functions in rvest package

Examples

library(rvest)
  bow("https://en.wikipedia.org/wiki/List_of_cognitive_biases") %>%
   scrape(content="text/html; charset=UTF-8") %>%
   html_nodes(".wikitable") %>%
   html_table()

Reset scraping/ripping rate limit

Description

Reset scraping/ripping rate limit

Usage

set_scrape_delay(delay)

set_rip_delay(delay)

Arguments

delay

Delay between subsequent requests. Default for package is 5 sec. It can be set lower only under the condition of specifying a custom user-agent string.

Value

Updates rate-limit property of scrape and rip functions, respectively.

Examples

library(polite)

 host <- "https://www.cheese.com"
 session <- bow(host)
 session

Use manners in your own package or script

Description

Creates collection of polite functions for scraping and downloading

Usage

use_manners(save_as = "R/polite-scrape.R", open = TRUE)

Arguments

save_as

File where function should be created Defaults to "R/polite-scrape.R"

open

if TRUE, open the resultant files