Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
^renv$
^\\.git$
^renv\.lock$
^.*\.Rproj$
^\.Rproj\.user$
Expand Down
2 changes: 0 additions & 2 deletions .github/workflows/check-full.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,6 @@ jobs:
- {os: macOS-latest, r: 'release'}

- {os: windows-latest, r: 'release'}
# Use 3.6 to trigger usage of RTools35
- {os: windows-latest, r: '3.6'}

# Use older ubuntu to maximise backward compatibility
- {os: ubuntu-latest, r: 'devel', http-user-agent: 'release'}
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@
.RData
docs/
pkgdown/
.worktrees/
5 changes: 3 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,15 @@ Description: Provides some functions to get Korean text sample from news article
License: MIT + file LICENSE
URL: https://forkonlp.github.io/N2H4/, https://github.com/forkonlp/N2H4
BugReports: https://github.com/forkonlp/N2H4/issues
RoxygenNote: 7.2.3
RoxygenNote: 7.3.3
Depends:
R (>= 3.5.0)
Encoding: UTF-8
Suggests:
testthat,
devtools,
usethis
usethis,
xml2
Imports:
rvest,
jsonlite,
Expand Down
6 changes: 6 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,12 @@ export(getMainCategory)
export(getMaxPageNum)
export(getSubCategory)
export(getUrlList)
export(news_category_get)
export(news_comment)
export(news_comment_history)
export(news_content)
export(news_max_page_num)
export(news_urls_from_list)
importFrom(httr2,req_headers)
importFrom(httr2,req_method)
importFrom(httr2,req_perform)
Expand Down
6 changes: 6 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
# N2H4 (development)

* Add `reporter` column support in `getContent()` / `news_content()`.
* Add `N2H4_CACHE` environment variable to control category cache usage.
* Add legacy installation article and migrate selected wiki guides into pkgdown articles.

# N2H4 0.8.4

* getMainCategory() 함수를 수정했습니다.
Expand Down
14 changes: 13 additions & 1 deletion R/getCategory.R
Original file line number Diff line number Diff line change
@@ -1,10 +1,22 @@
#' News Category
#'
#' @param fresh get data from online. Default is FALSE using cached built-in data.

Check warning on line 3 in R/getCategory.R

View workflow job for this annotation

GitHub Actions / lint

file=R/getCategory.R,line=3,col=81,[line_length_linter] Lines should not be more than 80 characters. This line is 82 characters.
#' @details Use `N2H4_CACHE` to control cached data usage when `fresh = FALSE`.
#' truthy values: `1`, `true`, `yes`, `on`; falsy values: `0`, `false`,
#' `no`, `off`.
#' @export
getCategory <- function(fresh = FALSE) {

Check warning on line 8 in R/getCategory.R

View workflow job for this annotation

GitHub Actions / lint

file=R/getCategory.R,line=8,col=1,[object_name_linter] Variable and function name style should match snake_case or symbols.
if (!fresh) {
warn_legacy("getCategory()", "news_category_get()")

Check warning on line 9 in R/getCategory.R

View workflow job for this annotation

GitHub Actions / lint

file=R/getCategory.R,line=9,col=3,[object_usage_linter] no visible global function definition for 'warn_legacy'
news_category_get(fresh = fresh)
}

#' @rdname getCategory
#' @export
news_category_get <- function(fresh = FALSE) {
use_cache <- !isTRUE(fresh) && news_cache_enabled(default = TRUE)

Check warning on line 16 in R/getCategory.R

View workflow job for this annotation

GitHub Actions / lint

file=R/getCategory.R,line=16,col=34,[object_usage_linter] no visible global function definition for 'news_cache_enabled'

if (use_cache) {
return(news_category)

Check warning on line 19 in R/getCategory.R

View workflow job for this annotation

GitHub Actions / lint

file=R/getCategory.R,line=19,col=12,[object_usage_linter] no visible binding for global variable 'news_category'
}
mcate <- getMainCategory()
cate <- list()
Expand All @@ -15,7 +27,7 @@
getSubCategory(sid1 = mcate$sid1[i])
)
}
return(tibble::as_tibble(do.call(rbind, cate)))

Check warning on line 30 in R/getCategory.R

View workflow job for this annotation

GitHub Actions / lint

file=R/getCategory.R,line=30,col=3,[return_linter] Use implicit return behavior; explicit return() is not needed.
}

#' Get News Main Categories
Expand All @@ -25,14 +37,14 @@
#' @return a [tibble][tibble::tibble-package]
#' @export
#' @importFrom rvest html_nodes html_attr html_text
#' @importFrom httr2 request req_user_agent req_headers req_method req_perform resp_body_html

Check warning on line 40 in R/getCategory.R

View workflow job for this annotation

GitHub Actions / lint

file=R/getCategory.R,line=40,col=81,[line_length_linter] Lines should not be more than 80 characters. This line is 93 characters.
#' @examples
#' \dontrun{
#' getMainCategory()
#' }

getMainCategory <- function() {

Check warning on line 46 in R/getCategory.R

View workflow job for this annotation

GitHub Actions / lint

file=R/getCategory.R,line=46,col=1,[object_name_linter] Variable and function name style should match snake_case or symbols.
httr2::request("https://news.naver.com/") %>%

Check warning on line 47 in R/getCategory.R

View workflow job for this annotation

GitHub Actions / lint

file=R/getCategory.R,line=47,col=45,[object_usage_linter] no visible global function definition for '%>%'

Check warning on line 47 in R/getCategory.R

View workflow job for this annotation

GitHub Actions / lint

file=R/getCategory.R,line=47,col=45,[object_usage_linter] no visible global function definition for '%>%'
httr2::req_user_agent("N2H4 by chanyub.park <mrchypark@gmail.com>") %>%
httr2::req_headers("Accept" = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9") %>%
httr2::req_method("GET") %>%
Expand Down
10 changes: 9 additions & 1 deletion R/getComment.R
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,15 @@
getComment <- function(turl,
count = 10,
type = c("df", "list")) {
warn_legacy("getComment()", "news_comment()")
news_comment(turl, count, type)
}

#' @rdname getComment
#' @export
news_comment <- function(turl,
count = 10,
type = c("df", "list")) {
get_comment(turl, count, type)
}

Expand Down Expand Up @@ -109,4 +118,3 @@ get_comment <- function(turl,

return(do.call(rbind, res))
}

11 changes: 10 additions & 1 deletion R/getCommentHistory.R
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,16 @@ getCommentHistory <- function(turl,
commentNo,
count = 10,
type = c("df", "list")) {
warn_legacy("getCommentHistory()", "news_comment_history()")
news_comment_history(turl, commentNo, count, type)
}

#' @rdname getCommentHistory
#' @export
news_comment_history <- function(turl,
commentNo,
count = 10,
type = c("df", "list")) {
get_comment_history(turl, commentNo, count, type)
}

Expand Down Expand Up @@ -111,4 +121,3 @@ get_comment_history <- function(turl,
return(do.call(rbind, res))
}


46 changes: 43 additions & 3 deletions R/getContent.R
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,28 @@ getContent <-
"datetime",
"edittime",
"press",
"reporter",
"title",
"body"
)) {
warn_legacy("getContent()", "news_content()")
news_content(turl, col = col)
}

#' @rdname getContent
#' @export
#' @importFrom httr2 request req_user_agent req_method req_perform resp_body_html
#' @importFrom rvest html_nodes html_text html_attr
news_content <-
function(turl,
col = c(
"url",
"original_url",
"section",
"datetime",
"edittime",
"press",
"reporter",
"title",
"body"
)) {
Expand All @@ -36,22 +58,23 @@ getContent <-
if (
identical(
grep("^https?://n.news.naver.com", urlcheck), integer(0)
)
) {
)
) {
original_url <- "page is not news section."
title <- "page is not news section."
datetime <- "page is not news section."
edittime <- "page is not news section."
press <- "page is not news section."
reporter <- "page is not news section."
body <- "page is not news section."
section <- "page is not news section."

} else {
original_url <- getOriginalUrl(html_obj)
title <- getContentTitle(html_obj)
datetime <- getContentDatetime(html_obj)
edittime <- getContentEditDatetime(html_obj)
press <- getContentPress(html_obj)
reporter <- news_content_reporter(html_obj)
body <- getContentBody(html_obj)
section <- getSection(turl)
}
Expand All @@ -65,6 +88,7 @@ getContent <-
datetime = datetime,
edittime = edittime,
press = press,
reporter = reporter,
title = title,
body = body,
section = section
Expand Down Expand Up @@ -121,6 +145,22 @@ getContentPress <-
return(press[1])
}

news_content_reporter <-
function(html_obj,
reporter_node_info = c(
".media_end_head_journalist_name",
".byline_s"
)) {
node <- rvest::html_nodes(html_obj, paste(reporter_node_info, collapse = ", "))
reporter <- trimws(rvest::html_text(node))
reporter <- reporter[nchar(reporter) > 0]

if (length(reporter) == 0) {
return(NA_character_)
}
return(reporter[1])
}

getContentBody <-
function(html_obj,
body_node_info = "article#dic_area",
Expand Down
7 changes: 7 additions & 0 deletions R/getMaxPageNum.R
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,13 @@
#' }

getMaxPageNum <- function(turl, max = 100) {
warn_legacy("getMaxPageNum()", "news_max_page_num()")
news_max_page_num(turl, max = max)
}

#' @rdname getMaxPageNum
#' @export
news_max_page_num <- function(turl, max = 100) {
lifecycle::deprecate_soft("1.0.0", "when()", I("`if`"))

httr2::request(turl) %>%
Expand Down
21 changes: 15 additions & 6 deletions R/getUrlList.R
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,22 @@
getUrlList <-
function(turl,
col = c("titles", "links")) {
warn_legacy("getUrlList()", "news_urls_from_list()")
news_urls_from_list(turl, col = col)
}

httr2::request(turl) %>%
httr2::req_user_agent("N2H4 by chanyub.park <mrchypark@gmail.com>") %>%
httr2::req_method("GET") %>%
httr2::req_perform() %>%
httr2::resp_body_html() -> hobj
#' @rdname getUrlList
#' @export
#' @importFrom rvest html_nodes html_attr html_text
#' @importFrom httr2 request req_user_agent req_method req_perform resp_body_html
news_urls_from_list <-
function(turl,
col = c("titles", "links")) {
httr2::request(turl) %>%
httr2::req_user_agent("N2H4 by chanyub.park <mrchypark@gmail.com>") %>%
httr2::req_method("GET") %>%
httr2::req_perform() %>%
httr2::resp_body_html() -> hobj

titles <- rvest::html_nodes(hobj, "dt a")
titles <- rvest::html_text(titles)
Expand Down Expand Up @@ -51,4 +61,3 @@ getUrlList <-

return(news_lists[, col])
}

24 changes: 24 additions & 0 deletions R/internal.R
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,30 @@ get_oid <- function(turl) {
paste0(tem[3], ",", tem[4])
}

warn_legacy <- function(old, new) {
lifecycle::deprecate_warn(
when = "0.9.0",
what = old,
with = new,
id = paste0("n2h4-", old)
)
}

news_cache_enabled <- function(default = TRUE) {
unset <- if (isTRUE(default)) "true" else "false"
value <- tolower(trimws(Sys.getenv("N2H4_CACHE", unset = unset)))

if (value %in% c("1", "true", "yes", "on")) {
return(TRUE)
}
if (value %in% c("0", "false", "no", "off")) {
return(FALSE)
}

warning("Invalid N2H4_CACHE value. Falling back to default.")
isTRUE(default)
}

rm_callback <- function(text) {
text <- gsub("_callback", "", text)
text <- gsub("\\(", "[", text)
Expand Down
24 changes: 24 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,30 @@ install.packages("N2H4")
install.packages("N2H4", repos = "https://forkonlp.r-universe.dev")
```

## Legacy installation

```r
install.packages("remotes")
remotes::install_version("N2H4", version = "0.8.4", repos = "https://cran.r-project.org")
```

More details: <https://forkonlp.github.io/N2H4/articles/install-legacy.html>

## Guides

- Korean README: <https://forkonlp.github.io/N2H4/articles/readmekr.html>
- Get content: <https://forkonlp.github.io/N2H4/articles/get-content.html>
- Get comments: <https://forkonlp.github.io/N2H4/articles/get-comment.html>
- Wiki migration - feature overview: <https://forkonlp.github.io/N2H4/articles/wiki-feature-overview.html>
- Wiki migration - category collection: <https://forkonlp.github.io/N2H4/articles/wiki-category-collection.html>
- Wiki migration - text collection: <https://forkonlp.github.io/N2H4/articles/wiki-text-collection.html>

## Environment variables

- `N2H4_CACHE`: controls whether cached category data is used in `getCategory()` / `news_category_get()` when `fresh = FALSE`.
- truthy: `1`, `true`, `yes`, `on`
- falsy: `0`, `false`, `no`, `off`

## Contributors

Thanks goes to these wonderful people ([emoji key](https://allcontributors.org/docs/en/emoji-key)):
Expand Down
19 changes: 19 additions & 0 deletions _pkgdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,25 @@ template:
params:
toggle: manual
ganalytics: UA-47822682-17
navbar:
components:
articles:
text: 한글문서모음
menu:
- text: 리드미
href: articles/readmekr.html
- text: 기사 가져오기
href: articles/get-content.html
- text: 댓글 가져오기
href: articles/get-comment.html
- text: 레거시 버전 설치
href: articles/install-legacy.html
- text: 위키 이관 - 기능 설명
href: articles/wiki-feature-overview.html
- text: 위키 이관 - 카테고리 수집
href: articles/wiki-category-collection.html
- text: 위키 이관 - 텍스트 수집
href: articles/wiki-text-collection.html

reference:
- title: Content
Expand Down
8 changes: 8 additions & 0 deletions man/getCategory.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 3 additions & 0 deletions man/getComment.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 3 additions & 0 deletions man/getCommentHistory.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading