TidierText.nma_words
— Constantnma_words
Returns a DataFrame containing 44 English negators, modals, and adverbs
The NMA words come from a list developed by Saif Mohammad: http://saifmohammad.com/WebPages/SCL.html#NMA
Returns
DataFrame
with a single columnword
, each row containing a stopword.
Examples
julia> first(nma_words, 5)
5×2 DataFrame
Row │ word modifier
│ String String
─────┼─────────────────────
1 │ cannot negator
2 │ could not negator
3 │ did not negator
4 │ does not negator
5 │ had no negator
TidierText.get_stopwords
— Methodget_stopwords()
Returns a DataFrame containing English stopwords.
The stopwords come from the Languages.jl
package: https://github.com/JuliaText/Languages.jl/blob/master/data/stopwords/English.txt.
Returns
DataFrame
with a single columnword
, each row containing a stopword.
Examples
julia> first(get_stopwords(), 5)
5×1 DataFrame
Row │ word
│ String
─────┼────────
1 │ a
2 │ about
3 │ above
4 │ across
5 │ after
TidierText.tidy
— Method@tidy(corpus)
Returns a DataFrame of the corpus with 1 row per document
Arguments
corpus
: The corpus to be converted into a DataFrame with appropriate metadata for each document in the corpus.
Returns
- A DataFrame of the corpus with 1 row per document
Examples
julia>
TidierText.@bind_tf_idf
— Macro@bind_tf_idf(df, term_col, document_col, n)
Calculates TF-IDF values for the specified columns of df
.
Arguments
df
: The input DataFrame.term_col
: The column containing terms.document_col
: The column containing document identifiers.n
: The column containing term frequencies.
Returns
- A new DataFrame containing TF, IDF, and TF-IDF values.
Examples
julia> using DataFrames;
df = DataFrame(
doc_id = [1, 1, 2, 2, 3, 3],
term = ["apple", "banana", "apple", "cherry", "banana", "date"],
n = [1, 4, 6, 4, 9, 8]
);
julia> @bind_tf_idf(df, doc_id, term, n)
6×6 DataFrame
Row │ doc_id term n tf idf tf_idf
│ Int64 String Int64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────────────
1 │ 1 apple 1 0.142857 0.693147 0.099021
2 │ 1 banana 4 0.307692 0.693147 0.213276
3 │ 2 apple 6 0.857143 0.693147 0.594126
4 │ 2 cherry 4 1.0 0.693147 0.693147
5 │ 3 banana 9 0.692308 0.693147 0.479871
6 │ 3 date 8 1.0 0.693147 0.693147
TidierText.@unnest_characters
— Macro@unnest_characters(df, output_col, input_col, to_lower = false, strip_non_alphanum = true)
Splits the text in input_col
of df
into separate characters, outputting the result to output_col
.
Arguments
df
: The input DataFrame.output_col
: The name of the output column to store the characters.input_col
: The name of the input column containing text.to_lower
: An optional boolean that defaults to false which indicates whether to convert text to lowercase before splitting.strip_non_alphanum
: Optional boolean that defualts to true, which strips non alphanumeric characters
Returns
- A new DataFrame with the text split into characters.
Examples
julia> using DataFrames;
df = DataFrame(
text = ["The quick.", "Nice."],
doc = [1, 2]);
julia> @unnest_characters(df, term, text, to_lower = false, strip_non_alphanum = false)
15×2 DataFrame
Row │ doc term
│ Int64 Char
─────┼─────────────
1 │ 1 T
2 │ 1 h
3 │ 1 e
4 │ 1
5 │ 1 q
6 │ 1 u
7 │ 1 i
8 │ 1 c
9 │ 1 k
10 │ 1 .
11 │ 2 N
12 │ 2 i
13 │ 2 c
14 │ 2 e
15 │ 2 .
julia> @unnest_characters(df, term, text, to_lower = true)
13×2 DataFrame
Row │ doc term
│ Int64 Char
─────┼─────────────
1 │ 1 t
2 │ 1 h
3 │ 1 e
4 │ 1
5 │ 1 q
6 │ 1 u
7 │ 1 i
8 │ 1 c
9 │ 1 k
10 │ 2 n
11 │ 2 i
12 │ 2 c
13 │ 2 e
TidierText.@unnest_ngrams
— Macro@unnest_ngrams(df, output_col, input_col, n, to_lower = false)
Creates n-grams from the text in input_col
of df
, outputting the result to output_col
.
Arguments
df
: The input DataFrame.output_col
: The name of the output column to store the n-grams.input_col
: The name of the input column containing text.n
: The size of the n-grams.to_lower
: An optional boolean that defaults to false which indicates whether to convert text to lowercase before splitting.strip_non_alphanum
: Optional boolean that defualts to true, which strips non alphanumeric characters
Returns
- A new DataFrame with the n-grams.
Examples
julia> using DataFrames;
df = DataFrame(
text = ["The quick brown fox jumps.",
"The sun rises in the east."],
doc = [1, 2]
);
julia> @unnest_ngrams(df, term, text, 2, to_lower = false)
9×2 DataFrame
Row │ doc term
│ Int64 String
─────┼────────────────────
1 │ 1 The quick
2 │ 1 quick brown
3 │ 1 brown fox
4 │ 1 fox jumps
5 │ 2 The sun
6 │ 2 sun rises
7 │ 2 rises in
8 │ 2 in the
9 │ 2 the east
julia> @unnest_ngrams(df, term, text, 2)
9×2 DataFrame
Row │ doc term
│ Int64 String
─────┼────────────────────
1 │ 1 The quick
2 │ 1 quick brown
3 │ 1 brown fox
4 │ 1 fox jumps
5 │ 2 The sun
6 │ 2 sun rises
7 │ 2 rises in
8 │ 2 in the
9 │ 2 the east
TidierText.@unnest_regex
— Macro@unnest_regex(df, output_col, input_col, pattern = "\s+", to_lower = false)
Splits the text in input_col
of df
based on a regex pattern
, outputting the result to output_col
.
Arguments
df
: The input DataFrame.output_col
: The name of the output column to store the result.input_col
: The name of the input column containing text to split.pattern
: The regex pattern to use for splitting text.to_lower
: An optional boolean that defaults to false which indicates whether to convert text to lowercase before splitting.strip_non_alphanum
: Optional boolean that defualts to true, which strips non alphanumeric characters
Returns
- A new DataFrame with the text split based on the regex pattern.
Examples
julia> using DataFrames;
df = DataFrame(
text = ["The quick brown fox jumps.",
"One column and the one row?"],
doc = [1, 2]
);
julia> @unnest_regex(df, word, text, "the")
3×2 DataFrame
Row │ doc word
│ Int64 SubStrin…
─────┼──────────────────────────────────
1 │ 1 The quick brown fox jumps
2 │ 2 One column and
3 │ 2 one row
julia> @unnest_regex(df, word, text, "the", to_lower = true, strip_non_alphanum = false)
3×2 DataFrame
Row │ doc word
│ Int64 SubStrin…
─────┼───────────────────────────────
1 │ 1 quick brown fox jumps.
2 │ 2 one column and
3 │ 2 one row?
TidierText.@unnest_tokens
— Macro@unnest_tokens(df, output_col, input_col, to_lower = false)
Tokenizes the text in input_col
of df
into separate words, outputting the result to output_col
.
Arguments
df
: The input DataFrame.output_col
: The name of the output column to store the tokens.input_col
: The name of the input column containing text to tokenize.to_lower
: An optional boolean that defaults to false which indicates whether to convert text to lowercase before splitting.strip_non_alphanum
: Optional boolean that defualts to true, which strips non alphanumeric characters
Returns
- A new DataFrame with the tokenized text.
Examples
julia> using DataFrames;
df = DataFrame(
text = ["The quick brown fox jumps.",
"One column and the one row?"],
doc = [1, 2]
);
julia> @unnest_tokens(df, word, text, strip_non_alphanum = false)
11×2 DataFrame
Row │ doc word
│ Int64 SubStrin…
─────┼──────────────────
1 │ 1 The
2 │ 1 quick
3 │ 1 brown
4 │ 1 fox
5 │ 1 jumps.
6 │ 2 One
7 │ 2 column
8 │ 2 and
9 │ 2 the
10 │ 2 one
11 │ 2 row?
julia> @unnest_tokens(df, word, text, to_lower = true)
11×2 DataFrame
Row │ doc word
│ Int64 SubStrin…
─────┼──────────────────
1 │ 1 the
2 │ 1 quick
3 │ 1 brown
4 │ 1 fox
5 │ 1 jumps
6 │ 2 one
7 │ 2 column
8 │ 2 and
9 │ 2 the
10 │ 2 one
11 │ 2 row