nlp

parse.nlp

Natural Language Processing with TVB-O.

Example

# Provide the path to the PDF file you want to parse
pdf_file_path = "/Users/leonmartin_bih/Downloads/Deco2014.pdf"
pdf_file_path = "/Users/leonmartin_bih/Downloads/fncom-13-00054 (1).pdf"

#Call the function to extract text from the PDF file
extracted_text = extract_text_from_pdf(pdf_file_path)
methods = extract_methods_from_pdf(pdf_file_path)

Functions

Name	Description
extract_methods_from_pdf	Extract the “Methods” or “Materials and Methods” section from a PDF.
extract_text_from_pdf	Extract text content from a PDF file.
find_in_text	Find occurrences of a keyword in the text using regex.
get_pdf_page	Extract text content from a specific page of a PDF file.
ner_by_classes	Named Entity Recognition (NER) based on ontology classes.
ner_by_words	Named Entity Recognition (NER) by matching terms against an ontology.

extract_methods_from_pdf

parse.nlp.extract_methods_from_pdf(pdf_path)

Extract the “Methods” or “Materials and Methods” section from a PDF.

Parameters

pdf_path : str Path to the PDF file.

Returns

str Extracted text of the methods section. Hyphens at line breaks (“-”) are replaced with an empty string.

Notes

This function searches for the “Methods” or “Materials and Methods” section in the PDF and extracts the text until it encounters the “Results” or “Discussion” section.

extract_text_from_pdf

parse.nlp.extract_text_from_pdf(file_path)

Extract text content from a PDF file.

Parameters

file_path : str Path to the PDF file.

Returns

str Extracted text from the PDF. Hyphens at line breaks (“-”) are replaced with an empty string.

find_in_text

parse.nlp.find_in_text(text, keyword, standalone=True, equation=False, **kwargs)

Find occurrences of a keyword in the text using regex.

Parameters

text : str Input text to search within. keyword : str Keyword to search for. standalone : bool, optional (default is True) If True, only matches the keyword as a standalone word. Otherwise, matches all occurrences. equation : bool, optional (default is False) If True, only matches the keyword in sentences containing an equal sign. **kwargs Additional keyword arguments for re.finditer.

Returns

list of tuple List of (start, end) indices for each match in the text.

get_pdf_page

parse.nlp.get_pdf_page(pdf_path, pagenum)

Extract text content from a specific page of a PDF file.

Parameters

pdf_path : str Path to the PDF file. pagenum : int Page number (0-indexed) to extract text from.

Returns

str Extracted text from the specified PDF page. Hyphens at line breaks (“-”) are replaced with an empty string.

ner_by_classes

parse.nlp.ner_by_classes(text, semantic_type='all')

Named Entity Recognition (NER) based on ontology classes.

Parameters

text : str Input text for NER. semantic_type : str, optional (default is “all”) Semantic type for filtering: “all”, “label”, “synonym”, “acronym”, or “symbol”.

Returns

dict NER results based on the specified semantic type.

Raises

Name	Type	Description
	ValueError	If an incorrect `semantic_type` is specified.

Example

text = “The Jansen-Rit neural mass model is accociated with the alpha frequency in the EEG.” result = ner_by_classes(text, semantic_type=“all”) print(result) {Neural Mass Model: [(15, 32)], EEG: [(79, 82)], Model: [(27, 32)], JansenRit: {‘Jansen-Rit’: [(4, 14)]}}

ner_by_words

parse.nlp.ner_by_words(text)

Named Entity Recognition (NER) by matching terms against an ontology.

Parameters

text : str Input text for NER.

Returns

dict Dictionary with ontology classes as keys. Values are dictionaries with named entities as keys and their occurrences as values.