nlp
parse.nlp
Natural Language Processing with TVB-O.
Example
# Provide the path to the PDF file you want to parse
pdf_file_path = "/Users/leonmartin_bih/Downloads/Deco2014.pdf"
pdf_file_path = "/Users/leonmartin_bih/Downloads/fncom-13-00054 (1).pdf"
#Call the function to extract text from the PDF file
extracted_text = extract_text_from_pdf(pdf_file_path)
methods = extract_methods_from_pdf(pdf_file_path)Functions
| Name | Description |
|---|---|
| extract_methods_from_pdf | Extract the “Methods” or “Materials and Methods” section from a PDF. |
| extract_text_from_pdf | Extract text content from a PDF file. |
| find_in_text | Find occurrences of a keyword in the text using regex. |
| get_pdf_page | Extract text content from a specific page of a PDF file. |
| ner_by_classes | Named Entity Recognition (NER) based on ontology classes. |
| ner_by_words | Named Entity Recognition (NER) by matching terms against an ontology. |
extract_methods_from_pdf
parse.nlp.extract_methods_from_pdf(pdf_path)Extract the “Methods” or “Materials and Methods” section from a PDF.
Parameters
pdf_path : str Path to the PDF file.
Returns
str Extracted text of the methods section. Hyphens at line breaks (“-”) are replaced with an empty string.
Notes
This function searches for the “Methods” or “Materials and Methods” section in the PDF and extracts the text until it encounters the “Results” or “Discussion” section.
extract_text_from_pdf
parse.nlp.extract_text_from_pdf(file_path)Extract text content from a PDF file.
Parameters
file_path : str Path to the PDF file.
Returns
str Extracted text from the PDF. Hyphens at line breaks (“-”) are replaced with an empty string.
find_in_text
parse.nlp.find_in_text(text, keyword, standalone=True, equation=False, **kwargs)Find occurrences of a keyword in the text using regex.
Parameters
text : str Input text to search within. keyword : str Keyword to search for. standalone : bool, optional (default is True) If True, only matches the keyword as a standalone word. Otherwise, matches all occurrences. equation : bool, optional (default is False) If True, only matches the keyword in sentences containing an equal sign. **kwargs Additional keyword arguments for re.finditer.
Returns
list of tuple List of (start, end) indices for each match in the text.
get_pdf_page
parse.nlp.get_pdf_page(pdf_path, pagenum)Extract text content from a specific page of a PDF file.
Parameters
pdf_path : str Path to the PDF file. pagenum : int Page number (0-indexed) to extract text from.
Returns
str Extracted text from the specified PDF page. Hyphens at line breaks (“-”) are replaced with an empty string.
ner_by_classes
parse.nlp.ner_by_classes(text, semantic_type='all')Named Entity Recognition (NER) based on ontology classes.
Parameters
text : str Input text for NER. semantic_type : str, optional (default is “all”) Semantic type for filtering: “all”, “label”, “synonym”, “acronym”, or “symbol”.
Returns
dict NER results based on the specified semantic type.
Raises
| Name | Type | Description |
|---|---|---|
| ValueError | If an incorrect semantic_type is specified. |
Example
text = “The Jansen-Rit neural mass model is accociated with the alpha frequency in the EEG.” result = ner_by_classes(text, semantic_type=“all”) print(result) {Neural Mass Model: [(15, 32)], EEG: [(79, 82)], Model: [(27, 32)], JansenRit: {‘Jansen-Rit’: [(4, 14)]}}
ner_by_words
parse.nlp.ner_by_words(text)Named Entity Recognition (NER) by matching terms against an ontology.
Parameters
text : str Input text for NER.
Returns
dict Dictionary with ontology classes as keys. Values are dictionaries with named entities as keys and their occurrences as values.