CS 598: Homework Assignment-1

Assignment 1: Elementary wrestling with citation data. (due by 11:59 pm on 9/7/2022). This assignment can be run off some laptops (it was tested on an Intel Mac with 32 GB RAM) but you are strongly encouraged to explore the campus cluster instructional queue (eng-instruction) for items c, d, and e.

a) Using PubMed, retrieve journal articles in English using the search term ‘histone’ and restrict your search to articles published between 1980 and 1990 (including boundary years). You are free to choose between various options for this search- manual (Pubmed GUI), the Entrez API, other tools such as Rentrez and EntrezPy., For each retrieved article collect the PubMed ID (pmid), the year of publication, the title, names of authors, DOI (when available), and publication type. You should also save and report the query you construct, e.g., in PubMed

“(((kahnemann[Author]) AND (1R01GM12345-01[Grant Number])) AND (“historical article”[Publication Type])) AND ((“2001/10/15″[Date – Publication] : “3000”[Date – Publication]))”

b) Describe your query approach in plain English and append any scripts that you developed. Provide enough information that a third person could reproduce your results. Describe any alternate strategies that you consider viable, e.g. EuropePMC or Scopus via the library. Indicate your preference and justify it.

c) Export retrieved data and match as many of the retrieved articles as possible to the open_citations dataset available at <campus_cluster_storage_uri>. Note that the open_citations data are in the form of citations: citing_node and cited_node where nodes are identified by DOIs. So, you will need to use DOIs from your PubMed daa to match the hit list to the open_citations data. For example, if you get back two articles with DOIs of 10.1/abcd-12234 and 10.1/efgh-5678 from PubMed then you would extract all rows in the open_citations data where either 10.1/abcd-12234 or 10.1/efgh-5678 are present.

d) Assuming a directed graph where each vertex is a document identified by a DOI and each edge represents a citation from one document to another, report the degree distribution of nodes in the rows you have extracted from the open_citations data, i.e., for each node report in-degree and out-degree separately in the open_citations dataset. Present your findings in aggregate.

e) Report the total number of nodes and edges in the open_citation dataset. Calculate the total degree (in + out) distribution for each node in the open_citation dataset. Report the number of citations in each year across the dataset. Graphically represent this distribution and comment on whether it likely fits a normal distribution, a lognormal, a power-law distribution or some other distribution.

The completed assignment should be presented as a PDF document that should not exceed 5 pages. [trying to decide whether webtools or some other option should be used]. Be creative in how you present your data- for example, would biology researchers understand your write up?

Other: What is a histone and why do researchers study histones? Is metadata from the histone literature useful? Why do you think this exercise might serve any purpose beyond satisfying a course instructor?