Assignment 1: Elementary wrestling with citation data. (due by 11:59 pm on 9/7/2022). This assignment can be run off some laptops (it was tested on an Intel Mac with 32 GB RAM) but you are strongly encouraged to explore the campus cluster instructional queue (eng-instruction) for items c, d, and e.
You’re free to work together in pairs and/or discuss your ideas with other people but each person’s assignment must be individually written up and acknowledge input from whoever you discussed your assignment with.
a) Using PubMed, retrieve journal articles in English using the search terms ‘histone’, ‘methylase’, ‘deacetylase’ and ‘epigenomics’ and restrict your search to articles published between 1980 and 1990 (including boundary years). You are free to choose between various options for this search- manual (Pubmed GUI), the Entrez API, other tools such as Rentrez and EntrezPy., For each retrieved article collect the PubMed ID (pmid), the year of publication, the title, names of authors, DOI (when available), and publication type. You should also save and report the query you construct, e.g., in PubMed.
“(((kahnemann[Author]) AND (1R01GM12345-01[Grant Number])) AND (“historical article”[Publication Type])) AND ((“2001/10/15″[Date – Publication] : “3000”[Date – Publication]))”
Alternatively, you could download all of PubMed and process it locally.
b) Describe your query approach in plain English and append any scripts that you developed. Provide enough information that a third person could reproduce your results. A minimum working example is very usefu. Describe any alternate strategies that you consider viable, e.g. EuropePMC or Scopus via the library. Indicate your preference and justify it.
c) Export retrieved data and match as many of the retrieved articles as possible to the open_citations dataset available at /projects/eng/shared/CS598GGC-RO. To access this folder, you’ll need to login to a head node on the campus cluster, e.g., cc-login.campuscluster.illinois.edu. All enrolled students have access to this folder and the campus cluster instructional queue. If you experience problems accessing it please file a ticket with EngrIT (engrit-help@illinois.edu), suggest Glen Rundblom as a source of knowledge and copy chackoge@illinois.edu.
Note that the open_citations data are in the form of citation records rather than publication records: citing_node and cited_node where nodes are identified by DOIs. So, you will need to use DOIs from your PubMed data to match the hit list to the open_citations data. For example, if you get back two articles with DOIs of 10.1/abcd-12234 and 10.1/efgh-5678 from PubMed then you would extract all rows in the open_citations data where either 10.1/abcd-12234 or 10.1/efgh-5678 are present.
Feel free to supplement the open citations data from other public sources such as EuropePMC and Crossref as well as Scopus and WebOfScience provided through the University Library.
d) Assuming a directed graph where each vertex is a document identified by a DOI and each edge represents a citation from one document to another, report the degree distribution of nodes in the rows you have extracted from the open_citations data, i.e., for each node report in-degree and out-degree separately in the open_citations dataset. Present your findings in aggregate.
e) Report the total number of nodes and edges in the open_citation dataset. Calculate the total degree (in + out) distribution for each node in the open_citation dataset. Report the number of citations in each year across the dataset. Graphically represent this distribution and comment on whether it likely fits a normal distribution, a lognormal, a power-law distribution or some other distribution that you would like to consider.
Other: What is a histone and why do researchers study histones? Is metadata from the histone literature useful? Why do you think this exercise might serve any purpose beyond satisfying a course instructor?
Be creative and inclusive in how you present your data- for example, would biology researchers understand your write up?
The completed assignment should be presented as a PDF document that should not exceed 5 pages. Please email the PDF to George Chacko by midnight US Central on Sep 7.