{"id":13826,"date":"2015-04-08T17:54:54","date_gmt":"2015-04-08T16:54:54","guid":{"rendered":"http:\/\/www.ch.imperial.ac.uk\/rzepa\/blog\/?p=13826"},"modified":"2015-04-09T07:30:00","modified_gmt":"2015-04-09T06:30:00","slug":"goldilocks-data","status":"publish","type":"post","link":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=13826","title":{"rendered":"Goldilocks Data."},"content":{"rendered":"<div class=\"kcite-section\" kcite-section-id=\"13826\">\n<p>Last August, I <a title=\"Data galore!  134 kilomolecules.\" href=\"http:\/\/www.ch.imperial.ac.uk\/rzepa\/blog\/?p=12803\" target=\"_blank\">wrote about<\/a> <em>data galore<\/em>, the archival\u00a0of data for 133,885 (134 kilo) molecules into a repository, together with an associated data descriptor<span id=\"cite_ITEM-13826-0\" name=\"citation\"><a href=\"#ITEM-13826-0\">[1]<\/a><\/span> published in the new journal <em>Scientific Data<\/em>. Since six months is a long time in the rapidly evolving field of RDM, or research data management, I offer an update in the form of some new observations.<\/p>\n<p>Firstly, 131 kilo molecules are now offered in a new different form;\u00a0<a href=\"http:\/\/gdb.koitz.info\/gdbrowse\/\">http:\/\/gdb.koitz.info\/gdbrowse\/<\/a>\u00a0and it is worth comparing the differences between the presentation of the two sets of otherwise identical data.<\/p>\n<ol>\n<li>The original<strong><span style=\"color: #ff00ff;\"> archive<\/span><\/strong>\u00a0had a single assigned DOI<span id=\"cite_ITEM-13826-1\" name=\"citation\"><a href=\"#ITEM-13826-1\">[2]<\/a><\/span> from where you could download a ZIP file to be unpacked and navigated on your own computer. The exposed metadata for the deposition (by which I mean in this case, metadata registered with <a href=\"http:\/\/search.datacite.org\/\" target=\"_blank\">DataCite<\/a>, the registration authority used by Figshare) was limited to general information about the 133,885 molecules such as the authorship and license. The granularity is coarse, not extending to descriptions of individual molecules.<\/li>\n<li>The new version forgoes the ZIP archive, replacing it with a proper <strong><span style=\"color: #ff00ff;\">database<\/span><\/strong> (based on <a href=\"http:\/\/www.mongodb.org\/\" target=\"_blank\">MongoDB<\/a>) containing information about 130,832 molecules.<b>\u00a0<\/b>\u00a0This allows one to search the data\u00a0at the individual\u00a0molecule level (formula, InChI descriptor, mass, <em>etc<\/em>) using the tools provided. To the end-user, this is much more useful; the data is both\u00a0<strong>discoverable<\/strong> and\u00a0<strong>re-usable<\/strong>.<\/li>\n<\/ol>\n<p>This is no overlap between these two presentations of the data. There also appears to be no API (application programming interface) which might allow one to write code to construct one&#8217;s own searches. The apparent absence of an API also means that really only a human navigating the set menus can discover and re-use that\u00a0data; the data might not be mineable by a machine for example. The absence of an API is not that unusual, only some of the best known molecular databases offer this; the\u00a0<a href=\"http:\/\/www.programmableweb.com\/api\/rcsb-protein-data-bank\" target=\"_blank\">RCSB Protein Data Bank<\/a> is a good example. More significantly, each instance of such a molecule-based database is likely to have its own way of accessing the data and even if a documented API were available, one would still have to write specific code for each such resource.<\/p>\n<p>So the first bowl contains what I suggest is cold porridge and the second is perhaps\u00a0equivalent to a\u00a0<a href=\"http:\/\/en.wikipedia.org\/wiki\/Table_d%27h\u00f4te\" target=\"_blank\">table d&#8217;h\u00f4te menu<\/a>. Does Goldilocks have a third option? I would argue yes, she could have:<\/p>\n<ol start=\"3\">\n<li>We recently published data for 158 kilo molecules<span id=\"cite_ITEM-13826-2\" name=\"citation\"><a href=\"#ITEM-13826-2\">[3]<\/a><\/span> for which each molecule carries its own metadata. That metadata can be queried using any search engine that supports the basic metadata standards:<br \/>\n<small><a href=\"http:\/\/search.datacite.org\/ui?q=has_media:true&amp;fq=prefix:10.14469\" target=\"demo\">http:\/\/search.datacite.org\/ui?q=has_media:true&amp;fq=prefix:10.14469<\/a><\/small><br \/>\nis an example. Or armed with the metadata schema, one could also write one&#8217;s own search engine\u00a0and in theory at least, that code should serve to query ANY repository that supports these standards.<\/li>\n<\/ol>\n<p>You could argue that all that has happened is one has simply replaced a specific database API (if it exists) with a specific metadata schema. But these metadata schemas are controlled standards, the components of which should be self-describing (and one can see the schema components by invoking the link above).<\/p>\n<p>As the archival of data (RDM) becomes increasingly important, communities will have to start making decisions about which flavour of data-porridge to offer Goldilocks. For molecular data at least, I suggest the third option is highly desirable and perhaps likely to be the most persistent. Parochial databases very much depend on a specialised team of people to maintain them in perpetuity, which I gather now means 20 years. At very least, we should start to have a debate about how the future will evolve. Let us not leave this debate merely in the hands of a small number of large organisations that are likely to make decisions based on their own business models. After all, it\u00a0starts off at least as our data, not theirs! Arguably, we as authors have now largely lost control over how our stories (journal articles) are managed, let us not cede the same for data.<\/p>\n<h2>References<\/h2>\n    <ol class=\"kcite-bibliography csl-bib-body\"><li id=\"ITEM-13826-0\">R. Ramakrishnan, P.O. Dral, M. Rupp, and O.A. von Lilienfeld, \"Quantum chemistry structures and properties of 134 kilo molecules\", <i>Scientific Data<\/i>, vol. 1, 2014. <a href=\"https:\/\/doi.org\/10.1038\/sdata.2014.22\">https:\/\/doi.org\/10.1038\/sdata.2014.22<\/a>\n\n<\/li>\n<li id=\"ITEM-13826-1\">Raghunathan Ramakrishnan., P. Dral, P.O. Dral, M. Rupp, and O. Anatole Von Lilienfeld., \"Quantum chemistry structures and properties of 134 kilo molecules\", 2014. <a href=\"https:\/\/doi.org\/10.6084\/m9.figshare.978904\">https:\/\/doi.org\/10.6084\/m9.figshare.978904<\/a>\n\n<\/li>\n<li id=\"ITEM-13826-2\">Y. Zhang, H.S. Rzepa, J.J.P. Stewart, P. Murray-Rust, M.J. Harvey, N. Mason, A. McLean, and Imperial College High Performance Computing Service., \"Revised Cambridge NCI database\", 2014. <a href=\"https:\/\/doi.org\/10.14469\/ch\/2\">https:\/\/doi.org\/10.14469\/ch\/2<\/a>\n\n<\/li>\n<\/ol>\n\n<\/div> <!-- kcite-section 13826 -->","protected":false},"excerpt":{"rendered":"<p>Last August, I wrote about data galore, the archival\u00a0of data for 133,885 (134 kilo) molecules into a repository, together with an associated data descriptor published in the new journal Scientific Data. Since six months is a long time in the rapidly evolving field of RDM, or research data management, I offer an update in the [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"activitypub_content_warning":"","activitypub_content_visibility":"","activitypub_max_image_attachments":5,"activitypub_interaction_policy_quote":"anyone","activitypub_status":"","footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[2],"tags":[805,1379,1222],"ppma_author":[2661],"class_list":["post-13826","post","type-post","status-publish","format-standard","hentry","category-chemical-it","tag-api","tag-rcsb-protein-data-bank","tag-search-engine"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.5 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Goldilocks Data. - Henry Rzepa&#039;s Blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=13826\" \/>\n<meta property=\"og:locale\" content=\"en_GB\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Goldilocks Data. - Henry Rzepa&#039;s Blog\" \/>\n<meta property=\"og:description\" content=\"Last August, I wrote about data galore, the archival\u00a0of data for 133,885 (134 kilo) molecules into a repository, together with an associated data descriptor published in the new journal Scientific Data. Since six months is a long time in the rapidly evolving field of RDM, or research data management, I offer an update in the [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=13826\" \/>\n<meta property=\"og:site_name\" content=\"Henry Rzepa&#039;s Blog\" \/>\n<meta property=\"article:published_time\" content=\"2015-04-08T16:54:54+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2015-04-09T06:30:00+00:00\" \/>\n<meta name=\"author\" content=\"Henry Rzepa\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Henry Rzepa\" \/>\n\t<meta name=\"twitter:label2\" content=\"Estimated reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"3 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Goldilocks Data. - Henry Rzepa&#039;s Blog","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=13826","og_locale":"en_GB","og_type":"article","og_title":"Goldilocks Data. - Henry Rzepa&#039;s Blog","og_description":"Last August, I wrote about data galore, the archival\u00a0of data for 133,885 (134 kilo) molecules into a repository, together with an associated data descriptor published in the new journal Scientific Data. Since six months is a long time in the rapidly evolving field of RDM, or research data management, I offer an update in the [&hellip;]","og_url":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=13826","og_site_name":"Henry Rzepa&#039;s Blog","article_published_time":"2015-04-08T16:54:54+00:00","article_modified_time":"2015-04-09T06:30:00+00:00","author":"Henry Rzepa","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Henry Rzepa","Estimated reading time":"3 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=13826#article","isPartOf":{"@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=13826"},"author":{"name":"Henry Rzepa","@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/#\/schema\/person\/2b40f7b9c872a4dc1547e040a11b6281"},"headline":"Goldilocks Data.","datePublished":"2015-04-08T16:54:54+00:00","dateModified":"2015-04-09T06:30:00+00:00","mainEntityOfPage":{"@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=13826"},"wordCount":684,"commentCount":0,"keywords":["API","RCSB Protein Data Bank","search engine"],"articleSection":["Chemical IT"],"inLanguage":"en-GB","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=13826#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=13826","url":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=13826","name":"Goldilocks Data. - Henry Rzepa&#039;s Blog","isPartOf":{"@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/#website"},"datePublished":"2015-04-08T16:54:54+00:00","dateModified":"2015-04-09T06:30:00+00:00","author":{"@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/#\/schema\/person\/2b40f7b9c872a4dc1547e040a11b6281"},"breadcrumb":{"@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=13826#breadcrumb"},"inLanguage":"en-GB","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=13826"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=13826#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog"},{"@type":"ListItem","position":2,"name":"Goldilocks Data."}]},{"@type":"WebSite","@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/#website","url":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/","name":"Henry Rzepa&#039;s Blog","description":"Chemistry with a twist","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-GB"},{"@type":"Person","@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/#\/schema\/person\/2b40f7b9c872a4dc1547e040a11b6281","name":"Henry Rzepa","image":{"@type":"ImageObject","inLanguage":"en-GB","@id":"https:\/\/secure.gravatar.com\/avatar\/897b6740f7f599bca7942cdf7d7914af5988937ae0e3869ab09aebb87f26a731?s=96&d=blank&r=g370be3a7397865e4fd161aefeb0a5a85","url":"https:\/\/secure.gravatar.com\/avatar\/897b6740f7f599bca7942cdf7d7914af5988937ae0e3869ab09aebb87f26a731?s=96&d=blank&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/897b6740f7f599bca7942cdf7d7914af5988937ae0e3869ab09aebb87f26a731?s=96&d=blank&r=g","caption":"Henry Rzepa"},"description":"Henry Rzepa is Emeritus Professor of Computational Chemistry at Imperial College London.","sameAs":["https:\/\/orcid.org\/0000-0002-8635-8390"],"url":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?author=1"}]}},"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/pDef7-3B0","jetpack-related-posts":[{"id":12803,"url":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=12803","url_meta":{"origin":13826,"position":0},"title":"Data galore!  134 kilomolecules.","author":"Henry Rzepa","date":"August 6, 2014","format":false,"excerpt":"I do go on a lot about the importance of having modern access to data. And so the appearance of this article immediately struck me as important. It is appropriately enough in the new journal Scientific Data. The data contain computed properties at the B3LYP\/6-31G(2df,p) level for 133,885 species with\u2026","rel":"","context":"In &quot;Chemical IT&quot;","block_context":{"text":"Chemical IT","link":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?cat=2"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":12932,"url":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=12932","url_meta":{"origin":13826,"position":1},"title":"One molecule, one identifier: Viewing molecular files from a digital repository using metadata standards.","author":"Henry Rzepa","date":"September 8, 2014","format":false,"excerpt":"In the beginning (taken here as\u00a0prior to ~1980) libraries held\u00a0five-year printed consolidated indices of molecules, organised by formula or name (Chemical abstracts). This could occupy about 2m of shelf space for each five years. And an equivalent set of printed volumes from the Beilstein collection. Those of us who needed\u2026","rel":"","context":"In &quot;Chemical IT&quot;","block_context":{"text":"Chemical IT","link":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?cat=2"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":25761,"url":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=25761","url_meta":{"origin":13826,"position":2},"title":"Molecules of the year -2022.  Data issues!","author":"Henry Rzepa","date":"December 13, 2022","format":false,"excerpt":"The list of molecules of the year is out now at C&E News (but you have to have an account to view the list, unlike previous years).\u2663 These three caught my eye: Electron in a cube: Synthesis and characterization of perfluorocubane as an electron acceptor,. I have already written about\u2026","rel":"","context":"Similar post","block_context":{"text":"Similar post","link":""},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":14454,"url":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=14454","url_meta":{"origin":13826,"position":3},"title":"A (light) introductory tutorial on Research Data Management (in chemistry).","author":"Henry Rzepa","date":"August 20, 2015","format":false,"excerpt":"Management of research (data) outputs is a hot topic in the UK at the moment, although the topic has been rumbling for five years or more. Most research-active higher educational establishments have or are about to publish general guidelines, which predominantly take the form of aspirational targets rather than actionable\u2026","rel":"","context":"In &quot;Chemical IT&quot;","block_context":{"text":"Chemical IT","link":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?cat=2"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":20601,"url":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=20601","url_meta":{"origin":13826,"position":4},"title":"Impossible molecules.","author":"Henry Rzepa","date":"April 1, 2019","format":false,"excerpt":"Members of the chemical FAIR data community have just met in Orlando (with help from the NSF, the American National Science Foundation)\u00a0to discuss how such data is progressing in chemistry. There are a lot of themes converging at the moment. Thus this article extolls the virtues of having raw NMR\u2026","rel":"","context":"In &quot;Chemical IT&quot;","block_context":{"text":"Chemical IT","link":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?cat=2"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":13248,"url":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=13248","url_meta":{"origin":13826,"position":5},"title":"A convincing example of the need for data repositories. FAIR Data.","author":"Henry Rzepa","date":"January 15, 2015","format":false,"excerpt":"Derek Lowe in his In the Pipeline blog is famed for spotting unusual claims in the literature and subjecting them to analysis. This one is entitled\u00a0Odd Structures, Subjected to Powerful Computations. He looks at this image below, and finds the structures represented there might be a mistake, based on his\u2026","rel":"","context":"In &quot;Chemical IT&quot;","block_context":{"text":"Chemical IT","link":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?cat=2"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"jetpack_likes_enabled":false,"authors":[{"term_id":2661,"user_id":1,"is_guest":0,"slug":"admin","display_name":"Henry Rzepa","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/897b6740f7f599bca7942cdf7d7914af5988937ae0e3869ab09aebb87f26a731?s=96&d=blank&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=\/wp\/v2\/posts\/13826","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=13826"}],"version-history":[{"count":13,"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=\/wp\/v2\/posts\/13826\/revisions"}],"predecessor-version":[{"id":13845,"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=\/wp\/v2\/posts\/13826\/revisions\/13845"}],"wp:attachment":[{"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=13826"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=13826"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=13826"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fppma_author&post=13826"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}