top of page
Computer Sketch

04 Data

[The data below is described in this paper published in Industrial and Corporate Change.] 

While patent data are being used across many disciplines in the social sciences, little has been known on patents published before 1975 when the USPTO implemented a digital system.


The pre-1975 have been made digitally and publicly available by Google Patents. Google used Optical Character Recognition software to read the text of these historical patent documents, digitize and hosts them online.


Petralia, Balland and Rigby (2016) have scraped these text files, cleaned and structured these data. These data are made publicly available in HistPat and can be found here at Nature's Scientific Data. 

This data-base is an extremely rich source. For more than 4 million patents it provides information on: (1) first inventor, (2) her/his geographical location, (3) application year,
(4) grant year, (5) technology class(es) and sometimes an (6) assignee.


However, it does not provide any information on additional inventors that might collaborated on the patents. 


This is where I contribute. I've mined the text of more than 4 million HistPat text-files. Using complex search and matching algorithms, I examined each single word to identify  inventors names and their exact geographical location.

After picking up more than 8 million could-be-inventors, I've used state-of-the-art (fully supervised) machine learning techniques to identify which could-be's are truly inventors - not witnesses, examiners, assignee's etc.

Finally, building upon work by Ventura et al. (2015) I built a supervised machine learning algorithm to disambiguate unique inventors.


The product is an inventor-patent data-base that holds - for each historical U.S. patent between 1836-1975 - information on all inventors and their geographical location. This allows us to generate networks of collaboration that connects inventors within and between U.S. cities, as well as to track the movement of inventors over time and in technology and geographical space. 


I describe these data in this paper, published in Industrial and Corporate Change


For more information, check this dedicated website:

bottom of page