A major strength of the Record Linking Lab is the access to the undergraduate students at Brigham Young University. BYU has 27,000 undergraduate students, many of whom work on-campus. In addition, BYU has a large selection of undergraduate family history classes, with more than 1,200 students taking an introductory family history course each year.
One of our goals as a lab is to provide a research assistant position to as many students as possible to help with projects related to family history work and improving the Family Tree. These students work as research assistants on the specific projects of the lab and can also be hired by scholars that would like to tap into this labor pool for their own projects. We provide an excellent mentoring environment that assigns advanced students to work with more recent hires.
If you are a student that is interested in working for the Record Linking Lab, please send an email to email@example.com and we will get in touch with you about filling out an application and having an interview.
Current Undergraduate Assistants
There are three pages we are working on - first, attaching the 1920 census (which is this link: https://docs.google.com/spreadsheets/d/1xEBIGA2UgY_wK4MnyNfnUKCfs4yXfX7wP_OteaGZBdw/edit?usp=sharing), attaching the 1910 census (which is this link: https://docs.google.com/spreadsheets/d/1CMvOJISGMWA-kV0tDy_XwY5g52TYJhOWQjo0PubYqgQ/edit?usp=sharing), and then the island hints (which is this link: https://docs.google.com/spreadsheets/d/14sFT9PH1Z2HK1SGPVagaJz5U_UYlfdX1BJQzpWYfSSA/edit#gid=0). The point of the families added tab is to extend the tree out as far as you can. If you are having any difficulty, please contact one of us or look at the FAQ section on the rll.byu.edu/students website. Please work on the 1920, then 1910 document, and then move onto the island hints afterwards. If you have moved onto the adding families section, please watch Dr. Price’s video: https://www.youtube.com/watch?v=PPu22KgOMH8&feature=youtu.be.
Here is the link to the data for the project. Our goal is to label this training set for them as 0 = not a match, 0.5=maybe a match, and 1=is a match. They are welcome to figure out the best way to do this and I created a video for how I approached it but I’m sure as a RA works on this, they’ll probably have a better approach to do it.
We are continuing the WWI project. Here is the link to the spreadsheet: https://docs.google.com/spreadsheets/d/1KNe3QdnpDfeK-guT1J9ssa3AUiLJska9vf9Hv0ad-jg/edit?usp=sharing
Before you start, please watch this video with instructions: https://youtu.be/Cg4-IPuOu_k
How do I get a free Ancestry.com account?
What is a public member tree on Ancestry.com?
A public member tree is the personal family tree of an Ancestry.com member. While Ancestry.com does not check for ancestry, often these are trees passed down through individuals’ families and thus, is harder to be traced via additional sources. It is a very helpful tool to try and link people together.
Why can’t I find the 1890 census?
Unfortunately, the records for the 1890 census were damaged in a fire in 1921 so it is extremely rare to find anything from 1890.
Where do I save files? How does our file structure work?
Everything you need should be in the V drive. Almost all of your files should be saved in the paper folder you are currently working on, which will be in the papers\current folder. You are welcome to copy code to your working folder. However, we don’t want big files, or almost any data files in your working folder. If you are saving some form of raw data that you haven’t done anything to, then you can save it in the raw_data\raw folder. You should never change anything in the raw_data folder, unless you have express permission to. We will have a few other folders that will have resources for you to learn from: namely, New RA Resources. This will have information pertaining to your team and skills you will need or can use in this job. The folder “tools” will have some code and tools for web scraping, record linking and similar tasks, as well as have our base python environment that you should never touch. If there are any questions about files, ask your paper or team lead. But please remember we want any code you save to be usable in the future. Please annotate and make clear what your code does and how it does it. This video can also help with this question.
Ark- (archival record key) This is an I.D. for a record on FamilySearch. So when FamilySearch references a person on a census, they use an ark to reference it.
Pid- (personal identifier) This is a person I.D. So one person on FamilySearch has one id. Then all records that attach to it have arks. So one person (PID) can have many records attached to their name (arks).
Histid- This is the same thing as an ark, but what ancestry uses. So if you are working with ancestry data you will use histid instead of ark.
Machine learning- This is teaching a computer to make really good predictions. There are many different ways to do this (we call these ways models), but in essence it is teaching a computer to guess something. For example we can teach a computer to guess if record A on a census is the same person as a record B on a different census.
Training data- This is the data we use to teach a computer. So we have to know if each observation is true or false. The computer uses that information to figure out the best way to predict what we want it to predict. For example it could be a big data set of possible matches for people on a census to people on a different census. But for each of those we need to know if it is true (a correct match) or false (an incorrect match)
The tree- we often say this to reference the tree on FamilySearch. This is all the connected people and records that make a big family tree.
Census tree- this is something we have created using only census data. We connect people from one census to another creating a big family tree across American history.
R-drive/V-drive- the V drive is a shared drive accessible by Dr Price’s RA’s, and the R drive is the old one. We are in the process of moving everything to the V drive. For now, try to have all of your files in the V drive.
Crosswalk- This is basically a data file that links two different datasets by those data sets' unique identifiers. So it could link an ark to a histid (FamilySearch data to Ancestry data).
Dictionary- This is something in python, but we generally refer to a dictionary as a data file that makes a big data file, like a census much easier to work with. For example, in our compact census, we have place of birth as numbers. You can then look at the place of birth dictionary which will tell you that 10 means “Denver” or something like that. Working with numbers is easier, and takes less space than working with words. That is why dictionaries are helpful.
API- (application programming interface) This lets us interact with websites data. So for example, we can ask FamilySearch’s API to give us all the arks they have for specific PIDS. The API is how we get that information.