In Diego Lopez Yse’s “Your Guide to Natural Language Processing (NLP),” he discusses a number of techniques used by computer scientists to analyze text for spelling, meaning, and the topics they consist of. In the piece, he mentions a number of the current use cases, from predicting diseases using electronic health records to identifying fake news. Of course, it should be noted that this article was published in 2019, before the popularization of Large Language Models to analyze text and determine their meaning. Despite this, the article still has a number of arguments that are still applicable to Large Language Models.
For example, Yse explains at the beginning of the article that human language, while it can be interpreted in a way to extract meaning, is often messy and unstructured. NLP often relies on written text, particularly from the internet, to process and extract the meaning from language. However, when pieces of texts are taken from a broad range of sources, the words used in each piece are divorced from their context, which contributes to the meaning of the text. Further, by omitting words from their writing, authors can further develop meaning, complicating the relationship between a written text and its meaning. As Yse highlights, simple steps in Natural Language Processing, like tokenization or the removal of stop words, can warp the meaning of a text, especially if the text incorporates multiple languages. So, for a model to accurately extract the meaning of a text, it cannot exclusively focus on the written word. Instead, the model will have to have an understanding of culture, other texts being referenced, and more to fully grasp a text and its meaning. Given that Large Language Models are an ongoing development, it is difficult to predict whether models can learn the necessary aspects of culture and writing exclusively from analyzing written texts that are often divorced from the context they are written in.
Today’s reading was about NLP’s and how they are currently used within different industries. As of now it is commonly used the most in medicine, and can attest to this since my father who is a veterinarian uses NLP’s in his practice. Because the algorithm model is so common, there are naturally many different iterations. The first is Bag of Words, which splits strings of words into individual elements for counting and analytical purposes, but there is some crowding that occurs because there is no omission of what Lopez Yse coins “stop words”, which are words like “are” or “to” or “not”. The next and more advanced algorithm is called tokenization, which addresses the issue mentioned previously. But, new issues arise with this method, for punctuation can be removed, and words that need to be together like “deja vu” are split. The third is stop word removal, which only removes the stop words from phrases for a less-crowded analysis of a phrase. The downside to this approach is that removing stop words can remove necessary context from a phrase, like when you remove the word “not” from the phrase “I am not someone who enjoys larping”, because it changes the phrase’s meaning entirely. So, it depends on the list of stop words for each application of this algorithm. There is stemming, which simply slices words to get their prefix, which works most of the time. However, the word that could result from slicing the words often does not correlate to the original word, like the “one” in “Oneonta”. Lastly, there is lemmatization, which is objectively the most effective yet complex approach. It resolves words to their dictionary form and retains their context. The downside is that it takes the longest and is the most complex to implement.
I think it is interesting to speak about the process of creating these models without much consideration of the ethics. The more we become involved with NLP models to the point that we have models as large as chat GPT, the more people are beginning to recognize the involvement of data and public texts as inputs for these systems. The article talks about many different ways people deal with language by stripping words down or using context to determine roots. These are all different techniques to mimic the way we understand and utilize patterns in our own language learning as humans to understand things. I find it interesting that we have needed to deconstruct these underlying system to such an extent to allow models to replicate our learning process. Furthermore, when we think of the way these models learn from vast amounts of textual information as compared to a human’s vast visual auditory learning over the course of many continuous years it is interesting to note the differences. We also tend to strip a lot of information by stemming or stop word removal to make categorization of materials or ideas easier. These models are definitely useful for helping categorize and display data from health to financial to misinformation as was mentioned, but it does, as always, bring up issues of censorship bias and discrimination. A lot of information is contextual. If one a particular idea is considered misinformation and someone goes to explain the issues behind that idea. Would an algorithm be nuanced enough to deal with that. If it gets it wrong and there are not the resources to review such a decision will an improper decision stand. I feel likeI have seen this kind of issue all throughout digital platform moderation.
Learning about the processes behind Natural Language Processing was extremely interesting. There are so many minute details that I read through that I had never considered when it came to automating the human language. With that being said, however, I still have a lot of questions regarding NLP. In particular, it has capabilities with spoken discussion and dialogue. There is of course the famous preconceived notion I had in my head going into this reading which was that three-quarters of communication is non-verbal. Although the reading focuses primarily on textual language, I still believe that part of the latter notion translates. So much of what we perceive and take away from communication is context. What is happening to the communicator, around them, to the person they are communicating to, etc.. To simply look at the language to me, seems like a restrictive way of truly comprehending what was communicated.
Of course, I am interested to see what positives come from NLP, but I have my skepticism about its success. We have already learned about failures of similar technology models of predicting or reading into human insight. Additionally, it would not surprise me if there was bias intertwined with the technology as well or if it was built to only suit or with the needs of only English speakers. Moreover, I am interested to see the course that NLP takes, but like all forms of automation, I believe it needs to be closely studied before wide acceptance.
In the realm of Natural Language Processing (NLP), the ability of computers to comprehend and extract meaning from human language is a fascinating and rapidly advancing field. NLP techniques have found diverse applications across industries such as healthcare, finance, media, and recruitment, empowering machines to process and analyze vast amounts of natural language data. From sentiment analysis to information extraction, chatbots to predictive analytics, NLP has demonstrated its potential to revolutionize various domains.
Key techniques in NLP include the Bag of Words model, tokenization, stop words removal, stemming, lemmatization, and topic modeling. Each technique serves a specific purpose, enabling researchers and practitioners to manipulate and derive insights from textual data. However, challenges persist in accurately capturing language nuances and context, as evidenced by the limitations of early chatbots like Microsoft’s Tay. The complex nature of language poses obstacles, but ongoing advancements in NLP are paving the way for more sophisticated applications.
The Bag of Words model, while useful for text classification, disregards semantic meaning and context, highlighting the need for more sophisticated approaches. Stemming, although fast, may produce non-words or alter the intended meaning of a sentence. In contrast, lemmatization provides proper words and handles context, albeit at a slower pace. Meanwhile, topic modeling, particularly through algorithms like Latent Dirichlet Allocation (LDA), enables the discovery of latent topics within document collections, facilitating tasks such as text classification and trend detection.
Looking ahead, the future of NLP holds both challenges and promise. While the complexities of language pose ongoing difficulties, the field is progressing rapidly. Innovations are emerging, such as chatbots that can remember context and details from prior conversations, suggesting a future where machines can better understand and engage in meaningful dialogues. NLP’s value lies not only in its technology but also in its vast range of applications that continue to expand with each passing day.
In conclusion, NLP is a captivating field that empowers machines to comprehend and analyze human language. Despite the challenges posed by language complexity, ongoing advancements in NLP techniques offer exciting prospects for the future. As NLP continues to evolve, complex applications that were once deemed impossible may become a reality, transforming the way we interact with and harness the power of language.
I thought this was a pretty good intro into NLP for general applications. I personally have really only seen articles focused on the models – i.e. Recurrent NNs and their math – so focusing on processing the data before feels like a wider-view approach and including stuff that is also really important (if less interesting). I wish this article would have included another paragraph on vectorization cause I think that’s kinda the missing connection between what it touched on and the math side that people tend to be more familiar with. Really impressive that models can extract the stuff that they can, and cool that language is so mathy, and something I didn’t really appreciate until I took linguistics. Also fun that non-context dependent languages are a thing we talk about in 341 but also like something that is applicable in real life languages like Japanese.
I found the distinction between traditional keyword-based interpretation and the more cognitive approach of understanding the meaning behind words to be particularly insightful. The complexity of human language, filled with nuances, ambiguities, and cultural contexts, is immense, and it’s remarkable to consider the progress we’ve made in teaching machines to interpret it. Also, it’s noteworthy to see how expansive the applications of NLP have become. The range of industries from healthcare to finance to human resources that are harnessing the power of NLP is indicative of the transformative potential of this field. The examples of predicting diseases from electronic health records and using NLP in sentiment analysis are particularly relevant, showcasing real-world value and the merging of computational techniques with human-centric data. Moreover, I appreciated the balanced perspective provided, with the inclusion of potential pitfalls and ethical concerns. The example of Microsoft identifying users with potential health conditions based on search queries emphasizes the thin line between innovation and privacy invasion. It’s a reminder that with great technological power comes great responsibility.
It is easy to forget how complicated a language is when you speak and interpret it constantly, but this article was a reminder at the enormous amount of difficulty associated with modeling one. Especially with a language like English, which is ripe with abnormality, this article seems to open more questions than it answers about NLP, such as how an algorithm might distinguish words spelled the same that are both nouns with different meanings like case and case, bat and bat etc. That also raises questions about how a computer may interpret things to be appropriate or not, the consequences of which are demonstrated by the Twitter AI at the end of the article. Ethically, it is difficult to determine what may or may not be acceptable input and output for language modeling, and it is even more sticky when trying to determine how to set the rules. When looking at the organizations that are producing AI and LLMs, who is setting the ethical standards for its use and outputs, and how are those standards enforced in the code they are producing?
I think LLMs are some of the most compelling advancements in the field of AI/machine learning in the last couple of years. As others have mentioned, languages are an incredibly complex thing to set out and understand. Even humans, the language kings of the animal kingdom, have a hard time reaching complete understanding after a specific time of cognitive development has passed. That we can make an algorithm capable of cognitively understanding something as expansive, complicated, and artistic as language is pretty incredible.
One of the coolest uses of LLMs mentioned in this article is the idea of a cognitive assistant. In the strange way the article describes, every single action we take sheds an immense amount of data about ourselves. Normally, these actions go unanalysed, ignored, or even completely forgotten. That something could observe, learn, and never truly forget is an interesting and very concrete reason to incorporate such tools into your life. Even though there are a lot of potential data privacy complications involved in a product like this, I see this sort of thing being compelling enough for a lot of people to make concessions in order to enjoy it.
One thing I strongly dislike is the incorporation of this technology into creative efforts. Humans are by no means original creatures. We learn and are influenced by everything we are exposed to, consciously or unconsciously, but everything we create is inseparable from our lived experience as humans. It informs our decisions, even if we don’t understand them, to create something new. LLMs can do much of this work, mostly poorly, but does it all without the spark that fuses inspiration and originality. I think this issue reached a head recently with the SAG AFTRA strikes, where screenwriters, actors, and associated professionals felt seriously threatened that corporations would adopt AI as rapidly as possible and leave them without a job, all while pillaging their previous work. This isn’t a far out concern and the time is quickly coming to legally address it.
Languages are something that we often use without thought, but they are really difficult and complicated for computers. The article illustrated how computers get from people’s conversations, tweets, and emails and how those processes are complicated.
The NLP will face a lot of challenges in the future because roughly 150 to 200 languages are spoken by more than a million people each.
Each language has totally different structures and grammar. The NLP developers need to update NLP for each type of language continuously, which is a really demanding task.
NLP provides an efficient approach to setting the variables and analyzing the sentence by assigning value with sentiment analysis. However, it might not always get the most reasonable analysis. Some special sentence, such as irony, is hard to analyze without cultural and social context, might not provide similar meanings. As a non-native speaker living in the US, there are lots of things that I do not understand correctly. I remember when I first heard all live matter, I thought it was a good intention since it is about the lives of all human beings. However, after talking with some Americans, I realize that it is actually an argument against Black Lives Matter. As those opponents argue all lives matter not only black lives. I cannot understand those without the experience and cultural context, and I think the machine is probably not able to figure that out as well.
A lot of the time when I read these types of articles discussing machine learning technology, I get the same feeling of watching an infomercial. Beyond learning about NLP, I feel like I’m being sold it. The article discusses current use cases for NLP, but most of them involve either more efficient business practices or more curated internet media. The only one that seemed truly helpful for average people was help in prediction and diagnosis of disease. Spencer brought up how it’s strange to talk about these models without delving into the ethics of using them, and I totally agree. I really disliked how the article brought up “huge improvements in the access to data” like that was an uncontroversial, completely positive thing. This is not written for the average consumer who might be concerned with their data privacy, this is for NLP tech to be sold to businesses. They bring up Microsoft’s Tay at the end, one of the most memorable failures of AI in recent history. But this isn’t used as a warning, it’s shown as an example to see how far we’ve come, even though NLP models like ChatGPT are already being used by many dubious people today.
I think it’s a shame that the future of these types of language models is set in the breaching of private data and the efficiency of industry (which often involves the firing of humans and the creation of poorly written, pointless content), as I don’t think all use cases are problematic. Using these types of models to summarize text, clean up my feed, or warn me about potential heart disease sounds genuinely useful. I appreciate how the article explains that these NLP models are pretty simple to understand from a general standpoint, as a lot of other media about machine learning models treat them like magic. I just can’t get over the fact that they discuss NLP without acknowledging the breaching of private data, the internal biases AI models can inherit, and use cases beyond helping large corporations.
I never really considered the idea of NLP, and so I thought this was a good article to tell me about NLP and how it works. It was very interesting to read and one thought that crossed my mind was Google’s first AI bot that was put into Twitter. Seeing how it ended up being a failure as it became racists by taking in input from people they talked to made me wonder about society as a whole and how it could affect the world of technology. By learning more about society and technology and how both of these components affect each other, I became curious as to is society the reason why technology is biased? Because we see this even with algorithms for criminal recidivism where the data we input into the algorithm is our own and we see the flaw that it is quite biased. It is also interesting to see how we would be able to know which inputs are good for the bot and which ones aren’t. I also thought about the idea of different languages going into NLP. Is it only english? Do other languages work as well for things such as Alexa? How would they work? How will NLP be able to read several different languages? Is it through something as inaccurate as google translate? These ideas make NLP such a leading factor to technology but also a concern to how technology can be equal in society and a more viable option to use rather than humans for communication.
Reading about natural language processing was really interesting. I wasn’t aware of what exactly NLP was, and I was quite surprised to see how important it has become in today’s modern world. It was really surprising to see that there exists a field of artificial intelligence that gives machines the ability to read, understand, and derive meaning from human language and that such a field is used in many other fields. For example, I was surprised that NLP enables the prediction of diseases based on electronic health records and patients’ own personal speech. I think that this shows that the potential for this field is immense; it’s just a matter of time for it to become more reliable and accurate. However, there are many circumstances that need to be addressed before it reaches a state where it is fully functional and accurate.
Reading about the chatbot released on Twitter as an experiment in 2016 made me realize how most of the problems with these technologies might not be directly related to the technology itself but with the ways humans use language to interact. Language and words are powerful tools that can be really harmful, and sometimes even though those words are harmful, they can be. Also, it’s important to take into account that language evolves at a rapid pace; new expressions and connotations for words arise constantly, thus it’s important for the model to keep up with this.
Overall, I consider that NLP will be even more impactful in the future, but it is important to keep in mind the effects it can have socially.
Spencer hit on a lot of thoughts I had after reading this article about ethics. I am taking Teaching Writing this semester, and we just had a module on linguistic justice and valuing languages and dialects like Black English/Ebonics or Spanglish in academic environments as writers find their voices. So I wonder how NLP reckons with that, which the article does not really touch on. I also wonder how sentiment analysis comes into play. For example, the bullet point about NLP being able to predict diseases based on factors like a patient’s own speech may have adverse impacts especially regarding mental/behavioral disorders and neurodivergence. What is considered “aggressive” or “disoriented” speech? How do ideas of competency come into play when analyzing how a patient of color speaks versus a white person? Also, using “medications and treatment outcomes from patient notes” is also cause for concern because of medical racism and discrimination. A patient’s history may not point to anything because they have been neglected for so long.
I was also really struck by one of the examples of NLP applications given: “An inventor at IBM developed a cognitive assistant that works like a personalized search engine by learning all about you and then remind you of a name, a song, or anything you can’t remember the moment you need it to.” The other day I read a chapter from Edwin Black’s “IBM and the Holocaust” which is about how IBM punch cards were used to facilitate the collection and genocide of Jewish people. A cognitive assistant/personalized search engine is not going to directly facilitate another atrocity like the Holocaust, but I think it is worth tracing the lineage of these ideas of surveillance. The article associated with this bullet point is titled “A Search Engine for Your Memories,” and there is already so much research that proves that our memories can be unreliable. So how does this algorithm avoid our own personal biases if that is the only thing it can build off of? It seems like it would work to reinforce them rather than diminish them.
While I find Language related computational methods very cool, I still think there are some aspects that give me pause. The article talks about one the use cases of NLP being prediction of fake news. One of the emerging uses cases for this technology is also its application in things like deepfakes and the generation of fake news. What is the line between the development of this technology to combat misinformation and how it could be harnessed for the purpose of the very thing it purports to fight?
Also, as touched on by the article, many NLP methods are developed with English in mind. While languages are extremely systematic, they are not systematic in the same ways. This contributes to the greater proliferation of English on the internet, or the assumption that the language is the lingua franca of this arena (which it kind of is) due to things like relative ease of translation to and from it. When translating between languages, for example, it is more common to have correct (or slightly more correct) translations between languages that are similar to English (mostly European languages) than it is for languages that do not resemble the same. While a part of this is because there is much more data for English on the internet, it is also because these methods implicitly centre and work the best for this language.
I was also thinking about the prescriptive vs. descriptive divide within linguistics. In the context of English, this would be the difference between standard English for example (how it “should” be spoken) and the various dialects which it actually takes the form of). Since descriptive accounts of language are much more numerous and nuanced than prescriptive accounts, NLP might support the reproduction and proliferation of standard forms of English without the inclusion of dialects.
Finally, the article mentions that language, much like other forms of information, can have value extracted from it in that we can predict human behaviour. How does privacy play into this? Shoshana Zuboff talks about how there is a slippery slope between prediction of human behaviour and actual influencing of them same, which is something that could be extended to the use cases of NLP.
Especially in the era of rapidly developing LLMs such as ChatGPT, which has become so prevalent that in under a year it has gone to unreleased to such a force it has its own place in the Academic Honesty policy, it is essential to consider NLP (Natural Language Processing), which describe the tasks these models perform. Much of the article read for today focuses on the technical aspects of NLP, defining terms like tokenization and describing challenges like appropriately stemming words that can impact the performance of these models. These technical descriptions are essential to an appropriate understanding of NLP’s outcomes, as well as what it is as a process.
However, the article’s purpose is not exclusively to give a guidebook to NLP terminology and practices—it also addresses NLP futures, and the perils that arise from NLP done wrong. Specifically, there is a focus on a Microsoft Twitter chatbot from 2016 called Tay, which had to be taken down within 16 hours because it had become “racist and abusive”. Microsoft theorized that, by interacting with more users, Tay would become smarter and more nuanced, but instead, Tay got a lot of racist and abusive interactions, which it parroted back. Much like many algorithms that we’ve discussed, NLP is only as good as its data. Tay could only learn what it was told and the environment it was exposed to caused its training to go south very quickly. Therefore, as computer scientists, in NLP as in elsewhere, it is essential to consider not just the ethical impacts of what we ourselves are writing at face value, but how input, such as training data, which comes from the larger, often biased, outside world, factors in to our outcomes. It is not enough to be ethical in theory; instead, this must reflect in practice. As always, results matter.
I think the use of NLP in the healthcare industry to improve clinical documentation, thus helping healthcare workers to improve patient experience sounds helpful to me. I guess what I’m unsure about is what data is fed to these machines, given the racism/sexism embedded in a multitude medical research and practices. If this is the data used for diagnosis, then these machines just continue the misdiagnoses of people without someone being held accountable. Under less serious circumstances, I wonder how the constant changes in slang, other new phrases incorporated into language, and differences in meaning across languages would be accounted for in language learning models. I’m also wary of the use of a chat bot therapist, even if NLPs are advanced greatly; in addition to lacking the human aspect of therapy, I am wary of how it’s responses to each individual would be since experiences vary greatly.
In Diego Lopez Yse’s article, he gives a brief overview of the motivations for an techniques used in modern Natural Language Processing (NLP). To me, NLP is an exciting area of study, since the field has advanced rapidly in just the last few years, moreso than any other technology that I’m aware of. I remember trying out a state-of-the-art chatbot model on HuggingFace when I was in high school, and while it was sometimes amusing it often produced completely non-sensical or un-grammatical responses. Fast forward to 2023, and a conversation with ChatGPT is often impossible to differentiate from one with a human being.
After reading about some of the techniques described like stemming, tokenization, and lemmatization, I’m curious about if any of these low-level operations have been changed or improved upon by newer NLP models. If not, then I’m curious about how LLMs have been able to overcome the problems with tokenization, like producing sentences with an ambiguous meaning or the meaning of a sentence changing. I also think the underlying engineering challenges of processing billions or even trillions of input tokens must be immense, and I’m interested to learn about how those have impacted the algorithm’s development.
I thought this week’s reading was very interesting. I liked getting to know exactly how NLP works and how it can do so many different things like sentiment analysis, help identify, fake news, help automate litigation tasks, and help treat Alzheimer’s. This reading left me wondering about how the code would look. I would love to see that. I was particularly impressed by the idea of lemmatization in which the programs can learn more about the context of words by looking at the stem. Lastly, I thought it was crazy how the artificial intelligence chatbot released on Twitter became so racist.
At one point in the reading the author mentions that the use cases of natural language processing are more important than how it is done. I agree, the use cases are far more diverse than the methodologies of NLP. The comprehension and use of language seem like something that a model would struggle greatly to keep up with. So many different terms are used for purposes that are contrary to their definitions. However, when I fire up Chat GPT I find that it understands exactly what I mean every time I ask a question or make a statement. It then responds in a clear and understandable language. I wonder if it would be able to keep up at mealtime in a conversation with friends, friends that like to create nicknames and slang. I wonder if NLP functions in a situation where language use is constantly evolving, where a new term for an item might be created and discarded in the very same sentence. I think this matters because I think eventually AI is going to be developed to operate in situations that require social competency, and social competency is not something that can be easily defined or implemented. Even when it can be described one day, the very next day the landscape may have changed overnight.
Also, what about body language? How would somebody go about tokenizing or stemming body language for the purposes of NLP? Is NLP just limited to Western vocal communication right now?
I have a lot of interest in the relation (and the differences) between words and meaning, intention, etc. so this article was an interesting introduction to current trends and fascinations in the computational analysis of language. I was hoping for the article to go a bit deeper into the ethical concerns LLM and NLP present, especially as Sira mentioned for other dialects and languages. Additionally, the separation of words into units by spacing shows the seeming anglocentricness of the field as it was described in the article. I am also curious about the use of Large Language Models like ChatGPT and how they can become a crutch for students, used to offload the valuable work of critical thinking, developing writing skills, etc. I have noticed this especially in some of my past CS courses, where I have seen classmates use chatgpt or similar applications for the few written/ reflective assignments we are given, even though these people are often the ones who could benefit the most from these endeavors.
At this point I’ve taken nearly as many linguistics courses at Grinnell as I have CS courses. Linguistics is my primary interest, and I’ve made to sure to pay extra attention to the overlap between linguistics and CS. I understand this is an introductory article to NLP, so I won’t fault it too much for its surface-level explanations. That said, I find it frustrating how much of the discourse around NLP is so focused on data processing, tokenization, and pattern analysis without much regard for grammar models and linguistic research. I know it’s because robust linguistic analyses are much more expensive and time-consuming than simple ML algorithms, but we’re trying to use NLP for some very important things like healthcare, and that is terrifying to me. Also, I do not believe for a moment that NLP technology is improving care delivery and bringing healthcare costs down for patients, at least not in the US. That has never been the goal for the companies investing in NLP. Much more likely is that NLP is being used to help automate claim denials for insurance companies.
“Your Guide to Natural Language Processing (NLP)” was a good introduction to the fastly developing subfield between computer science, linguistics and psychology. In the past decade, I’ve been exposed to some applications of NLP in chatbots or recommendation systems, but I never put too much thought into the work behind the scenes because it felt too complex. The closest thing to NLP that I did was a CSC-151 project about sentiment analysis, where we determined the general sentiment or attitude of a text or a novel based on the frequency of positive and negative words. The list of application examples surprised me because I never expected that NLP could accomplish a lot more than I’ve known, including disease diagnoses, cognitive assistance and fake news prevention. The fact that it can be used so widely in many industries is promising, but it’s also concerning to think about the amount of information that it requires and the knowledge that it may acquire from human data. Although the techniques used to process and manipulate languages seem very challenging, especially when NLP is being applied in so many languages and used in more colloquial contexts, it does feel like the field has been advancing at a really fast pace given the widespread use of ChatGPT. I’m excited to learn more about the progress NLP professionals have made, and the remaining challenges and ethical concerns raised during the process.
The views and opinions expressed on individual web pages are strictly those of their authors and are not official statements of Grinnell College.
Copyright Statement.
In Diego Lopez Yse’s “Your Guide to Natural Language Processing (NLP),” he discusses a number of techniques used by computer scientists to analyze text for spelling, meaning, and the topics they consist of. In the piece, he mentions a number of the current use cases, from predicting diseases using electronic health records to identifying fake news. Of course, it should be noted that this article was published in 2019, before the popularization of Large Language Models to analyze text and determine their meaning. Despite this, the article still has a number of arguments that are still applicable to Large Language Models.
For example, Yse explains at the beginning of the article that human language, while it can be interpreted in a way to extract meaning, is often messy and unstructured. NLP often relies on written text, particularly from the internet, to process and extract the meaning from language. However, when pieces of texts are taken from a broad range of sources, the words used in each piece are divorced from their context, which contributes to the meaning of the text. Further, by omitting words from their writing, authors can further develop meaning, complicating the relationship between a written text and its meaning. As Yse highlights, simple steps in Natural Language Processing, like tokenization or the removal of stop words, can warp the meaning of a text, especially if the text incorporates multiple languages. So, for a model to accurately extract the meaning of a text, it cannot exclusively focus on the written word. Instead, the model will have to have an understanding of culture, other texts being referenced, and more to fully grasp a text and its meaning. Given that Large Language Models are an ongoing development, it is difficult to predict whether models can learn the necessary aspects of culture and writing exclusively from analyzing written texts that are often divorced from the context they are written in.
Today’s reading was about NLP’s and how they are currently used within different industries. As of now it is commonly used the most in medicine, and can attest to this since my father who is a veterinarian uses NLP’s in his practice. Because the algorithm model is so common, there are naturally many different iterations. The first is Bag of Words, which splits strings of words into individual elements for counting and analytical purposes, but there is some crowding that occurs because there is no omission of what Lopez Yse coins “stop words”, which are words like “are” or “to” or “not”. The next and more advanced algorithm is called tokenization, which addresses the issue mentioned previously. But, new issues arise with this method, for punctuation can be removed, and words that need to be together like “deja vu” are split. The third is stop word removal, which only removes the stop words from phrases for a less-crowded analysis of a phrase. The downside to this approach is that removing stop words can remove necessary context from a phrase, like when you remove the word “not” from the phrase “I am not someone who enjoys larping”, because it changes the phrase’s meaning entirely. So, it depends on the list of stop words for each application of this algorithm. There is stemming, which simply slices words to get their prefix, which works most of the time. However, the word that could result from slicing the words often does not correlate to the original word, like the “one” in “Oneonta”. Lastly, there is lemmatization, which is objectively the most effective yet complex approach. It resolves words to their dictionary form and retains their context. The downside is that it takes the longest and is the most complex to implement.
I think it is interesting to speak about the process of creating these models without much consideration of the ethics. The more we become involved with NLP models to the point that we have models as large as chat GPT, the more people are beginning to recognize the involvement of data and public texts as inputs for these systems. The article talks about many different ways people deal with language by stripping words down or using context to determine roots. These are all different techniques to mimic the way we understand and utilize patterns in our own language learning as humans to understand things. I find it interesting that we have needed to deconstruct these underlying system to such an extent to allow models to replicate our learning process. Furthermore, when we think of the way these models learn from vast amounts of textual information as compared to a human’s vast visual auditory learning over the course of many continuous years it is interesting to note the differences. We also tend to strip a lot of information by stemming or stop word removal to make categorization of materials or ideas easier. These models are definitely useful for helping categorize and display data from health to financial to misinformation as was mentioned, but it does, as always, bring up issues of censorship bias and discrimination. A lot of information is contextual. If one a particular idea is considered misinformation and someone goes to explain the issues behind that idea. Would an algorithm be nuanced enough to deal with that. If it gets it wrong and there are not the resources to review such a decision will an improper decision stand. I feel likeI have seen this kind of issue all throughout digital platform moderation.
Learning about the processes behind Natural Language Processing was extremely interesting. There are so many minute details that I read through that I had never considered when it came to automating the human language. With that being said, however, I still have a lot of questions regarding NLP. In particular, it has capabilities with spoken discussion and dialogue. There is of course the famous preconceived notion I had in my head going into this reading which was that three-quarters of communication is non-verbal. Although the reading focuses primarily on textual language, I still believe that part of the latter notion translates. So much of what we perceive and take away from communication is context. What is happening to the communicator, around them, to the person they are communicating to, etc.. To simply look at the language to me, seems like a restrictive way of truly comprehending what was communicated.
Of course, I am interested to see what positives come from NLP, but I have my skepticism about its success. We have already learned about failures of similar technology models of predicting or reading into human insight. Additionally, it would not surprise me if there was bias intertwined with the technology as well or if it was built to only suit or with the needs of only English speakers. Moreover, I am interested to see the course that NLP takes, but like all forms of automation, I believe it needs to be closely studied before wide acceptance.
In the realm of Natural Language Processing (NLP), the ability of computers to comprehend and extract meaning from human language is a fascinating and rapidly advancing field. NLP techniques have found diverse applications across industries such as healthcare, finance, media, and recruitment, empowering machines to process and analyze vast amounts of natural language data. From sentiment analysis to information extraction, chatbots to predictive analytics, NLP has demonstrated its potential to revolutionize various domains.
Key techniques in NLP include the Bag of Words model, tokenization, stop words removal, stemming, lemmatization, and topic modeling. Each technique serves a specific purpose, enabling researchers and practitioners to manipulate and derive insights from textual data. However, challenges persist in accurately capturing language nuances and context, as evidenced by the limitations of early chatbots like Microsoft’s Tay. The complex nature of language poses obstacles, but ongoing advancements in NLP are paving the way for more sophisticated applications.
The Bag of Words model, while useful for text classification, disregards semantic meaning and context, highlighting the need for more sophisticated approaches. Stemming, although fast, may produce non-words or alter the intended meaning of a sentence. In contrast, lemmatization provides proper words and handles context, albeit at a slower pace. Meanwhile, topic modeling, particularly through algorithms like Latent Dirichlet Allocation (LDA), enables the discovery of latent topics within document collections, facilitating tasks such as text classification and trend detection.
Looking ahead, the future of NLP holds both challenges and promise. While the complexities of language pose ongoing difficulties, the field is progressing rapidly. Innovations are emerging, such as chatbots that can remember context and details from prior conversations, suggesting a future where machines can better understand and engage in meaningful dialogues. NLP’s value lies not only in its technology but also in its vast range of applications that continue to expand with each passing day.
In conclusion, NLP is a captivating field that empowers machines to comprehend and analyze human language. Despite the challenges posed by language complexity, ongoing advancements in NLP techniques offer exciting prospects for the future. As NLP continues to evolve, complex applications that were once deemed impossible may become a reality, transforming the way we interact with and harness the power of language.
I thought this was a pretty good intro into NLP for general applications. I personally have really only seen articles focused on the models – i.e. Recurrent NNs and their math – so focusing on processing the data before feels like a wider-view approach and including stuff that is also really important (if less interesting). I wish this article would have included another paragraph on vectorization cause I think that’s kinda the missing connection between what it touched on and the math side that people tend to be more familiar with. Really impressive that models can extract the stuff that they can, and cool that language is so mathy, and something I didn’t really appreciate until I took linguistics. Also fun that non-context dependent languages are a thing we talk about in 341 but also like something that is applicable in real life languages like Japanese.
I found the distinction between traditional keyword-based interpretation and the more cognitive approach of understanding the meaning behind words to be particularly insightful. The complexity of human language, filled with nuances, ambiguities, and cultural contexts, is immense, and it’s remarkable to consider the progress we’ve made in teaching machines to interpret it. Also, it’s noteworthy to see how expansive the applications of NLP have become. The range of industries from healthcare to finance to human resources that are harnessing the power of NLP is indicative of the transformative potential of this field. The examples of predicting diseases from electronic health records and using NLP in sentiment analysis are particularly relevant, showcasing real-world value and the merging of computational techniques with human-centric data. Moreover, I appreciated the balanced perspective provided, with the inclusion of potential pitfalls and ethical concerns. The example of Microsoft identifying users with potential health conditions based on search queries emphasizes the thin line between innovation and privacy invasion. It’s a reminder that with great technological power comes great responsibility.
It is easy to forget how complicated a language is when you speak and interpret it constantly, but this article was a reminder at the enormous amount of difficulty associated with modeling one. Especially with a language like English, which is ripe with abnormality, this article seems to open more questions than it answers about NLP, such as how an algorithm might distinguish words spelled the same that are both nouns with different meanings like case and case, bat and bat etc. That also raises questions about how a computer may interpret things to be appropriate or not, the consequences of which are demonstrated by the Twitter AI at the end of the article. Ethically, it is difficult to determine what may or may not be acceptable input and output for language modeling, and it is even more sticky when trying to determine how to set the rules. When looking at the organizations that are producing AI and LLMs, who is setting the ethical standards for its use and outputs, and how are those standards enforced in the code they are producing?
I think LLMs are some of the most compelling advancements in the field of AI/machine learning in the last couple of years. As others have mentioned, languages are an incredibly complex thing to set out and understand. Even humans, the language kings of the animal kingdom, have a hard time reaching complete understanding after a specific time of cognitive development has passed. That we can make an algorithm capable of cognitively understanding something as expansive, complicated, and artistic as language is pretty incredible.
One of the coolest uses of LLMs mentioned in this article is the idea of a cognitive assistant. In the strange way the article describes, every single action we take sheds an immense amount of data about ourselves. Normally, these actions go unanalysed, ignored, or even completely forgotten. That something could observe, learn, and never truly forget is an interesting and very concrete reason to incorporate such tools into your life. Even though there are a lot of potential data privacy complications involved in a product like this, I see this sort of thing being compelling enough for a lot of people to make concessions in order to enjoy it.
One thing I strongly dislike is the incorporation of this technology into creative efforts. Humans are by no means original creatures. We learn and are influenced by everything we are exposed to, consciously or unconsciously, but everything we create is inseparable from our lived experience as humans. It informs our decisions, even if we don’t understand them, to create something new. LLMs can do much of this work, mostly poorly, but does it all without the spark that fuses inspiration and originality. I think this issue reached a head recently with the SAG AFTRA strikes, where screenwriters, actors, and associated professionals felt seriously threatened that corporations would adopt AI as rapidly as possible and leave them without a job, all while pillaging their previous work. This isn’t a far out concern and the time is quickly coming to legally address it.
Languages are something that we often use without thought, but they are really difficult and complicated for computers. The article illustrated how computers get from people’s conversations, tweets, and emails and how those processes are complicated.
The NLP will face a lot of challenges in the future because roughly 150 to 200 languages are spoken by more than a million people each.
Each language has totally different structures and grammar. The NLP developers need to update NLP for each type of language continuously, which is a really demanding task.
NLP provides an efficient approach to setting the variables and analyzing the sentence by assigning value with sentiment analysis. However, it might not always get the most reasonable analysis. Some special sentence, such as irony, is hard to analyze without cultural and social context, might not provide similar meanings. As a non-native speaker living in the US, there are lots of things that I do not understand correctly. I remember when I first heard all live matter, I thought it was a good intention since it is about the lives of all human beings. However, after talking with some Americans, I realize that it is actually an argument against Black Lives Matter. As those opponents argue all lives matter not only black lives. I cannot understand those without the experience and cultural context, and I think the machine is probably not able to figure that out as well.
A lot of the time when I read these types of articles discussing machine learning technology, I get the same feeling of watching an infomercial. Beyond learning about NLP, I feel like I’m being sold it. The article discusses current use cases for NLP, but most of them involve either more efficient business practices or more curated internet media. The only one that seemed truly helpful for average people was help in prediction and diagnosis of disease. Spencer brought up how it’s strange to talk about these models without delving into the ethics of using them, and I totally agree. I really disliked how the article brought up “huge improvements in the access to data” like that was an uncontroversial, completely positive thing. This is not written for the average consumer who might be concerned with their data privacy, this is for NLP tech to be sold to businesses. They bring up Microsoft’s Tay at the end, one of the most memorable failures of AI in recent history. But this isn’t used as a warning, it’s shown as an example to see how far we’ve come, even though NLP models like ChatGPT are already being used by many dubious people today.
I think it’s a shame that the future of these types of language models is set in the breaching of private data and the efficiency of industry (which often involves the firing of humans and the creation of poorly written, pointless content), as I don’t think all use cases are problematic. Using these types of models to summarize text, clean up my feed, or warn me about potential heart disease sounds genuinely useful. I appreciate how the article explains that these NLP models are pretty simple to understand from a general standpoint, as a lot of other media about machine learning models treat them like magic. I just can’t get over the fact that they discuss NLP without acknowledging the breaching of private data, the internal biases AI models can inherit, and use cases beyond helping large corporations.
I never really considered the idea of NLP, and so I thought this was a good article to tell me about NLP and how it works. It was very interesting to read and one thought that crossed my mind was Google’s first AI bot that was put into Twitter. Seeing how it ended up being a failure as it became racists by taking in input from people they talked to made me wonder about society as a whole and how it could affect the world of technology. By learning more about society and technology and how both of these components affect each other, I became curious as to is society the reason why technology is biased? Because we see this even with algorithms for criminal recidivism where the data we input into the algorithm is our own and we see the flaw that it is quite biased. It is also interesting to see how we would be able to know which inputs are good for the bot and which ones aren’t. I also thought about the idea of different languages going into NLP. Is it only english? Do other languages work as well for things such as Alexa? How would they work? How will NLP be able to read several different languages? Is it through something as inaccurate as google translate? These ideas make NLP such a leading factor to technology but also a concern to how technology can be equal in society and a more viable option to use rather than humans for communication.
Reading about natural language processing was really interesting. I wasn’t aware of what exactly NLP was, and I was quite surprised to see how important it has become in today’s modern world. It was really surprising to see that there exists a field of artificial intelligence that gives machines the ability to read, understand, and derive meaning from human language and that such a field is used in many other fields. For example, I was surprised that NLP enables the prediction of diseases based on electronic health records and patients’ own personal speech. I think that this shows that the potential for this field is immense; it’s just a matter of time for it to become more reliable and accurate. However, there are many circumstances that need to be addressed before it reaches a state where it is fully functional and accurate.
Reading about the chatbot released on Twitter as an experiment in 2016 made me realize how most of the problems with these technologies might not be directly related to the technology itself but with the ways humans use language to interact. Language and words are powerful tools that can be really harmful, and sometimes even though those words are harmful, they can be. Also, it’s important to take into account that language evolves at a rapid pace; new expressions and connotations for words arise constantly, thus it’s important for the model to keep up with this.
Overall, I consider that NLP will be even more impactful in the future, but it is important to keep in mind the effects it can have socially.
Spencer hit on a lot of thoughts I had after reading this article about ethics. I am taking Teaching Writing this semester, and we just had a module on linguistic justice and valuing languages and dialects like Black English/Ebonics or Spanglish in academic environments as writers find their voices. So I wonder how NLP reckons with that, which the article does not really touch on. I also wonder how sentiment analysis comes into play. For example, the bullet point about NLP being able to predict diseases based on factors like a patient’s own speech may have adverse impacts especially regarding mental/behavioral disorders and neurodivergence. What is considered “aggressive” or “disoriented” speech? How do ideas of competency come into play when analyzing how a patient of color speaks versus a white person? Also, using “medications and treatment outcomes from patient notes” is also cause for concern because of medical racism and discrimination. A patient’s history may not point to anything because they have been neglected for so long.
I was also really struck by one of the examples of NLP applications given: “An inventor at IBM developed a cognitive assistant that works like a personalized search engine by learning all about you and then remind you of a name, a song, or anything you can’t remember the moment you need it to.” The other day I read a chapter from Edwin Black’s “IBM and the Holocaust” which is about how IBM punch cards were used to facilitate the collection and genocide of Jewish people. A cognitive assistant/personalized search engine is not going to directly facilitate another atrocity like the Holocaust, but I think it is worth tracing the lineage of these ideas of surveillance. The article associated with this bullet point is titled “A Search Engine for Your Memories,” and there is already so much research that proves that our memories can be unreliable. So how does this algorithm avoid our own personal biases if that is the only thing it can build off of? It seems like it would work to reinforce them rather than diminish them.
While I find Language related computational methods very cool, I still think there are some aspects that give me pause. The article talks about one the use cases of NLP being prediction of fake news. One of the emerging uses cases for this technology is also its application in things like deepfakes and the generation of fake news. What is the line between the development of this technology to combat misinformation and how it could be harnessed for the purpose of the very thing it purports to fight?
Also, as touched on by the article, many NLP methods are developed with English in mind. While languages are extremely systematic, they are not systematic in the same ways. This contributes to the greater proliferation of English on the internet, or the assumption that the language is the lingua franca of this arena (which it kind of is) due to things like relative ease of translation to and from it. When translating between languages, for example, it is more common to have correct (or slightly more correct) translations between languages that are similar to English (mostly European languages) than it is for languages that do not resemble the same. While a part of this is because there is much more data for English on the internet, it is also because these methods implicitly centre and work the best for this language.
I was also thinking about the prescriptive vs. descriptive divide within linguistics. In the context of English, this would be the difference between standard English for example (how it “should” be spoken) and the various dialects which it actually takes the form of). Since descriptive accounts of language are much more numerous and nuanced than prescriptive accounts, NLP might support the reproduction and proliferation of standard forms of English without the inclusion of dialects.
Finally, the article mentions that language, much like other forms of information, can have value extracted from it in that we can predict human behaviour. How does privacy play into this? Shoshana Zuboff talks about how there is a slippery slope between prediction of human behaviour and actual influencing of them same, which is something that could be extended to the use cases of NLP.
Especially in the era of rapidly developing LLMs such as ChatGPT, which has become so prevalent that in under a year it has gone to unreleased to such a force it has its own place in the Academic Honesty policy, it is essential to consider NLP (Natural Language Processing), which describe the tasks these models perform. Much of the article read for today focuses on the technical aspects of NLP, defining terms like tokenization and describing challenges like appropriately stemming words that can impact the performance of these models. These technical descriptions are essential to an appropriate understanding of NLP’s outcomes, as well as what it is as a process.
However, the article’s purpose is not exclusively to give a guidebook to NLP terminology and practices—it also addresses NLP futures, and the perils that arise from NLP done wrong. Specifically, there is a focus on a Microsoft Twitter chatbot from 2016 called Tay, which had to be taken down within 16 hours because it had become “racist and abusive”. Microsoft theorized that, by interacting with more users, Tay would become smarter and more nuanced, but instead, Tay got a lot of racist and abusive interactions, which it parroted back. Much like many algorithms that we’ve discussed, NLP is only as good as its data. Tay could only learn what it was told and the environment it was exposed to caused its training to go south very quickly. Therefore, as computer scientists, in NLP as in elsewhere, it is essential to consider not just the ethical impacts of what we ourselves are writing at face value, but how input, such as training data, which comes from the larger, often biased, outside world, factors in to our outcomes. It is not enough to be ethical in theory; instead, this must reflect in practice. As always, results matter.
I think the use of NLP in the healthcare industry to improve clinical documentation, thus helping healthcare workers to improve patient experience sounds helpful to me. I guess what I’m unsure about is what data is fed to these machines, given the racism/sexism embedded in a multitude medical research and practices. If this is the data used for diagnosis, then these machines just continue the misdiagnoses of people without someone being held accountable. Under less serious circumstances, I wonder how the constant changes in slang, other new phrases incorporated into language, and differences in meaning across languages would be accounted for in language learning models. I’m also wary of the use of a chat bot therapist, even if NLPs are advanced greatly; in addition to lacking the human aspect of therapy, I am wary of how it’s responses to each individual would be since experiences vary greatly.
In Diego Lopez Yse’s article, he gives a brief overview of the motivations for an techniques used in modern Natural Language Processing (NLP). To me, NLP is an exciting area of study, since the field has advanced rapidly in just the last few years, moreso than any other technology that I’m aware of. I remember trying out a state-of-the-art chatbot model on HuggingFace when I was in high school, and while it was sometimes amusing it often produced completely non-sensical or un-grammatical responses. Fast forward to 2023, and a conversation with ChatGPT is often impossible to differentiate from one with a human being.
After reading about some of the techniques described like stemming, tokenization, and lemmatization, I’m curious about if any of these low-level operations have been changed or improved upon by newer NLP models. If not, then I’m curious about how LLMs have been able to overcome the problems with tokenization, like producing sentences with an ambiguous meaning or the meaning of a sentence changing. I also think the underlying engineering challenges of processing billions or even trillions of input tokens must be immense, and I’m interested to learn about how those have impacted the algorithm’s development.
I thought this week’s reading was very interesting. I liked getting to know exactly how NLP works and how it can do so many different things like sentiment analysis, help identify, fake news, help automate litigation tasks, and help treat Alzheimer’s. This reading left me wondering about how the code would look. I would love to see that. I was particularly impressed by the idea of lemmatization in which the programs can learn more about the context of words by looking at the stem. Lastly, I thought it was crazy how the artificial intelligence chatbot released on Twitter became so racist.
At one point in the reading the author mentions that the use cases of natural language processing are more important than how it is done. I agree, the use cases are far more diverse than the methodologies of NLP. The comprehension and use of language seem like something that a model would struggle greatly to keep up with. So many different terms are used for purposes that are contrary to their definitions. However, when I fire up Chat GPT I find that it understands exactly what I mean every time I ask a question or make a statement. It then responds in a clear and understandable language. I wonder if it would be able to keep up at mealtime in a conversation with friends, friends that like to create nicknames and slang. I wonder if NLP functions in a situation where language use is constantly evolving, where a new term for an item might be created and discarded in the very same sentence. I think this matters because I think eventually AI is going to be developed to operate in situations that require social competency, and social competency is not something that can be easily defined or implemented. Even when it can be described one day, the very next day the landscape may have changed overnight.
Also, what about body language? How would somebody go about tokenizing or stemming body language for the purposes of NLP? Is NLP just limited to Western vocal communication right now?
I have a lot of interest in the relation (and the differences) between words and meaning, intention, etc. so this article was an interesting introduction to current trends and fascinations in the computational analysis of language. I was hoping for the article to go a bit deeper into the ethical concerns LLM and NLP present, especially as Sira mentioned for other dialects and languages. Additionally, the separation of words into units by spacing shows the seeming anglocentricness of the field as it was described in the article. I am also curious about the use of Large Language Models like ChatGPT and how they can become a crutch for students, used to offload the valuable work of critical thinking, developing writing skills, etc. I have noticed this especially in some of my past CS courses, where I have seen classmates use chatgpt or similar applications for the few written/ reflective assignments we are given, even though these people are often the ones who could benefit the most from these endeavors.
At this point I’ve taken nearly as many linguistics courses at Grinnell as I have CS courses. Linguistics is my primary interest, and I’ve made to sure to pay extra attention to the overlap between linguistics and CS. I understand this is an introductory article to NLP, so I won’t fault it too much for its surface-level explanations. That said, I find it frustrating how much of the discourse around NLP is so focused on data processing, tokenization, and pattern analysis without much regard for grammar models and linguistic research. I know it’s because robust linguistic analyses are much more expensive and time-consuming than simple ML algorithms, but we’re trying to use NLP for some very important things like healthcare, and that is terrifying to me. Also, I do not believe for a moment that NLP technology is improving care delivery and bringing healthcare costs down for patients, at least not in the US. That has never been the goal for the companies investing in NLP. Much more likely is that NLP is being used to help automate claim denials for insurance companies.
“Your Guide to Natural Language Processing (NLP)” was a good introduction to the fastly developing subfield between computer science, linguistics and psychology. In the past decade, I’ve been exposed to some applications of NLP in chatbots or recommendation systems, but I never put too much thought into the work behind the scenes because it felt too complex. The closest thing to NLP that I did was a CSC-151 project about sentiment analysis, where we determined the general sentiment or attitude of a text or a novel based on the frequency of positive and negative words. The list of application examples surprised me because I never expected that NLP could accomplish a lot more than I’ve known, including disease diagnoses, cognitive assistance and fake news prevention. The fact that it can be used so widely in many industries is promising, but it’s also concerning to think about the amount of information that it requires and the knowledge that it may acquire from human data. Although the techniques used to process and manipulate languages seem very challenging, especially when NLP is being applied in so many languages and used in more colloquial contexts, it does feel like the field has been advancing at a really fast pace given the widespread use of ChatGPT. I’m excited to learn more about the progress NLP professionals have made, and the remaining challenges and ethical concerns raised during the process.