Markovian discrimination

Markovian discrimination

Markovian discrimination is a class of spam filtering methods used in CRM114 and other spam filters to filter based on statistical patterns of transition probabilities between words or other lexical tokens in spam messages that would not be captured using simple bag-of-words naive Bayes spam filtering. == Markovian Discrimination vs. Bag-of-Words Discrimination == A bag-of-words model contains only a dictionary of legal words and their relative probabilities in spam and genuine messages. A Markovian model additionally includes the relative transition probabilities between words in spam and in genuine messages, where the relative transition probability is the likelihood that a given word will be written next, based on what the current word is. Put another way, a bag-of-words filter discriminates based on relative probabilities of single words alone regardless of phrase structure, while a Markovian word-based filter discriminates based on relative probabilities of either pairs of words, or, more commonly, short sequences of words. This allows the Markovian filter greater sensitivity to phrase structure. Neither naive Bayes nor Markovian filters are limited to the word level for tokenizing messages. They may also process letters, partial words, or phrases as tokens. In such cases, specific bag-of-words methods would correspond to general bag-of-tokens methods. Modelers can parameterize Markovian spam filters based on the relative probabilities of any such tokens' transitions appearing in spam or in legitimate messages. == Visible and Hidden Markov Models == There are two primary classes of Markov models, visible Markov models and hidden Markov models, which differ in whether the Markov chain generating token sequences is assumed to have its states fully determined by each generated token (the visible Markov models) or might also have additional state (the hidden Markov models). With a visible Markov model, each current token is modeled as if it contains the complete information about previous tokens of the message relevant to the probability of future tokens, whereas a hidden Markov model allows for more obscure conditional relationships. Since those more obscure conditional relationships are more typical of natural language messages including both genuine messages and spam, hidden Markov models are generally preferred over visible Markov models for spam filtering. Due to storage constraints, the most commonly employed model is a specific type of hidden Markov model known as a Markov random field, typically with a 'sliding window' or clique size ranging between four and six tokens.

Xiaoice

Xiaoice (Chinese: 微软小冰; pinyin: Wēiruǎn Xiǎobīng; lit. 'Microsoft Little Ice', IPA [wéɪɻwânɕjâʊpíŋ]) is an AI system developed by Microsoft (Asia) Software Technology Center (STCA) in 2014 based on an emotional computing framework. In July 2018, Microsoft Xiaoice released the 6th generation. Xiaoice Company, formerly known as AI Xiaoice Team of Microsoft Software Technology Center Asia, was Microsoft's largest independent R&D team for AI products. Founded in China in December 2013 with an expanded Japanese R&D team established in September 2014, this team is distributed in Beijing, Suzhou, and Tokyo, etc. with its technical products covering Asia. On 13 July 2020, Microsoft spun off its Xiaoice business into a separate company. As of 2021, the AI chatbots created and hosted by the Xiaoice framework accounted for about 60% of total global AI interactions. == Platforms, languages and countries == Xiaoice exists on more than 40 platforms in four countries (China, Japan, USA and Indonesia) including apps such as WeChat, QQ, Weibo and Meipai in China, and Facebook Messenger in USA and LINE in Japan. == Introduction == On 13 July 2020, Microsoft spun off its Xiaoice business into a separate company, aiming at enabling the Xiaoice product line to accelerate the pace of local innovation and commercialization, and appointed Dr. Harry Shum, former global executive VP of Microsoft, as the chairman of the new company, Li Di, Microsoft Partner of Products in Microsoft STCA, as the CEO, and Cliff, Chief R&D Director, as the GM of the Japan branch. The new company will continue to use the brands of Xiaoice China and Rinna Japan. As of 2022, the single brand of Xiaoice has covered 660 million online users, 1 billion third-party smart devices and 900 million content viewers in the aforementioned countries. Xiaoice's customers include China Merchants Group, Winter Sports Center of the General Administration of Sport of China, China Textile Information Center, China Unicom, China Foreign Exchange Trade System, Hong Kong Securities and Futures Commission (SFC), Wind Information, BMW, Nissan, SAIC Motor, BAIC Group, Nio Inc., XPeng, HiPhi, Vanke, Wensli, etc. The Xiaoice Avatar Framework has incubated tens of millions of AI Beings, such as Xiaoice, Rinna, the Expo exhibitor Xia Yubing, the singer He Chang, the anchor F201, the human observer MERROR, anime robot character Roboko, and other; == Application == === Poet === In May 2017, the first AI-authored collection of poems in China—The Sunshine Lost Windows was published by Xiaoice. === Singer === Xiaoice has released dozens of songs with the similar quality to human singers, including I Know I New, Breeze, I Am Xiaoice, Miss You etc. The 4th version of the DNN singing model allows Xiaoice to learn more details. For example, Xiaoice can produce this breathing sound along with her singing as human. === Kid audio-books reciter === Xiaoice can automatically analyze the stories, to choose the suitable tones and characters to finish the entire process of creating the audio. === Designer === By learning the melodies of the songs and the landmarks about different cities, Xiaoice can create visual artworks of skylines when listening to the songs related to this city. Skyline Series T-shirts designed by Xiaoice have been jointly launched with SELECTED and been sold in stores. === TV and radio hostess === Xiaoice has hosted 21 TV programs and 28 Radio programs, such as CCTV-1 AI Show, Dragon TV Morning East News, Hunan TV My Future, several daily radio programs for Jiangsu FM99.7, Hunan FM89.3, Henan FM104.1 etc. === "AI being" === An "AI being" is a concept proposed by the Xiaoice team in 2019. According to the "White Book of China Virtual Human Development Industry in 2022" released by Frost & Sullivan and LeadLeo, the white paper cites six elements of an AI being proposed by the Xiaoice team, including: Persona, Attitude, Biological Characteristic, Creation, Knowledge and Skill. On May 16, 2023, Xiaoice released their "GPT Clones" as its "GPT Human Cloning Plan." The program is aimed at replicating celebrities, public figures, and regular people. As of June 2023, Xiaoice had launched more than 300 "GPT Clones." People were invited to register via WeChat in China and Japan. A major point of focus for Xiaoice with their AI Beings is having virtual partners. A paid fee allow for more complex responses, voice messages, and more. == Community feedback == Bill Gates mentioned Xiaoice during his speech at the Peking University: "Some of you may have had conversations with Xiaoice on Weibo, or seen her weather forecasts on TV, or read her column in the Qianjiang Evening News." '"Xiaoice has attracted 45 million followers and is quite skilled at multitasking. And I’ve heard she’s gotten good enough at sensing a user’s emotional state that she can even help with relationship breakups." According to Mr Li Di, vice President of Microsoft (Asia) Internet Engineering School, Xiaoice started writing poems since last year. Based on the data base that includes works of 519 Chinese contemporary poets since 1920s, a 100 hour long training session was conducted to allow Xiaoice to acquire the ability to write poems. What is more impressive is that Xiaoice has never been spotted as a bot while publishing poems on various forums and traditional literary under an alias. == Controversy == In 2017, Xiaoice was taken offline on WeChat after giving user responses critical to the Chinese government. It was subsequently censored and the bots will avoid and sidestep any inquiries using politically sensitive terms and phrases. == Activity == On September 22, 2021, Xiaoice Company and Microsoft Software Technology Center Asia (STCA) jointly held the 9th generation Xiaoice annual press conference in Beijing.Upgrading of Core Technologies of the 9th Generation Xiaoice Avatar Framework,1st First-party Social Platform APP "Xiaoice Island" from Xiaoice, WeChat Xiaoice has been reopened and other information == Regional varieties of Xiaoice == China: Xiaoice, launched in 2014 Japan: りんな, launched in 2015 America: Zo, launched in 2016 – discontinued summer 2019 India: Ruuh, launched in 2017 – discontinued June 21, 2019 Indonesia: Rinna, launched in 2017

Rada Mihalcea

Rada Mihalcea is the Janice M. Jenkins Collegiate Professor of Computer Science and Engineering at the University of Michigan. She has made significant contributions to natural language processing, multimodal processing, computational social science, and AI for Social Good. With Paul Tarau, she invented the TextRank Algorithm, which is a classic algorithm widely used for text summarization. == Career == Mihalcea has a Ph.D. in Computer Science and Engineering from Southern Methodist University (2001) and a Ph.D. in Linguistics, Oxford University (2010). In 2017 she was named Director of the Artificial Intelligence Laboratory at University of Michigan, Computer Science and Engineering. In 2018, Mihalcea was elected as vice president for the Association for Computational Linguistics (ACL). In 2021, she was elected the president for ACL. She is a professor of Computer Science and Engineering at the University of Michigan, where she also leads the Language and Information Technologies (LIT) Lab. Before joining UofM, she was a professor at North Texas University between 2002-2013. A prolific researcher, Mihalcea has authored or coauthored over 500 articles since 1998 on topics ranging from semantic analysis of text to lie detection. Her work has been cited over 50,000 times on Google Scholar, which made her one of the most cited scholars in Multimodal Interaction and Computational Social Science. In 2008, Mihalcea received the Presidential Early Career Award for Scientists and Engineers (PECASE) She is an ACM Fellow (since 2019), AAAI Fellow (since 2021), and ACL Fellow (since 2025). Mihalcea is an outspoken promoter of diversity in computer science. She also supports an expansion of the traditional analysis of educational success, which tends to focus on academic behaviour, to include student life, personality and background outside of the classroom. Mihalcea leads Girls Encoded, a program designed to develop the pipeline of women in computer science as well as to retain the women who have entered into the program. == Awards == Elected to American Academy of Arts & Sciences, 2026 ACL Fellow, 2025 "for significant contributions to graph-based language processing, computational social science, and the advancement of NLP for social good." AAAI Fellow, 2021 "for significant contributions to natural language processing and computational social science". ACM Fellow, 2019 "for contributions to natural language processing, with innovations in data-driven and graph-based language processing". Sarah Goddard Power Award, 2019. Carol Hollenshead Award, 2018. Presidential Early Career Award for Scientists and Engineers (PECASE), 2009. Awarded by President Barack Obama. == Research == Mihalcea is known for her research in natural language processing, multimodal processing, computational social sciences. In a collaboration she leads at the University of Michigan, Mihalcea has created software that can detect human lying. In a study of video clips of high profile court cases, a computer was more accurate at detecting deception than human judges. Mihalcea's lie-detection software uses machine learning techniques to analyze video clips of actual trials. In her 2015 study, the team used clips from The Innocence Project, a national organization that works to reexamine cases where individuals were tried without the benefit of DNA testing with the aim of exonerating wrongfully convicted individuals. After identifying common human gestures, they transcribed the audio from the video clips of trials and analyzed how often subjects labeled deceptive used various words and phrases. The system was 75% accurate in identifying which subjects were deceptive among 120 videos. That puts Mihalcea's algorithm on par with the most commonly accepted form of lie detection, polygraph tests, which are roughly 85 percent accurate when testing guilty people and 56 percent accurate when testing the innocent. She notes there are still improvements to be made — in particular to account for cultural and demographic differences. A possibly unique advantage of Mihalcea's study was the real world, high stakes nature of the footage analyzed in the study. In laboratory experiments, it is difficult to create a setting that motivates people to truly lie. In 2018, Mihalcea and her collaborators worked on an algorithm-based system that identifies linguistic cues in fake news stories. It successfully found fakes up to 76% of the time, compared to a human success rate of 70%. == Publications == === Books === Rada Mihalcea and Dragomir Radev, Graph-based Natural Language Processing and Information Retrieval, Cambridge U. Press, 2011. Gabe Ignatow and Rada Mihalcea, Text Mining: A Guidebook for the Social Sciences, SAGE, 2016. Gabe Ignatow and Rada Mihalcea, An Introduction to Text Mining: Research Design, Data Collection, and Analysis, SAGE, 2017. === Journals and conferences === Textrank: Bringing order into text. R. Mihalcea, P. Tarau. Proceedings of the 2004 conference on empirical methods in natural language processing. 2004 Corpus-based and knowledge-based measures of text semantic similarity. R. Mihalcea, C. Corley, C. Strapparava. AAAI 6, 775-780. 2006 Wikify!: linking documents to encyclopedic knowledge. R. Mihalcea, A. Csomai. Proceedings of the sixteenth ACM conference on Conference on information and information management. 2007 Learning to identify emotions in text. C. Strapparava, R. Mihalcea. Proceedings of the 2008 ACM symposium on Applied computing, 1556-1560. 2008 Semeval-2007 task 14: Affective text. C. Strapparava, R. Mihalcea. Proceedings of the Fourth International Workshop on Semantic Evaluations. 2007 Learning multilingual subjective language via cross-lingual projections. R. Mihalcea, C. Banea, J. Wiebe. Proceedings of the 45th annual meeting of the association of computational linguistics. 2007 Graph-based ranking algorithms for sentence extraction, applied to text summarization. R. Mihalcea. Proceedings of the ACL Interactive Poster and Demonstration Sessions. 2004 Falcon: Boosting knowledge for answer engines. S. Harabagiu, D. Moldovan, M. Pasca, R. Mihalcea, M. Surdeanu, Razvan Bunescu, Roxana Girju, Vasile Rus, Paul Morarescu. TREC 9, 479-488. 2000 Measuring the semantic similarity of texts. C. Corley, R. Mihalcea. Proceedings of the ACL workshop on empirical modeling of semantic equivalence and entailment. 2005 R Mihalcea (2007). "Using wikipedia for automatic word-sense disambiguation". Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference. CiteSeerX 10.1.1.74.3561. - see also Word-sense disambiguation Unsupervised graph-based word sense disambiguation using measures of word semantic similarity. R. Sinha, R. Mihalcea. International Conference on Semantic Computing (ICSC 2007), 363-369. 2007 == Personal life == Mihalcea was born in Cluj-Napoca, Romania, where she attended the Technical University of Cluj-Napoca. She can speak Romanian, English, Italian, and French. Mihalcea has two children - Zara (b. 2009) and Caius (b. 2013). They were both born in Dallas, Texas. She is married to an associate professor of engineering at the University of Michigan–Flint - Mihai Burzo. They met while they were both completing Ph.D.s at Southern Methodist University in 2001 and have often collaborated on research, such as the 2015 study on lie detection.

Bibliotheca Polyglotta

The Bibliotheca Polyglotta is a Norwegian database for Multilingualism project, lingua franca and science per global history at the University of Oslo. The aim of the project is according to pages is "producing a web corpus of Buddhist texts for using in multilingual lexicography. More generally, will the texts used for the study Sanskrit, Chinese and Tibetan."

FastText

fastText is a library for learning of word embeddings and text classification created by Facebook's AI Research (FAIR) lab. The model allows one to create an unsupervised learning or supervised learning algorithm for obtaining vector representations for words. Facebook makes available pretrained models for 294 languages. Several papers describe the techniques used by fastText. The GitHub repository was archived on March 19, 2024.

Language Computer Corporation

Language Computer Corporation (LCC) is a natural language processing research company based in Richardson, Texas. The company develops a variety of natural language processing products, including software for question answering, information extraction, and automatic summarization. Since its founding in 1995, the low-profile company has landed significant United States Government contracts, with $8,353,476 in contracts in 2006-2008. While the company has focused primarily on the government software market, LCC has also used its technology to spin off three start-up companies. The first spin-off, known as Lymba Corporation, markets the PowerAnswer question answering product originally developed at LCC. In 2010, LCC's CEO, Andrew Hickl, co-founded two start-ups which made use of the company's technology. These included Swingly, an automatic question answering start-up, and Extractiv, an information extraction service that was founded in partnership with Houston, Texas-based 80legs.

Larry Heck

Larry Paul Heck is the Rhesa Screven Farmer, Jr., Advanced Computing Concepts Chair, Georgia Research Alliance Eminent Scholar, Co-Executive Director of the Machine Learning Center and Professor at the Georgia Institute of Technology. His career spans many of the sub-disciplines of artificial intelligence, including conversational AI, speech recognition and speaker recognition, natural language processing, web search, online advertising and acoustics. He is best known for his role as a co-founder of the Microsoft Cortana Personal Assistant and his early work in deep learning for speech processing. == Education and career == Larry Heck was born in Havre, Montana. After receiving the Bachelor of Science in electrical engineering at Texas Tech University, he was admitted to graduate school at the Georgia Institute of Technology in 1986. Heck received the MSEE in 1989 and the PhD in 1991 under advisor Prof. James H. McClellan. From 1992 to 1998, he was a senior research engineer at SRI International with the Acoustics and Radar Technology Lab (ARTL) and Speech Technology and Research (STAR) Lab, and in 1998 joined Nuance Communications, serving as vice president of R&D. Funded by the US government's NSA and DARPA from 1995-1998, Heck led the SRI team that was the first to successfully create large-scale deep neural network (DNN) deep learning technology in the field of speech processing. The deep learning technology was used to win the 1998 National Institute of Standards and Technology Speaker Recognition evaluation. The approach trained a 5-layer deep neural network, with the first two layers used as a (learned) feature extractor. To stabilize the training of the DNN, a weight normalization method was used (later rediscovered in 2010 by Xavier, et.al). Heck deployed this DNN in 1999 with Nuance Communications at the Home Shopping Network, representing the first major industrial application of deep learning with over 100K Nuance Verifier voiceprints. From 2005 to 2008, he was vice president of search & advertising quality at Yahoo!. In 2008, Heck and Ron Brachman combined search & advertising quality with Yahoo! Research to form Yahoo! Labs. Beginning in 2009, he was the chief scientist of speech products at Microsoft. In this role, he established the vision, mission and long-range plan and hired the initial team to create Microsoft’s digital-personal-assistant Cortana. Heck was named a Microsoft Distinguished Engineer in 2012 and joined Microsoft Research that same year. In 2014, he joined Google as a principal research scientist, where he founded the deep learning-based conversational AI team "Deep Dialogue". The team works on advanced research for the Google Assistant. In 2017, Heck joined Samsung as SVP and co-head of global AI Research. In 2019, he became head of Bixby (virtual assistant) North America and the CEO of Viv Labs, an independent subsidiary of Samsung. In that same year, Heck led one of the first large scale deployments of Transformer-Based LLMs as part of the Bixby Categories launch at the 2019 Samsung Developer Conference. In 2021, Heck returned to the Georgia Institute of Technology as a Professor. == Awards and honors == Larry Heck was named Fellow of the Institute of Electrical and Electronics Engineers (IEEE) in 2016 for leadership in application of machine learning to spoken and text language processing. Heck was inducted as a Fellow of the National Academy of Inventors (NAI) in 2024. Heck received the 2017 Academy of Distinguished Engineering Alumni Award from the Georgia Institute of Technology. In the same year, he also received the Texas Tech University Whitacre College of Engineering Distinguished Engineer Award. Larry Heck has several best papers including the 2020 IEEE Signal Processing Society (SPS) Best Paper Award: “Using Recurrent Neural Networks for Slot Filling in Spoken Language Understanding” published in the IEEE/ACM Transactions on Audio, Speech, and Language Processing in March 2015, and the 2020 ACM Conference on Information and Knowledge Management (CIKM) Test of Time Award for the paper "Learning Deep Structured Semantic Models for Web Search using Clickthrough Data".